[Fix](pyudf) clear stale UDAF state cache on drop#63062
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
ee34297 to
1edf152
Compare
|
/review |
There was a problem hiding this comment.
I found two blocking issues in the new Python UDAF drop cleanup path.
Critical checkpoint conclusions:
- Goal/test: The PR addresses stale inline UDAF class reuse after DROP/CREATE with the same signature, and adds a targeted regression test. The function-id cache key covers the recreate case, but the cleanup path introduces failure and lifecycle risks.
- Scope: The change is mostly focused, but the cleanup action should preserve existing best-effort cache cleanup semantics and in-flight query lifecycles.
- Concurrency: The new DROP cleanup can run concurrently with active Python Flight UDAF exchanges. Current code can clear states still needed by those exchanges.
- Lifecycle: Function-id keys avoid reuse by new functions, but removing/clearing old managers needs reference-safe lifecycle handling.
- Compatibility/config: No new config or storage/protocol incompatibility identified for the BE/Python process pair built from this PR.
- Parallel paths: Module cache cleanup remains best-effort; the new UDAF cleanup should follow the same non-crashing error handling model.
- Tests: Regression coverage exercises stale reuse, but does not cover cleanup failure or concurrent DROP while an old UDAF query is still running.
- Observability/performance/data correctness: No additional observability requirement found; the main correctness risks are in error propagation and active-state invalidation.
- User focus: No additional user-provided review focus was specified.
|
/review |
There was a problem hiding this comment.
Automated review completed for PR head 7bdde75837b919220cc8758638b8bdaf61dd14b7.
No new blocking issues found. I also reviewed the existing inline threads and did not duplicate them; the current head appears to address the previously raised UDAF cleanup exception and in-flight state invalidation concerns by using WARN_IF_ERROR, returning the manager under lock, and detaching managers without clearing active states.
Critical checkpoint conclusions:
- Goal/test: The PR targets stale Python UDAF state/class reuse after DROP/CREATE. The function-id keying plus drop cleanup path matches that goal, with added BE unit tests and regression coverage for inline drop/recreate.
- Scope: The change is small and focused on Python UDF/UDAF metadata, manager cleanup, and related tests.
- Concurrency/lifecycle: Reviewed the async DROP cleanup versus active Flight exchanges. The current implementation avoids clearing in-flight state and prevents lookup/pop races for manager retrieval. No additional concurrency blocker found.
- Configuration: No new configuration items.
- Compatibility/storage: No storage-format or persisted metadata change. The new
idfield is carried in the internal BE-to-Python descriptor and all reviewed production builders populate it. - Parallel paths: UDF, UDAF, and UDTF metadata serialization now include the function id; only UDAF manager keys use it, which is appropriate for this bug.
- Error handling: Cleanup failures are best-effort warnings instead of uncaught task-worker exceptions. Status handling in the reviewed changed C++ paths is acceptable.
- Tests: Added BE unit coverage for no-process, failed-action, action broadcast, and JSON id serialization, plus regression coverage for inline UDAF recreation. I did not run tests locally in this review runner.
- Observability: Existing INFO/WARNING logs around cleanup/broadcast are sufficient for this change.
- Transaction/persistence/data writes: Not applicable.
- Performance: Cleanup scans only the per-process UDAF manager registry under a lock; no obvious performance blocker found.
- User focus: No additional user-provided review focus was specified.
7bdde75 to
7bff96e
Compare
7bff96e to
e321868
Compare
|
run buildall |
TPC-H: Total hot run time: 29888 ms |
TPC-DS: Total hot run time: 169707 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 29628 ms |
TPC-DS: Total hot run time: 170228 ms |
|
/review |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I found one additional blocking issue: the newly added BE unit test calls a private PythonServerManager method directly, so the test target will not compile. Existing review threads already cover the async UDAF manager cleanup lifecycle and cleanup-error handling concerns; I did not duplicate those.
Critical checkpoint conclusions:
- Goal/test: The PR targets stale Python UDAF cache reuse after DROP/CREATE and adds regression/unit coverage, but the new unit test has a compile-time access violation.
- Scope: The functional change is focused on Python UDAF metadata/cache keys and drop cleanup; no unrelated broad behavior was necessary for this fix.
- Concurrency/lifecycle: The relevant async DROP cleanup and in-flight UDAF lifecycle risks are already tracked in existing review context; the current distinct finding is test compilation.
- Error handling: The C++ cleanup path now uses warning-style handling rather than throwing from the task callback; no new duplicate comment was added.
- Compatibility/protocol: The new
idmetadata is produced by BE and consumed by the local Python server; no additional compatibility issue found in this review. - Tests: Regression coverage was added for drop/recreate, but BE unit coverage must compile before it can validate anything.
- User focus: No additional user-provided review focus was supplied.
52b5132 to
d3d0fe9
Compare
|
/review |
There was a problem hiding this comment.
No additional blocking issues found in the current diff.
Critical checkpoint conclusions:
- Goal/test coverage: The change fixes stale Python UDAF class reuse after DROP/CREATE with the same name/signature by carrying the FE function id into Python metadata and the UDAF manager key. The added regression case covers inline UDAF recreate behavior, and the added unit tests cover metadata serialization and broadcast error/no-process paths.
- Scope: The modification is focused on Python UDF/UDAF metadata, Python server cache keys, and DROP cleanup dispatch.
- Concurrency/lifecycle: The current cleanup detaches managers under
udaf_managers_lockand avoids clearingmanager.states, so in-flight exchanges holding an existing manager are not invalidated by cleanup. The C++ cleanup path is best-effort and no longer throws out of the task callback. - Compatibility/protocol: Adding
idto the BE-generated Python metadata is consumed by the Python server launched from the same BE package; no separate persistent format or FE-BE thrift schema change is introduced. - Parallel paths: UDF, UDAF, and UDTF metadata serialization now includes the existing
TFunction.id; the DROP cleanup is applied when FE sends a function id. - Error handling: Broadcast failures are converted to
Status; DROP UDAF cleanup logs viaWARN_IF_ERROR, matching best-effort cache cleanup behavior. - Memory/observability/performance: No new tracked BE allocations or hot-path scans of concern were found. Cleanup uses existing logs and Python GC only after managers are detached.
- Transaction/persistence/data visibility: Not applicable; this PR does not alter storage, transaction, visible-version, or persistence behavior.
- User focus: No additional user-provided review focus was supplied.
Previously raised inline review threads were treated as known context and not duplicated.
|
run buildall |
TPC-H: Total hot run time: 29600 ms |
TPC-DS: Total hot run time: 171047 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Fix Python UDAF stale cache reuse after dropping and recreating an inline UDAF with the same name/signature.
The Python server previously keyed
UDAF statemanagers byfunction name and argument types, so a recreated inline UDAF could reuse the old loaded Python class. This fix includes the FE function id in the Python UDAF metadata/cache key and clearsUDAF statemanager cache duringDROP FUNCTIONcleanup.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)