[Fix](pyudf) clear stale UDAF state cache on drop#63062
[Fix](pyudf) clear stale UDAF state cache on drop#63062linrrzqqq wants to merge 3 commits intoapache:masterfrom
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
ee34297 to
1edf152
Compare
|
/review |
There was a problem hiding this comment.
I found two blocking issues in the new Python UDAF drop cleanup path.
Critical checkpoint conclusions:
- Goal/test: The PR addresses stale inline UDAF class reuse after DROP/CREATE with the same signature, and adds a targeted regression test. The function-id cache key covers the recreate case, but the cleanup path introduces failure and lifecycle risks.
- Scope: The change is mostly focused, but the cleanup action should preserve existing best-effort cache cleanup semantics and in-flight query lifecycles.
- Concurrency: The new DROP cleanup can run concurrently with active Python Flight UDAF exchanges. Current code can clear states still needed by those exchanges.
- Lifecycle: Function-id keys avoid reuse by new functions, but removing/clearing old managers needs reference-safe lifecycle handling.
- Compatibility/config: No new config or storage/protocol incompatibility identified for the BE/Python process pair built from this PR.
- Parallel paths: Module cache cleanup remains best-effort; the new UDAF cleanup should follow the same non-crashing error handling model.
- Tests: Regression coverage exercises stale reuse, but does not cover cleanup failure or concurrent DROP while an old UDAF query is still running.
- Observability/performance/data correctness: No additional observability requirement found; the main correctness risks are in error propagation and active-state invalidation.
- User focus: No additional user-provided review focus was specified.
| void PythonServerManager::clear_udaf_state_cache(int64_t function_id) { | ||
| std::string body = fmt::format(R"({{"function_id": {}}})", function_id); | ||
| THROW_IF_ERROR(_broadcast_action_to_processes("clear_udaf_state_cache", body, | ||
| fmt::format("function_id={}", function_id))); |
There was a problem hiding this comment.
This can crash the BE from a DROP FUNCTION cleanup task. _broadcast_action_to_processes() returns an error whenever an active Python process fails DoAction/Next, and THROW_IF_ERROR converts that into a doris::Exception. The caller is clean_udf_cache_callback(), which is run by TaskWorkerPool via _callback(task) without any catch boundary, so an uncaught exception terminates the worker thread/process instead of reporting a best-effort cache cleanup failure. The existing module-cache cleanup path logs and continues; this should return Status and be handled with a warning (or use the existing cleanup-style WARN_IF_ERROR) rather than throwing out of the task callback.
| for key in keys_to_remove: | ||
| manager = self.udaf_state_managers.pop(key) | ||
| manager.states.clear() | ||
| cleared += 1 |
There was a problem hiding this comment.
Clearing manager.states can invalidate in-flight UDAF queries. The DROP cleanup task is submitted asynchronously after FE removes the function, while an already-started query can still have a Flight exchange using this same UDAFStateManager for the old function id. If this action runs between that query's CREATE/ACCUMULATE and later SERIALIZE/FINALIZE/DESTROY calls, those operations will find their place_id entries removed and fail with KeyError/failed UDAF results. Since adding function_id to the key already prevents a recreated function from reusing the old class, cleanup should detach the manager from udaf_state_managers without clearing a manager that active exchanges may still reference, or add explicit lifecycle/ref-count coordination. Also consider returning the manager from _get_udaf_state_manager() while still under the lock so a concurrent pop cannot occur between lookup and return.
|
/review |
There was a problem hiding this comment.
Automated review completed for PR head 7bdde75837b919220cc8758638b8bdaf61dd14b7.
No new blocking issues found. I also reviewed the existing inline threads and did not duplicate them; the current head appears to address the previously raised UDAF cleanup exception and in-flight state invalidation concerns by using WARN_IF_ERROR, returning the manager under lock, and detaching managers without clearing active states.
Critical checkpoint conclusions:
- Goal/test: The PR targets stale Python UDAF state/class reuse after DROP/CREATE. The function-id keying plus drop cleanup path matches that goal, with added BE unit tests and regression coverage for inline drop/recreate.
- Scope: The change is small and focused on Python UDF/UDAF metadata, manager cleanup, and related tests.
- Concurrency/lifecycle: Reviewed the async DROP cleanup versus active Flight exchanges. The current implementation avoids clearing in-flight state and prevents lookup/pop races for manager retrieval. No additional concurrency blocker found.
- Configuration: No new configuration items.
- Compatibility/storage: No storage-format or persisted metadata change. The new
idfield is carried in the internal BE-to-Python descriptor and all reviewed production builders populate it. - Parallel paths: UDF, UDAF, and UDTF metadata serialization now include the function id; only UDAF manager keys use it, which is appropriate for this bug.
- Error handling: Cleanup failures are best-effort warnings instead of uncaught task-worker exceptions. Status handling in the reviewed changed C++ paths is acceptable.
- Tests: Added BE unit coverage for no-process, failed-action, action broadcast, and JSON id serialization, plus regression coverage for inline UDAF recreation. I did not run tests locally in this review runner.
- Observability: Existing INFO/WARNING logs around cleanup/broadcast are sufficient for this change.
- Transaction/persistence/data writes: Not applicable.
- Performance: Cleanup scans only the per-process UDAF manager registry under a lock; no obvious performance blocker found.
- User focus: No additional user-provided review focus was specified.
7bdde75 to
7bff96e
Compare
7bff96e to
e321868
Compare
|
run buildall |
TPC-H: Total hot run time: 29888 ms |
TPC-DS: Total hot run time: 169707 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run buildall |
TPC-H: Total hot run time: 29628 ms |
TPC-DS: Total hot run time: 170228 ms |
|
/review |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
1 similar comment
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
/review |
There was a problem hiding this comment.
I found one additional blocking issue: the newly added BE unit test calls a private PythonServerManager method directly, so the test target will not compile. Existing review threads already cover the async UDAF manager cleanup lifecycle and cleanup-error handling concerns; I did not duplicate those.
Critical checkpoint conclusions:
- Goal/test: The PR targets stale Python UDAF cache reuse after DROP/CREATE and adds regression/unit coverage, but the new unit test has a compile-time access violation.
- Scope: The functional change is focused on Python UDAF metadata/cache keys and drop cleanup; no unrelated broad behavior was necessary for this fix.
- Concurrency/lifecycle: The relevant async DROP cleanup and in-flight UDAF lifecycle risks are already tracked in existing review context; the current distinct finding is test compilation.
- Error handling: The C++ cleanup path now uses warning-style handling rather than throwing from the task callback; no new duplicate comment was added.
- Compatibility/protocol: The new
idmetadata is produced by BE and consumed by the local Python server; no additional compatibility issue found in this review. - Tests: Regression coverage was added for drop/recreate, but BE unit coverage must compile before it can validate anything.
- User focus: No additional user-provided review focus was supplied.
|
|
||
| mgr.set_process_pool_for_test(version, {process}); | ||
| auto status = mgr._broadcast_action_to_processes( | ||
| "clear_udaf_state_cache", R"({"function_id": 12345})", "function_id=12345"); |
There was a problem hiding this comment.
This test will not compile as written. _broadcast_action_to_processes() is declared after private: in PythonServerManager, and PythonServerTest is not a friend and does not use a BE_TEST public wrapper for this method. Please either test through the public clear_udaf_state_cache() path or expose a test-only wrapper under #ifdef BE_TEST instead of calling the private method directly.
What problem does this PR solve?
Issue Number: close #xxx
Related PR: #xxx
Problem Summary:
Fix Python UDAF stale cache reuse after dropping and recreating an inline UDAF with the same name/signature.
The Python server previously keyed
UDAF statemanagers byfunction name and argument types, so a recreated inline UDAF could reuse the old loaded Python class. This fix includes the FE function id in the Python UDAF metadata/cache key and clearsUDAF statemanager cache duringDROP FUNCTIONcleanup.Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)