Introduce a process-wide singleton engine for .collect(engine="gpu")#22410
Introduce a process-wide singleton engine for .collect(engine="gpu")#22410madsbk wants to merge 9 commits intorapidsai:mainfrom
.collect(engine="gpu")#22410Conversation
7e6beeb to
0fb9fe8
Compare
| Because each call forks a new child, process-wide side-effects | ||
| (the ``_bind_done`` flag, CPU affinity, environment variables) never | ||
| leak between tests or back into the pytest process. | ||
| def _run_in_subprocess(target: Callable[[], None]) -> None: |
There was a problem hiding this comment.
This PR exposed some issues with the "fork" approach, so we now use "spawn" instead. Otherwise, the tests remain the same.
| chunk: TableChunk | ||
| if chunks: | ||
| chunk = await evaluate_batch(chunks, context, ir, ir_context=ir_context) | ||
| else: | ||
| # This rank received no input partitions. Produce an empty chunk | ||
| # with the IR's output schema so the AllGather below still has | ||
| # something to insert (and other ranks don't deadlock waiting). | ||
| chunk = empty_table_chunk(ir, context, ir_context.get_cuda_stream()) |
There was a problem hiding this comment.
This fixes a bug exposed by this PR: a multi-rank top-k crashes when a rank receives zero input partitions. An empty chunks list flows into evaluate_batch([], …) → concat_batch([], …) → _concat() → pylibcudf.concatenate.concatenate([]), which raises ValueError: input list may not be empty.
Why now: this branch widens the explain_engine fixture in test_explain.py from a hand-rolled pl.GPUEngine(executor="streaming") (which used cluster="single") to streaming_engine_factory, which now parametrizes over [spmd, spmd-small, dask, ray].
The fixture switch was needed so the test would not conflict with the session-scoped streaming engines introduced by the new active-engine guard. The expanded matrix is what first exercises a multi-worker top-k where one rank legitimately receives no input partitions, exposing the latent bug.
Fix: when chunks is empty, construct an empty TableChunk for the rank using the existing empty_table_chunk(ir, context, stream) helper and feed that into the AllGather. The non-empty ranks still dominate the merged top-k result, and no rank deadlocks waiting for a message from the empty rank.
| executor="streaming", | ||
| executor_options={"max_rows_per_partition": 1_000}, | ||
| def test_join_in_memory_lazy_stable_id_pickle(streaming_engine_factory): | ||
| engine = streaming_engine_factory( |
There was a problem hiding this comment.
We can no longer use the default streaming engine in the tests, so I refactored some of them, including this one, to use either the streaming engines or the in-memory engine instead. Otherwise, the tests remain the same.
eb2e155 to
792f8f8
Compare
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughThis PR introduces ChangesDefault Singleton GPU Engine
Estimated code review effort🎯 4 (Complex) | ⏱️ ~50 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
lf.collect(engine="gpu")andpl.GPUEngine(executor="streaming")using the default cluster now route through a new process-wideDefaultSingletonEngineinstead of constructing a fresh rapidsmpfContext, RMM adaptor, and Python executor for every query. Bootstrap now happens once per process rather than once per query.DefaultSingletonEngineis a process-wide single-GPU singleton specialization ofSPMDEngine: at most one live instance exists per process, it always uses a single-rank communicator plus default environment-derived settings, and repeated calls reuse the same engine instance until explicit shutdown.The default cluster enum value is renamed from
Cluster.SINGLEtoCluster.DEFAULT_SINGLETONso the dispatch token better reflects the actual behavior.This PR also removes the dead inline-context fallback in
evaluate_pipeline, which was the original"single"execution path.