Summary
When the streaming engine executes pl.concat([A, B]) (or any Union IR node) with N > 1 ranks, the resulting row order does not match Polars semantics. Polars guarantees that pl.concat([A, B]) yields all rows of A followed by all rows of B. Under multi-rank execution, the client instead observes rank-local concatenations interleaved at rank boundaries.
Reproducer
import polars as pl
from cudf_polars.experimental.rapidsmpf.frontend.ray import RayEngine
from cudf_polars.testing.asserts import assert_gpu_result_equal
with RayEngine(
num_ranks=2, # Work when `num_ranks=1`
engine_options={"allow_gpu_sharing": True},
executor_options={"max_rows_per_partition": 1_000},
) as streaming_engine:
df = pl.LazyFrame({
"x": range(30_000),
"y": [1, 2, 3] * 10_000,
"z": [1.0, 2.0, 3.0, 4.0, 5.0] * 6_000,
})
df2 = pl.concat([df, df])
assert_gpu_result_equal(df2, engine=streaming_engine, check_row_order=True)
Currently xfailed in:
python/cudf_polars/tests/experimental/test_dataframescan.py::test_dataframescan_concat
Observed vs. expected order (2 ranks, children A and B)
- Expected (Polars CPU):
A_rank0, A_rank1, B_rank0, B_rank1
- Actual (streaming engine):
A_rank0, B_rank0, A_rank1, B_rank1
Each rank correctly processes child A before B, but the client-side concatenation across ranks does not enforce a barrier between children. As a result, chunks from child A on rank 1 may arrive after chunks from child B on rank 0.
Root cause (sketch)
cudf_polars/experimental/rapidsmpf/union.py::union_node iterates over chs_in in order and forwards child[0] then child[1] per rank, but emits chunks asynchronously. The downstream collector concatenates chunks in arrival order across ranks. There is no cross-rank synchronization between children, and chunks carry no child_id metadata to allow reordering on the client.
Possible fixes
-
Add a per-child cross-rank barrier in union_node: emit all chunks for child A, synchronize across ranks via the communicator, then proceed to child B. Simple, but serializes cross-child streaming.
-
Tag chunks with child_id metadata and have the client-side collector group and concatenate per child before combining children. Preserves streaming overlap, but requires extending metadata on TableChunk/messages.
Option (2) is preferred unless profiling shows the barrier cost is negligible.
Scope
Affects any Union/pl.concat query under the rapidsmpf streaming engine with N > 1 ranks. Not specific to DataFrameScan.
Summary
When the streaming engine executes
pl.concat([A, B])(or anyUnionIR node) withN > 1ranks, the resulting row order does not match Polars semantics. Polars guarantees thatpl.concat([A, B])yields all rows ofAfollowed by all rows ofB. Under multi-rank execution, the client instead observes rank-local concatenations interleaved at rank boundaries.Reproducer
Currently xfailed in:
python/cudf_polars/tests/experimental/test_dataframescan.py::test_dataframescan_concatObserved vs. expected order (2 ranks, children A and B)
A_rank0, A_rank1, B_rank0, B_rank1A_rank0, B_rank0, A_rank1, B_rank1Each rank correctly processes child
AbeforeB, but the client-side concatenation across ranks does not enforce a barrier between children. As a result, chunks from childAon rank 1 may arrive after chunks from childBon rank 0.Root cause (sketch)
cudf_polars/experimental/rapidsmpf/union.py::union_nodeiterates overchs_inin order and forwardschild[0]thenchild[1]per rank, but emits chunks asynchronously. The downstream collector concatenates chunks in arrival order across ranks. There is no cross-rank synchronization between children, and chunks carry nochild_idmetadata to allow reordering on the client.Possible fixes
Add a per-child cross-rank barrier in
union_node: emit all chunks for childA, synchronize across ranks via the communicator, then proceed to childB. Simple, but serializes cross-child streaming.Tag chunks with
child_idmetadata and have the client-side collector group and concatenate per child before combining children. Preserves streaming overlap, but requires extending metadata onTableChunk/messages.Option (2) is preferred unless profiling shows the barrier cost is negligible.
Scope
Affects any
Union/pl.concatquery under the rapidsmpf streaming engine withN > 1ranks. Not specific toDataFrameScan.