Forward-merge release/26.06 into main#22555
Merged
12 commits merged intoMay 20, 2026
Merged
Conversation
Updated cudf-polars to support Polars 1.39. Summary: * **Dependency pin** updated across conda envs, the recipe, `dependencies.yaml`, and `pyproject.toml`. New `POLARS_VERSION_LT_139` flag gates version specific code. * **Rolling expressions:** polars 1.39 makes `pl.col(...).rolling(...)` accessible again via `AExpr::Rolling`. A new `_translate_rolling` handles it, registered only when the node type exists. Rolling tests use a single `skip_rolling_expr_136_to_138` marker. * **HConcat strict mode:** added a `strict` slot on the `HConcat` IR that raises `pl.exceptions.ShapeError` on height mismatch, threaded through every construction site. * **IsBetween Decimal vs Float:** new `_align_decimal_float_for_comparison` casts Decimal to Float64 on 1.39+, since polars no longer inserts that cast and libcudf would otherwise give wrong results. * **set_sorted:** options shape changed from `(asc_str,)` to `(descending_bool, ...)`; translator branches on type. * **Dynamic predicates:** new `_is_dynamic_pred` helper makes Scan and Filter skip predicates that raise `"dynamic_pred"`. * **IR version ceiling** raised from `(12, 1)` to `(12, 2)`. Sink format check now includes `"Json"`, and a precedence bug in `_sink_to_file` is fixed. Authors: - Matthew Murray (https://github.com/Matt711) - Matthew Roeschke (https://github.com/mroeschke) Approvers: - James Lamb (https://github.com/jameslamb) - Matthew Roeschke (https://github.com/mroeschke) URL: #22048
Contributor
Author
|
FAILURE - Unable to forward-merge due to an error, manual merge is necessary. Do not use the IMPORTANT: When merging this PR, do not use the auto-merger (i.e. the |
#22558) PR #22048 (merged today) added the new `test_hconcat_strict_different_heights` test, which imports `assert_collect_raises`. However, PR #22535 (also merged today) removed that helper. The two PRs landed on `release/26.06` without the conflict being noticed. On `main`, `test_hconcat.py` does not contain the strict-mode test, so the issue is limited to `release/26.06`. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Matthew Murray (https://github.com/Matt711) URL: #22558
This PR fixes the use-after-destroy and stream ordering (with PTDS input) issue (with host buffer source) in the `fetch_byte_ranges_to_device_async` IO utility used by parquet and hybrid scan. See follow up PR #22550 that reduces the locked region size by moving all `host_read_async` outside it. Authors: - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - Bradley Dice (https://github.com/bdice) - Amin Aramoon (https://github.com/aminaramoon) - Vukasin Milovanovic (https://github.com/vuule) URL: #22529
This PR is pure moving/renaming.
### New layout
```
cudf_polars/
callback.py
containers/
dsl/
engine/ ← user-facing GPU engine classes (Streaming/Ray/Dask/SPMD/DefaultSingleton)
streaming/ ← multi-partition execution layer (formerly "experimental")
actor_graph/ ← RapidsMPF-backed runtime
collectives/ ← RapidsMPF collective communication primitives
benchmarks/
utils.py ← consolidated benchmark utilities (formerly split between utils.py shim and utils_new_frontends.py)
pdsds.py
...
base.py
dispatch.py
parallel.py
groupby.py
io.py
join.py
...
testing/
typing/
utils/
```
Engine entry points move from deeply nested experimental paths to top-level imports:
```
cudf_polars.experimental.rapidsmpf.frontend.options → cudf_polars.engine.options
cudf_polars.experimental.rapidsmpf.frontend.spmd → cudf_polars.engine.spmd
cudf_polars.experimental.rapidsmpf.frontend.ray → cudf_polars.engine.ray
cudf_polars.experimental.rapidsmpf.frontend.dask → cudf_polars.engine.dask
cudf_polars.experimental.rapidsmpf.frontend.core → cudf_polars.engine.core
```
Benchmarks is now under `streaming`:
```
python -m cudf_polars.streaming.benchmarks.pdsh
```
Authors:
- Mads R. B. Kristensen (https://github.com/madsbk)
Approvers:
- Lawrence Mitchell (https://github.com/wence-)
- Peter Andreas Entschev (https://github.com/pentschev)
- Bradley Dice (https://github.com/bdice)
- Matthew Murray (https://github.com/Matt711)
URL: #22491
This backports a pair of commits for the cudf-polars benchmarking CLI. We're currently running benchmarks against both release/26.06 and main. Authors: - Tom Augspurger (https://github.com/TomAugspurger) - Lawrence Mitchell (https://github.com/wence-) Approvers: - Bradley Dice (https://github.com/bdice) URL: #22572
Fixes a memcheck error introduced by #22452 where an atomic operation on a bool variable is reported by compute-sanitizer as an out-of-bounds access. Changing the variable to an `int32_t` resolves the error. Closes #22570 Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Bradley Dice (https://github.com/bdice) - Yunsong Wang (https://github.com/PointKernel) URL: #22571
- Follow up to #22491 - Moves the `collectives` module under `actor_graph` to break circular dependencies. The "collectives" are **mostly** used to build the actor graph anyway. **Note**: Before merging this, I'd like to get confirmation that others see circular-import errors locally. E.g. ``` pytest -v python/cudf_polars/tests/streaming/test_groupby.py ... E ImportError: cannot import name 'ShuffleManager' from partially initialized module 'cudf_polars.streaming.collectives.shuffle' (most likely due to a circular import) (/raid/rzamora/rapids-26.06/cudf/python/cudf_polars/cudf_polars/streaming/collectives/shuffle.py) ``` Authors: - Richard (Rick) Zamora (https://github.com/rjzamora) Approvers: - Matthew Murray (https://github.com/Matt711) - Mads R. B. Kristensen (https://github.com/madsbk) URL: #22578
Hand-tune `polars_impl` for 19 TPC-DS benchmark queries in `python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/`. Each rewrite preserves query semantics and only changes how the polars LazyFrame is constructed; `duckdb_impl` is unchanged. The optimizations apply a small set of recurring patterns that the polars optimizer does not (yet) perform automatically: - **Predicate pushdown on dimension tables** — pre-filter `date_dim`, `item`, `store`, etc. by literal predicates (year, quarter, month window, category/class/brand) before any join, so the join builds smaller hash tables. - **Semi-join fact-table pre-filtering** — use selective dimension keys (and in some cases `store_returns` (customer, item) pairs) as semi-join probes against the fact tables, shrinking them before the expensive joins. - **Projection pushdown** — `select(...)` only the columns each table contributes before joining, instead of relying on the planner to prune them later. - **Condition-join → equi-join** — replace cross-join + filter and CONDITIONALJOIN-style patterns with constant-key equi-joins where the predicate is equivalent. - **Single-pass bucket aggregation** — collapse multiple independent global-sum group-bys over the same fact table into one pass that emits the values in a single aggregation, replacing N scans with 1. - **Join reordering** — defer non-selective joins (e.g. customer) until after the selective filter chain so the row count entering the deferred join is much smaller. ## Test plan - [ ] Run TPC-DS validation against DuckDB on the 19 modified queries - [ ] Run benchmark sweep and confirm no regressions vs. main on unmodified queries - [ ] Confirm result equality (sorted output) matches DuckDB reference Authors: - Matthew Murray (https://github.com/Matt711) Approvers: - Tom Augspurger (https://github.com/TomAugspurger) URL: #22395
Provide a patch for apache/arrow#48801 Fixes #22540 Authors: - Kyle Edwards (https://github.com/KyleFromNVIDIA) Approvers: - Bradley Dice (https://github.com/bdice) - Paul Mattione (https://github.com/pmattione-nvidia) - MithunR (https://github.com/mythrocks) URL: #22582
Member
Restructure the docs around the new streaming multi-GPU engines and unified configuration model, replacing the legacy execution narrative, add a set of user-facing guides covering usage, engines, configuration, profiling, and legacy workflows. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Matthew Roeschke (https://github.com/mroeschke) - James Lamb (https://github.com/jameslamb) - Lawrence Mitchell (https://github.com/wence-) URL: #22252
## Summary
- Remove the `CUDF_EXPECTS` check that was comparing the actual token count against an analytical upper bound
- Expand test coverage for malformed JSON recovery with `{\n` and `{"\n` patterns
### Details
The JSON tokenizer was failing on malformed input like repeated `{\n` lines in recovery mode with:
```
CUDF failure at: nested_json_gpu.cu:1683: Generated token count exceeds the expected token count
```
The analytical bound assumed a worst-case ratio of 6 tokens per 5 characters (based on valid JSON like {"":_}), but recovery mode can produce higher ratios—for example, {"\n produces 5 tokens for 3 characters.
Since the buffer allocation uses the exact count from a sizing FST pass (not the analytical bound), this check served only as a debug assertion. Removing it fixes the failure without any functional impact.
Authors:
- Shruti Shivakumar (https://github.com/shrshi)
Approvers:
- Vukasin Milovanovic (https://github.com/vuule)
- David Wendt (https://github.com/davidwendt)
- Bradley Dice (https://github.com/bdice)
URL: #22589
Closes #18182 This PR adds `streaming_groupby`, a stateful groupby that accumulates partial aggregates across batches using a single persistent hash table. Users specify `max_groups`, the maximum number of distinct groups expected, and all main data structures are allocated once and reused without resizing. The hash table stores a `size_type` group ID per slot. The ID is global across the stream: each distinct group, in the order it is first seen, is assigned a stable ID in `[0, distinct_count)` that is shared by the result table (used as the row index) and by the companion array (used as the lookup index). The actual keys live in a list of per-batch compacted key tables. The companion array, of length `max_groups`, holds a `{batch_id, row_id}` pair for each group ID, pointing back to where that group's representative key is stored. Equality probes resolve a slot ID through the companion array into the correct preprocessed batch table and compare via an n-table row comparator. Each batch is processed in two steps. The first step calls `insert_and_find` against the hash set. Winners write a transient value `max_groups + batch_idx` into their slot, which is distinguishable from any real group ID since real IDs live in `[0, max_groups)`. Existing slots already hold a final ID and are returned as-is. A side flag array marks which rows won their slot, and a slot-offset array records where each row landed for cheap revisits. The newly inserted rows are stream-compacted and gathered into a fresh compacted key table that is appended to the per-batch list. The second step walks only the new keys, atomically rewrites their transient slot values to stable global IDs starting at the current distinct count, and writes the matching `{batch_id, row_id}` entries into the companion array. A final reread converts any remaining transient slot reads into global IDs, so every row in the batch ends up mapped to its stable group ID. Aggregations are updated atomically into a single result table indexed directly by these IDs. Merging reprobes the other object's compacted keys against this hash table to recover their target group IDs in this object's ID space, then atomically combines the matching result rows. Finalization concatenates the per-batch compacted key tables to produce the distinct-keys output, slices the result table to `[0, distinct_count)` — no gather is needed, since the global IDs are already the row indices — and runs the compound-aggregation finalizers to produce user-facing columns. The internal state is left intact, so further `aggregate` calls remain valid. Certain trade-offs are intentional. For example, the streaming groupby is designed to deep-copy all distinct keys locally. This enables batch-based processing: once a batch has completed the add step, its input data can be released, which helps reduce memory usage. Additionally, the current code path does not support shared memory. As a result, inputs with very low cardinality can suffer from poor runtime performance due to high atomic contention, since many updates target the same key or memory location. This is an accepted trade-off. In practice, downstream users can run a standard groupby to estimate cardinality; if it is low, they can concatenate all input data and use a regular groupby instead, which typically yields better performance. Authors: - Yunsong Wang (https://github.com/PointKernel) Approvers: - Devavret Makkar (https://github.com/devavret) - Lawrence Mitchell (https://github.com/wence-) - Bradley Dice (https://github.com/bdice) URL: #21924
3 tasks
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Forward-merge triggered by push to release/26.06 that creates a PR to keep main up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge. See forward-merger docs for more info.