Skip to content

Forward-merge release/26.06 into main#22555

Merged
12 commits merged into
mainfrom
release/26.06
May 20, 2026
Merged

Forward-merge release/26.06 into main#22555
12 commits merged into
mainfrom
release/26.06

Conversation

@rapids-bot
Copy link
Copy Markdown
Contributor

@rapids-bot rapids-bot Bot commented May 18, 2026

Forward-merge triggered by push to release/26.06 that creates a PR to keep main up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge. See forward-merger docs for more info.

Updated cudf-polars to support Polars 1.39.

Summary:

* **Dependency pin** updated across conda envs, the recipe, `dependencies.yaml`, and `pyproject.toml`. New `POLARS_VERSION_LT_139` flag gates version specific code.
* **Rolling expressions:** polars 1.39 makes `pl.col(...).rolling(...)` accessible again via `AExpr::Rolling`. A new `_translate_rolling` handles it, registered only when the node type exists. Rolling tests use a single `skip_rolling_expr_136_to_138` marker.
* **HConcat strict mode:** added a `strict` slot on the `HConcat` IR that raises `pl.exceptions.ShapeError` on height mismatch, threaded through every construction site.
* **IsBetween Decimal vs Float:** new `_align_decimal_float_for_comparison` casts Decimal to Float64 on 1.39+, since polars no longer inserts that cast and libcudf would otherwise give wrong results.
* **set_sorted:** options shape changed from `(asc_str,)` to `(descending_bool, ...)`; translator branches on type.
* **Dynamic predicates:** new `_is_dynamic_pred` helper makes Scan and Filter skip predicates that raise `"dynamic_pred"`.
* **IR version ceiling** raised from `(12, 1)` to `(12, 2)`. Sink format check now includes `"Json"`, and a precedence bug in `_sink_to_file` is fixed.

Authors:
  - Matthew Murray (https://github.com/Matt711)
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - James Lamb (https://github.com/jameslamb)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: #22048
@rapids-bot rapids-bot Bot requested review from a team as code owners May 18, 2026 19:00
@rapids-bot rapids-bot Bot requested a review from jameslamb May 18, 2026 19:00
@rapids-bot
Copy link
Copy Markdown
Contributor Author

rapids-bot Bot commented May 18, 2026

FAILURE - Unable to forward-merge due to an error, manual merge is necessary. Do not use the Resolve conflicts option in this PR, follow these instructions https://docs.rapids.ai/maintainers/forward-merger/

IMPORTANT: When merging this PR, do not use the auto-merger (i.e. the /merge comment). Instead, an admin must manually merge by changing the merging strategy to Create a Merge Commit. Otherwise, history will be lost and the branches become incompatible.

@rapids-bot rapids-bot Bot requested a review from rjzamora May 18, 2026 19:00
@github-actions github-actions Bot added Python Affects Python cuDF API. cudf-polars Issues specific to cudf-polars labels May 18, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 18, 2026
madsbk and others added 2 commits May 19, 2026 00:19
#22558)

PR #22048 (merged today) added the new `test_hconcat_strict_different_heights` test, which imports `assert_collect_raises`. However, PR #22535 (also merged today) removed that helper.

The two PRs landed on `release/26.06` without the conflict being noticed.

On `main`, `test_hconcat.py` does not contain the strict-mode test, so the issue is limited to `release/26.06`.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Matthew Murray (https://github.com/Matt711)

URL: #22558
This PR fixes the use-after-destroy and stream ordering (with PTDS input) issue (with host buffer source) in the `fetch_byte_ranges_to_device_async` IO utility used by parquet and hybrid scan.

See follow up PR #22550 that reduces the locked region size by moving all `host_read_async` outside it.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Amin Aramoon (https://github.com/aminaramoon)
  - Vukasin Milovanovic (https://github.com/vuule)

URL: #22529
@rapids-bot rapids-bot Bot requested a review from a team as a code owner May 19, 2026 00:21
@rapids-bot rapids-bot Bot requested review from karthikeyann and shrshi May 19, 2026 00:21
@github-actions github-actions Bot added the libcudf Affects libcudf (C++/CUDA) code. label May 19, 2026
madsbk and others added 3 commits May 19, 2026 13:00
This PR is pure moving/renaming.

### New layout
```
cudf_polars/
  callback.py
  containers/
  dsl/
  engine/         ← user-facing GPU engine classes (Streaming/Ray/Dask/SPMD/DefaultSingleton)
  streaming/      ← multi-partition execution layer (formerly "experimental")
    actor_graph/  ← RapidsMPF-backed runtime
    collectives/  ← RapidsMPF collective communication primitives
    benchmarks/
      utils.py    ← consolidated benchmark utilities (formerly split between utils.py shim and utils_new_frontends.py)
      pdsds.py
      ...
    base.py
    dispatch.py
    parallel.py
    groupby.py
    io.py
    join.py
    ...
  testing/
  typing/
  utils/
```

Engine entry points move from deeply nested experimental paths to top-level imports:
```
cudf_polars.experimental.rapidsmpf.frontend.options → cudf_polars.engine.options
cudf_polars.experimental.rapidsmpf.frontend.spmd    → cudf_polars.engine.spmd
cudf_polars.experimental.rapidsmpf.frontend.ray     → cudf_polars.engine.ray
cudf_polars.experimental.rapidsmpf.frontend.dask    → cudf_polars.engine.dask
cudf_polars.experimental.rapidsmpf.frontend.core    → cudf_polars.engine.core
```

Benchmarks is now under `streaming`:
```
python -m cudf_polars.streaming.benchmarks.pdsh
```

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Lawrence Mitchell (https://github.com/wence-)
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Bradley Dice (https://github.com/bdice)
  - Matthew Murray (https://github.com/Matt711)

URL: #22491
This backports a pair of commits for the cudf-polars benchmarking CLI. We're currently running benchmarks against both release/26.06 and main.

Authors:
  - Tom Augspurger (https://github.com/TomAugspurger)
  - Lawrence Mitchell (https://github.com/wence-)

Approvers:
  - Bradley Dice (https://github.com/bdice)

URL: #22572
Fixes a memcheck error introduced by #22452 where an atomic operation on a bool variable is reported by compute-sanitizer as an out-of-bounds access. Changing the variable to an `int32_t` resolves the error.

Closes #22570

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Yunsong Wang (https://github.com/PointKernel)

URL: #22571
rjzamora and others added 3 commits May 19, 2026 19:10
- Follow up to #22491
- Moves the `collectives` module under `actor_graph` to break circular dependencies. The "collectives" are **mostly** used to build the actor graph anyway.

**Note**: Before merging this, I'd like to get confirmation that others see circular-import errors locally. E.g.

```
pytest -v python/cudf_polars/tests/streaming/test_groupby.py

...

E   ImportError: cannot import name 'ShuffleManager' from partially initialized module 'cudf_polars.streaming.collectives.shuffle' (most likely due to a circular import) (/raid/rzamora/rapids-26.06/cudf/python/cudf_polars/cudf_polars/streaming/collectives/shuffle.py)
```

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: #22578
Hand-tune `polars_impl` for 19 TPC-DS benchmark queries in `python/cudf_polars/cudf_polars/experimental/benchmarks/pdsds_queries/`. Each rewrite preserves query semantics and only changes how the polars LazyFrame is constructed; `duckdb_impl` is unchanged.

The optimizations apply a small set of recurring patterns that the polars optimizer does not (yet) perform automatically:

- **Predicate pushdown on dimension tables** — pre-filter `date_dim`, `item`, `store`, etc. by literal predicates (year, quarter, month window, category/class/brand) before any join, so the join builds smaller hash tables.
- **Semi-join fact-table pre-filtering** — use selective dimension keys (and in some cases `store_returns` (customer, item) pairs) as semi-join probes against the fact tables, shrinking them before the expensive joins.
- **Projection pushdown** — `select(...)` only the columns each table contributes before joining, instead of relying on the planner to prune them later.
- **Condition-join → equi-join** — replace cross-join + filter and CONDITIONALJOIN-style patterns with constant-key equi-joins where the predicate is equivalent.
- **Single-pass bucket aggregation** — collapse multiple independent global-sum group-bys over the same fact table into one pass that emits the values in a single aggregation, replacing N scans with 1.
- **Join reordering** — defer non-selective joins (e.g. customer) until after the selective filter chain so the row count entering the deferred join is much smaller.

## Test plan

- [ ] Run TPC-DS validation against DuckDB on the 19 modified queries
- [ ] Run benchmark sweep and confirm no regressions vs. main on unmodified queries
- [ ] Confirm result equality (sorted output) matches DuckDB reference

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Tom Augspurger (https://github.com/TomAugspurger)

URL: #22395
@rapids-bot rapids-bot Bot requested review from a team as code owners May 19, 2026 21:03
@github-actions github-actions Bot added CMake CMake build issue Java Affects Java cuDF API. labels May 19, 2026
@KyleFromNVIDIA
Copy link
Copy Markdown
Member

KyleFromNVIDIA commented May 19, 2026

Superseded by #22590 #22585 and #22603

madsbk and others added 2 commits May 20, 2026 11:40
Restructure the docs around the new streaming multi-GPU engines and unified configuration model, replacing the legacy execution narrative, add a set of user-facing guides covering usage, engines, configuration, profiling, and legacy workflows.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - James Lamb (https://github.com/jameslamb)
  - Lawrence Mitchell (https://github.com/wence-)

URL: #22252
## Summary
- Remove the `CUDF_EXPECTS` check that was comparing the actual token count against an analytical upper bound
- Expand test coverage for malformed JSON recovery with `{\n` and `{"\n` patterns

### Details
The JSON tokenizer was failing on malformed input like repeated `{\n` lines in recovery mode with:
```
CUDF failure at: nested_json_gpu.cu:1683: Generated token count exceeds the expected token count
```
The analytical bound assumed a worst-case ratio of 6 tokens per 5 characters (based on valid JSON like {"":_}), but recovery mode can produce higher ratios—for example, {"\n produces 5 tokens for 3 characters.
Since the buffer allocation uses the exact count from a sizing FST pass (not the analytical bound), this check served only as a debug assertion. Removing it fixes the failure without any functional impact.

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - David Wendt (https://github.com/davidwendt)
  - Bradley Dice (https://github.com/bdice)

URL: #22589
Closes #18182

This PR adds `streaming_groupby`, a stateful groupby that accumulates partial aggregates across batches using a single persistent hash table.

Users specify `max_groups`, the maximum number of distinct groups expected, and all main data structures are allocated once and reused without resizing. The hash table stores a `size_type` group ID per slot.
The ID is global across the stream: each distinct group, in the order it is first seen, is assigned a stable ID in `[0, distinct_count)` that is shared by the result table (used as the row index) and by the companion array (used as the lookup index). The actual keys live in a list of per-batch compacted key tables. The companion array, of length `max_groups`, holds a `{batch_id, row_id}` pair for each group ID, pointing back to where that group's representative key is stored. Equality probes resolve a slot ID through the companion array into the correct preprocessed batch table and compare via an n-table row comparator.

Each batch is processed in two steps. The first step calls `insert_and_find` against the hash set. Winners write a transient value `max_groups + batch_idx` into their slot, which is distinguishable from any real group ID since real IDs live in `[0, max_groups)`. Existing slots already hold a final ID and are returned as-is. A side flag array marks which rows won their slot, and a slot-offset array records where each row landed for cheap revisits. The newly inserted rows are stream-compacted and gathered into a fresh compacted key table that is appended to the per-batch list. The second step walks only the new keys, atomically rewrites their transient slot values to stable global IDs starting at the current distinct count, and writes the matching `{batch_id, row_id}` entries into the companion array. A final reread converts any remaining transient slot reads into global IDs, so every row in the batch ends up mapped to its stable group ID. Aggregations are updated atomically into a single result table indexed directly by these IDs.

Merging reprobes the other object's compacted keys against this hash table to recover their target group IDs in this object's ID space, then atomically combines the matching result rows. Finalization concatenates the per-batch compacted key tables to produce the distinct-keys output, slices the result table to `[0, distinct_count)` — no gather is needed, since the global IDs are already the row indices — and runs the compound-aggregation finalizers to produce user-facing columns. The internal state is left intact, so further `aggregate` calls remain valid.

Certain trade-offs are intentional. For example, the streaming groupby is designed to deep-copy all distinct keys locally. This enables batch-based processing: once a batch has completed the add step, its input data can be released, which helps reduce memory usage.

Additionally, the current code path does not support shared memory. As a result, inputs with very low cardinality can suffer from poor runtime performance due to high atomic contention, since many updates target the same key or memory location. This is an accepted trade-off. In practice, downstream users can run a standard groupby to estimate cardinality; if it is low, they can concatenate all input data and use a regular groupby instead, which typically yields better performance.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - Lawrence Mitchell (https://github.com/wence-)
  - Bradley Dice (https://github.com/bdice)

URL: #21924
@rapids-bot rapids-bot Bot requested a review from a team as a code owner May 20, 2026 17:01
@vyasr vyasr closed this pull request by merging all changes into main in 235f69a May 20, 2026
@github-project-automation github-project-automation Bot moved this from In Progress to Done in cuDF Python May 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CMake CMake build issue cudf-polars Issues specific to cudf-polars Java Affects Java cuDF API. libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.