Skip to content

Implement groupby all/any via bool-coercion + min/max#22371

Merged
galipremsagar merged 41 commits into
rapidsai:pandas3from
galipremsagar:groupby_bool_reduce
May 13, 2026
Merged

Implement groupby all/any via bool-coercion + min/max#22371
galipremsagar merged 41 commits into
rapidsai:pandas3from
galipremsagar:groupby_bool_reduce

Conversation

@galipremsagar
Copy link
Copy Markdown
Contributor

@galipremsagar galipremsagar commented May 4, 2026

Summary

Split out from #22289. GroupBy.all and GroupBy.any previously raised NotImplementedError. This PR implements them by reducing to min/max on a bool-coerced copy of the value columns.

Implementation (python/cudf/cudf/core/groupby/groupby.py)

A new _bool_reduce helper:

  • Coerces strings as count_characters > 0 so empty strings become False and nulls remain null (preserved through the aggregation).
  • Coerces numerics as != 0 with the same null preservation.
  • For skipna=False, fills nulls with True before aggregation so they don't flip all to False and trivially make any True.
  • Empty groups (skipna=True with all-NA values) yield NA from min/max; pandas treats those as vacuously True for all and False for any, so the result is filled accordingly.
  • Applies min_count by counting per-group non-nulls and masking groups whose count is below the threshold.

The new GroupBy is constructed with by=self.grouping (passing the existing _Grouping object) so key columns match the bool-coerced value columns exactly, avoiding label-based lookup when the original key column was excluded.

Tests

python/cudf/cudf/tests/groupby/test_reductions.py:

  • test_groupby_all_any over bool/int/float data.
  • test_groupby_all_any_string for string columns.
  • test_groupby_all_any_empty for empty-group behavior.

Conftest

Removes 32 test_string_dtype_all_na[*-all-*] and [*-any-*] entries.

Relationship to #22289

One of the four split PRs requested in the review on #22289. The DataFrame-case test_string_dtype_all_na[*-{all,any}-*] parametrizations (df.groupby(df["a"]).all()) also rely on identity-based grouping-key column exclusion in #22369; both must merge before the 32 conftest removals stop xpassing.

Both methods previously raised ``NotImplementedError``. Reduce ``all``/
``any`` to ``min``/``max`` on a bool-coerced copy of the value columns:

- Strings coerce as ``count_characters > 0`` so empty strings become
  ``False`` and nulls remain null (preserving them through the agg).
- Numerics coerce as ``!= 0`` with the same null preservation.
- ``skipna=False`` replaces nulls with ``True`` before the aggregation
  so that nulls don't flip ``all`` to ``False`` and trivially make
  ``any`` ``True``.
- Empty groups (all-NA values, skipna=True) yield NA from min/max;
  pandas treats those as vacuously ``True`` for ``all`` and ``False``
  for ``any``, so the result is filled accordingly.
- ``min_count`` masks groups whose non-null count is below the
  threshold.

Conftest update for ``test_string_dtype_all_na[*-all-*]`` and
``[*-any-*]`` (32 entries). The string-key DataFrame cases additionally
rely on identity-based grouping-key column exclusion, which lands in
a sibling PR; both must merge before the entries can be removed
without xpassing.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@galipremsagar galipremsagar requested a review from a team as a code owner May 4, 2026 20:05
@galipremsagar galipremsagar requested review from TomAugspurger and brandon-b-miller and removed request for a team May 4, 2026 20:05
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 4, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 4, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 4, 2026
@galipremsagar galipremsagar added bug Something isn't working non-breaking Non-breaking change labels May 4, 2026
@galipremsagar galipremsagar requested a review from mroeschke May 4, 2026 20:33
@galipremsagar
Copy link
Copy Markdown
Contributor Author

/okay to test b288bbc

Comment thread python/cudf/cudf/core/groupby/groupby.py Outdated
Comment thread python/cudf/cudf/core/groupby/groupby.py Outdated
# Empty groups (skipna=True with all-NA values) yield NA from
# min/max — pandas treats these as ``True`` for ``all`` and
# ``False`` for ``any``.
bool_np = np.dtype(np.bool_)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just confirming, is np.dtype(np.bool_) return regardless of the pandas string type?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes:

In [9]: df = pd.DataFrame({
   ...:       "k": [1, 1, 2, 2],
   ...:       "s": pd.array(["a", "b", pd.NA, "c"], dtype="string"),
   ...:   })

In [10]: df
Out[10]: 
   k     s
0  1     a
1  1     b
2  2  <NA>
3  2     c

In [11]: df.groupby("k").all()
Out[11]: 
      s
k      
1  True
2  True

In [12]: df.groupby("k").all().dtypes
Out[12]: 
s    bool
dtype: object

galipremsagar and others added 13 commits May 6, 2026 15:49
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
…ai#22295)

## Description

In pandas-compatible mode, reject casting nullable string columns that
use `pd.NA` as their missing-value sentinel to numpy `object` dtype.

This came from a pandas 3 compatibility issue in `cudf.pandas`: pandas
preserves `pd.NA` when `StringDtype(na_value=pd.NA)` is cast to
`object`, while cuDF's string-to-object path materializes nulls as
Python `None`. Preserving that sentinel would require carrying source
dtype metadata after the result has become plain `object`, which the
review pointed out is not a good fit for the current column model.

Instead, when `mode.pandas_compatible` is enabled, this PR now raises in
`StringColumn.as_string_column` for:

- `pd.StringDtype(..., na_value=pd.NA)` -> `object`
- string `pd.ArrowDtype` -> `object`

Outside pandas-compatible mode, the existing string-to-object cast
behavior is unchanged. String dtypes that use `np.nan` as their
missing-value sentinel and ordinary object string columns also keep the
existing behavior.

## Changes

- Add an explicit pandas-compatible-mode `NotImplementedError` for
nullable `pd.NA` string-to-object casts in
`python/cudf/cudf/core/column/string.py`.
- Add focused coverage in
`python/cudf/cudf/tests/series/methods/test_astype.py` for both
pandas-compatible and non-pandas-compatible behavior.
- Remove the previous per-instance `_PANDAS_NA_VALUE` override path.

## Checklist

- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
Drops the legacy `Cluster.DISTRIBUTED` cluster and the entire `rapidsmpf.integrations.dask` execution path. The new `DaskEngine` (`Cluster.DASK`) is unaffected.

Note: all removed components were under `experimental`, so no deprecation period is required.


**What’s removed**

* `Cluster.DISTRIBUTED` enum value and all dispatch paths (`rapidsmpf/core.py`, `parallel.py:get_scheduler`)
* `experimental/dask_registers.py`, `experimental/spilling.py`, `experimental/rapidsmpf/dask.py`
* `rapidsmpf_distributed_available()`, `StreamingExecutor.rapidsmpf_spill`, and `cluster_kind` plumbing in `shuffle.py` and `sort.py`
* Legacy benchmark harness (`benchmarks/utils_legacy.py`) and the `utils.py` dispatch shim
* Legacy test suite (`tests/experimental/legacy/`) and Dask registration test files

**What stays**

* `Cluster.DASK` / `DaskEngine` (`frontend/dask.py`), the supported Dask backend
* `Cluster.SINGLE`, `SPMD`, and `RAY` streaming frontends
* The task-graph backend (`Runtime.TASKS`).

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Matthew Murray (https://github.com/Matt711)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#22358
…identity (rapidsai#22366)

Uses node directly as the dict key instead of `id(node)`, so nodes reconstructed on workers (introduced in rapidsai#22287) are found correctly by value rather than failing with a `KeyError`.

Authors:
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Tom Augspurger (https://github.com/TomAugspurger)
  - Mads R. B. Kristensen (https://github.com/madsbk)

URL: rapidsai#22366
…dsai#22344)

Pass the managed-pool MR directly into each `cudf::datagen::generate_*` call instead of swapping it in as the current device resource and restoring on exit. Also fixes forwarding of the mr parameter down the datagen stack.

There are still a few tiny allocations (KBs) that use the default mr because switching would require a copy. These should not cause OOM errors.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Tianyu Liu (https://github.com/kingcrimsontianyu)

URL: rapidsai#22344
Fixes some compile warnings in the libcudf tests. These are deprecation warnings about the missing alignment parameter for the custom allocators in the `hybrid_scan_io` and `parquet_io` examples.

```
/cudf/cpp/examples/parquet_io/io_source.hpp:61:66: warning: 'void cuda::mr::__4::__ibasic_async_resource< <template-parameter-1-1> >::deallocate(cuda::__4::stream_ref, void*, size_t) [with <template-parameter-1-1> = {cuda::__4::__ireference<cuda::__4::__iset_<cuda::mr::__4::__ibasic_async_resource<>, cuda::mr::__4::__ibasic_resource<>, cuda::mr::__4::__with_property<cuda::mr::__4::dynamic_accessibility_property>::__iproperty<>, cuda::mr::__4::__with_property<cuda::mr::__4::host_accessible>::__iproperty<>, cuda::__4::__icopyable<>, cuda::__4::__iequality_comparable<> > >}; size_t = long unsigned int]' is deprecated: Specify an explicit alignment argument. The default alignment will be removed in a future release. [-Wdeprecated-declarations]
   61 |   void deallocate(T* ptr, std::size_t n) noexcept { mr.deallocate(stream, ptr, n * sizeof(T)); }

```

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#22335
This PR updates the join benchmarks to include a skip axis, allowing users to optionally include large table sizes, which is not possible in the current setup due to its unconditional skip of those sizes.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Vyas Ramasubramani (https://github.com/vyasr)
  - Shruti Shivakumar (https://github.com/shrshi)

URL: rapidsai#22241
## Summary

Pandas' `BaseMaskedDtype` defines `__from_arrow__` for converting a
`pyarrow.Array` (including `NullArray`/`ChunkedArray` of nulls) into the
matching `BaseMaskedArray`. The cudf.pandas final proxy types for
`BooleanDtype`, `Int{8,16,32,64}Dtype`, `UInt{8,16,32,64}Dtype`, and
`Float{32,64}Dtype` did not list `__from_arrow__` in their
`additional_attributes`, so the proxy `__getattr__` raised
`AttributeError` even though the slow object has it.

## Change

Add `"__from_arrow__": _FastSlowAttribute("__from_arrow__")` to all
eleven masked dtype proxy declarations in
`python/cudf/cudf/pandas/_wrappers/pandas.py`, mirroring the existing
pattern on `ArrowDtype`.

## Tests / Conftest

Removes 25 entries from `conftest-patch.py` that were xfailed only
because of the missing attribute:

- 22 parametrizations of
`tests/arrays/masked/test_arrow_compat.py::test_from_arrow_null` (all
four masked dtype families × two arrow array shapes).
-
`tests/arrays/masked/test_arrow_compat.py::test_arrow_from_arrow_uint`.
-
`tests/arrays/masked/test_arrow_compat.py::test_dataframe_from_arrow_types_mapper`.
-
`tests/indexes/multi/test_constructors.py::test_from_frame_missing_values_multiIndex`.

All 22 `test_from_arrow_null` cases pass, the full
`test_arrow_compat.py` file passes (69 passed, 22 unrelated xfails), and
the cudf-side `cudf_pandas_tests/` suite is clean (435 passed).

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Description
In pandas empty datetime inputs default to `s` resolution, this PR fixes
that inconsistency and matches `cudf` with `pandas`. This PR also fixes
`freq` preservation in `Groupby.size`

## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
…thodProxy (rapidsai#22374)

## Summary

Three pandas-tests xfail entries surfaced `AttributeError` failures that
were just missing entries in the proxy `additional_attributes` (or on
`_MethodProxy` itself).

## Changes

- **`IntervalArray` proxy** now exposes `_left` and `_right` (private),
matching the existing `_data`/`_mask` plumbing. Fixes
`test_series_from_temporary_intervalindex_readonly_data`.
- **`Styler` proxy** now exposes `_compute`,
`_display_funcs_column_names`, and `_display_funcs_index_names` (all
private). Fixes
`test_format_index_names_clear[_display_funcs_column_names-kwargs1]` and
`[_display_funcs_index_names-kwargs0]`.
- **`_MethodProxy`** now exposes `__func__` (forwarded to the slow
underlying method), mirroring the existing `__name__` and `__doc__`
properties. This is required for callers that introspect classmethod
descriptors via `type(x).method.__func__`.

## Conftest

Removed three `NODEIDS_THAT_FAIL` entries whose underlying tests now
pass.

## Notes on remaining `AttributeError` xfails

Audited the remaining 17 `AttributeError` xfail entries; they fall into
a few buckets that need deeper changes (out of scope for this PR):

- **Slow-side `isinstance` failures** (`Styler._compute`,
`'DataFrame'/'SubclassedDataFrame' object has no attribute 'dtype'`):
the slow-side function's `__globals__` was bound at import time before
the proxy classes were installed, so `isinstance(proxy_df,
real_DataFrame)` is `False` inside the slow module. Needs a different
mechanism than `additional_attributes`.
- **Mixed-type Series limitations** (`quantile_box`, `quantile_box_nat`,
`quantile_date_range`, `quantile_ea_scalar`): cuDF documents that it
returns a `DataFrame` instead of a `Series` when the result would be
mixed-type — the proxy preserves that type, breaking downstream
`assert_series_equal`.
- **`.values` returning ndarray for nullable dtypes**
(`test_construct_from_dict_ea_series`): pure pandas returns
`IntegerArray`; cuDF returns `ndarray`.
- **Other one-offs** (`SparseArray.reshape`, abstract
`_from_sequence_of_strings`, custom accessor `xyz`, loc setitem datetime
parsing, `_fsproxy_slow` proxy-conversion failure): each needs its own
targeted fix.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Fix bugs that appear when running with `num_ranks > 1`, where client-side `pl.concat(per_rank_outputs)` exposes assumptions that do not hold under single-rank execution.

These were all discovered while working on multi-rank tests.

**NB:** Please take a close look during review, as I’m still a bit unfamiliar with the IR part of cudf-polars.

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Lawrence Mitchell (https://github.com/wence-)

URL: rapidsai#22361
Fixes a regression in rapidsai#22237 where reading a CSV larger than the internal 64 MiB chunk size dropped all rows past the first chunk. Root cause is a misuse of a clamped value to determine the EOF state.

This PR fixes the EOF transition so it only happens in the last chunk.

Also added a large test - all previous CSV tests were below the chunk threshold.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)

Approvers:
  - Yunsong Wang (https://github.com/PointKernel)
  - Basit Ayantunde (https://github.com/lamarrr)
  - Bradley Dice (https://github.com/bdice)

URL: rapidsai#22375
Closes rapidsai#22154

This PR adds decimal128 values to the groupby_max_cardinality benchmark.

Authors:
  - Yunsong Wang (https://github.com/PointKernel)

Approvers:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - David Wendt (https://github.com/davidwendt)

URL: rapidsai#22162
vyasr and others added 8 commits May 8, 2026 02:49
…apidsai#22384)

The `cudf-polars-ir-signatures` pre-commit hook uses `language: python` but is just a local script (`./ci/check_cudf_polars_ir.py`) that only depends on stdlib modules (`ast`, `argparse`, `sys`, `typing`) and has a `#!/usr/bin/env python3` shebang.

With `language: python`, pre-commit unnecessarily creates a virtualenv for this hook. `language: script` is the correct setting — it runs the entry point directly as an executable, relying on the shebang for interpreter selection, with no virtualenv overhead.

Authors:
  - Vyas Ramasubramani (https://github.com/vyasr)

Approvers:
  - James Lamb (https://github.com/jameslamb)

URL: rapidsai#22384
This PR fixes a potential infinite loop in parquet page header count/decode kernels if case of malformed input.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Paul Mattione (https://github.com/pmattione-nvidia)

URL: rapidsai#22274
…rapidsai#22281)

closes rapidsai#21466
closes rapidsai#21767

Waiting for rapidsai#22212

* Makes rapidsmpf a required dependency of cudf_polars
* Removes the following `StreamingExecutor` options as they were "experimental" with associated code paths
    * `StreamingExecutor.runtime`
    * `StreamingExecutor.shuffle_method`
    * `StreamingExecutor.unique_fraction`
    * `StreamingExecutor.groupby_n_ary`
    * `StreamingExecutor.rapidsmpf_spill`
* Removes the task runtime and associated tests
* Some tests we modified to only test 1 specific test configuration because of rapidsai#22346 to pass these tests for now. Planning on revisiting this once rapidsmpf becomes the default

Ops-Bot-Merge-Barrier: true

Authors:
  - Matthew Roeschke (https://github.com/mroeschke)

Approvers:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Bradley Dice (https://github.com/bdice)
  - Matthew Murray (https://github.com/Matt711)
  - Lawrence Mitchell (https://github.com/wence-)

URL: rapidsai#22281
This PR uses the host worker pool to submit hybrid scan's host-read IO tasks so that the mutex can be safely released after submission.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)

Approvers:
  - Tianyu Liu (https://github.com/kingcrimsontianyu)
  - Shruti Shivakumar (https://github.com/shrshi)

URL: rapidsai#21992
…#22145)

Follow up rapidsai#22144

Adds Python bindings for the `cudf::apply_deletion_mask` API and adds pytests for stream compaction.

Authors:
  - Muhammad Haseeb (https://github.com/mhaseeb123)
  - Matthew Murray (https://github.com/Matt711)

Approvers:
  - Matthew Roeschke (https://github.com/mroeschke)
  - Bradley Dice (https://github.com/bdice)
  - Matthew Murray (https://github.com/Matt711)

URL: rapidsai#22145
…sai#22350)

- Follow up to rapidsai#22315 - Further revises `sort_actor` in preparation for rapidsai/rapidsmpf#853
- Part of rapidsai#22128
- Breaks apart `sort_actor` logic into modular steps, so we can avoid collecting boundaries when we already know the boundaries (future work).

Authors:
  - Richard (Rick) Zamora (https://github.com/rjzamora)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: rapidsai#22350
…apidsai#22381)

Builds on the cached `streaming_engines` fixture from rapidsai#22364, which amortizes SPMD bootstrap via `_reset()`, and extends the same pattern to Dask and Ray.

With this change, the test matrix runs against:

`["in-memory", "spmd", "spmd-small", "dask", "ray"]`

subject to package availability and `rrun` gating.

We might change the different setups later, but for now CI runs:

| Engine        | Block Size(s)         | GPU Configuration |
|----------------|-----------------------|-------------------|
| `SPMDEngine`   | `"medium"`, `"small"` | Single GPU        |
| `DaskEngine`   | `"medium"`            | Single GPU        |
| `RayEngine`    | `"medium"`            | Two GPUs          |

Authors:
  - Mads R. B. Kristensen (https://github.com/madsbk)
  - Peter Andreas Entschev (https://github.com/pentschev)

Approvers:
  - Matthew Murray (https://github.com/Matt711)
  - Bradley Dice (https://github.com/bdice)
  - Peter Andreas Entschev (https://github.com/pentschev)
  - Matthew Roeschke (https://github.com/mroeschke)

URL: rapidsai#22381
@galipremsagar galipremsagar requested a review from a team as a code owner May 8, 2026 02:55
@github-actions github-actions Bot added Java Affects Java cuDF API. cudf-polars Issues specific to cudf-polars pylibcudf Issues specific to the pylibcudf package labels May 8, 2026
@galipremsagar galipremsagar requested review from mroeschke and removed request for a team May 8, 2026 02:56
@galipremsagar galipremsagar removed libcudf Affects libcudf (C++/CUDA) code. CMake CMake build issue Java Affects Java cuDF API. labels May 8, 2026
galipremsagar added a commit that referenced this pull request May 11, 2026
…22289)

## Summary

`get_dtype_of_same_kind` was silently downgrading `StringDtype` results
when the source and target were both `StringDtype` but with different
storage/`na_value`. When the source had `na_value=np.nan`, it returned
the bare target dtype; when the source had `pyarrow` storage, it
converted to `large_string[pyarrow]` unless the source equaled the
target exactly.

This caused groupby `min`/`max`/`first`/`last` on `StringDtype` value
columns to return the wrong dtype (e.g., `str[python]` would come back
as `StringDtype(na_value=nan)` with pyarrow storage; `string[pyarrow]`
would come back as `large_string[pyarrow]`).

## Change

If both the source and target dtypes are `pd.StringDtype`, return the
source unchanged. This preserves storage and `na_value` for all four
storage/`na_value` combinations.

## Tests

`test_groupby_string_min_max_preserves_dtype` covers
`min`/`max`/`first`/`last` over the four `StringDtype`
storage/`na_value` combinations and asserts that the result dtype
matches pandas.

## Conftest

Removes 24
`test_string_dtype_all_na[*-{min,max,first,last}-{True,False}-True-0]`
entries (the `Series.groupby(df[\"a\"]).<op>()` parametrizations with
`min_count=0`) that now produce the correct dtype on the first try.

## Relationship to other split PRs

This was originally part of a larger #22289 covering string sum, bool
any/all, min_count, and several dtype-preservation pieces. Per [the
review
request](#22289 (review))
this branch now contains only the `get_dtype_of_same_kind` change. The
remaining work is split into:
- #22369 — extension-type preservation in groupby reductions and
identity-based grouping-key column exclusion
- #22370 — string sum
- #22371 — bool any/all
- #22372 — min_count support

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@galipremsagar
Copy link
Copy Markdown
Contributor Author

@mroeschke This one is ready for review.

@galipremsagar
Copy link
Copy Markdown
Contributor Author

/okay to test 48c4ccd

@galipremsagar galipremsagar added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team labels May 13, 2026
@galipremsagar galipremsagar merged commit 0c1b66a into rapidsai:pandas3 May 13, 2026
6 of 8 checks passed
@github-project-automation github-project-automation Bot moved this from In Progress to Done in cuDF Python May 13, 2026
@GPUtester GPUtester moved this from Done to In Progress in cuDF Python May 13, 2026
galipremsagar added a commit to galipremsagar/cudf that referenced this pull request May 13, 2026
## Summary

Split out from rapidsai#22289. `GroupBy.all` and `GroupBy.any` previously raised
`NotImplementedError`. This PR implements them by reducing to
`min`/`max` on a bool-coerced copy of the value columns.

## Implementation (`python/cudf/cudf/core/groupby/groupby.py`)

A new `_bool_reduce` helper:
- Coerces strings as `count_characters > 0` so empty strings become
`False` and nulls remain null (preserved through the aggregation).
- Coerces numerics as `!= 0` with the same null preservation.
- For `skipna=False`, fills nulls with `True` before aggregation so they
don't flip `all` to `False` and trivially make `any` `True`.
- Empty groups (skipna=True with all-NA values) yield NA from min/max;
pandas treats those as vacuously `True` for `all` and `False` for `any`,
so the result is filled accordingly.
- Applies `min_count` by counting per-group non-nulls and masking groups
whose count is below the threshold.

The new GroupBy is constructed with `by=self.grouping` (passing the
existing `_Grouping` object) so key columns match the bool-coerced value
columns exactly, avoiding label-based lookup when the original key
column was excluded.

## Tests

`python/cudf/cudf/tests/groupby/test_reductions.py`:
- `test_groupby_all_any` over bool/int/float data.
- `test_groupby_all_any_string` for string columns.
- `test_groupby_all_any_empty` for empty-group behavior.

## Conftest

Removes 32 `test_string_dtype_all_na[*-all-*]` and `[*-any-*]` entries.

## Relationship to rapidsai#22289

One of the four split PRs requested in [the review on
rapidsai#22289](rapidsai#22289 (review)).
The DataFrame-case `test_string_dtype_all_na[*-{all,any}-*]`
parametrizations (`df.groupby(df["a"]).all()`) also rely on
identity-based grouping-key column exclusion in rapidsai#22369; both must merge
before the 32 conftest removals stop xpassing.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com>
Co-authored-by: Matthew Murray <41342305+Matt711@users.noreply.github.com>
Co-authored-by: Vukasin Milovanovic <vmilovanovic@nvidia.com>
Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com>
Co-authored-by: Yunsong Wang <12716979+PointKernel@users.noreply.github.com>
Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com>
Co-authored-by: Kyle Edwards <kyedwards@nvidia.com>
Co-authored-by: Bradley Dice <bdice@bradleydice.com>
Co-authored-by: Paul Taylor <178183+trxcllnt@users.noreply.github.com>
Co-authored-by: Vyas Ramasubramani <vyasr@nvidia.com>
Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cudf.pandas Issues specific to cudf.pandas cudf-polars Issues specific to cudf-polars non-breaking Non-breaking change pylibcudf Issues specific to the pylibcudf package Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.