Skip to content

Preserve StringDtype storage and na_value in get_dtype_of_same_kind#22289

Open
galipremsagar wants to merge 2 commits intorapidsai:pandas3from
galipremsagar:groupby_fixes
Open

Preserve StringDtype storage and na_value in get_dtype_of_same_kind#22289
galipremsagar wants to merge 2 commits intorapidsai:pandas3from
galipremsagar:groupby_fixes

Conversation

@galipremsagar
Copy link
Copy Markdown
Contributor

@galipremsagar galipremsagar commented Apr 24, 2026

Summary

get_dtype_of_same_kind was silently downgrading StringDtype results when the source and target were both StringDtype but with different storage/na_value. When the source had na_value=np.nan, it returned the bare target dtype; when the source had pyarrow storage, it converted to large_string[pyarrow] unless the source equaled the target exactly.

This caused groupby min/max/first/last on StringDtype value columns to return the wrong dtype (e.g., str[python] would come back as StringDtype(na_value=nan) with pyarrow storage; string[pyarrow] would come back as large_string[pyarrow]).

Change

If both the source and target dtypes are pd.StringDtype, return the source unchanged. This preserves storage and na_value for all four storage/na_value combinations.

Tests

test_groupby_string_min_max_preserves_dtype covers min/max/first/last over the four StringDtype storage/na_value combinations and asserts that the result dtype matches pandas.

Conftest

Removes 24 test_string_dtype_all_na[*-{min,max,first,last}-{True,False}-True-0] entries (the Series.groupby(df[\"a\"]).<op>() parametrizations with min_count=0) that now produce the correct dtype on the first try.

Relationship to other split PRs

This was originally part of a larger #22289 covering string sum, bool any/all, min_count, and several dtype-preservation pieces. Per the review request this branch now contains only the get_dtype_of_same_kind change. The remaining work is split into:

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 24, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@galipremsagar galipremsagar changed the base branch from main to pandas3 April 24, 2026 15:47
@github-actions github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Apr 24, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Apr 24, 2026
@galipremsagar galipremsagar changed the title Groupby fixes Fix groupby reductions on StringDtype: min_count, all/any, string sum, and dtype preservation Apr 24, 2026
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 24, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@galipremsagar galipremsagar requested a review from mroeschke April 24, 2026 15:52
@galipremsagar galipremsagar marked this pull request as ready for review April 24, 2026 15:52
@galipremsagar galipremsagar requested a review from a team as a code owner April 24, 2026 15:52
@galipremsagar galipremsagar requested review from bdice and removed request for a team April 24, 2026 15:52
@galipremsagar galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Apr 24, 2026
Copy link
Copy Markdown
Contributor

@mroeschke mroeschke left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting that this is a pretty large PR.

Could it be split up into multiple PRs addressing separate groupby reductions like

  1. string sum
  2. bool any/all
  3. min count support
  4. other extension type preservation

Comment thread python/cudf/cudf/core/groupby/groupby.py Outdated
Comment thread python/cudf/cudf/core/groupby/groupby.py Outdated
Comment thread python/cudf/cudf/core/groupby/groupby.py Outdated
Comment thread python/cudf/cudf/core/groupby/groupby.py Outdated
Comment thread python/cudf/cudf/core/groupby/groupby.py Outdated
Comment thread python/cudf/cudf/core/groupby/groupby.py Outdated
Comment thread python/cudf/cudf/core/groupby/groupby.py Outdated
…ame_kind``

When the source is a ``pd.StringDtype`` and the target is also a
``pd.StringDtype`` (regardless of storage or na_value), return the
source dtype unchanged. This fixes groupby min/max/first/last on
``StringDtype`` value columns silently downgrading the result dtype:

- ``str[python]`` was being converted to ``StringDtype(na_value=nan)``
  with pyarrow storage.
- ``string[pyarrow]`` was being converted to ``large_string[pyarrow]``.

The previous code only preserved the dtype for the pyarrow-storage
``source == target`` case, and replaced ``np.nan`` na_value strings
with the bare target dtype, both of which produced the wrong result
for at least one of the four ``StringDtype`` storage/na_value
combinations.

Conftest update for ``test_string_dtype_all_na``: 24 entries that
exercise ``Series.groupby(df["a"]).<min|max|first|last>()`` (the
``test_series=True``, ``min_count=0`` parametrizations) now produce
the correct dtype on the first try and no longer need an xfail.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@galipremsagar galipremsagar changed the title Fix groupby reductions on StringDtype: min_count, all/any, string sum, and dtype preservation Preserve StringDtype storage and na_value in get_dtype_of_same_kind May 4, 2026
@galipremsagar
Copy link
Copy Markdown
Contributor Author

Noting that this is a pretty large PR.

Could it be split up into multiple PRs addressing separate groupby reductions like

  1. string sum
  2. bool any/all
  3. min count support
  4. other extension type preservation

@mroeschke I addressed all the review comments and split up changes into smaller PRs:
#22369 — extension-type preservation in groupby reductions and identity-based grouping-key column exclusion
#22370 — string sum
#22371 — bool any/all
#22372 — min_count support

@galipremsagar galipremsagar requested a review from mroeschke May 4, 2026 20:31
@galipremsagar
Copy link
Copy Markdown
Contributor Author

/okay to test ce5babc

galipremsagar added a commit that referenced this pull request May 7, 2026
## Summary

Split out from #22289. `GroupBy._reduce` previously raised
`NotImplementedError` whenever `min_count != 0`, forcing `cudf.pandas`
to fall back to the slow path for `groupby.sum(min_count=...)` and
similar calls.

## Implementation (`python/cudf/cudf/core/groupby/groupby.py`)

Run the requested aggregation, then mask result rows whose per-group
non-null count (computed via `self.agg(\"count\")`) is below
`min_count`. Supports both `Series` and `DataFrame` results.

## Tests

`python/cudf/cudf/tests/groupby/test_reductions.py`:
- `test_groupby_reduce_min_count` over `sum`, `min`, `max`, `first`,
`last` for `min_count` values 0, 1, 2, 3, 5.
- `test_groupby_series_reduce_min_count` for `Series.groupby` paths.

## Relationship to #22289

One of the four split PRs requested in [the review on
#22289](#22289 (review)).
No conftest removals because the existing pandas-tests entries that fail
with `min_count` errors also need the other split PRs (string sum, bool
any/all, grouping-key exclusion) before they can be unmarked.

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants