Preserve StringDtype storage and na_value in get_dtype_of_same_kind by galipremsagar · Pull Request #22289 · rapidsai/cudf

galipremsagar · 2026-04-24T15:47:43Z

Summary

get_dtype_of_same_kind was silently downgrading StringDtype results when the source and target were both StringDtype but with different storage/na_value. When the source had na_value=np.nan, it returned the bare target dtype; when the source had pyarrow storage, it converted to large_string[pyarrow] unless the source equaled the target exactly.

This caused groupby min/max/first/last on StringDtype value columns to return the wrong dtype (e.g., str[python] would come back as StringDtype(na_value=nan) with pyarrow storage; string[pyarrow] would come back as large_string[pyarrow]).

Change

If both the source and target dtypes are pd.StringDtype, return the source unchanged. This preserves storage and na_value for all four storage/na_value combinations.

Tests

test_groupby_string_min_max_preserves_dtype covers min/max/first/last over the four StringDtype storage/na_value combinations and asserts that the result dtype matches pandas.

Conftest

Removes 24 test_string_dtype_all_na[*-{min,max,first,last}-{True,False}-True-0] entries (the Series.groupby(df[\"a\"]).<op>() parametrizations with min_count=0) that now produce the correct dtype on the first try.

Relationship to other split PRs

This was originally part of a larger #22289 covering string sum, bool any/all, min_count, and several dtype-preservation pieces. Per the review request this branch now contains only the get_dtype_of_same_kind change. The remaining work is split into:

Preserve extension dtypes in groupby reductions and exclude grouping-key columns by identity #22369 — extension-type preservation in groupby reductions and identity-based grouping-key column exclusion
Implement groupby sum on StringDtype columns as per-group concatenation #22370 — string sum
Implement groupby all/any via bool-coercion + min/max #22371 — bool any/all
Implement min_count for groupby reductions #22372 — min_count support

copy-pr-bot · 2026-04-24T15:47:49Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

copy-pr-bot · 2026-04-24T15:52:39Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mroeschke

Noting that this is a pretty large PR.

Could it be split up into multiple PRs addressing separate groupby reductions like

string sum
bool any/all
min count support
other extension type preservation

…ame_kind`` When the source is a ``pd.StringDtype`` and the target is also a ``pd.StringDtype`` (regardless of storage or na_value), return the source dtype unchanged. This fixes groupby min/max/first/last on ``StringDtype`` value columns silently downgrading the result dtype: - ``str[python]`` was being converted to ``StringDtype(na_value=nan)`` with pyarrow storage. - ``string[pyarrow]`` was being converted to ``large_string[pyarrow]``. The previous code only preserved the dtype for the pyarrow-storage ``source == target`` case, and replaced ``np.nan`` na_value strings with the bare target dtype, both of which produced the wrong result for at least one of the four ``StringDtype`` storage/na_value combinations. Conftest update for ``test_string_dtype_all_na``: 24 entries that exercise ``Series.groupby(df["a"]).<min|max|first|last>()`` (the ``test_series=True``, ``min_count=0`` parametrizations) now produce the correct dtype on the first try and no longer need an xfail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

galipremsagar · 2026-05-04T20:31:41Z

Noting that this is a pretty large PR.

Could it be split up into multiple PRs addressing separate groupby reductions like

string sum

bool any/all

min count support

other extension type preservation

@mroeschke I addressed all the review comments and split up changes into smaller PRs:
#22369 — extension-type preservation in groupby reductions and identity-based grouping-key column exclusion
#22370 — string sum
#22371 — bool any/all
#22372 — min_count support

galipremsagar · 2026-05-04T20:32:29Z

/okay to test ce5babc

## Summary Split out from #22289. `GroupBy._reduce` previously raised `NotImplementedError` whenever `min_count != 0`, forcing `cudf.pandas` to fall back to the slow path for `groupby.sum(min_count=...)` and similar calls. ## Implementation (`python/cudf/cudf/core/groupby/groupby.py`) Run the requested aggregation, then mask result rows whose per-group non-null count (computed via `self.agg(\"count\")`) is below `min_count`. Supports both `Series` and `DataFrame` results. ## Tests `python/cudf/cudf/tests/groupby/test_reductions.py`: - `test_groupby_reduce_min_count` over `sum`, `min`, `max`, `first`, `last` for `min_count` values 0, 1, 2, 3, 5. - `test_groupby_series_reduce_min_count` for `Series.groupby` paths. ## Relationship to #22289 One of the four split PRs requested in [the review on #22289](#22289 (review)). No conftest removals because the existing pandas-tests entries that fail with `min_count` errors also need the other split PRs (string sum, bool any/all, grouping-key exclusion) before they can be unmarked. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions Bot assigned galipremsagar Apr 24, 2026

galipremsagar changed the base branch from main to pandas3 April 24, 2026 15:47

github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels Apr 24, 2026

github-project-automation Bot added this to cuDF Python Apr 24, 2026

GPUtester moved this to In Progress in cuDF Python Apr 24, 2026

galipremsagar changed the title ~~Groupby fixes~~ Fix groupby reductions on StringDtype: min_count, all/any, string sum, and dtype preservation Apr 24, 2026

galipremsagar requested a review from mroeschke April 24, 2026 15:52

galipremsagar marked this pull request as ready for review April 24, 2026 15:52

galipremsagar requested a review from a team as a code owner April 24, 2026 15:52

galipremsagar requested review from bdice and removed request for a team April 24, 2026 15:52

galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels Apr 24, 2026

mroeschke reviewed May 1, 2026

View reviewed changes

galipremsagar force-pushed the groupby_fixes branch from bdae957 to ce5babc Compare May 4, 2026 20:04

galipremsagar changed the title ~~Fix groupby reductions on StringDtype: min_count, all/any, string sum, and dtype preservation~~ Preserve StringDtype storage and na_value in get_dtype_of_same_kind May 4, 2026

galipremsagar requested a review from mroeschke May 4, 2026 20:31

Merge branch 'pandas3' into groupby_fixes

e836a83

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Preserve StringDtype storage and na_value in get_dtype_of_same_kind#22289

Preserve StringDtype storage and na_value in get_dtype_of_same_kind#22289
galipremsagar wants to merge 2 commits intorapidsai:pandas3from
galipremsagar:groupby_fixes

galipremsagar commented Apr 24, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

mroeschke left a comment •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

galipremsagar commented Apr 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Change

Tests

Conftest

Relationship to other split PRs

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

copy-pr-bot Bot commented Apr 24, 2026

Uh oh!

mroeschke left a comment • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galipremsagar commented Apr 24, 2026 •

edited

Loading

mroeschke left a comment •

edited

Loading