Preserve StringDtype storage and na_value in get_dtype_of_same_kind#22289
Open
galipremsagar wants to merge 2 commits intorapidsai:pandas3from
Open
Preserve StringDtype storage and na_value in get_dtype_of_same_kind#22289galipremsagar wants to merge 2 commits intorapidsai:pandas3from
galipremsagar wants to merge 2 commits intorapidsai:pandas3from
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
mroeschke
reviewed
May 1, 2026
…ame_kind`` When the source is a ``pd.StringDtype`` and the target is also a ``pd.StringDtype`` (regardless of storage or na_value), return the source dtype unchanged. This fixes groupby min/max/first/last on ``StringDtype`` value columns silently downgrading the result dtype: - ``str[python]`` was being converted to ``StringDtype(na_value=nan)`` with pyarrow storage. - ``string[pyarrow]`` was being converted to ``large_string[pyarrow]``. The previous code only preserved the dtype for the pyarrow-storage ``source == target`` case, and replaced ``np.nan`` na_value strings with the bare target dtype, both of which produced the wrong result for at least one of the four ``StringDtype`` storage/na_value combinations. Conftest update for ``test_string_dtype_all_na``: 24 entries that exercise ``Series.groupby(df["a"]).<min|max|first|last>()`` (the ``test_series=True``, ``min_count=0`` parametrizations) now produce the correct dtype on the first try and no longer need an xfail. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
bdae957 to
ce5babc
Compare
This was referenced May 4, 2026
Open
Contributor
Author
@mroeschke I addressed all the review comments and split up changes into smaller PRs: |
Contributor
Author
|
/okay to test ce5babc |
galipremsagar
added a commit
that referenced
this pull request
May 7, 2026
## Summary Split out from #22289. `GroupBy._reduce` previously raised `NotImplementedError` whenever `min_count != 0`, forcing `cudf.pandas` to fall back to the slow path for `groupby.sum(min_count=...)` and similar calls. ## Implementation (`python/cudf/cudf/core/groupby/groupby.py`) Run the requested aggregation, then mask result rows whose per-group non-null count (computed via `self.agg(\"count\")`) is below `min_count`. Supports both `Series` and `DataFrame` results. ## Tests `python/cudf/cudf/tests/groupby/test_reductions.py`: - `test_groupby_reduce_min_count` over `sum`, `min`, `max`, `first`, `last` for `min_count` values 0, 1, 2, 3, 5. - `test_groupby_series_reduce_min_count` for `Series.groupby` paths. ## Relationship to #22289 One of the four split PRs requested in [the review on #22289](#22289 (review)). No conftest removals because the existing pandas-tests entries that fail with `min_count` errors also need the other split PRs (string sum, bool any/all, grouping-key exclusion) before they can be unmarked. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
get_dtype_of_same_kindwas silently downgradingStringDtyperesults when the source and target were bothStringDtypebut with different storage/na_value. When the source hadna_value=np.nan, it returned the bare target dtype; when the source hadpyarrowstorage, it converted tolarge_string[pyarrow]unless the source equaled the target exactly.This caused groupby
min/max/first/lastonStringDtypevalue columns to return the wrong dtype (e.g.,str[python]would come back asStringDtype(na_value=nan)with pyarrow storage;string[pyarrow]would come back aslarge_string[pyarrow]).Change
If both the source and target dtypes are
pd.StringDtype, return the source unchanged. This preserves storage andna_valuefor all four storage/na_valuecombinations.Tests
test_groupby_string_min_max_preserves_dtypecoversmin/max/first/lastover the fourStringDtypestorage/na_valuecombinations and asserts that the result dtype matches pandas.Conftest
Removes 24
test_string_dtype_all_na[*-{min,max,first,last}-{True,False}-True-0]entries (theSeries.groupby(df[\"a\"]).<op>()parametrizations withmin_count=0) that now produce the correct dtype on the first try.Relationship to other split PRs
This was originally part of a larger #22289 covering string sum, bool any/all, min_count, and several dtype-preservation pieces. Per the review request this branch now contains only the
get_dtype_of_same_kindchange. The remaining work is split into: