Implement groupby sum on StringDtype columns as per-group concatenation by galipremsagar · Pull Request #22370 · rapidsai/cudf

galipremsagar · 2026-05-04T20:05:24Z

Summary

Split out from #22289. Pandas 3 makes DataFrame.groupby(...).sum() on StringDtype columns return a per-group string concatenation rather than raise TypeError. This PR implements that path for cuDF.

Implementation (`python/cudf/cudf/core/groupby/groupby.py`)

GroupBy._reduce dispatches to a new _string_sum helper whenever the value column dtype is pd.StringDtype (and op == "sum"). The dispatch happens before the pre-existing min_count != 0 guard so that string sum supports min_count > 0 independently of the general min_count work in the sibling PR.

_string_sum:

Collects per-group values with plc.aggregation.collect_list.
Joins each list with plc.strings.combine.join_list_elements, using OutputIfEmptyList.EMPTY_STRING (skipna=True) or NULL_ELEMENT (skipna=False) and the matching per-element narep.
Applies min_count by counting per-group non-nulls (plc.aggregation.count) and using ColumnBase.copy_if_else with a null scalar where count < min_count.

The test_group_by_empty_reduction xfail is updated since str + sum no longer raises TypeError.

Tests

test_groupby_string_sum covers all four StringDtype storage/na_value combinations.

Conftest

Removes 16 test_string_dtype_all_na[*-sum-*] entries.

Relationship to #22289

One of the four split PRs requested in the review on #22289. The DataFrame-case parametrizations in test_string_dtype_all_na[*-sum-*] (df.groupby(df["a"]).sum()) also rely on identity-based grouping-key column exclusion, which lands in #22369. Both PRs must merge before those 16 conftest removals stop xpassing.

Pandas 3 makes ``DataFrame.groupby(...).sum()`` on StringDtype columns return a per-group string concatenation rather than raise. Implement that by dispatching to a new ``_string_sum`` helper from ``GroupBy._reduce`` whenever the value column dtype is ``pd.StringDtype``. The implementation: - collects values per group with ``plc.aggregation.collect_list`` - joins each list with ``plc.strings.combine.join_list_elements``, using ``OutputIfEmptyList.EMPTY_STRING`` (skipna=True) or ``NULL_ELEMENT`` (skipna=False) and a null/empty string as the per-element narep to match pandas' all-NA group semantics - applies ``min_count`` by counting per-group non-nulls and using ``copy_if_else`` with a null scalar where ``count < min_count`` The dispatch happens before the pre-existing ``min_count`` guard so that string sum works with ``min_count > 0`` even before general ``min_count`` support is wired up for non-string ops. Conftest update for ``test_string_dtype_all_na[*-sum-*]``: those parametrizations exercise ``df.groupby(df["a"]).sum()``, which also relies on identity-based grouping-key column exclusion. The xfail entries are removed here in anticipation of the grouping-key exclusion change landing as a sibling PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

copy-pr-bot · 2026-05-04T20:05:28Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

galipremsagar · 2026-05-04T20:33:26Z

/okay to test a7b08d8

galipremsagar · 2026-05-06T15:15:49Z

/okay to test 56c1088

galipremsagar · 2026-05-08T16:01:22Z

/okay to test 3852eb3

galipremsagar · 2026-05-08T17:30:18Z

/okay to test dced349

vyasr · 2026-05-11T16:57:27Z

+                return True
+        return False
+
+    def _string_sum(self, *, skipna: bool, min_count: int):


This function is too complicated. It has two single-use nested functions defined that are called by _group_and_join, which is overwriting a nonlocal variable. We need to find a way to simplify that. It looks like we might be able to inline the helpers into _group_and_join in some sensible ways, for instance by combining the aggregations like I note below. Does _group_and_join really need to be local here, or could we define it outside of _string_sum?

vyasr · 2026-05-11T16:59:47Z

+            if keys_cache is None:
+                keys_cache = keys


What is the point of a keys cache if the value is still recomputed before checking if the cache exists?

vyasr · 2026-05-11T17:06:46Z

+            count_col = ColumnBase.create(
+                count_plc, dtype_from_pylibcudf_column(count_plc)
+            )
+            keep_mask = binaryop.binaryop(
+                count_col,
+                plc.Scalar.from_py(min_count),
+                "__ge__",
+                np.dtype(np.bool_),
+            )


Could we operate directly on count_plc instead here and save a round trip?

vyasr · 2026-05-11T17:09:40Z

+            count_req = [
+                plc.groupby.GroupByRequest(
+                    string_col.plc_column, [plc.aggregation.count()]
+                )
+            ]


Can this groupby request be bundled with the one in _concat_column? It looks like we're not using the result_col here, so I think we could conditionally add this to the list of requests depending on min_count and then conditionally extract it.

galipremsagar requested a review from a team as a code owner May 4, 2026 20:05

galipremsagar requested review from rjzamora and wence- and removed request for a team May 4, 2026 20:05

github-actions Bot assigned galipremsagar May 4, 2026

github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 4, 2026

github-project-automation Bot added this to cuDF Python May 4, 2026

GPUtester moved this to In Progress in cuDF Python May 4, 2026

galipremsagar mentioned this pull request May 4, 2026

Preserve StringDtype storage and na_value in get_dtype_of_same_kind #22289

Open

galipremsagar requested a review from mroeschke May 4, 2026 20:30

galipremsagar added bug Something isn't working non-breaking Non-breaking change labels May 4, 2026

Merge branch 'pandas3' into groupby_string_sum

56c1088

galipremsagar added 2 commits May 7, 2026 15:17

Merge branch 'pandas3' into groupby_string_sum

b8ddab3

Merge branch 'pandas3' into groupby_string_sum

3852eb3

Update

dced349

vyasr requested changes May 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement groupby sum on StringDtype columns as per-group concatenation#22370

Implement groupby sum on StringDtype columns as per-group concatenation#22370
galipremsagar wants to merge 5 commits into
rapidsai:pandas3from
galipremsagar:groupby_string_sum

galipremsagar commented May 4, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

galipremsagar commented May 6, 2026

Uh oh!

galipremsagar commented May 8, 2026

Uh oh!

galipremsagar commented May 8, 2026

Uh oh!

vyasr May 11, 2026

Uh oh!

vyasr May 11, 2026

Uh oh!

vyasr May 11, 2026

Uh oh!

vyasr May 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

galipremsagar commented May 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Implementation (python/cudf/cudf/core/groupby/groupby.py)

Tests

Conftest

Relationship to #22289

Uh oh!

copy-pr-bot Bot commented May 4, 2026

Uh oh!

galipremsagar commented May 4, 2026

Uh oh!

galipremsagar commented May 6, 2026

Uh oh!

galipremsagar commented May 8, 2026

Uh oh!

galipremsagar commented May 8, 2026

Uh oh!

vyasr May 11, 2026

Choose a reason for hiding this comment

Uh oh!

vyasr May 11, 2026

Choose a reason for hiding this comment

Uh oh!

vyasr May 11, 2026

Choose a reason for hiding this comment

Uh oh!

vyasr May 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galipremsagar commented May 4, 2026 •

edited

Loading

Implementation (`python/cudf/cudf/core/groupby/groupby.py`)