Implement groupby sum on StringDtype columns as per-group concatenation#22370
Implement groupby sum on StringDtype columns as per-group concatenation#22370galipremsagar wants to merge 5 commits into
Conversation
Pandas 3 makes ``DataFrame.groupby(...).sum()`` on StringDtype columns return a per-group string concatenation rather than raise. Implement that by dispatching to a new ``_string_sum`` helper from ``GroupBy._reduce`` whenever the value column dtype is ``pd.StringDtype``. The implementation: - collects values per group with ``plc.aggregation.collect_list`` - joins each list with ``plc.strings.combine.join_list_elements``, using ``OutputIfEmptyList.EMPTY_STRING`` (skipna=True) or ``NULL_ELEMENT`` (skipna=False) and a null/empty string as the per-element narep to match pandas' all-NA group semantics - applies ``min_count`` by counting per-group non-nulls and using ``copy_if_else`` with a null scalar where ``count < min_count`` The dispatch happens before the pre-existing ``min_count`` guard so that string sum works with ``min_count > 0`` even before general ``min_count`` support is wired up for non-string ops. Conftest update for ``test_string_dtype_all_na[*-sum-*]``: those parametrizations exercise ``df.groupby(df["a"]).sum()``, which also relies on identity-based grouping-key column exclusion. The xfail entries are removed here in anticipation of the grouping-key exclusion change landing as a sibling PR. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
/okay to test a7b08d8 |
|
/okay to test 56c1088 |
|
/okay to test 3852eb3 |
|
/okay to test dced349 |
| return True | ||
| return False | ||
|
|
||
| def _string_sum(self, *, skipna: bool, min_count: int): |
There was a problem hiding this comment.
This function is too complicated. It has two single-use nested functions defined that are called by _group_and_join, which is overwriting a nonlocal variable. We need to find a way to simplify that. It looks like we might be able to inline the helpers into _group_and_join in some sensible ways, for instance by combining the aggregations like I note below. Does _group_and_join really need to be local here, or could we define it outside of _string_sum?
| if keys_cache is None: | ||
| keys_cache = keys |
There was a problem hiding this comment.
What is the point of a keys cache if the value is still recomputed before checking if the cache exists?
| count_col = ColumnBase.create( | ||
| count_plc, dtype_from_pylibcudf_column(count_plc) | ||
| ) | ||
| keep_mask = binaryop.binaryop( | ||
| count_col, | ||
| plc.Scalar.from_py(min_count), | ||
| "__ge__", | ||
| np.dtype(np.bool_), | ||
| ) |
There was a problem hiding this comment.
Could we operate directly on count_plc instead here and save a round trip?
| count_req = [ | ||
| plc.groupby.GroupByRequest( | ||
| string_col.plc_column, [plc.aggregation.count()] | ||
| ) | ||
| ] |
There was a problem hiding this comment.
Can this groupby request be bundled with the one in _concat_column? It looks like we're not using the result_col here, so I think we could conditionally add this to the list of requests depending on min_count and then conditionally extract it.
Summary
Split out from #22289. Pandas 3 makes
DataFrame.groupby(...).sum()onStringDtypecolumns return a per-group string concatenation rather than raiseTypeError. This PR implements that path for cuDF.Implementation (
python/cudf/cudf/core/groupby/groupby.py)GroupBy._reducedispatches to a new_string_sumhelper whenever the value column dtype ispd.StringDtype(andop == "sum"). The dispatch happens before the pre-existingmin_count != 0guard so that string sum supportsmin_count > 0independently of the generalmin_countwork in the sibling PR._string_sum:plc.aggregation.collect_list.plc.strings.combine.join_list_elements, usingOutputIfEmptyList.EMPTY_STRING(skipna=True) orNULL_ELEMENT(skipna=False) and the matching per-element narep.min_countby counting per-group non-nulls (plc.aggregation.count) and usingColumnBase.copy_if_elsewith a null scalar wherecount < min_count.The
test_group_by_empty_reductionxfail is updated sincestr+sumno longer raisesTypeError.Tests
test_groupby_string_sumcovers all fourStringDtypestorage/na_value combinations.Conftest
Removes 16
test_string_dtype_all_na[*-sum-*]entries.Relationship to #22289
One of the four split PRs requested in the review on #22289. The DataFrame-case parametrizations in
test_string_dtype_all_na[*-sum-*](df.groupby(df["a"]).sum()) also rely on identity-based grouping-key column exclusion, which lands in #22369. Both PRs must merge before those 16 conftest removals stop xpassing.