Fix AssertionError: DataFrame.columns are different failures in cudf.pandas#22351
Merged
Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
AssertionError: DataFrame.columns are different failures in cudf.pandas
mroeschke
reviewed
May 7, 2026
| ] | ||
| return self.loc[:, to_select] | ||
| result = self.loc[:, to_select] | ||
| if not to_select and self._data.rangeindex: |
Contributor
There was a problem hiding this comment.
Ideally I would hope loc preserved the .rangeindex but that could be for another PR
mroeschke
approved these changes
May 13, 2026
Contributor
mroeschke
left a comment
There was a problem hiding this comment.
Minor comment otherwise LGTM
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
At various times, it is useful to check whether two tables are equal. For example, in cudf-polars we use this to check if two tables are "compatibly" partitioned. Previously there have been no such utilities in libcudf proper. The best one can do is to loop over the columns, call cudf::binary_operation with NULL_EQUALS and then cudf::reduce on the result. This launches many more kernels than necessary. Instead, use the existing row_equality operators to perform a single transform_reduce over the table checking for equality. Authors: - Lawrence Mitchell (https://github.com/wence-) - Muhammad Haseeb (https://github.com/mhaseeb123) Approvers: - David Wendt (https://github.com/davidwendt) - Muhammad Haseeb (https://github.com/mhaseeb123) URL: rapidsai#22319
rapidsai#22410) `lf.collect(engine="gpu")` and `pl.GPUEngine(executor="streaming")` using the default cluster now route through a new process-wide `DefaultSingletonEngine` instead of constructing a fresh rapidsmpf `Context`, RMM adaptor, and Python executor for every query. Bootstrap now happens once per process rather than once per query. `DefaultSingletonEngine` is a process-wide single-GPU singleton specialization of `SPMDEngine`: at most one live instance exists per process, it always uses a single-rank communicator plus default environment-derived settings, and repeated calls reuse the same engine instance until explicit shutdown. The default cluster enum value is renamed from `Cluster.SINGLE` to `Cluster.DEFAULT_SINGLETON` so the dispatch token better reflects the actual behavior. This PR also removes the dead inline-context fallback in `evaluate_pipeline`, which was the original `"single"` execution path. Authors: - Mads R. B. Kristensen (https://github.com/madsbk) Approvers: - Lawrence Mitchell (https://github.com/wence-) URL: rapidsai#22410
…nees (rapidsai#22453) I can't tell if this is right, because that repo has weird tags: ``` 🐚 git ls-remote https://github.com/actions-ecosystem/action-add-assignees refs/tags/* 59970ef501a38f91ea9afa2993b44162e33b3eac refs/tags/v1 ce5019e63cc4f35aba27308dc88d19c8f3686747 refs/tags/v1^{} 60aa57ae61b8fc53785076d0fc7327a6ef3a06fd refs/tags/v1.0.0 ce5019e63cc4f35aba27308dc88d19c8f3686747 refs/tags/v1.0.0^{} 48956ae0c11159427139404f968c4686dd245cfd refs/tags/v1.0.1 a5b84af721c4a621eb9c7a4a95ec20a90d0b88e9 refs/tags/v1.0.1^{} ``` Only the `@v1` mutable ref is allow-listed in the org-wide actions settings, so maybe the tag pointing to `v1.0.0` messes it up? The action that allowed is listed as: `actions-ecosystem/action-add-assignee@v1` It is unclear to me if that `@v1` will allow a commit SHA that points to the same location that the `@v1` tag points. I would've thought so, but the current SHA on `main` _does_ point to the same location: ``` gforsyth …/action-add-assignees main 13:29 🐚 git checkout v1 HEAD is now at ce5019e Update action.yml (#3) gforsyth …/action-add-assignees HEAD 13:29 🐚 git rev-parse HEAD ce5019e63cc4f35aba27308dc88d19c8f3686747 ``` It's possible (and what this PR currently changes) that the _commented_ tag corresponding to that SHA is causing the issue here, since `v1.0.0` isn't explicitly allowed (despite being the same commit): ``` * ce5019e - (HEAD, tag: v1.0.0, tag: v1) Update action.yml (#3) (6 years ago) <micnncim> ``` The other option is that the `@v1` only allows resolving the SHA of that git tag object itself (the tag, not what it points to), which is `59970ef501a38f91ea9afa2993b44162e33b3eac`. Does the SHA of a tag object change if the tag is mutated to point to a different commit? I don't know. Authors: - Gil Forsyth (https://github.com/gforsyth) Approvers: - Bradley Dice (https://github.com/bdice) URL: rapidsai#22453
Adds `NumpyExtensionArray` to the bases of the `pd.arrays.StringArray` proxy so it is recognized as a `NumpyExtensionArray` subclass. This unblocks the previously xfailed `test_comparison_methods_array` cases in `tests/arithmetic/test_string.py` and related `test_EA_types` tests, which are now removed from the conftest xfail list. Also adds direct cudf.pandas tests covering the subclass check and object/string array comparison ops.
## Description This PR fixes return types of `quantile` to properly preserve the type kinds when pandas nullable extension types are present. ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes.
…key columns by identity (rapidsai#22369) ## Summary Split out from rapidsai#22289. Two related fixes that together restore pandas-3 behavior for groupby reductions on extension-typed value columns: ### Output dtype for int-returning aggregations on `StringDtype` - `COUNT`/`SIZE`/`ARGMIN`/`ARGMAX` now return `np.int64` (matching pandas 3) instead of `Int64Dtype`/`int64[pyarrow]`. - `NUNIQUE` always casts to `np.int64`. - `Series.groupby.size()` on `string[pyarrow]` with `na_value=pd.NA` now returns `Int64Dtype` to match pandas 3's specific behavior for that storage/na_value combination. ### Identity-based exclusion of grouping-key columns Pandas excludes a value column whose underlying object is the same as the grouping Series' column (i.e., `df.groupby(df["a"])` drops `"a"` from the aggregated values). This was missing in cuDF. A new `_collect_series_key_column_names` helper captures this identity information *before* `nans_to_nulls()` breaks it under `mode.pandas_compatible`, and threads matched column names through `_Grouping` to populate `_named_columns`. The check is restricted to `DataFrame` inputs so that `Series.groupby(self)` (used internally by `Series.value_counts`, `Series.mode`, etc.) doesn't falsely match the Series against itself and empty the aggregation result. ## Tests `python/cudf/cudf/tests/groupby/test_reductions.py`: - `test_groupby_string_int_returning_aggs_dtype` covers `count`/`nunique`/`size` across the four `StringDtype` storage/na_value combinations. - `test_groupby_series_identity_column_exclusion` and `test_groupby_series_copy_no_column_exclusion` exercise the matched/non-matched paths. - `test_groupby_series_self_does_not_exclude` guards against the regression where `Series.groupby(self)` empties the aggregation. ## Conftest Removes 75 `NODEIDS_THAT_FAIL` entries that now pass on the regular path: - `test_string_dtype_all_na[*-{count,size,nunique}-*]` (60 entries) — fixed by the int-returning dtype change. - 15 nunique- and identity-related entries: `test_size_strings[string=string[pyarrow]]`, `test_groupby_column_index_in_references`, `test_groupby_nonstring_columns`, `test_groupby_series_with_name`, several `test_nunique_*` and `test_duplicate_columns[nunique-*]`, etc. ## Relationship to rapidsai#22289 This is one of the four split PRs requested in [the review on rapidsai#22289](rapidsai#22289 (review)). rapidsai#22289 retains only the `get_dtype_of_same_kind` change; the remaining three split PRs are #string-sum / #bool-any-all / #min-count. Some `test_string_dtype_all_na[*-{sum,all,any,min,max,first,last}-*]` parametrizations exercise both this PR's grouping-key-exclusion logic and another split PR's reduction logic, so the corresponding xfail entries stay until both halves merge. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
## Description This PR adds validations in `setitem`, `binops` and `fillna` code-paths. ## Checklist - [x] I am familiar with the [Contributing Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md). - [x] New or existing tests cover these changes. - [x] The documentation is up to date with these changes. ## Pandas test suite comparison | Metric | `pandas3` | this PR | Δ (this PR − `pandas3`) | |---|---:|---:|---:| | Failed | 3144 | 3002 | −142 | | Passed | 210751 | 210986 | +235 | | Skipped | 9884 | 9884 | 0 | | Xfailed | 6037 | 5944 | −93 | | Xpassed | 78 | 78 | 0 | | Warnings | 7288 | 7287 | −1 | Net: 142 fewer failures, 235 more passes.
…ai#22446) ## Summary - Validate strftime directive combinations (`%V`/`%G`/`%W`/`%U`/`%j`) and raise `ValueError` for incompatible combos. - Raise `TypeError` for bool scalars and Decimal list-likes; return `pd.NaT` (not `np.datetime64('NaT')`) on `errors='coerce'` for scalar inputs. - Raise `NotImplementedError` for `unit='Y'/'M'` so cudf.pandas falls back to pandas (which has correct calendrical-addition semantics). - Unxfails 42 tests in `pandas-tests/tests/tools/test_to_datetime.py`.
## Description
Improves test pass rate of the pandas 3 test suite under `cudf.pandas`
by fixing several cudf and cudf.pandas issues surfaced in
`pandas-testing/pandas-tests/tests/extension/test_arrow.py` (and related
test files).
### Pandas test suite comparison
| Metric | pandas3 (baseline) | This PR | Δ |
|---|---|---|---|
| **failed** | 3,710 | 3,461 | **−249** ✅ |
| **passed** | 203,927 | 204,207 | **+280** ✅ |
| **skipped** | 7,361 | 7,361 | 0 |
| **xfailed** | 6,283 | 6,252 | −31 |
| **xpassed** | 77 | 77 | 0 |
| **warnings** | 7,305 | 7,305 | 0 |
| **time** | 2411.09s (40:11) | 2367.20s (39:27) | −43.89s (−1.8%) |
Net improvement: **249 fewer failures, 280 more passes**. The xfailed
drop (−31) is from conftest xfail entries removed for tests that now
pass organically due to cudf fixes (no new xpass strict failures).
## Root causes & fixes
### 1. ``cudf.dtype(str)`` returned ``dtype('<U0')`` instead of a string
dtype
**Symptom:** ``pd.Series([...], dtype=str)`` under cudf.pandas produced
an ``object``-dtype series, so comparisons against
``StringDtype``-backed expected results failed.
**Cause:** ``cudf.dtype`` only special-cased the string ``""str""``;
passing the Python ``str`` class fell through ``np.dtype(str)`` →
``<U0``.
**Fix:** Treat the ``str`` class the same as ``""str""`` and route it
through ``pd.api.types.pandas_dtype``
(``python/cudf/cudf/core/dtypes.py``).
### 2. ``astype(str)`` dropped sub-second precision on Arrow-backed
timestamps
**Symptom:** ``Series(dtype=timestamp[us/ns][pyarrow]).astype(str)``
returned strings like ``""2020-01-01 01:01:01""``, losing
microseconds/nanoseconds.
**Cause:** ``_dtype_to_format_conversion`` in ``DatetimeColumn`` only
had entries for numpy ``datetime64[...]`` names; Arrow-named timestamps
hit the default ``""%Y-%m-%d %H:%M:%S""`` format.
**Fix:** Added ``timestamp[ns|us|ms|s][pyarrow]`` entries to the format
table (``python/cudf/cudf/core/column/datetime.py``).
### 3. Arrow ``bool`` reductions returned the wrong dtype
**Symptom:** ``df.sum()`` on a ``bool[pyarrow]`` column returned
``int64[pyarrow]``; pandas returns ``uint64[pyarrow]``.
**Cause:** ``NumericalColumn._reduction_result_dtype`` treated ``kind ==
""b""`` as signed-int-like.
**Fix:** For Arrow-backed bool columns, route ``sum``/``product`` to
``uint64`` to match pandas
(``python/cudf/cudf/core/column/numerical.py``). Numpy ``bool`` still
returns ``int64``. ``DataFrame._reduce`` was updated to apply the same
handling on the all-null path.
### 4. Duration / timestamp reductions lost the source dtype
**Symptom:** ``df.sum(skipna=False)`` on a ``duration[ns][pyarrow]``
column with nulls came back as ``timestamp[s][pyarrow]``;
``min/max/median`` on timestamp Arrow dtypes dropped the original unit.
**Cause:** The all-NaT scalar result was materialized through
``as_column([NaT])`` which defaults to ``datetime64[s]``, and
``get_dtype_of_same_kind`` then folded duration → timestamp.
**Fixes:**
- In ``DataFrame._reduce`` pass ``result_dtype = common_dtype`` for
``m``/``M`` kinds on the ops where the result is the same kind
(``python/cudf/cudf/core/dataframe.py``).
- In ``TemporalBaseColumn.element_indexing`` call
``.as_unit(self.time_unit)`` so reductions preserve the input unit
(``python/cudf/cudf/core/column/temporal_base.py``).
### 5. ``sum`` of Arrow ``string[pyarrow]`` was widened to
``large_string[pyarrow]``
**Symptom:** ``arr._reduce(""sum"")`` produced ``string[pyarrow]``;
``df.sum()`` under cudf produced ``large_string[pyarrow]``.
**Cause:** ``get_dtype_of_same_kind`` always routed Arrow-string targets
through ``dtype_to_pandas_arrowdtype`` which hard-codes
``pa.large_string()``.
**Fix:** Preserve the source Arrow string variant (``string`` vs
``large_string``) when the target is also a string-like dtype
(``python/cudf/cudf/utils/dtypes.py``).
### 6. ``_readonly`` attribute did not propagate through ExtensionArray
proxies
**Symptom:** `test_getitem_propagates_readonly_property` (40),
`test_readonly_property` (40), and many readonly tests across other test
files (`test_string.py`, `test_string_arrow.py`, etc.) failed because
setting ``arr._readonly = True`` on a proxy didn't flow through to the
wrapped pandas object, so slicing returned a result with ``_readonly =
False``.
**Cause:** ``_FastSlowProxy.__setattr__`` unconditionally stored any
``_*`` attribute on the proxy instance itself. Pandas attaches
``_readonly`` to the wrapped ``ExtensionArray`` and consults it inside
``__getitem__`` (and similar) to decide whether to propagate
readonly-ness to views — so writes need to reach the wrapped object.
**Fix (two parts):**
- In ``_FastSlowProxy.__setattr__``, if the class declares the private
name via ``_FastSlowAttribute``, forward the write to the wrapped slow
object. Otherwise, retain the old behavior (store on the proxy), so
proxy-internal attributes like ``_method_chain`` are unaffected
(``python/cudf/cudf/pandas/fast_slow_proxy.py``).
- Added ``""_readonly"": _FastSlowAttribute(""_readonly"",
private=True)`` to every ExtensionArray proxy: ``ArrowExtensionArray``,
``ArrowStringArray``, ``StringArray``, ``IntegerArray``,
``FloatingArray``, ``BooleanArray``, ``DatetimeArray``,
``TimedeltaArray``, ``PeriodArray``, ``IntervalArray``, ``SparseArray``,
``NumpyExtensionArray``/``PandasArray``
(``python/cudf/cudf/pandas/_wrappers/pandas.py``).
### 7. conftest xfail cleanup
Removed entries from
``python/cudf/cudf/pandas/scripts/conftest-patch.py`` for tests that now
pass organically due to the fixes above
## Checklist
- [x] I am familiar with the [Contributing
Guidelines](https://github.com/rapidsai/cudf/blob/HEAD/CONTRIBUTING.md).
- [x] New or existing tests cover these changes.
- [x] The documentation is up to date with these changes.
---------
Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com>
Co-authored-by: Tom Augspurger <toaugspurger@nvidia.com>
Co-authored-by: Vukasin Milovanovic <vmilovanovic@nvidia.com>
## Summary Split out from rapidsai#22289. `GroupBy.all` and `GroupBy.any` previously raised `NotImplementedError`. This PR implements them by reducing to `min`/`max` on a bool-coerced copy of the value columns. ## Implementation (`python/cudf/cudf/core/groupby/groupby.py`) A new `_bool_reduce` helper: - Coerces strings as `count_characters > 0` so empty strings become `False` and nulls remain null (preserved through the aggregation). - Coerces numerics as `!= 0` with the same null preservation. - For `skipna=False`, fills nulls with `True` before aggregation so they don't flip `all` to `False` and trivially make `any` `True`. - Empty groups (skipna=True with all-NA values) yield NA from min/max; pandas treats those as vacuously `True` for `all` and `False` for `any`, so the result is filled accordingly. - Applies `min_count` by counting per-group non-nulls and masking groups whose count is below the threshold. The new GroupBy is constructed with `by=self.grouping` (passing the existing `_Grouping` object) so key columns match the bool-coerced value columns exactly, avoiding label-based lookup when the original key column was excluded. ## Tests `python/cudf/cudf/tests/groupby/test_reductions.py`: - `test_groupby_all_any` over bool/int/float data. - `test_groupby_all_any_string` for string columns. - `test_groupby_all_any_empty` for empty-group behavior. ## Conftest Removes 32 `test_string_dtype_all_na[*-all-*]` and `[*-any-*]` entries. ## Relationship to rapidsai#22289 One of the four split PRs requested in [the review on rapidsai#22289](rapidsai#22289 (review)). The DataFrame-case `test_string_dtype_all_na[*-{all,any}-*]` parametrizations (`df.groupby(df["a"]).all()`) also rely on identity-based grouping-key column exclusion in rapidsai#22369; both must merge before the 32 conftest removals stop xpassing. --------- Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: Matthew Roeschke <10647082+mroeschke@users.noreply.github.com> Co-authored-by: Mads R. B. Kristensen <madsbk@gmail.com> Co-authored-by: Matthew Murray <41342305+Matt711@users.noreply.github.com> Co-authored-by: Vukasin Milovanovic <vmilovanovic@nvidia.com> Co-authored-by: David Wendt <45795991+davidwendt@users.noreply.github.com> Co-authored-by: Yunsong Wang <12716979+PointKernel@users.noreply.github.com> Co-authored-by: Richard (Rick) Zamora <rzamora217@gmail.com> Co-authored-by: Kyle Edwards <kyedwards@nvidia.com> Co-authored-by: Bradley Dice <bdice@bradleydice.com> Co-authored-by: Paul Taylor <178183+trxcllnt@users.noreply.github.com> Co-authored-by: Vyas Ramasubramani <vyasr@nvidia.com> Co-authored-by: Muhammad Haseeb <14217455+mhaseeb123@users.noreply.github.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Description
Fixes 78 occurrences (across 41 unique tests) of
AssertionError: DataFrame.columns are differentsurfaced by the pandas test suite undercudf.pandas. All fixes are applied to cuDF classic — noconftest-patch.pyxfails added.Areas touched (column-metadata propagation):
base_accessor.py—str.partition/split/rpartition/rsplit(expand=True) now produceRangeIndexcolumns.single_column_frame._to_frame— unnamedSeries.to_frame()keepsRangeIndex(1)columns; tuple names produceMultiIndexcolumns.dataframe.py—_make_operands_and_index_for_binop,_concat,quantile(single),select_dtypes, andjoinnow preservemultiindex / level_names / rangeindex / label_dtype/CategoricalIndexon the result columns.indexed_frame._reindex— propagateslabel_dtypeso categorical-typed reindex targets keepCategoricalIndexcolumns.groupby.agg/_scan_fill— preservelevel_nameson agg results;ffill/bfillmatches pandas' object-typed columns labels (with a guard skippingStringDtype).reshape.py—_normalize_series_and_dataframekeeps unnamed-Series semantic inconcat;pivotflattens scalar 2-D selections fromMultiIndexrows and usesobjectdtype for empty-axis results.No regressions in the cuDF classic test suite (reshape, groupby, dataframe binops/reindex/select_dtypes/ffill_bfill, doctests).
cudf.pandastest suite comparisonpandas3Net: 32 prior failures → pass and 8 prior xfails → pass (+40 passed total). No regressions in failed/xpassed.
Checklist