Fix `AssertionError: DataFrame.columns are different` failures in cudf.pandas by galipremsagar · Pull Request #22351 · rapidsai/cudf

galipremsagar · 2026-05-01T15:42:44Z

Description

Fixes 78 occurrences (across 41 unique tests) of AssertionError: DataFrame.columns are different surfaced by the pandas test suite under cudf.pandas. All fixes are applied to cuDF classic — no conftest-patch.py xfails added.

Areas touched (column-metadata propagation):

base_accessor.py — str.partition/split/rpartition/rsplit (expand=True) now produce RangeIndex columns.
single_column_frame._to_frame — unnamed Series.to_frame() keeps RangeIndex(1) columns; tuple names produce MultiIndex columns.
dataframe.py — _make_operands_and_index_for_binop, _concat, quantile (single), select_dtypes, and join now preserve multiindex / level_names / rangeindex / label_dtype / CategoricalIndex on the result columns.
indexed_frame._reindex — propagates label_dtype so categorical-typed reindex targets keep CategoricalIndex columns.
groupby.agg / _scan_fill — preserve level_names on agg results; ffill/bfill matches pandas' object-typed columns labels (with a guard skipping StringDtype).
reshape.py — _normalize_series_and_dataframe keeps unnamed-Series semantic in concat; pivot flattens scalar 2-D selections from MultiIndex rows and uses object dtype for empty-axis results.

No regressions in the cuDF classic test suite (reshape, groupby, dataframe binops/reindex/select_dtypes/ffill_bfill, doctests).

`cudf.pandas` test suite comparison

Metric	`pandas3`	this PR	Δ
failed	2,955	2,923	−32
passed	202,101	202,141	+40
skipped	9,757	9,757	0
xfailed	5,995	5,987	−8
xpassed	78	78	0
warnings	6,298	6,298	0
total collected	220,886	220,886	0
runtime	2,018.26s (33:38)	3,067.20s (51:07)	+1,048.94s (+52.0%)

Net: 32 prior failures → pass and 8 prior xfails → pass (+40 passed total). No regressions in failed/xpassed.

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

copy-pr-bot · 2026-05-01T15:42:47Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

mroeschke · 2026-05-07T19:14:56Z

                    )
-                    if len(table) == 0:
+                    keys = (
+                        tuple(table.keys()) if hasattr(table, "keys") else ()


Suggested change

tuple(table.keys()) if hasattr(table, "keys") else ()

tuple(table.keys()) if isinstance(table, dict) else ()

Could we use this stricter check?

mroeschke · 2026-05-07T19:17:48Z

+                    if len(table) == 0 or (
+                        keys
+                        and all(isinstance(k, int) for k in keys)
+                        and tuple(keys) == tuple(range(len(keys)))


Suggested change

and tuple(keys) == tuple(range(len(keys)))

and keys == tuple(range(len(keys)))

(Since the keys assignment above creates it as a tuple already)

Also this does have the assumption that the columns are always 0..n (e.g. maybe the rangeindex could be 1...n + 1), but that can be tackled in a follow up

mroeschke · 2026-05-07T19:18:50Z

+                    )
+                    if len(table) == 0 or (
+                        keys
+                        and all(isinstance(k, int) for k in keys)


I believe this check is redundant with the one below. The equality comparison below should be false if keys did not contain ints

mroeschke · 2026-05-07T19:22:04Z

+        # axis name (matching pandas behavior).
+        if (
+            not multilevel
+            and isinstance(self.obj, DataFrame)


Suggested change

and isinstance(self.obj, DataFrame)

and self.obj.ndim == 2

nit (to avoid the DataFrame runtime import)

mroeschke · 2026-05-07T19:30:05Z

+                    positions = [
+                        source_pd_cols.get_loc(c) for c in result._column_names
+                    ]


You might be able to to the same with

indexer = source_pd_cols.get_indexer(result._column_names) if not (indexer == -1).any(): taken = source_pd_cols.take(positions) ...

mroeschke · 2026-05-07T19:38:22Z

+        elif (
+            all(obj._data.rangeindex for obj in objs)
+            and all(
+                tuple(obj._column_names) == tuple(range(obj._num_columns))


If the first all check confirms obj has a RangeIndex columns, and we're only going to compare against 0..n (which as mentioned above might be limiting), we could make this check quicker by just checking obj._column_names[0] == 0 and obj._column_names[-1] == obj._num_columns - 1) instead of all the materialized range values

mroeschke · 2026-05-07T20:55:33Z

+            df.columns = pd.CategoricalIndex(
+                list(self_pd_cols) + list(other_pd_cols),
+                dtype=self_pd_cols.dtype,
+                name=self_pd_cols.name
+                if self_pd_cols.name == other_pd_cols.name
+                else None,
+            )


Suggested change

df.columns = pd.CategoricalIndex(

list(self_pd_cols) + list(other_pd_cols),

dtype=self_pd_cols.dtype,

name=self_pd_cols.name

if self_pd_cols.name == other_pd_cols.name

else None,

)

df.columns = self_pd_cols.append(other_pd_cols)

Should give you the same result

mroeschke · 2026-05-07T20:58:21Z

        ]
-        return self.loc[:, to_select]
+        result = self.loc[:, to_select]
+        if not to_select and self._data.rangeindex:


Ideally I would hope loc preserved the .rangeindex but that could be for another PR

mroeschke · 2026-05-07T21:00:30Z

+        if (
+            len(data) == 0
+            and not isinstance(index_data, cudf.MultiIndex)
+            and isinstance(index_data.dtype, pd.StringDtype)


Does this to apply to any of the string types and not just pd.StringDtype?

mroeschke · 2026-05-07T21:00:41Z

+    if (
+        len(data) == 0
+        and not isinstance(column_data, cudf.MultiIndex)
+        and isinstance(column_data.dtype, pd.StringDtype)


Same question here

fix

a3a3dfc

github-actions Bot assigned galipremsagar May 1, 2026

github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 1, 2026

github-project-automation Bot added this to cuDF Python May 1, 2026

GPUtester moved this to In Progress in cuDF Python May 1, 2026

galipremsagar changed the title ~~fix~~ Fix AssertionError: DataFrame.columns are different failures in cudf.pandas May 1, 2026

galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels May 1, 2026

galipremsagar marked this pull request as ready for review May 1, 2026 16:25

galipremsagar requested a review from a team as a code owner May 1, 2026 16:25

galipremsagar requested review from bdice and vyasr and removed request for a team May 1, 2026 16:25

galipremsagar added 2 commits May 6, 2026 21:25

Merge

361dd88

drop tests

42dbed4

mroeschke reviewed May 7, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix `AssertionError: DataFrame.columns are different` failures in cudf.pandas#22351

Fix `AssertionError: DataFrame.columns are different` failures in cudf.pandas#22351
galipremsagar wants to merge 3 commits intorapidsai:pandas3from
galipremsagar:colum_assign

galipremsagar commented May 1, 2026 •

edited

Loading

Uh oh!

copy-pr-bot Bot commented May 1, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

mroeschke May 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

	tuple(table.keys()) if hasattr(table, "keys") else ()
	tuple(table.keys()) if isinstance(table, dict) else ()

	and tuple(keys) == tuple(range(len(keys)))
	and keys == tuple(range(len(keys)))

Conversation

galipremsagar commented May 1, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

cudf.pandas test suite comparison

Checklist

Uh oh!

copy-pr-bot Bot commented May 1, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

galipremsagar commented May 1, 2026 •

edited

Loading

`cudf.pandas` test suite comparison