Skip to content

Fix AssertionError: DataFrame.columns are different failures in cudf.pandas#22351

Open
galipremsagar wants to merge 3 commits intorapidsai:pandas3from
galipremsagar:colum_assign
Open

Fix AssertionError: DataFrame.columns are different failures in cudf.pandas#22351
galipremsagar wants to merge 3 commits intorapidsai:pandas3from
galipremsagar:colum_assign

Conversation

@galipremsagar
Copy link
Copy Markdown
Contributor

@galipremsagar galipremsagar commented May 1, 2026

Description

Fixes 78 occurrences (across 41 unique tests) of AssertionError: DataFrame.columns are different surfaced by the pandas test suite under cudf.pandas. All fixes are applied to cuDF classic — no conftest-patch.py xfails added.

Areas touched (column-metadata propagation):

  • base_accessor.pystr.partition/split/rpartition/rsplit (expand=True) now produce RangeIndex columns.
  • single_column_frame._to_frame — unnamed Series.to_frame() keeps RangeIndex(1) columns; tuple names produce MultiIndex columns.
  • dataframe.py_make_operands_and_index_for_binop, _concat, quantile (single), select_dtypes, and join now preserve multiindex / level_names / rangeindex / label_dtype / CategoricalIndex on the result columns.
  • indexed_frame._reindex — propagates label_dtype so categorical-typed reindex targets keep CategoricalIndex columns.
  • groupby.agg / _scan_fill — preserve level_names on agg results; ffill/bfill matches pandas' object-typed columns labels (with a guard skipping StringDtype).
  • reshape.py_normalize_series_and_dataframe keeps unnamed-Series semantic in concat; pivot flattens scalar 2-D selections from MultiIndex rows and uses object dtype for empty-axis results.

No regressions in the cuDF classic test suite (reshape, groupby, dataframe binops/reindex/select_dtypes/ffill_bfill, doctests).

cudf.pandas test suite comparison

Metric pandas3 this PR Δ
failed 2,955 2,923 −32
passed 202,101 202,141 +40
skipped 9,757 9,757 0
xfailed 5,995 5,987 −8
xpassed 78 78 0
warnings 6,298 6,298 0
total collected 220,886 220,886 0
runtime 2,018.26s (33:38) 3,067.20s (51:07) +1,048.94s (+52.0%)

Net: 32 prior failures → pass and 8 prior xfails → pass (+40 passed total). No regressions in failed/xpassed.

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 1, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added Python Affects Python cuDF API. cudf.pandas Issues specific to cudf.pandas labels May 1, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python May 1, 2026
@galipremsagar galipremsagar changed the title fix Fix AssertionError: DataFrame.columns are different failures in cudf.pandas May 1, 2026
@galipremsagar galipremsagar added bug Something isn't working 3 - Ready for Review Ready for review by team non-breaking Non-breaking change labels May 1, 2026
@galipremsagar galipremsagar marked this pull request as ready for review May 1, 2026 16:25
@galipremsagar galipremsagar requested a review from a team as a code owner May 1, 2026 16:25
@galipremsagar galipremsagar requested review from bdice and vyasr and removed request for a team May 1, 2026 16:25
)
if len(table) == 0:
keys = (
tuple(table.keys()) if hasattr(table, "keys") else ()
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
tuple(table.keys()) if hasattr(table, "keys") else ()
tuple(table.keys()) if isinstance(table, dict) else ()

Could we use this stricter check?

if len(table) == 0 or (
keys
and all(isinstance(k, int) for k in keys)
and tuple(keys) == tuple(range(len(keys)))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and tuple(keys) == tuple(range(len(keys)))
and keys == tuple(range(len(keys)))

(Since the keys assignment above creates it as a tuple already)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also this does have the assumption that the columns are always 0..n (e.g. maybe the rangeindex could be 1...n + 1), but that can be tackled in a follow up

)
if len(table) == 0 or (
keys
and all(isinstance(k, int) for k in keys)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this check is redundant with the one below. The equality comparison below should be false if keys did not contain ints

# axis name (matching pandas behavior).
if (
not multilevel
and isinstance(self.obj, DataFrame)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
and isinstance(self.obj, DataFrame)
and self.obj.ndim == 2

nit (to avoid the DataFrame runtime import)

Comment on lines +2796 to +2798
positions = [
source_pd_cols.get_loc(c) for c in result._column_names
]
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You might be able to to the same with

indexer = source_pd_cols.get_indexer(result._column_names)
if not (indexer == -1).any():
    taken = source_pd_cols.take(positions)
    ...

elif (
all(obj._data.rangeindex for obj in objs)
and all(
tuple(obj._column_names) == tuple(range(obj._num_columns))
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the first all check confirms obj has a RangeIndex columns, and we're only going to compare against 0..n (which as mentioned above might be limiting), we could make this check quicker by just checking obj._column_names[0] == 0 and obj._column_names[-1] == obj._num_columns - 1) instead of all the materialized range values

Comment on lines +4830 to +4836
df.columns = pd.CategoricalIndex(
list(self_pd_cols) + list(other_pd_cols),
dtype=self_pd_cols.dtype,
name=self_pd_cols.name
if self_pd_cols.name == other_pd_cols.name
else None,
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
df.columns = pd.CategoricalIndex(
list(self_pd_cols) + list(other_pd_cols),
dtype=self_pd_cols.dtype,
name=self_pd_cols.name
if self_pd_cols.name == other_pd_cols.name
else None,
)
df.columns = self_pd_cols.append(other_pd_cols)

Should give you the same result

]
return self.loc[:, to_select]
result = self.loc[:, to_select]
if not to_select and self._data.rangeindex:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally I would hope loc preserved the .rangeindex but that could be for another PR

if (
len(data) == 0
and not isinstance(index_data, cudf.MultiIndex)
and isinstance(index_data.dtype, pd.StringDtype)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this to apply to any of the string types and not just pd.StringDtype?

if (
len(data) == 0
and not isinstance(column_data, cudf.MultiIndex)
and isinstance(column_data.dtype, pd.StringDtype)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team bug Something isn't working cudf.pandas Issues specific to cudf.pandas non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants