Fix AssertionError: DataFrame.columns are different failures in cudf.pandas#22351
Fix AssertionError: DataFrame.columns are different failures in cudf.pandas#22351galipremsagar wants to merge 3 commits intorapidsai:pandas3from
AssertionError: DataFrame.columns are different failures in cudf.pandas#22351Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
AssertionError: DataFrame.columns are different failures in cudf.pandas
| ) | ||
| if len(table) == 0: | ||
| keys = ( | ||
| tuple(table.keys()) if hasattr(table, "keys") else () |
There was a problem hiding this comment.
| tuple(table.keys()) if hasattr(table, "keys") else () | |
| tuple(table.keys()) if isinstance(table, dict) else () |
Could we use this stricter check?
| if len(table) == 0 or ( | ||
| keys | ||
| and all(isinstance(k, int) for k in keys) | ||
| and tuple(keys) == tuple(range(len(keys))) |
There was a problem hiding this comment.
| and tuple(keys) == tuple(range(len(keys))) | |
| and keys == tuple(range(len(keys))) |
(Since the keys assignment above creates it as a tuple already)
There was a problem hiding this comment.
Also this does have the assumption that the columns are always 0..n (e.g. maybe the rangeindex could be 1...n + 1), but that can be tackled in a follow up
| ) | ||
| if len(table) == 0 or ( | ||
| keys | ||
| and all(isinstance(k, int) for k in keys) |
There was a problem hiding this comment.
I believe this check is redundant with the one below. The equality comparison below should be false if keys did not contain ints
| # axis name (matching pandas behavior). | ||
| if ( | ||
| not multilevel | ||
| and isinstance(self.obj, DataFrame) |
There was a problem hiding this comment.
| and isinstance(self.obj, DataFrame) | |
| and self.obj.ndim == 2 |
nit (to avoid the DataFrame runtime import)
| positions = [ | ||
| source_pd_cols.get_loc(c) for c in result._column_names | ||
| ] |
There was a problem hiding this comment.
You might be able to to the same with
indexer = source_pd_cols.get_indexer(result._column_names)
if not (indexer == -1).any():
taken = source_pd_cols.take(positions)
...| elif ( | ||
| all(obj._data.rangeindex for obj in objs) | ||
| and all( | ||
| tuple(obj._column_names) == tuple(range(obj._num_columns)) |
There was a problem hiding this comment.
If the first all check confirms obj has a RangeIndex columns, and we're only going to compare against 0..n (which as mentioned above might be limiting), we could make this check quicker by just checking obj._column_names[0] == 0 and obj._column_names[-1] == obj._num_columns - 1) instead of all the materialized range values
| df.columns = pd.CategoricalIndex( | ||
| list(self_pd_cols) + list(other_pd_cols), | ||
| dtype=self_pd_cols.dtype, | ||
| name=self_pd_cols.name | ||
| if self_pd_cols.name == other_pd_cols.name | ||
| else None, | ||
| ) |
There was a problem hiding this comment.
| df.columns = pd.CategoricalIndex( | |
| list(self_pd_cols) + list(other_pd_cols), | |
| dtype=self_pd_cols.dtype, | |
| name=self_pd_cols.name | |
| if self_pd_cols.name == other_pd_cols.name | |
| else None, | |
| ) | |
| df.columns = self_pd_cols.append(other_pd_cols) |
Should give you the same result
| ] | ||
| return self.loc[:, to_select] | ||
| result = self.loc[:, to_select] | ||
| if not to_select and self._data.rangeindex: |
There was a problem hiding this comment.
Ideally I would hope loc preserved the .rangeindex but that could be for another PR
| if ( | ||
| len(data) == 0 | ||
| and not isinstance(index_data, cudf.MultiIndex) | ||
| and isinstance(index_data.dtype, pd.StringDtype) |
There was a problem hiding this comment.
Does this to apply to any of the string types and not just pd.StringDtype?
| if ( | ||
| len(data) == 0 | ||
| and not isinstance(column_data, cudf.MultiIndex) | ||
| and isinstance(column_data.dtype, pd.StringDtype) |
Description
Fixes 78 occurrences (across 41 unique tests) of
AssertionError: DataFrame.columns are differentsurfaced by the pandas test suite undercudf.pandas. All fixes are applied to cuDF classic — noconftest-patch.pyxfails added.Areas touched (column-metadata propagation):
base_accessor.py—str.partition/split/rpartition/rsplit(expand=True) now produceRangeIndexcolumns.single_column_frame._to_frame— unnamedSeries.to_frame()keepsRangeIndex(1)columns; tuple names produceMultiIndexcolumns.dataframe.py—_make_operands_and_index_for_binop,_concat,quantile(single),select_dtypes, andjoinnow preservemultiindex / level_names / rangeindex / label_dtype/CategoricalIndexon the result columns.indexed_frame._reindex— propagateslabel_dtypeso categorical-typed reindex targets keepCategoricalIndexcolumns.groupby.agg/_scan_fill— preservelevel_nameson agg results;ffill/bfillmatches pandas' object-typed columns labels (with a guard skippingStringDtype).reshape.py—_normalize_series_and_dataframekeeps unnamed-Series semantic inconcat;pivotflattens scalar 2-D selections fromMultiIndexrows and usesobjectdtype for empty-axis results.No regressions in the cuDF classic test suite (reshape, groupby, dataframe binops/reindex/select_dtypes/ffill_bfill, doctests).
cudf.pandastest suite comparisonpandas3Net: 32 prior failures → pass and 8 prior xfails → pass (+40 passed total). No regressions in failed/xpassed.
Checklist