Skip to content

Handle all dtypes in convert_dtypes#22202

Open
galipremsagar wants to merge 9 commits intorapidsai:pandas3from
galipremsagar:convert_dtypes
Open

Handle all dtypes in convert_dtypes#22202
galipremsagar wants to merge 9 commits intorapidsai:pandas3from
galipremsagar:convert_dtypes

Conversation

@galipremsagar
Copy link
Copy Markdown
Contributor

@galipremsagar galipremsagar commented Apr 17, 2026

Description

This PR brings convert_dtypes in party with pandas3

Comparison: pandas3 vs This PR

Metric pandas3 This PR Δ (PR − pandas3)
failed 3,710 3,547 −163
passed 203,927 204,095 +168
skipped 7,361 7,361 0
xfailed 6,283 6,278 −5
xpassed 77 77 0
warnings 7,305 7,305 0

Checklist

  • I am familiar with the Contributing Guidelines.
  • New or existing tests cover these changes.
  • The documentation is up to date with these changes.

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented Apr 17, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@github-actions github-actions Bot added the Python Affects Python cuDF API. label Apr 17, 2026
@GPUtester GPUtester moved this to In Progress in cuDF Python Apr 17, 2026
@galipremsagar galipremsagar added improvement Improvement / enhancement to an existing function non-breaking Non-breaking change labels Apr 17, 2026
@galipremsagar galipremsagar requested a review from mroeschke April 17, 2026 22:09
@galipremsagar galipremsagar marked this pull request as ready for review April 17, 2026 22:09
@galipremsagar galipremsagar requested a review from a team as a code owner April 17, 2026 22:09
@galipremsagar galipremsagar requested review from bdice and removed request for a team April 17, 2026 22:09
@galipremsagar galipremsagar added the 3 - Ready for Review Ready for review by team label Apr 17, 2026
@galipremsagar galipremsagar changed the title Convert dtypes Handle all dtypes in convert_dtypes Apr 17, 2026

def to_arrow(self) -> pa.Array:
if isinstance(self.dtype, pd.ArrowDtype) and pa.types.is_null(
self.dtype.pyarrow_dtype
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If StringColumn had an pd.ArrowDtype(pa.null) I would consider this a bug. The pyarrow types should only be a pa.string or pa.large_string

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In all our cudf type conversions, we convert pa.null type to object dtype. Previously, we used to map pa.null to int8 but we made this recent change to object to match closely with pandas. Do you think we should be mapping to some other type or introduce a new type null type in cudf?

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think converting pa.null to object is correct. I'm mainly wondering what series of calls leads to StringColumn having a pd.ArrowDtype(pa.null()) where we need specific check for this in to_arrow?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ColumnBase.create normalizes the plc buffer but keeps the null dtype:

# For pandas nullable null types (ArrowDtype wrapping pa.null()),                                                                                                
# normalize the column data and dtype before construction.                                                                                                       
col, dtype, old_dtype = maybe_normalize_arrow_null(col, dtype)
                                                                                                                                                                 
target_cls = ColumnBase._dispatch_subclass_from_dtype(dtype)   # dispatches on object → StringColumn
self = target_cls.__new__(target_cls)                                                                                                                            
self.plc_column = _wrap_and_validate(col, dtype) if validate else col                                                                                            
self._dtype = dtype if old_dtype is None else old_dtype         # original ArrowDtype(pa.null()) is restored
                                                                                                                                                                 
So the subclass dispatch sees np.dtype("object") and picks StringColumn, but self._dtype ends up as the original pd.ArrowDtype(pa.null()) — which is what the new
 to_arrow check guards against.                                                                                                                                  

I know we have been wanting to remove this kind of casting in ColumnBase.create, I plan on doing that once we are close to merging to main

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok if this goes back to the casting thing we're doing in ColumnBase, I would suggest we hold off on these changes since this might fix itself once fix the ColumnBase issue

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah ok if this goes back to the casting thing we're doing in ColumnBase, I would suggest we hold off on these changes since this might fix itself once fix the ColumnBase issue

Opened a PR to address this issue and remove the workaround from create: #22396

Comment thread python/cudf/cudf/core/indexed_frame.py Outdated
Comment thread python/cudf/cudf/tests/series/methods/test_convert_dtypes.py Outdated
Comment thread python/cudf/cudf/tests/series/methods/test_convert_dtypes.py Outdated
Comment thread python/cudf/cudf/tests/dataframe/methods/test_convert_dtypes.py Outdated
@galipremsagar
Copy link
Copy Markdown
Contributor Author

/okay to test de74324

@galipremsagar galipremsagar requested a review from mroeschke April 23, 2026 16:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

3 - Ready for Review Ready for review by team improvement Improvement / enhancement to an existing function non-breaking Non-breaking change Python Affects Python cuDF API.

Projects

Status: In Progress

Development

Successfully merging this pull request may close these issues.

3 participants