Handle all dtypes in convert_dtypes#22202
Handle all dtypes in convert_dtypes#22202galipremsagar wants to merge 9 commits intorapidsai:pandas3from
convert_dtypes#22202Conversation
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
|
||
| def to_arrow(self) -> pa.Array: | ||
| if isinstance(self.dtype, pd.ArrowDtype) and pa.types.is_null( | ||
| self.dtype.pyarrow_dtype |
There was a problem hiding this comment.
If StringColumn had an pd.ArrowDtype(pa.null) I would consider this a bug. The pyarrow types should only be a pa.string or pa.large_string
There was a problem hiding this comment.
In all our cudf type conversions, we convert pa.null type to object dtype. Previously, we used to map pa.null to int8 but we made this recent change to object to match closely with pandas. Do you think we should be mapping to some other type or introduce a new type null type in cudf?
There was a problem hiding this comment.
I think converting pa.null to object is correct. I'm mainly wondering what series of calls leads to StringColumn having a pd.ArrowDtype(pa.null()) where we need specific check for this in to_arrow?
There was a problem hiding this comment.
ColumnBase.create normalizes the plc buffer but keeps the null dtype:
# For pandas nullable null types (ArrowDtype wrapping pa.null()),
# normalize the column data and dtype before construction.
col, dtype, old_dtype = maybe_normalize_arrow_null(col, dtype)
target_cls = ColumnBase._dispatch_subclass_from_dtype(dtype) # dispatches on object → StringColumn
self = target_cls.__new__(target_cls)
self.plc_column = _wrap_and_validate(col, dtype) if validate else col
self._dtype = dtype if old_dtype is None else old_dtype # original ArrowDtype(pa.null()) is restored
So the subclass dispatch sees np.dtype("object") and picks StringColumn, but self._dtype ends up as the original pd.ArrowDtype(pa.null()) — which is what the new
to_arrow check guards against. I know we have been wanting to remove this kind of casting in ColumnBase.create, I plan on doing that once we are close to merging to main
There was a problem hiding this comment.
Ah ok if this goes back to the casting thing we're doing in ColumnBase, I would suggest we hold off on these changes since this might fix itself once fix the ColumnBase issue
There was a problem hiding this comment.
Ah ok if this goes back to the casting thing we're doing in
ColumnBase, I would suggest we hold off on these changes since this might fix itself once fix theColumnBaseissue
Opened a PR to address this issue and remove the workaround from create: #22396
|
/okay to test de74324 |
Description
This PR brings
convert_dtypesin party withpandas3Comparison:
pandas3vs This PRChecklist