-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object #34289
GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object #34289
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems a good short term fix for this specific case. When pandas starts to use pyarrow chunked arrays more and more, we might need to think about what pa.array(pandas_series)
should do if the pandas object has multiple chunks, but let's leave that for another issue.
Just one small remark on the tests
Co-authored-by: Joris Van den Bossche <[email protected]>
Thanks! Opened #34755 for the broader question about how to handle chunked arrays in |
Benchmark runs are scheduled for baseline = 2dbd39c and contender = ca18e6f. ca18e6f is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
…has dictionary as string not object (apache#34289) ### Rationale for this change Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays` fails. ### What changes are included in this PR? `_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.` ### Are these changes tested? Yes. Tests are added to: - python/pyarrow/tests/parquet/test_pandas.py - python/pyarrow/tests/test_pandas.py - python/pyarrow/tests/test_array.py ### Are there any user-facing changes? No. * Closes: apache#33727 Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Rationale for this change
Currently writing a pandas dataframe with categorical column of dtype
string[pyarrow]
fails. The reason for this is that when category withstring[pyarrow]
dtype is converted to an array in pyarrow it results in aChunkedArray,
notArray
, and thenDictionaryArray.from_arrays
fails.What changes are included in this PR?
_handle_arrow_array_protocol
method in array.pxi is updated so that in case of aChunkedArray
with one chunk, the result is apyarrow.Array
and notpa.ChunkedArray.
Are these changes tested?
Yes. Tests are added to:
Are there any user-facing changes?
No.