GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object #34289

AlenkaF · 2023-02-22T09:55:20Z

Rationale for this change

Currently writing a pandas dataframe with categorical column of dtype string[pyarrow] fails. The reason for this is that when category with string[pyarrow] dtype is converted to an array in pyarrow it results in a ChunkedArray, not Array, and then DictionaryArray.from_arrays fails.

What changes are included in this PR?

_handle_arrow_array_protocol method in array.pxi is updated so that in case of a ChunkedArray with one chunk, the result is a pyarrow.Array and not pa.ChunkedArray.

Are these changes tested?

Yes. Tests are added to:

python/pyarrow/tests/parquet/test_pandas.py
python/pyarrow/tests/test_pandas.py
python/pyarrow/tests/test_array.py

Are there any user-facing changes?

No.

Closes: [Python] array() errors if pandas categorical column has dictionary as string not object #33727

github-actions · 2023-02-22T09:55:45Z

Closes: [Python] array() errors if pandas categorical column has dictionary as string not object #33727

jorisvandenbossche

This seems a good short term fix for this specific case. When pandas starts to use pyarrow chunked arrays more and more, we might need to think about what pa.array(pandas_series) should do if the pandas object has multiple chunks, but let's leave that for another issue.

Just one small remark on the tests

python/pyarrow/tests/test_array.py

Co-authored-by: Joris Van den Bossche <[email protected]>

jorisvandenbossche · 2023-03-28T08:40:23Z

Thanks! Opened #34755 for the broader question about how to handle chunked arrays in pa.array()

ursabot · 2023-03-28T11:33:54Z

Benchmark runs are scheduled for baseline = 2dbd39c and contender = ca18e6f. ca18e6f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.33% ⬆️0.06%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] ca18e6f7 ec2-t3-xlarge-us-east-2
[Failed] ca18e6f7 test-mac-arm
[Failed] ca18e6f7 ursa-i9-9960x
[Finished] ca18e6f7 ursa-thinkcentre-m75q
[Finished] 2dbd39c7 ec2-t3-xlarge-us-east-2
[Failed] 2dbd39c7 test-mac-arm
[Finished] 2dbd39c7 ursa-i9-9960x
[Finished] 2dbd39c7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

…has dictionary as string not object (apache#34289) ### Rationale for this change Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays` fails. ### What changes are included in this PR? `_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.` ### Are these changes tested? Yes. Tests are added to: - python/pyarrow/tests/parquet/test_pandas.py - python/pyarrow/tests/test_pandas.py - python/pyarrow/tests/test_array.py ### Are there any user-facing changes? No. * Closes: apache#33727 Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

Update _handle_arrow_array_protocol and add tests

464b917

github-actions bot added the Component: Python label Feb 22, 2023

Fix failing test with pandas 1.0

d35e77a

This was referenced Mar 6, 2023

[Python] to_parquet fails with a category field backed by pyarrow string #34449

Closed

GH-34449 [Python] array(pd.Categorical) raising for arrow-backed cat #34456

Closed

j-bennet mentioned this pull request Mar 6, 2023

Improved support for pyarrow strings dask/dask#10000

Merged

2 tasks

jorisvandenbossche approved these changes Mar 23, 2023

View reviewed changes

python/pyarrow/tests/test_array.py Show resolved Hide resolved

github-actions bot added the awaiting merge Awaiting merge label Mar 23, 2023

Update python/pyarrow/tests/test_array.py

7a9febe

Co-authored-by: Joris Van den Bossche <[email protected]>

AlenkaF added this to the 12.0.0 milestone Mar 27, 2023

jorisvandenbossche merged commit ca18e6f into apache:main Mar 28, 2023

AlenkaF deleted the gh-33727-chunkedarray-with-one-chunk branch March 28, 2023 08:33

jorisvandenbossche mentioned this pull request Mar 28, 2023

[Python] How to handle chunked arrays output in pyarrow.array(...) #34755

Open

j-bennet added a commit to j-bennet/dask that referenced this pull request May 1, 2023

apache/arrow#33727 has been fixed in pyarrow via apache/arrow#34289

77a73c9

j-bennet mentioned this pull request May 1, 2023

Un-xfail test_categories with pyarrow strings and pyarrow>=12 dask/dask#10244

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object #34289

GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object #34289

AlenkaF commented Feb 22, 2023 •

edited by github-actions bot

Loading

github-actions bot commented Feb 22, 2023

jorisvandenbossche left a comment

jorisvandenbossche commented Mar 28, 2023

ursabot commented Mar 28, 2023

GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object #34289

GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object #34289

Conversation

AlenkaF commented Feb 22, 2023 • edited by github-actions bot Loading

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

github-actions bot commented Feb 22, 2023

jorisvandenbossche left a comment

Choose a reason for hiding this comment

jorisvandenbossche commented Mar 28, 2023

ursabot commented Mar 28, 2023

AlenkaF commented Feb 22, 2023 •

edited by github-actions bot

Loading