Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object #34289

Merged

Conversation

AlenkaF
Copy link
Member

@AlenkaF AlenkaF commented Feb 22, 2023

Rationale for this change

Currently writing a pandas dataframe with categorical column of dtype string[pyarrow] fails. The reason for this is that when category with string[pyarrow] dtype is converted to an array in pyarrow it results in a ChunkedArray, not Array, and then DictionaryArray.from_arrays fails.

What changes are included in this PR?

_handle_arrow_array_protocol method in array.pxi is updated so that in case of a ChunkedArray with one chunk, the result is a pyarrow.Array and not pa.ChunkedArray.

Are these changes tested?

Yes. Tests are added to:

  • python/pyarrow/tests/parquet/test_pandas.py
  • python/pyarrow/tests/test_pandas.py
  • python/pyarrow/tests/test_array.py

Are there any user-facing changes?

No.

Copy link
Member

@jorisvandenbossche jorisvandenbossche left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems a good short term fix for this specific case. When pandas starts to use pyarrow chunked arrays more and more, we might need to think about what pa.array(pandas_series) should do if the pandas object has multiple chunks, but let's leave that for another issue.

Just one small remark on the tests

python/pyarrow/tests/test_array.py Show resolved Hide resolved
@github-actions github-actions bot added the awaiting merge Awaiting merge label Mar 23, 2023
Co-authored-by: Joris Van den Bossche <[email protected]>
@AlenkaF AlenkaF added this to the 12.0.0 milestone Mar 27, 2023
@jorisvandenbossche jorisvandenbossche merged commit ca18e6f into apache:main Mar 28, 2023
@AlenkaF AlenkaF deleted the gh-33727-chunkedarray-with-one-chunk branch March 28, 2023 08:33
@jorisvandenbossche
Copy link
Member

Thanks! Opened #34755 for the broader question about how to handle chunked arrays in pa.array()

@ursabot
Copy link

ursabot commented Mar 28, 2023

Benchmark runs are scheduled for baseline = 2dbd39c and contender = ca18e6f. ca18e6f is a master commit associated with this PR. Results will be available as each benchmark for each run completes.
Conbench compare runs links:
[Finished ⬇️0.0% ⬆️0.0%] ec2-t3-xlarge-us-east-2
[Failed ⬇️0.33% ⬆️0.06%] test-mac-arm
[Failed ⬇️0.0% ⬆️0.0%] ursa-i9-9960x
[Finished ⬇️0.25% ⬆️0.0%] ursa-thinkcentre-m75q
Buildkite builds:
[Finished] ca18e6f7 ec2-t3-xlarge-us-east-2
[Failed] ca18e6f7 test-mac-arm
[Failed] ca18e6f7 ursa-i9-9960x
[Finished] ca18e6f7 ursa-thinkcentre-m75q
[Finished] 2dbd39c7 ec2-t3-xlarge-us-east-2
[Failed] 2dbd39c7 test-mac-arm
[Finished] 2dbd39c7 ursa-i9-9960x
[Finished] 2dbd39c7 ursa-thinkcentre-m75q
Supported benchmarks:
ec2-t3-xlarge-us-east-2: Supported benchmark langs: Python, R. Runs only benchmarks with cloud = True
test-mac-arm: Supported benchmark langs: C++, Python, R
ursa-i9-9960x: Supported benchmark langs: Python, R, JavaScript
ursa-thinkcentre-m75q: Supported benchmark langs: C++, Java

j-bennet added a commit to j-bennet/dask that referenced this pull request May 1, 2023
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this pull request May 15, 2023
…has dictionary as string not object (apache#34289)

### Rationale for this change
Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays`  fails.

### What changes are included in this PR?
`_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.`

### Are these changes tested?
Yes. Tests are added to:

- python/pyarrow/tests/parquet/test_pandas.py
- python/pyarrow/tests/test_pandas.py
- python/pyarrow/tests/test_array.py

### Are there any user-facing changes?
No.
* Closes: apache#33727

Lead-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Python] array() errors if pandas categorical column has dictionary as string not object
3 participants