-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Python] array() errors if pandas categorical column has dictionary as string not object #33727
Comments
Thank you for reporting @crusaderky! The error is triggered in arrow/python/pyarrow/pandas_compat.py Lines 591 to 598 in f769f6b
due to # Works with string series/column
df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
pa.array(df["x"])
# <pyarrow.lib.ChunkedArray object at 0x12dcbbef0>
# [
# [
# "foo",
# "bar",
# "foo"
# ]
# ]
# Works with categorical with dictionary as object type
df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
df = df.astype("category")
pa.array(df["x"])
# <pyarrow.lib.DictionaryArray object at 0x12dbc5ac0>
# -- dictionary:
# [
# "bar",
# "foo"
# ]
# -- indices:
# [
# 1,
# 0,
# 1
# ]
# Errors if dictionary in categorical column is string
df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
df = df.astype("category")
pa.array(df["x"])
# Traceback (most recent call last):
# File "<stdin>", line 1, in <module>
# File "pyarrow/array.pxi", line 310, in pyarrow.lib.array
# return DictionaryArray.from_arrays(
# File "pyarrow/array.pxi", line 2608, in pyarrow.lib.DictionaryArray.from_arrays
# _dictionary = array(dictionary, memory_pool=memory_pool)
# File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
# result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
# File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
# chunked = GetResultValue(
# File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
# return check_status(status)
# File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
# raise ArrowInvalid(message)
# pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type Debugging from df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
df = df.astype("category")
dataframe_to_arrays(df, schema=None, preserve_index=None)
# > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
# -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
(Pdb) col
# 0 foo
# 1 bar
# 2 foo
# Name: x, dtype: category
# Categories (2, object): ['bar', 'foo']
df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
df = df.astype("category")
dataframe_to_arrays(df, schema=None, preserve_index=None)
# > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
# -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
(Pdb) col
# 0 foo
# 1 bar
# 2 foo
# Name: x, dtype: category
# Categories (2, string): [bar, foo] |
Inside
But the converted categories result in a ChunkedArray, not a plain Array, and then it is
We should probably ensure that |
…ctionary as string not object (#34289) ### Rationale for this change Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays` fails. ### What changes are included in this PR? `_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.` ### Are these changes tested? Yes. Tests are added to: - python/pyarrow/tests/parquet/test_pandas.py - python/pyarrow/tests/test_pandas.py - python/pyarrow/tests/test_array.py ### Are there any user-facing changes? No. * Closes: #33727 Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
…has dictionary as string not object (apache#34289) ### Rationale for this change Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays` fails. ### What changes are included in this PR? `_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.` ### Are these changes tested? Yes. Tests are added to: - python/pyarrow/tests/parquet/test_pandas.py - python/pyarrow/tests/test_pandas.py - python/pyarrow/tests/test_array.py ### Are there any user-facing changes? No. * Closes: apache#33727 Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>
Describe the bug, including details regarding any error messages, version, and platform.
pandas 1.5.2
pyarrow 10.0.1
If you convert a pandas Series with dtype
string[pyarrow]
tocategory
, the categories will bestring[pyarrow]
. So far, so good.However, when you try writing the resulting object to parquet, PyArrow fails as it does not recognize its own datatype.
Reproducer
Workaround
Component(s)
Python
The text was updated successfully, but these errors were encountered: