Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Python] array() errors if pandas categorical column has dictionary as string not object #33727

Closed
crusaderky opened this issue Jan 17, 2023 · 3 comments · Fixed by #34289
Closed

Comments

@crusaderky
Copy link

crusaderky commented Jan 17, 2023

Describe the bug, including details regarding any error messages, version, and platform.

pandas 1.5.2
pyarrow 10.0.1

If you convert a pandas Series with dtype string[pyarrow] to category, the categories will be string[pyarrow]. So far, so good.
However, when you try writing the resulting object to parquet, PyArrow fails as it does not recognize its own datatype.

Reproducer

>>> import pandas as pd
>>> df = pd.DataFrame({"x": ["foo", "bar", "foo"], dtype="string[pyarrow]")
>>> df.dtypes.x
string[pyarrow]
>>> df = df.astype("category")
>>> df.dtypes.x
CategoricalDtype(categories=['bar', 'foo'], ordered=False)
>>> df.dtypes.x.categories.dtype
string[pyarrow]
>>> df.to_parquet("foo.parquet")
pyarrow.lib.ArrowInvalid: ("Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column x with type category')

Workaround

df = df.astype(
    {
        k: pd.CategoricalDtype(v.categories.astype(object))
        for k, v in df.dtypes.items()
        if isinstance(v, pd.CategoricalDtype)
        and v.categories.dtype == "string[pyarrow]"
    }
)

Component(s)

Python

@crusaderky
Copy link
Author

FYI @jrbourbeau @ncclementi

@AlenkaF
Copy link
Member

AlenkaF commented Jan 18, 2023

Thank you for reporting @crusaderky!
It seems array() method can't handle categorical pandas columns if the dictionary is string type.

The error is triggered in pandas_compat.py

try:
result = pa.array(col, type=type_, from_pandas=True, safe=safe)
except (pa.ArrowInvalid,
pa.ArrowNotImplementedError,
pa.ArrowTypeError) as e:
e.args += ("Conversion failed for column {!s} with type {!s}"
.format(col.name, col.dtype),)
raise e

due to array() method erroring with ArrowInvalid:

# Works with string series/column
df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
pa.array(df["x"])
# <pyarrow.lib.ChunkedArray object at 0x12dcbbef0>
# [
#   [
#     "foo",
#     "bar",
#     "foo"
#   ]
# ]

# Works with categorical with dictionary as object type
df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
df = df.astype("category")
pa.array(df["x"])
# <pyarrow.lib.DictionaryArray object at 0x12dbc5ac0>

# -- dictionary:
#   [
#     "bar",
#     "foo"
#   ]
# -- indices:
#   [
#     1,
#     0,
#     1
#   ]

# Errors if dictionary in categorical column is string
df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
df = df.astype("category")
pa.array(df["x"])
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "pyarrow/array.pxi", line 310, in pyarrow.lib.array
#     return DictionaryArray.from_arrays(
#   File "pyarrow/array.pxi", line 2608, in pyarrow.lib.DictionaryArray.from_arrays
#     _dictionary = array(dictionary, memory_pool=memory_pool)
#   File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
#     result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
#   File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
#     chunked = GetResultValue(
#   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
#     return check_status(status)
#   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
#     raise ArrowInvalid(message)
# pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type

Debugging from convert_column in dataframe_to_arrays (pandas_compat.py)

df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
df = df.astype("category")
dataframe_to_arrays(df, schema=None, preserve_index=None)
# > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
# -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
(Pdb) col
# 0    foo
# 1    bar
# 2    foo
# Name: x, dtype: category
# Categories (2, object): ['bar', 'foo']

df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
df = df.astype("category")
dataframe_to_arrays(df, schema=None, preserve_index=None)
# > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
# -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
(Pdb) col
# 0    foo
# 1    bar
# 2    foo
# Name: x, dtype: category
# Categories (2, string): [bar, foo]

@AlenkaF AlenkaF changed the title pandas string[pyarrow] -> category -> to_parquet fails [Python] array() errors if pandas categorical column has dictionary as string not object Jan 18, 2023
@jorisvandenbossche
Copy link
Member

Inside pa.array(..), we convert a pandas.Categorical by converting its indices and categories to an Array, and then call pa.DictionaryArray.from_array. The first step works:

In [36]: indices = pa.array(df['x'].cat.codes)

In [37]: df["x"].cat.categories.values
Out[37]: 
<ArrowStringArray>
['bar', 'foo']
Length: 2, dtype: string

In [39]: dictionary = pa.array(df["x"].cat.categories.values)

In [40]: dictionary
Out[40]: 
<pyarrow.lib.ChunkedArray object at 0x7f9e87f0e7a0>
[
  [
    "bar",
    "foo"
  ]
]

But the converted categories result in a ChunkedArray, not a plain Array, and then it is DictionaryArray.from_arrays that fails: it expects an Array, and if the passed dictionary is not already an Array, try to convert it to one:

In [43]: pa.DictionaryArray.from_arrays(indices, dictionary)
...
ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type

In [44]: pa.array(dictionary)
...
ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type

We should probably ensure that pa.array(..) returns an Array instead of ChunkedArray if there is only one chunk (that is also logic that could live inside pandas' StringArray.__arrow_array__).
I am not sure if our APIs should accept a ChunkedArray (and automatically concatenate the chunks?) in DictionaryArray.from_arrays.

@jorisvandenbossche jorisvandenbossche added this to the 12.0.0 milestone Jan 18, 2023
@AlenkaF AlenkaF self-assigned this Feb 22, 2023
jorisvandenbossche added a commit that referenced this issue Mar 28, 2023
…ctionary as string not object (#34289)

### Rationale for this change
Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays`  fails.

### What changes are included in this PR?
`_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.`

### Are these changes tested?
Yes. Tests are added to:

- python/pyarrow/tests/parquet/test_pandas.py
- python/pyarrow/tests/test_pandas.py
- python/pyarrow/tests/test_array.py

### Are there any user-facing changes?
No.
* Closes: #33727

Lead-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
j-bennet added a commit to j-bennet/dask that referenced this issue May 1, 2023
ArgusLi pushed a commit to Bit-Quill/arrow that referenced this issue May 15, 2023
…has dictionary as string not object (apache#34289)

### Rationale for this change
Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays`  fails.

### What changes are included in this PR?
`_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.`

### Are these changes tested?
Yes. Tests are added to:

- python/pyarrow/tests/parquet/test_pandas.py
- python/pyarrow/tests/test_pandas.py
- python/pyarrow/tests/test_array.py

### Are there any user-facing changes?
No.
* Closes: apache#33727

Lead-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Alenka Frim <[email protected]>
Co-authored-by: Joris Van den Bossche <[email protected]>
Signed-off-by: Joris Van den Bossche <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants