[Python] array() errors if pandas categorical column has dictionary as string not object #33727

crusaderky · 2023-01-17T17:26:51Z

Describe the bug, including details regarding any error messages, version, and platform.

pandas 1.5.2
pyarrow 10.0.1

If you convert a pandas Series with dtype string[pyarrow] to category, the categories will be string[pyarrow]. So far, so good.
However, when you try writing the resulting object to parquet, PyArrow fails as it does not recognize its own datatype.

Reproducer

>>> import pandas as pd
>>> df = pd.DataFrame({"x": ["foo", "bar", "foo"], dtype="string[pyarrow]")
>>> df.dtypes.x
string[pyarrow]
>>> df = df.astype("category")
>>> df.dtypes.x
CategoricalDtype(categories=['bar', 'foo'], ordered=False)
>>> df.dtypes.x.categories.dtype
string[pyarrow]
>>> df.to_parquet("foo.parquet")
pyarrow.lib.ArrowInvalid: ("Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type", 'Conversion failed for column x with type category')

Workaround

df = df.astype(
    {
        k: pd.CategoricalDtype(v.categories.astype(object))
        for k, v in df.dtypes.items()
        if isinstance(v, pd.CategoricalDtype)
        and v.categories.dtype == "string[pyarrow]"
    }
)

Component(s)

Python

The text was updated successfully, but these errors were encountered:

crusaderky · 2023-01-17T17:28:19Z

FYI @jrbourbeau @ncclementi

AlenkaF · 2023-01-18T08:55:34Z

Thank you for reporting @crusaderky!
It seems array() method can't handle categorical pandas columns if the dictionary is string type.

The error is triggered in pandas_compat.py

arrow/python/pyarrow/pandas_compat.py

Lines 591 to 598 in f769f6b

    
           try: 
        
               result = pa.array(col, type=type_, from_pandas=True, safe=safe) 
        
           except (pa.ArrowInvalid, 
        
                   pa.ArrowNotImplementedError, 
        
                   pa.ArrowTypeError) as e: 
        
               e.args += ("Conversion failed for column {!s} with type {!s}" 
        
                          .format(col.name, col.dtype),) 
        
               raise e

due to array() method erroring with ArrowInvalid:

# Works with string series/column
df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
pa.array(df["x"])
# <pyarrow.lib.ChunkedArray object at 0x12dcbbef0>
# [
#   [
#     "foo",
#     "bar",
#     "foo"
#   ]
# ]

# Works with categorical with dictionary as object type
df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
df = df.astype("category")
pa.array(df["x"])
# <pyarrow.lib.DictionaryArray object at 0x12dbc5ac0>

# -- dictionary:
#   [
#     "bar",
#     "foo"
#   ]
# -- indices:
#   [
#     1,
#     0,
#     1
#   ]

# Errors if dictionary in categorical column is string
df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
df = df.astype("category")
pa.array(df["x"])
# Traceback (most recent call last):
#   File "<stdin>", line 1, in <module>
#   File "pyarrow/array.pxi", line 310, in pyarrow.lib.array
#     return DictionaryArray.from_arrays(
#   File "pyarrow/array.pxi", line 2608, in pyarrow.lib.DictionaryArray.from_arrays
#     _dictionary = array(dictionary, memory_pool=memory_pool)
#   File "pyarrow/array.pxi", line 320, in pyarrow.lib.array
#     result = _sequence_to_array(obj, mask, size, type, pool, c_from_pandas)
#   File "pyarrow/array.pxi", line 39, in pyarrow.lib._sequence_to_array
#     chunked = GetResultValue(
#   File "pyarrow/error.pxi", line 144, in pyarrow.lib.pyarrow_internal_check_status
#     return check_status(status)
#   File "pyarrow/error.pxi", line 100, in pyarrow.lib.check_status
#     raise ArrowInvalid(message)
# pyarrow.lib.ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type

Debugging from convert_column in dataframe_to_arrays (pandas_compat.py)

df = pd.DataFrame({"x": ["foo", "bar", "foo"]})
df = df.astype("category")
dataframe_to_arrays(df, schema=None, preserve_index=None)
# > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
# -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
(Pdb) col
# 0    foo
# 1    bar
# 2    foo
# Name: x, dtype: category
# Categories (2, object): ['bar', 'foo']

df = pd.DataFrame({"x": ["foo", "bar", "foo"]}, dtype="string[pyarrow]")
df = df.astype("category")
dataframe_to_arrays(df, schema=None, preserve_index=None)
# > /Users/alenkafrim/repos/arrow-new/python/pyarrow/pandas_compat.py(593)convert_column()
# -> result = pa.array(col, type=type_, from_pandas=True, safe=safe)
(Pdb) col
# 0    foo
# 1    bar
# 2    foo
# Name: x, dtype: category
# Categories (2, string): [bar, foo]

jorisvandenbossche · 2023-01-18T16:10:08Z

Inside pa.array(..), we convert a pandas.Categorical by converting its indices and categories to an Array, and then call pa.DictionaryArray.from_array. The first step works:

In [36]: indices = pa.array(df['x'].cat.codes)

In [37]: df["x"].cat.categories.values
Out[37]: 
<ArrowStringArray>
['bar', 'foo']
Length: 2, dtype: string

In [39]: dictionary = pa.array(df["x"].cat.categories.values)

In [40]: dictionary
Out[40]: 
<pyarrow.lib.ChunkedArray object at 0x7f9e87f0e7a0>
[
  [
    "bar",
    "foo"
  ]
]

But the converted categories result in a ChunkedArray, not a plain Array, and then it is DictionaryArray.from_arrays that fails: it expects an Array, and if the passed dictionary is not already an Array, try to convert it to one:

In [43]: pa.DictionaryArray.from_arrays(indices, dictionary)
...
ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type

In [44]: pa.array(dictionary)
...
ArrowInvalid: Could not convert <pyarrow.StringScalar: 'bar'> with type pyarrow.lib.StringScalar: did not recognize Python value type when inferring an Arrow data type

We should probably ensure that pa.array(..) returns an Array instead of ChunkedArray if there is only one chunk (that is also logic that could live inside pandas' StringArray.__arrow_array__).
I am not sure if our APIs should accept a ChunkedArray (and automatically concatenate the chunks?) in DictionaryArray.from_arrays.

…ctionary as string not object (#34289) ### Rationale for this change Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays` fails. ### What changes are included in this PR? `_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.` ### Are these changes tested? Yes. Tests are added to: - python/pyarrow/tests/parquet/test_pandas.py - python/pyarrow/tests/test_pandas.py - python/pyarrow/tests/test_array.py ### Are there any user-facing changes? No. * Closes: #33727 Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

…has dictionary as string not object (apache#34289) ### Rationale for this change Currently writing a pandas dataframe with categorical column of dtype `string[pyarrow]` fails. The reason for this is that when category with `string[pyarrow]` dtype is converted to an array in pyarrow it results in a `ChunkedArray,` not `Array`, and then `DictionaryArray.from_arrays` fails. ### What changes are included in this PR? `_handle_arrow_array_protocol` method in _array.pxi_ is updated so that in case of a `ChunkedArray` with one chunk, the result is a `pyarrow.Array` and not `pa.ChunkedArray.` ### Are these changes tested? Yes. Tests are added to: - python/pyarrow/tests/parquet/test_pandas.py - python/pyarrow/tests/test_pandas.py - python/pyarrow/tests/test_array.py ### Are there any user-facing changes? No. * Closes: apache#33727 Lead-authored-by: Alenka Frim <[email protected]> Co-authored-by: Alenka Frim <[email protected]> Co-authored-by: Joris Van den Bossche <[email protected]> Signed-off-by: Joris Van den Bossche <[email protected]>

crusaderky added the Type: bug label Jan 17, 2023

github-actions bot added the Component: Python label Jan 17, 2023

AlenkaF changed the title ~~pandas string[pyarrow] -> category -> to_parquet fails~~ [Python] array() errors if pandas categorical column has dictionary as string not object Jan 18, 2023

jorisvandenbossche added this to the 12.0.0 milestone Jan 18, 2023

crusaderky mentioned this issue Jan 20, 2023

Post-mortem: why an easy workflow was horribly non-performant, and what we could do to make it easier for users to write fast dask code dask/community#301

Open

AlenkaF self-assigned this Feb 22, 2023

github-actions bot mentioned this issue Feb 22, 2023

GH-33727: [Python] array() errors if pandas categorical column has dictionary as string not object #34289

Merged

This was referenced Mar 6, 2023

[Python] to_parquet fails with a category field backed by pyarrow string #34449

Closed

GH-34449 [Python] array(pd.Categorical) raising for arrow-backed cat #34456

Closed

This was referenced Mar 6, 2023

Improved support for pyarrow strings dask/dask#10000

Merged

BUG: converting a string[pyarrow] column to category triggers an error in to_parquet pandas-dev/pandas#51752

Open

jorisvandenbossche closed this as completed in #34289 Mar 28, 2023

jorisvandenbossche mentioned this issue Mar 28, 2023

[Python] How to handle chunked arrays output in pyarrow.array(...) #34755

Open

j-bennet added a commit to j-bennet/dask that referenced this issue May 1, 2023

apache/arrow#33727 has been fixed in pyarrow via apache/arrow#34289

77a73c9

j-bennet mentioned this issue May 1, 2023

Un-xfail test_categories with pyarrow strings and pyarrow>=12 dask/dask#10244

Merged

1 task

jrbourbeau mentioned this issue Jul 19, 2023

Convert to pyarrow strings if proper dependencies are installed dask/dask#10400

Merged

2 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Python] array() errors if pandas categorical column has dictionary as string not object #33727

[Python] array() errors if pandas categorical column has dictionary as string not object #33727

crusaderky commented Jan 17, 2023 •

edited

Loading

crusaderky commented Jan 17, 2023

AlenkaF commented Jan 18, 2023

jorisvandenbossche commented Jan 18, 2023

[Python] array() errors if pandas categorical column has dictionary as string not object #33727

[Python] array() errors if pandas categorical column has dictionary as string not object #33727

Comments

crusaderky commented Jan 17, 2023 • edited Loading

Describe the bug, including details regarding any error messages, version, and platform.

Reproducer

Workaround

Component(s)

crusaderky commented Jan 17, 2023

AlenkaF commented Jan 18, 2023

jorisvandenbossche commented Jan 18, 2023

crusaderky commented Jan 17, 2023 •

edited

Loading