Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BUG: Series.str.split broken with pyarrow strings and regex argument #58321

Open
3 tasks done
WillAyd opened this issue Apr 18, 2024 · 3 comments
Open
3 tasks done

BUG: Series.str.split broken with pyarrow strings and regex argument #58321

WillAyd opened this issue Apr 18, 2024 · 3 comments
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data

Comments

@WillAyd
Copy link
Member

WillAyd commented Apr 18, 2024

Pandas version checks

  • I have checked that this issue has not already been reported.

  • I have confirmed this bug exists on the latest version of pandas.

  • I have confirmed this bug exists on the main branch of pandas.

Reproducible Example

In [70]: pd.Series(
    ...:     ["230/270/270", "240-290-290"],
    ...:     dtype="string[pyarrow]"
    ...: ).str.split(r"/|-", expand=True)
Out[70]: 
     0    1    2
0  230  270  270
1  240  290  290

In [71]: pd.Series(
    ...:     ["230/270/270", "240-290-290"],
    ...:     dtype=pd.ArrowDtype(pa.string())
    ...: ).str.split(r"/|-", expand=True)
Out[71]: 
             0
0  230/270/270
1  240-290-290

Issue Description

It doesn't look like arrow strings work when using a regular expression argument to split. I am also a bit confused why there is a difference between string[pyarrow] and pd.ArrowDtype(pa.string())

@phofl in case you know what's going on

Expected Behavior

Values should split for arrow string type

Installed Versions

In [73]: pd.version
Out[73]: '3.0.0.dev0+681.g434fda08cf'

@WillAyd WillAyd added Bug Strings String extension data type and string data Arrow pyarrow functionality labels Apr 18, 2024
@asishm
Copy link
Contributor

asishm commented Apr 18, 2024

iirc
pd.ArrowDtype(pa.string()) goes through ./pandas/core/arrays/arrow/array.py
while string[pyarrow] goes through ./pandas/core/arrays/string_arrow.py

@yuanx749
Copy link
Contributor

As per https://pandas.pydata.org/docs/user_guide/pyarrow.html

The string alias "string[pyarrow]" maps to pd.StringDtype("pyarrow") which is not equivalent to specifying dtype=pd.ArrowDtype(pa.string())

@yuanx749
Copy link
Contributor

take

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Arrow pyarrow functionality Bug Strings String extension data type and string data
Projects
None yet
3 participants