Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[C++][Python] Allow utf8_slice_codeunits to support default start value of None to support strings of different length #34917

Open
rohanjain101 opened this issue Apr 6, 2023 · 1 comment

Comments

@rohanjain101
Copy link

Describe the enhancement requested

Related to pandas-dev/pandas#52434. Currently utf8_slice_codeunits doesn't support a default start value. It would be nice to support default of 0 if step > 1 and len -1 if step < 0 for parity with pandas.

Component(s)

Python

@jorisvandenbossche jorisvandenbossche changed the title Allow utf8_slice_codeunits to support default start value of None to support strings of different length [C++][Python] Allow utf8_slice_codeunits to support default start value of None to support strings of different length Apr 6, 2023
@jorisvandenbossche
Copy link
Member

@rohanjain101 thanks for the report! I was going to mention that you could use sys.maxsize as start (the largest integer, will always be beyond the end of a single string in the input, and so will always start slicing from the end), but apparently you can easily run into a segfault with that: opened #34928

As a non-ideal workaround, you could check what is the largest string in your input array with pa.compute.max(pa.compute.utf8_length(arr)) (or just take a reasonable large value, but not close to sys.maxsize), and use that as start value:

In [1]: pa.compute.utf8_slice_codeunits("abcdefghijklmnabcdefghijkln", start=1000, stop=8, step=-9)
Out[1]: <pyarrow.StringScalar: 'nd'>

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants