[C++][Python] Allow utf8_slice_codeunits to support default start value of None to support strings of different length #34917

rohanjain101 · 2023-04-06T01:35:20Z

Describe the enhancement requested

Related to pandas-dev/pandas#52434. Currently utf8_slice_codeunits doesn't support a default start value. It would be nice to support default of 0 if step > 1 and len -1 if step < 0 for parity with pandas.

Component(s)

Python

jorisvandenbossche · 2023-04-06T12:13:15Z

@rohanjain101 thanks for the report! I was going to mention that you could use sys.maxsize as start (the largest integer, will always be beyond the end of a single string in the input, and so will always start slicing from the end), but apparently you can easily run into a segfault with that: opened #34928

As a non-ideal workaround, you could check what is the largest string in your input array with pa.compute.max(pa.compute.utf8_length(arr)) (or just take a reasonable large value, but not close to sys.maxsize), and use that as start value:

In [1]: pa.compute.utf8_slice_codeunits("abcdefghijklmnabcdefghijkln", start=1000, stop=8, step=-9)
Out[1]: <pyarrow.StringScalar: 'nd'>

rohanjain101 added the Type: enhancement label Apr 6, 2023

github-actions bot added the Component: Python label Apr 6, 2023

rohanjain101 mentioned this issue Apr 6, 2023

BUG: String slicing produces different results with pyarrow string datatype compared to python string type pandas-dev/pandas#52434

Closed

3 tasks

jorisvandenbossche changed the title ~~Allow utf8_slice_codeunits to support default start value of None to support strings of different length~~ [C++][Python] Allow utf8_slice_codeunits to support default start value of None to support strings of different length Apr 6, 2023

jorisvandenbossche added the Component: C++ label Apr 6, 2023

jorisvandenbossche mentioned this issue Apr 6, 2023

[C++] Better support optional start/stop in "utf8_slice_codeunits" kernel #34929

Open

westonpace added the good-second-issue label Apr 11, 2023

wirable23 mentioned this issue Jul 11, 2023

GH-36311: [C++] Fix integer overflows in utf8_slice_codeunits #36575

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++][Python] Allow utf8_slice_codeunits to support default start value of None to support strings of different length #34917

[C++][Python] Allow utf8_slice_codeunits to support default start value of None to support strings of different length #34917

rohanjain101 commented Apr 6, 2023

jorisvandenbossche commented Apr 6, 2023

[C++][Python] Allow utf8_slice_codeunits to support default start value of None to support strings of different length #34917

[C++][Python] Allow utf8_slice_codeunits to support default start value of None to support strings of different length #34917

Comments

rohanjain101 commented Apr 6, 2023

Describe the enhancement requested

Component(s)

jorisvandenbossche commented Apr 6, 2023