[C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths #28940

asfimport · 2021-07-05T15:04:08Z

We're currently trying to write bindings from the C++ function "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour of R's string::str_sub

In both the R and C++ implementations, I can use negative indices to count back from the end of a string (show below in R, but the latter directly invokes the C++ implementation):

# stringr version
> stringr::str_sub("Apache Arrow", -5, -2)
[1] "Arro"

# C++ version
> call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options = list(start=-5L, stop=-1L))
Scalar
Arro

Note that in the C++ implementation, I have to add 1 to the stop value as the final value is non-inclusive.

The problem is when I'm trying to use negative indices to refer to the final values in a string:

stringr version
> stringr::str_sub("Apache Arrow", -5, -1)
[1] "Arrow"

# C++ version
> call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options = list(start=-5L, stop=0L))
Scalar

The result is blank as the 'stop' value 0 refers to the start of the string, effective walking backwards, which isn't possible (except via the step argument which I can't get working but I don't think is what I want anyway).

I've tried to get around this by attempting to write some code that calculates the length of the string and supply that to the stop argument, but it didn't work.

I do have a possible workaround that involves reversing the string, extracting the substring using inverted values of swapped stop/start values, and then reversing the result, but before I go down that path, I was wondering if there is anything that can (and should! the answer may be a simple "nope!") be changed in the C++ code to make it possible to do this a different way?

Reporter: Nicola Crane / @thisisnic

_{Note: This issue was originally created as ARROW-13259. Please see the migration documentation for further details.}

asfimport · 2021-07-05T15:12:44Z

Maarten Breddels / @maartenbreddels:
Does my comment #9000 (comment) help you out?

asfimport · 2021-07-05T15:41:55Z

Joris Van den Bossche / @jorisvandenbossche:
To copy over the practical example:

In [24]: import sys

In [25]: string = "Apache Arrow"

In [26]: pc.utf8_slice_codeunits(string, start=-5, stop=sys.maxsize)
Out[26]: <pyarrow.StringScalar: 'Arrow'>

In [27]: pc.utf8_slice_codeunits(string, start=-5, stop=-1)
Out[27]: <pyarrow.StringScalar: 'Arro'>

So "a large integer" can be used to indicate "slice until the end" (I suppose because you can never have a scalar string with a longer length than that value?).
In Python this is sys.maxsize, in C++ it's std::numeric_limits<int64_t>::max().

asfimport · 2021-07-05T15:43:53Z

David Li / @lidavidm:
Maybe we could add a SliceOptions::kEnd constant just to make it clear what to do? (Not sure that'd help R?)

asfimport · 2021-07-05T16:25:10Z

Nicola Crane / @thisisnic:
Thanks very much @maartenbreddels and @jorisvandenbossche !

@lidavidm - nah, it's fine, I can just copy from the Python implementation and chuck in some R code like

if(stop==-1)stop = .Machine$integer.max

CC @pachadotdev

asfimport · 2021-07-05T17:16:21Z

Mauricio 'Pachá' Vargas Sepúlveda / @pachadotdev:
thanks a lot, I've edited my PR
since I'm on 21.04, I'm considering doing my work on a virtual machine until the build is fixed

asfimport · 2021-07-08T08:11:46Z

Eduardo Ponce / @edponce:
In C++ by default SliceOptions has the stop option set to std::numeric_limits<int64_t>::max(). Therefore, if you want to slice to end of string simply omit a value for stop or set it to a value >= len(string).

// start=-5, stop=std::numeric_limits<int64_t>::max(), step=1
SliceOptions opts(-5);
auto result = CallFunction("utf8_slice_codeunits", {Datum("Apache Arrow")}, &opts);
if (result.ok()) {
    Datum slice = std::move(result).ValueOrDie();
    // Prints "Arrow"
    std::cout << slice.scalar()->ToString() << std::endl;
} else {
    ARROW_LOG(ERROR) << result.status();
}

In R you should be able to do the following,

# C++ version
> call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options = list(start=-5L))
[1] "Arrow"

@jorisvandenbossche
The issue in PyArrow arises because the interface for SliceOptions does not sets the default value for stop option (only for step option).
Therefore, these are required arguments.

>>> string = 'Apache Arrow'
>>> pc.utf8_slice_codeunits(string, start=-5, stop=len(string))
<pyarrow.StringScalar: 'Arrow'>

By providing sys.maxsize as default stop option, we can do the following:

>>> string = 'Apache Arrow'
>>> pc.utf8_slice_codeunits(string, start=-5)
<pyarrow.StringScalar: 'Arrow'>

The question that naturally follows from this JIRA is: Are all the default options in PyArrow and R bindings consistent with C++ defaults?

asfimport · 2021-07-08T08:48:40Z

Eduardo Ponce / @edponce:
Created ARROW-13288 to verify inconsistencies between the C++ and Python kernel options.

asfimport · 2021-07-08T09:21:56Z

Nicola Crane / @thisisnic:
@edponce Thanks for highlighting that, I'd totally missed it!

@pachadotdev - totally missed this in my initial review of the code, but the thing that actually needs changing is the bindings for "utf8_slice_codeunits" in compute.cpp - here, start and stop have been set to 1 and -1 respectively, but instead need to reflect the default values from here:

arrow/cpp/src/arrow/compute/api_scalar.h

Lines 203 to 210 in 7eea2f5

    
           class ARROW_EXPORT SliceOptions : public FunctionOptions { 
        
            public: 
        
             explicit SliceOptions(int64_t start, int64_t stop = std::numeric_limits<int64_t>::max(), 
        
                                   int64_t step = 1); 
        
             SliceOptions(); 
        
             constexpr static char const kTypeName[] = "SliceOptions"; 
        
             int64_t start, stop, step; 
        
           };

I think that the step argument also needs implementing too.

We really should write this up (I can add it to my to-do list!) as it's neither obvious nor trivial to work out the various steps required here.

asfimport closed this as completed Jul 8, 2021

mroeschke mentioned this issue Apr 6, 2023

BUG: String slicing produces different results with pyarrow string datatype compared to python string type pandas-dev/pandas#52434

Closed

3 tasks

jorisvandenbossche mentioned this issue Apr 6, 2023

[C++] Better support optional start/stop in "utf8_slice_codeunits" kernel #34929

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths #28940

[C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths #28940

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 8, 2021

asfimport commented Jul 8, 2021

asfimport commented Jul 8, 2021

[C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths #28940

[C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths #28940

Comments

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 5, 2021

asfimport commented Jul 8, 2021

asfimport commented Jul 8, 2021

asfimport commented Jul 8, 2021