-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[C++] Enable slicing to end of string using "utf8_slice_codeunits" when string length unknown or different lengths #28940
Comments
Maarten Breddels / @maartenbreddels: |
Joris Van den Bossche / @jorisvandenbossche: In [24]: import sys
In [25]: string = "Apache Arrow"
In [26]: pc.utf8_slice_codeunits(string, start=-5, stop=sys.maxsize)
Out[26]: <pyarrow.StringScalar: 'Arrow'>
In [27]: pc.utf8_slice_codeunits(string, start=-5, stop=-1)
Out[27]: <pyarrow.StringScalar: 'Arro'> So "a large integer" can be used to indicate "slice until the end" (I suppose because you can never have a scalar string with a longer length than that value?). |
Nicola Crane / @thisisnic: @lidavidm - nah, it's fine, I can just copy from the Python implementation and chuck in some R code like if(stop==-1)stop = .Machine$integer.max CC @pachadotdev |
Mauricio 'Pachá' Vargas Sepúlveda / @pachadotdev: |
Eduardo Ponce / @edponce: // start=-5, stop=std::numeric_limits<int64_t>::max(), step=1
SliceOptions opts(-5);
auto result = CallFunction("utf8_slice_codeunits", {Datum("Apache Arrow")}, &opts);
if (result.ok()) {
Datum slice = std::move(result).ValueOrDie();
// Prints "Arrow"
std::cout << slice.scalar()->ToString() << std::endl;
} else {
ARROW_LOG(ERROR) << result.status();
}
In R you should be able to do the following, # C++ version
> call_function("utf8_slice_codeunits", Scalar$create("Apache Arrow"), options = list(start=-5L))
[1] "Arrow"
@jorisvandenbossche >>> string = 'Apache Arrow'
>>> pc.utf8_slice_codeunits(string, start=-5, stop=len(string))
<pyarrow.StringScalar: 'Arrow'>
By providing >>> string = 'Apache Arrow'
>>> pc.utf8_slice_codeunits(string, start=-5)
<pyarrow.StringScalar: 'Arrow'>
The question that naturally follows from this JIRA is: Are all the default options in PyArrow and R bindings consistent with C++ defaults? |
Eduardo Ponce / @edponce: |
Nicola Crane / @thisisnic: @pachadotdev - totally missed this in my initial review of the code, but the thing that actually needs changing is the bindings for "utf8_slice_codeunits" in arrow/cpp/src/arrow/compute/api_scalar.h Lines 203 to 210 in 7eea2f5
I think that the We really should write this up (I can add it to my to-do list!) as it's neither obvious nor trivial to work out the various steps required here.
|
We're currently trying to write bindings from the C++ function "utf8_slice_codeunits" to R, specifically trying to replicate the behaviour of R's string::str_sub
In both the R and C++ implementations, I can use negative indices to count back from the end of a string (show below in R, but the latter directly invokes the C++ implementation):
Note that in the C++ implementation, I have to add 1 to the stop value as the final value is non-inclusive.
The problem is when I'm trying to use negative indices to refer to the final values in a string:
The result is blank as the 'stop' value 0 refers to the start of the string, effective walking backwards, which isn't possible (except via the step argument which I can't get working but I don't think is what I want anyway).
I've tried to get around this by attempting to write some code that calculates the length of the string and supply that to the stop argument, but it didn't work.
I do have a possible workaround that involves reversing the string, extracting the substring using inverted values of swapped stop/start values, and then reversing the result, but before I go down that path, I was wondering if there is anything that can (and should! the answer may be a simple "nope!") be changed in the C++ code to make it possible to do this a different way?
Reporter: Nicola Crane / @thisisnic
Note: This issue was originally created as ARROW-13259. Please see the migration documentation for further details.
The text was updated successfully, but these errors were encountered: