GH-43960: [R] fix str_sub
binding to properly handle negative end
values
#44141
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
First-time contributor here, so let me know where I can improve!
Rationale for this change
The
str_sub
binding in arrow was not handling negativeend
values properly. The problem was two-fold:end
values were negative (and less than thestart
value, which might be positive),str_sub
would improperly return an empty string.end
values were < -1 but theend
position was still to the right of thestart
position,str_sub
failed to return the final character in the substring, since it did not account for the fact thatend
is counted exclusively in the underlying C++ function (utf8_slice_codeunits
), but inclusively in R.See discussion/examples at #43960 for details.
What changes are included in this PR?
r/R/dplyr-funcs-string.R
that previously setend
= 0 whenstart < end
, which meant if the user was counting backwards from the end of the string (with a negativeend
value), an empty string would [wrongly] be returned. It appears that the case that the previous code was trying to address is already handled properly by the underlying C++ function (utf8_slice_codeunits
).r/R/dplyr-funcs-string.R
in order to account the difference in between R's inclusiveend
and C++'s exclusiveend
whenend
is negative.r/tests/testthat/test-dplyr-funcs-string.R
to test for these cases.Are these changes tested?
Yes, I ran all tests in
r/tests/testthat/test-dplyr-funcs-string.R
, including one which I added (see attached commit), which explicitly tests the case whereend
is negative (-3) and less than thestart
value (1). This also tests the case whereend
< -1 and to the right of thestart
position.Are there any user-facing changes?
No.
This PR contains a "Critical Fix". Previously:
end
values were negative (and less than thestart
value, which might be positive),str_sub
would improperly return an empty string.end
values were < -1 but theend
position was still to the right of thestart
position,str_sub
failed to return the final character in the substring, since it did not account for the fact thatend
is counted exclusively in the underlying C++ function (utf8_slice_codeunits
), but inclusively in R.str_sub()
silently mishandles negative start/stop values #43960