Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
apacheGH-43960: [R] fix
str_sub
binding to properly handle negative…
… `end` values (apache#44141) First-time contributor here, so let me know where I can improve! ### Rationale for this change The `str_sub` binding in arrow was not handling negative `end` values properly. The problem was two-fold: 1. When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. 2. When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. See discussion/examples at apache#43960 for details. ### What changes are included in this PR? 1. The removal of lines from `r/R/dplyr-funcs-string.R` that previously set `end`= 0 when `start < end`, which meant if the user was counting backwards from the end of the string (with a negative `end` value), an empty string would [wrongly] be returned. It appears that the case that the previous code was trying to address is already handled properly by the underlying C++ function (`utf8_slice_codeunits`). 2. Addition of lines to `r/R/dplyr-funcs-string.R` in order to account the difference in between R's inclusive `end` and C++'s exclusive `end` when `end` is negative. 3. The addition of a test (described below) to `r/tests/testthat/test-dplyr-funcs-string.R` to test for these cases. ### Are these changes tested? Yes, I ran all tests in `r/tests/testthat/test-dplyr-funcs-string.R`, including one which I added (see attached commit), which explicitly tests the case where `end` is negative (-3) and less than the `start` value (1). This also tests the case where `end` < -1 and to the right of the `start` position. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** Previously: - When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. - When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. * GitHub Issue: apache#43960 Lead-authored-by: Stephen Coussens <[email protected]> Co-authored-by: Nic Crane <[email protected]> Signed-off-by: Nic Crane <[email protected]>
- Loading branch information