Skip to content

Commit

Permalink
apacheGH-43960: [R] fix str_sub binding to properly handle negative…
Browse files Browse the repository at this point in the history
… `end` values (apache#44141)

First-time contributor here, so let me know where I can improve!

### Rationale for this change

The `str_sub` binding in arrow was not handling negative `end` values properly. The problem was two-fold: 
 1.  When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. 
2.  When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. 

See discussion/examples at apache#43960 for details.

### What changes are included in this PR?
1. The removal of lines from `r/R/dplyr-funcs-string.R` that previously set `end`= 0 when `start < end`, which meant if the user was counting backwards from the end of the string (with a negative `end` value), an empty string would [wrongly] be returned. It appears that the case that the previous code was trying to address is already handled properly by the underlying C++ function (`utf8_slice_codeunits`).
2. Addition of lines to `r/R/dplyr-funcs-string.R` in order to account the difference in between R's inclusive `end` and C++'s exclusive `end` when `end` is negative.
3. The addition of a test (described below) to `r/tests/testthat/test-dplyr-funcs-string.R` to test for these cases.

### Are these changes tested?

Yes, I ran all tests in `r/tests/testthat/test-dplyr-funcs-string.R`, including one which I added (see attached commit), which explicitly tests the case where `end` is negative (-3) and less than the `start` value (1). This also tests the case where `end`  < -1 and to the right of the `start` position.

### Are there any user-facing changes?

No.

**This PR contains a "Critical Fix".** Previously:
- When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. 
- When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. 
* GitHub Issue: apache#43960

Lead-authored-by: Stephen Coussens <[email protected]>
Co-authored-by: Nic Crane <[email protected]>
Signed-off-by: Nic Crane <[email protected]>
  • Loading branch information
coussens and thisisnic authored Sep 21, 2024
1 parent b153791 commit 37f62d0
Show file tree
Hide file tree
Showing 2 changed files with 15 additions and 4 deletions.
10 changes: 6 additions & 4 deletions r/R/dplyr-funcs-string.R
Original file line number Diff line number Diff line change
Expand Up @@ -570,10 +570,12 @@ register_bindings_string_other <- function() {
end <- .Machine$integer.max
}

# An end value lower than a start value returns an empty string in
# stringr::str_sub so set end to 0 here to match this behavior
if (end < start) {
end <- 0
# strings returned by utf8_slice_codeunits are exclusive of the `end` position.
# stringr::str_sub returns strings inclusive of the `end` position, so add 1 to `end`.
# NOTE:this isn't necessary for positive values of `end`, because utf8_slice_codeunits
# is 0-based while R is 1-based, which cancels out the effect of the exclusive `end`
if (end < -1) {
end <- end + 1L
}

# subtract 1 from `start` because C++ is 0-based and R is 1-based
Expand Down
9 changes: 9 additions & 0 deletions r/tests/testthat/test-dplyr-funcs-string.R
Original file line number Diff line number Diff line change
Expand Up @@ -1178,6 +1178,15 @@ test_that("str_sub", {
collect(),
df
)
compare_dplyr_binding(
.input %>%
mutate(
y = str_sub(x, 1, -3),
y2 = stringr::str_sub(x, 1, -3)
) %>%
collect(),
df
)

expect_arrow_eval_error(
str_sub("Apache Arrow", c(1, 2), 3),
Expand Down

0 comments on commit 37f62d0

Please sign in to comment.