From 37f62d0bc5f4d22e7194947963b445225b984558 Mon Sep 17 00:00:00 2001 From: Stephen Coussens Date: Sat, 21 Sep 2024 03:05:10 -0700 Subject: [PATCH] GH-43960: [R] fix `str_sub` binding to properly handle negative `end` values (#44141) First-time contributor here, so let me know where I can improve! ### Rationale for this change The `str_sub` binding in arrow was not handling negative `end` values properly. The problem was two-fold: 1. When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. 2. When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. See discussion/examples at https://github.com/apache/arrow/issues/43960 for details. ### What changes are included in this PR? 1. The removal of lines from `r/R/dplyr-funcs-string.R` that previously set `end`= 0 when `start < end`, which meant if the user was counting backwards from the end of the string (with a negative `end` value), an empty string would [wrongly] be returned. It appears that the case that the previous code was trying to address is already handled properly by the underlying C++ function (`utf8_slice_codeunits`). 2. Addition of lines to `r/R/dplyr-funcs-string.R` in order to account the difference in between R's inclusive `end` and C++'s exclusive `end` when `end` is negative. 3. The addition of a test (described below) to `r/tests/testthat/test-dplyr-funcs-string.R` to test for these cases. ### Are these changes tested? Yes, I ran all tests in `r/tests/testthat/test-dplyr-funcs-string.R`, including one which I added (see attached commit), which explicitly tests the case where `end` is negative (-3) and less than the `start` value (1). This also tests the case where `end` < -1 and to the right of the `start` position. ### Are there any user-facing changes? No. **This PR contains a "Critical Fix".** Previously: - When `end` values were negative (and less than the `start` value, which might be positive), `str_sub` would improperly return an empty string. - When `end` values were < -1 but the `end` position was still to the right of the `start` position, `str_sub` failed to return the final character in the substring, since it did not account for the fact that `end` is counted exclusively in the underlying C++ function (`utf8_slice_codeunits`), but inclusively in R. * GitHub Issue: #43960 Lead-authored-by: Stephen Coussens Co-authored-by: Nic Crane Signed-off-by: Nic Crane --- r/R/dplyr-funcs-string.R | 10 ++++++---- r/tests/testthat/test-dplyr-funcs-string.R | 9 +++++++++ 2 files changed, 15 insertions(+), 4 deletions(-) diff --git a/r/R/dplyr-funcs-string.R b/r/R/dplyr-funcs-string.R index 77e1a5405a692..28db78f609309 100644 --- a/r/R/dplyr-funcs-string.R +++ b/r/R/dplyr-funcs-string.R @@ -570,10 +570,12 @@ register_bindings_string_other <- function() { end <- .Machine$integer.max } - # An end value lower than a start value returns an empty string in - # stringr::str_sub so set end to 0 here to match this behavior - if (end < start) { - end <- 0 + # strings returned by utf8_slice_codeunits are exclusive of the `end` position. + # stringr::str_sub returns strings inclusive of the `end` position, so add 1 to `end`. + # NOTE:this isn't necessary for positive values of `end`, because utf8_slice_codeunits + # is 0-based while R is 1-based, which cancels out the effect of the exclusive `end` + if (end < -1) { + end <- end + 1L } # subtract 1 from `start` because C++ is 0-based and R is 1-based diff --git a/r/tests/testthat/test-dplyr-funcs-string.R b/r/tests/testthat/test-dplyr-funcs-string.R index cb1d4675058b6..86966b305368a 100644 --- a/r/tests/testthat/test-dplyr-funcs-string.R +++ b/r/tests/testthat/test-dplyr-funcs-string.R @@ -1178,6 +1178,15 @@ test_that("str_sub", { collect(), df ) + compare_dplyr_binding( + .input %>% + mutate( + y = str_sub(x, 1, -3), + y2 = stringr::str_sub(x, 1, -3) + ) %>% + collect(), + df + ) expect_arrow_eval_error( str_sub("Apache Arrow", c(1, 2), 3),