Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-43960: [R] fix str_sub binding to properly handle negative end values #44141

Merged
merged 2 commits into from
Sep 21, 2024

Conversation

coussens
Copy link
Contributor

@coussens coussens commented Sep 17, 2024

First-time contributor here, so let me know where I can improve!

Rationale for this change

The str_sub binding in arrow was not handling negative end values properly. The problem was two-fold:

  1. When end values were negative (and less than the start value, which might be positive), str_sub would improperly return an empty string.
  2. When end values were < -1 but the end position was still to the right of the start position, str_sub failed to return the final character in the substring, since it did not account for the fact that end is counted exclusively in the underlying C++ function (utf8_slice_codeunits), but inclusively in R.

See discussion/examples at #43960 for details.

What changes are included in this PR?

  1. The removal of lines from r/R/dplyr-funcs-string.R that previously set end= 0 when start < end, which meant if the user was counting backwards from the end of the string (with a negative end value), an empty string would [wrongly] be returned. It appears that the case that the previous code was trying to address is already handled properly by the underlying C++ function (utf8_slice_codeunits).
  2. Addition of lines to r/R/dplyr-funcs-string.R in order to account the difference in between R's inclusive end and C++'s exclusive end when end is negative.
  3. The addition of a test (described below) to r/tests/testthat/test-dplyr-funcs-string.R to test for these cases.

Are these changes tested?

Yes, I ran all tests in r/tests/testthat/test-dplyr-funcs-string.R, including one which I added (see attached commit), which explicitly tests the case where end is negative (-3) and less than the start value (1). This also tests the case where end < -1 and to the right of the start position.

Are there any user-facing changes?

No.

This PR contains a "Critical Fix". Previously:

  • When end values were negative (and less than the start value, which might be positive), str_sub would improperly return an empty string.
  • When end values were < -1 but the end position was still to the right of the start position, str_sub failed to return the final character in the substring, since it did not account for the fact that end is counted exclusively in the underlying C++ function (utf8_slice_codeunits), but inclusively in R.

Copy link

⚠️ GitHub issue #43960 has been automatically assigned in GitHub to PR creator.

@thisisnic
Copy link
Member

This looks great - thanks for updating this @coussens, great to see you join the Arrow dev community!

I've just approved the workflow runs - once this passes, I'm happy to merge. Feel free to ping me as I'm a bit slow checking my notifications these days!

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Sep 20, 2024
Copy link
Member

@thisisnic thisisnic left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks so much for updating this @coussens!

@thisisnic thisisnic merged commit 37f62d0 into apache:main Sep 21, 2024
11 checks passed
@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting changes Awaiting changes labels Sep 21, 2024
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 37f62d0.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 4 possible false positives for unstable benchmarks that are known to sometimes produce them.

@amoeba amoeba removed the awaiting merge Awaiting merge label Oct 22, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants