Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

isValidUtf8 is broken #620

Closed
sol opened this issue Oct 23, 2023 · 4 comments · Fixed by #621
Closed

isValidUtf8 is broken #620

sol opened this issue Oct 23, 2023 · 4 comments · Fixed by #621
Milestone

Comments

@sol
Copy link
Member

sol commented Oct 23, 2023

Repro:

ghci> :set -XOverloadedLists
ghci> import Data.ByteString as B
ghci> let input = [0x80, 0x00, 0x00, 0x00] :: ByteString
ghci> isValidUtf8 input
False
ghci> isValidUtf8 (B.drop 1 input) -- this should be True !!!
False
ghci> isValidUtf8 (B.drop 2 input)
True
@sol
Copy link
Member Author

sol commented Oct 23, 2023

For the minimal case isValidUtf8 (drop 1 "\128\0")

  • both is_valid_utf8_avx2 and is_valid_utf8_ssse3 are broken
  • while is_valid_utf8_sse2 and is_valid_utf8_fallback seem to work

@clyring
Copy link
Member

clyring commented Oct 23, 2023

I see the problem. Here is the relevant bit of the avx2 code:

  // 'Roll back' our pointer a little to prepare for a slow search of the rest.
  uint32_t tokens_blob = _mm256_extract_epi32(prev_input, 7);
  int8_t const *tokens = (int8_t const *)&tokens_blob;
  ptrdiff_t lookahead = 0;
  if (tokens[3] > (int8_t)0xBF) {
    lookahead = 1;
  } else if (tokens[2] > (int8_t)0xBF) {
    lookahead = 2;
  } else if (tokens[1] > (int8_t)0xBF) {
    lookahead = 3;
  }
  uint8_t const *const small_ptr = ptr - lookahead;
  size_t const small_len = remaining + lookahead;
  return is_valid_utf8_fallback(small_ptr, small_len);

When the input is too small for a 128-byte big stride, we reach this code with prev_input containing all zeros.

This lookahead fiddling is meant to ensure we do not call the fallback after just part of a multi-byte code point. But for some reason it looks for any non-continuation-byte in prev_input rather than specifically for valid first bytes of a multi-byte code point. And zero is not a continuation byte, so we end up checking if [start-1,end) is valid rather than [start,end) as intended.

I'll put up a PR soon.

@Bodigrim
Copy link
Contributor

Yeah, that's my conclusion as well (I've been looking at aarch version). Seems wrapping lookahead manipulation in if(big_strides) should be enough.

GHC team has asked for any version bumps to be included in GHC 9.4.8 to be ready by Friday: https://mail.haskell.org/pipermail/ghc-devs/2023-October/021420.html. We probably want to fix this and backport to bytestring-0.11.

@clyring if you get time to put up a PR, fill free to merge and release without my approval, I'll be AFK until next week.

clyring added a commit to clyring/bytestring that referenced this issue Oct 26, 2023
Fixes haskell#620.  We must roll back some if the last SIMD block
contains an incomplete multi-byte code point.  The old logic
for this would roll back by one even if there were zero SIMD
blocks processed, which is exactly the bug.
clyring added a commit that referenced this issue Oct 27, 2023
Fixes #620.  We must roll back some if the last SIMD block
contains an incomplete multi-byte code point.  The old logic
for this would roll back by one even if there were zero SIMD
blocks processed, which is exactly the bug.
@clyring
Copy link
Member

clyring commented Oct 27, 2023

It seems I lack the authority to circumvent the approval requirement before merging to master.

Well, I pushed to bytestring-0.11 and cut a release anyway.

clyring added a commit that referenced this issue Oct 27, 2023
Fixes #620.  We must roll back some if the last SIMD block
contains an incomplete multi-byte code point.  The old logic
for this would roll back by one even if there were zero SIMD
blocks processed, which is exactly the bug.
@clyring clyring modified the milestones: 0.12.1.0, 0.11.5.3 Jan 30, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants