-
Notifications
You must be signed in to change notification settings - Fork 141
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
isValidUtf8 is broken #620
Comments
For the minimal case
|
I see the problem. Here is the relevant bit of the // 'Roll back' our pointer a little to prepare for a slow search of the rest.
uint32_t tokens_blob = _mm256_extract_epi32(prev_input, 7);
int8_t const *tokens = (int8_t const *)&tokens_blob;
ptrdiff_t lookahead = 0;
if (tokens[3] > (int8_t)0xBF) {
lookahead = 1;
} else if (tokens[2] > (int8_t)0xBF) {
lookahead = 2;
} else if (tokens[1] > (int8_t)0xBF) {
lookahead = 3;
}
uint8_t const *const small_ptr = ptr - lookahead;
size_t const small_len = remaining + lookahead;
return is_valid_utf8_fallback(small_ptr, small_len); When the input is too small for a 128-byte big stride, we reach this code with This I'll put up a PR soon. |
Yeah, that's my conclusion as well (I've been looking at aarch version). Seems wrapping GHC team has asked for any version bumps to be included in GHC 9.4.8 to be ready by Friday: https://mail.haskell.org/pipermail/ghc-devs/2023-October/021420.html. We probably want to fix this and backport to @clyring if you get time to put up a PR, fill free to merge and release without my approval, I'll be AFK until next week. |
Fixes haskell#620. We must roll back some if the last SIMD block contains an incomplete multi-byte code point. The old logic for this would roll back by one even if there were zero SIMD blocks processed, which is exactly the bug.
Fixes #620. We must roll back some if the last SIMD block contains an incomplete multi-byte code point. The old logic for this would roll back by one even if there were zero SIMD blocks processed, which is exactly the bug.
It seems I lack the authority to circumvent the approval requirement before merging to Well, I pushed to |
Fixes #620. We must roll back some if the last SIMD block contains an incomplete multi-byte code point. The old logic for this would roll back by one even if there were zero SIMD blocks processed, which is exactly the bug.
Repro:
The text was updated successfully, but these errors were encountered: