-
Notifications
You must be signed in to change notification settings - Fork 92
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix infinite loop in replace with AI collations #2849
fix infinite loop in replace with AI collations #2849
Conversation
Pull Request Test Coverage Report for Build 10385862370Details
💛 - Coveralls |
/* ICU bug, When pattern start with a surrogate pair ICU usearch_next stops moving forward entering an infinite loop */ | ||
if (u16_pos == pos_prev_loop) | ||
{ | ||
if (U16_IS_SURROGATE(src_uchar[matched_idx]) && U16_IS_SURROGATE(substr_uchar[0]) && matched_idx + 2 < src_ulen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does the problem arises only when the first char is surrogate pair? I think better to use U16_IS_SURROGATE(substr_uchar[u16_pos]).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it only happens when substring to find start with a surrogate pair.
In this case ICU doesn't seem to set the offset for the next search correctly.
/* ICU bug, When pattern start with a surrogate pair ICU usearch_next stops moving forward entering an infinite loop */ | ||
if (pos == pos_prev_loop) | ||
{ | ||
if ( U16_IS_SURROGATE(src_uchar[matched_idx]) && U16_IS_SURROGATE(from_uchar[0]) && matched_idx + 2 < src_ulen) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
@@ -1491,6 +1492,8 @@ pltsql_strpos_non_determinstic(text *src_text, text *substr_text, Oid collid, in | |||
src_ulen = icu_to_uchar(&src_uchar, VARDATA_ANY(src_text), src_len_utf8); | |||
substr_ulen = icu_to_uchar(&substr_uchar, VARDATA_ANY(substr_text), substr_len_utf8); | |||
|
|||
is_substr_starts_with_surrogate = U16_IS_SURROGATE(substr_uchar[0]); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if search str is empty?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
empty string never reaches pltsql_strpos_non_determinstic.
Even if it did we will error out before we reach here. Confirmed by setting values using GDB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We also have test cases for null inputs/empty string inputs in charindex, replace and patindex.
|
||
if (is_substr_starts_with_surrogate && next_char_idx < src_ulen) | ||
{ | ||
usearch_setOffset(usearch, next_char_idx, &status); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need error checking?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
02e5141
to
2606dce
Compare
Signed-off-by: Tanzeel Khan <[email protected]>
2606dce
to
3589eef
Compare
Signed-off-by: Tanzeel Khan <[email protected]>
Signed-off-by: Tanzeel Khan <[email protected]>
Signed-off-by: Tanzeel Khan <[email protected]>
4217dbf
into
babelfish-for-postgresql:BABEL_4_X_DEV
…esql#2849) ICU usearch_next() goes into infinite loop when pattern to search starts with a surrogate pair. To get around this we check if output of usearch_next() is stuck and not proceeding forwards and set the offset for next search ourselves. The next offset is simply the next character after the current char in source string. SRC STRING - 'abc🙂defghi🙂🙂' PATTERN TO FIND = '🙂def' usearch_next() gets stuck on "🙂" idx = 3 and repeatedly returns this index. We will intervene and set the offset to "d" idx = 4. So that usearch_next only starts looking from this character. Taks: BABEL-5167 Signed-off-by: Tanzeel Khan <[email protected]>
…esql#2849) ICU usearch_next() goes into infinite loop when pattern to search starts with a surrogate pair. To get around this we check if output of usearch_next() is stuck and not proceeding forwards and set the offset for next search ourselves. The next offset is simply the next character after the current char in source string. SRC STRING - 'abc🙂defghi🙂🙂' PATTERN TO FIND = '🙂def' usearch_next() gets stuck on "🙂" idx = 3 and repeatedly returns this index. We will intervene and set the offset to "d" idx = 4. So that usearch_next only starts looking from this character. Taks: BABEL-5167 Signed-off-by: Tanzeel Khan <[email protected]>
ICU usearch_next() goes into infinite loop when pattern to search starts with a surrogate pair. To get around this we check if output of usearch_next() is stuck and not proceeding forwards and set the offset for next search ourselves. The next offset is simply the next character after the current char in source string. SRC STRING - 'abc🙂defghi🙂🙂' PATTERN TO FIND = '🙂def' usearch_next() gets stuck on "🙂" idx = 3 and repeatedly returns this index. We will intervene and set the offset to "d" idx = 4. So that usearch_next only starts looking from this character. Taks: BABEL-5167 Signed-off-by: Tanzeel Khan <[email protected]>
Description
ICU usearch_next() goes into infinite loop when pattern to search starts with a surrogate pair.
To get around this we check if output of usearch_next() is stuck and not proceeding forwards
and set the offset for next search ourselves.
The next offset is simply the next character after the current char in source string.
Issues Resolved
[BABEL-5169]
Sign Off
Signed-off-by: Tanzeel Khan [email protected]
Check List
By submitting this pull request, I confirm that my contribution is under the terms of the Apache 2.0 and PostgreSQL licenses, and grant any person obtaining a copy of the contribution permission to relicense all or a portion of my contribution to the PostgreSQL License solely to contribute all or a portion of my contribution to the PostgreSQL open source project.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.