Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix infinite loop in replace with AI collations #2849

Merged

Conversation

tanscorpio7
Copy link
Contributor

@tanscorpio7 tanscorpio7 commented Aug 13, 2024

Description

ICU usearch_next() goes into infinite loop when pattern to search starts with a surrogate pair.
To get around this we check if output of usearch_next() is stuck and not proceeding forwards
and set the offset for next search ourselves.
The next offset is simply the next character after the current char in source string.

SRC STRING - 'abc🙂defghi🙂🙂'    PATTERN TO FIND = '🙂def'

usearch_next() gets stuck on "🙂" idx = 3 and repeatedly returns this index.
We will intervene and set the offset to "d" idx = 4. 
So that usearch_next only starts looking from this character.

Issues Resolved

[BABEL-5169]

Sign Off

Signed-off-by: Tanzeel Khan [email protected]

Check List

  • Commits are signed per the DCO using --signoff

By submitting this pull request, I confirm that my contribution is under the terms of the Apache 2.0 and PostgreSQL licenses, and grant any person obtaining a copy of the contribution permission to relicense all or a portion of my contribution to the PostgreSQL License solely to contribute all or a portion of my contribution to the PostgreSQL open source project.

For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@tanscorpio7 tanscorpio7 changed the title break infinite loop due to usearch next fix infinite loop in replace with AI collations Aug 13, 2024
@coveralls
Copy link
Collaborator

coveralls commented Aug 13, 2024

Pull Request Test Coverage Report for Build 10385862370

Details

  • 20 of 22 (90.91%) changed or added relevant lines in 1 file are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.005%) to 73.679%

Changes Missing Coverage Covered Lines Changed/Added Lines %
contrib/babelfishpg_tsql/src/collation.c 20 22 90.91%
Totals Coverage Status
Change from base Build 10326440407: 0.005%
Covered Lines: 44195
Relevant Lines: 59983

💛 - Coveralls

/* ICU bug, When pattern start with a surrogate pair ICU usearch_next stops moving forward entering an infinite loop */
if (u16_pos == pos_prev_loop)
{
if (U16_IS_SURROGATE(src_uchar[matched_idx]) && U16_IS_SURROGATE(substr_uchar[0]) && matched_idx + 2 < src_ulen)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the problem arises only when the first char is surrogate pair? I think better to use U16_IS_SURROGATE(substr_uchar[u16_pos]).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it only happens when substring to find start with a surrogate pair.
In this case ICU doesn't seem to set the offset for the next search correctly.

/* ICU bug, When pattern start with a surrogate pair ICU usearch_next stops moving forward entering an infinite loop */
if (pos == pos_prev_loop)
{
if ( U16_IS_SURROGATE(src_uchar[matched_idx]) && U16_IS_SURROGATE(from_uchar[0]) && matched_idx + 2 < src_ulen)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -1491,6 +1492,8 @@ pltsql_strpos_non_determinstic(text *src_text, text *substr_text, Oid collid, in
src_ulen = icu_to_uchar(&src_uchar, VARDATA_ANY(src_text), src_len_utf8);
substr_ulen = icu_to_uchar(&substr_uchar, VARDATA_ANY(substr_text), substr_len_utf8);

is_substr_starts_with_surrogate = U16_IS_SURROGATE(substr_uchar[0]);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if search str is empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

empty string never reaches pltsql_strpos_non_determinstic.
Even if it did we will error out before we reach here. Confirmed by setting values using GDB.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We also have test cases for null inputs/empty string inputs in charindex, replace and patindex.


if (is_substr_starts_with_surrogate && next_char_idx < src_ulen)
{
usearch_setOffset(usearch, next_char_idx, &status);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we need error checking?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@tanscorpio7 tanscorpio7 force-pushed the BABEL_5169 branch 2 times, most recently from 02e5141 to 2606dce Compare August 14, 2024 08:37
Signed-off-by: Tanzeel Khan <[email protected]>
Signed-off-by: Tanzeel Khan <[email protected]>
@Deepesh125 Deepesh125 merged commit 4217dbf into babelfish-for-postgresql:BABEL_4_X_DEV Aug 14, 2024
43 checks passed
tanscorpio7 added a commit to tanscorpio7/babelfish_extensions that referenced this pull request Aug 20, 2024
…esql#2849)

ICU usearch_next() goes into infinite loop when pattern to search starts with a surrogate pair.
To get around this we check if output of usearch_next() is stuck and not proceeding forwards
and set the offset for next search ourselves.
The next offset is simply the next character after the current char in source string.

SRC STRING - 'abc🙂defghi🙂🙂'    PATTERN TO FIND = '🙂def'

usearch_next() gets stuck on "🙂" idx = 3 and repeatedly returns this index.
We will intervene and set the offset to "d" idx = 4. 
So that usearch_next only starts looking from this character.

Taks: BABEL-5167
Signed-off-by: Tanzeel Khan <[email protected]>
sharathbp pushed a commit to amazon-aurora/babelfish_extensions that referenced this pull request Aug 20, 2024
…esql#2849)

ICU usearch_next() goes into infinite loop when pattern to search starts with a surrogate pair.
To get around this we check if output of usearch_next() is stuck and not proceeding forwards
and set the offset for next search ourselves.
The next offset is simply the next character after the current char in source string.

SRC STRING - 'abc🙂defghi🙂🙂'    PATTERN TO FIND = '🙂def'

usearch_next() gets stuck on "🙂" idx = 3 and repeatedly returns this index.
We will intervene and set the offset to "d" idx = 4. 
So that usearch_next only starts looking from this character.

Taks: BABEL-5167
Signed-off-by: Tanzeel Khan <[email protected]>
jsudrik pushed a commit that referenced this pull request Aug 20, 2024
ICU usearch_next() goes into infinite loop when pattern to search starts with a surrogate pair.
To get around this we check if output of usearch_next() is stuck and not proceeding forwards
and set the offset for next search ourselves.
The next offset is simply the next character after the current char in source string.

SRC STRING - 'abc🙂defghi🙂🙂'    PATTERN TO FIND = '🙂def'

usearch_next() gets stuck on "🙂" idx = 3 and repeatedly returns this index.
We will intervene and set the offset to "d" idx = 4. 
So that usearch_next only starts looking from this character.

Taks: BABEL-5167

Signed-off-by: Tanzeel Khan <[email protected]>
@tanscorpio7 tanscorpio7 deleted the BABEL_5169 branch October 11, 2024 14:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants