-
Notifications
You must be signed in to change notification settings - Fork 819
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix LIKE with escapes #6703
Fix LIKE with escapes #6703
Conversation
b151949
to
e463db6
Compare
@@ -350,7 +346,7 @@ mod tests { | |||
#[test] | |||
fn test_replace_like_wildcards_with_multiple_escape_chars() { | |||
let a_eq = "\\\\%"; | |||
let expected = "^\\\\%$"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after Rust parsing, this is \\%
input. which should match any string starting with \
so the regex should be ^\\
which looks like "r\\\\"
in Rust source code.
I.e. the value expected here wasn't quite correct
Please can you refrain from repeatedly tagging people on your PRs, we will get to your PRs when we have time, but spamming people needlessly is just rude |
Apologies if i tagged you (or anyone else) repeatedly on a single PR, especially if it was in a short period of time. |
Fix LIKE processing for patterns containing escapes - the starts_with / ends_with optimization did not correctly check for escapes when checking rest of the pattern for being literal or not - the pattern to regexp compiler incorrectly processed \ followed by a character other than % or _. In PostgreSQL '\x' pattern matches single 'x'. There are two tests - like_escape_many was generated using PostgreSQL with the code attached below for verification - like_escape is hand-picked test cases that are more interesting. Lower cardinality of hand-picked test cases allows for exercising all scalar/array vs scalar/array combinations. The below script isn't simples possible, because it was attempted to generate more test cases by adding padding. Hence e.g. is_like_without_dangling_escape. Since this is attached for reference, should be attached as-is. ```python import psycopg2 data = r""" \ \\ \\\ \\\\ a \a \\a % \% \\% %% \%% \\%% _ \_ \\_ __ \__ \\__ abc a_c a\bc a\_c %abc \%abc a\\_c% """.split('\n') data = list(dict.fromkeys(data)) conn = psycopg2.connect(host='localhost', port=5432, user='postgres', password='mysecretpassword') conn.set_session(autocommit=True) cursor = conn.cursor() for r in data: try: # PostgreSQL verifies dandling escape only sometimes cursor.execute(f"SELECT %s LIKE %s", (r, r)) is_like, = cursor.fetchone() has_dandling_escape = False pg_pattern = r except Exception as e: if 'LIKE pattern must not end with escape character' not in str(e): raise e has_dandling_escape = True pg_pattern = r + '\\' for l in data: # print() # print(' '.join(str(v) for v in (l, r, has_dandling_escape, postgres_pattern))) cursor.execute(f"SELECT %s LIKE %s", (l, pg_pattern)) is_like, = cursor.fetchone() assert type(is_like) is bool if not is_like and has_dandling_escape: pattern_without_escaped_dandling_escape = pg_pattern[:-2] cursor.execute(f"SELECT %s LIKE %s", (l, pattern_without_escaped_dandling_escape)) is_like_without_dangling_escape, = cursor.fetchone() assert type(is_like_without_dangling_escape) is bool else: is_like_without_dangling_escape = False assert '"' not in l assert '"' not in r print('(r"%s", r"%s", %s),' % ( l, r, str(is_like).lower(), # str(has_dandling_escape).lower(), # str(is_like_without_dangling_escape).lower(), )) ```
Reduce test code boilerplate and make it easier to see what are the test cases.
053bef3
to
6a7c624
Compare
Rebased to resolve conflict with #6662 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @findepi -- I looked at the code in this PR and it looks quite good to me. I am just running the like
benchmarks now to make sure there isn't any performance regression
&& !pattern.ends_with("\\%") | ||
&& !contains_like_pattern(&pattern[..pattern.len() - 1]) | ||
{ | ||
} else if pattern.ends_with('%') && !contains_like_pattern(&pattern[..pattern.len() - 1]) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 the contains_like_pattern check now also checks for (not) ending with %
too 👍 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, and for (not) ending with \
too (which is a change in that method)
Benchmark results (basically no change in my opinion):
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks (again) @findepi
* Fix LIKE with escapes Fix LIKE processing for patterns containing escapes - the starts_with / ends_with optimization did not correctly check for escapes when checking rest of the pattern for being literal or not - the pattern to regexp compiler incorrectly processed \ followed by a character other than % or _. In PostgreSQL '\x' pattern matches single 'x'. There are two tests - like_escape_many was generated using PostgreSQL with the code attached below for verification - like_escape is hand-picked test cases that are more interesting. Lower cardinality of hand-picked test cases allows for exercising all scalar/array vs scalar/array combinations. The below script isn't simples possible, because it was attempted to generate more test cases by adding padding. Hence e.g. is_like_without_dangling_escape. Since this is attached for reference, should be attached as-is. ```python import psycopg2 data = r""" \ \\ \\\ \\\\ a \a \\a % \% \\% %% \%% \\%% _ \_ \\_ __ \__ \\__ abc a_c a\bc a\_c %abc \%abc a\\_c% """.split('\n') data = list(dict.fromkeys(data)) conn = psycopg2.connect(host='localhost', port=5432, user='postgres', password='mysecretpassword') conn.set_session(autocommit=True) cursor = conn.cursor() for r in data: try: # PostgreSQL verifies dandling escape only sometimes cursor.execute(f"SELECT %s LIKE %s", (r, r)) is_like, = cursor.fetchone() has_dandling_escape = False pg_pattern = r except Exception as e: if 'LIKE pattern must not end with escape character' not in str(e): raise e has_dandling_escape = True pg_pattern = r + '\\' for l in data: # print() # print(' '.join(str(v) for v in (l, r, has_dandling_escape, postgres_pattern))) cursor.execute(f"SELECT %s LIKE %s", (l, pg_pattern)) is_like, = cursor.fetchone() assert type(is_like) is bool if not is_like and has_dandling_escape: pattern_without_escaped_dandling_escape = pg_pattern[:-2] cursor.execute(f"SELECT %s LIKE %s", (l, pattern_without_escaped_dandling_escape)) is_like_without_dangling_escape, = cursor.fetchone() assert type(is_like_without_dangling_escape) is bool else: is_like_without_dangling_escape = False assert '"' not in l assert '"' not in r print('(r"%s", r"%s", %s),' % ( l, r, str(is_like).lower(), # str(has_dandling_escape).lower(), # str(is_like_without_dangling_escape).lower(), )) ``` * Compact tests for regex_like Reduce test code boilerplate and make it easier to see what are the test cases. * Add more test cases for regex_like
Description
Fix LIKE processing for patterns containing escapes
escapes when checking rest of the pattern for being literal or not
character other than % or _. In PostgreSQL '\x' pattern matches single
'x'.
There are two tests
below for verification
Lower cardinality of hand-picked test cases allows for exercising all
scalar/array vs scalar/array combinations.
Test case generation
Show
The below script isn't simples possible, because it was attempted to
generate more test cases by adding padding. Hence e.g.
is_like_without_dangling_escape. Since this is attached for reference,
should be attached as-is.
Which issue does this PR close?
Fixes #6702
Rationale for this change
Correctness fix for LIKE.
What changes are included in this PR?
Correctness fix for LIKE.
Are there any user-facing changes?
Yes
\\
double backslash datafusion#13304