-
Notifications
You must be signed in to change notification settings - Fork 3.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-40968: [C++][Gandiva] add RE2::Options set_dot_nl(true) for Like function #40970
Conversation
|
@kou could you plz review? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Result<std::shared_ptr<LikeHolder>> LikeHolder::Make(const std::string& sql_pattern) { | ||
std::string pcre_pattern; | ||
ARROW_RETURN_NOT_OK(RegexUtil::SqlLikePatternToPcre(sql_pattern, pcre_pattern)); | ||
|
||
auto lholder = std::shared_ptr<LikeHolder>(new LikeHolder(pcre_pattern)); | ||
ARROW_RETURN_IF(!lholder->regex_.ok(), | ||
Status::Invalid("Building RE2 pattern '", pcre_pattern, | ||
"' failed with: ", lholder->regex_.error())); | ||
|
||
return lholder; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to remove this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
because it's not used anymore
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems that this is an exported API. If we remove this, we break backward compatibility. Is it expected?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. Restored
@@ -99,13 +99,14 @@ Result<std::shared_ptr<LikeHolder>> LikeHolder::Make(const FunctionNode& node) { | |||
"'like' function requires a string literal as the second parameter")); | |||
|
|||
RE2::Options regex_op; | |||
regex_op.set_dot_nl(true); // set dotall mode for the regex. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This breaks backward compatibility, right?
Can we keep backward compatibility?
Could you rebase on main to fix CI failures? |
Could you please clarify the example? There are no '\n' in the target string. Because the pattern '%Space1.%' does match the string you give. |
I do believe it is kind of a "bug" in Gandiva, a minimal reproducing example is |
@@ -46,6 +46,7 @@ class TestLikeHolder : public ::testing::Test { | |||
}; | |||
|
|||
TEST_F(TestLikeHolder, TestMatchAny) { | |||
regex_op.set_dot_nl(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this set_dot_nl be moved to the constructor, i.e. enable it for all test cases?
EXPECT_OK_AND_ASSIGN(auto const like_holder, LikeHolder::Make(".*ab_", regex_op)); | ||
|
||
auto& like = *like_holder; | ||
EXPECT_TRUE(like(".*abc")); // . and * aren't special in sql regex | ||
EXPECT_FALSE(like("xxabc")); | ||
} | ||
|
||
TEST_F(TestLikeHolder, TestPcreSpecialWithNewLine) { | ||
regex_op.set_dot_nl(true); | ||
EXPECT_OK_AND_ASSIGN(auto const like_holder, LikeHolder::Make("%Space1.%", regex_op)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a simpler test case, e.g. 'abc\nd' LIKE '%abc%' is enough to demonstate this change.
Oh. OK. Let's "fix" this without backward compatibility. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]
Pattern '%Space1%' match string, but '%Space1.%' - not.
Is this description correct? I think that %Space1%
doesn't match the string because the string includes \n
.
@kou yes, this is how we found a problem. RE2 documentation said that '.' any character, possibly including newline. If in pattern we don't use any special symbols - it work. But with special symbols and with \n in string - not. But flag handle this case. |
kou means that
is not accurate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1
Besides the LIKE
keyword, I tested postgres 16's regexp_like
function, which matches the new line character by default as well
SELECT regexp_like(E'hello\nfoo', 'hello.*'); ==> true
@kou Could you merge this, plz? |
We use the description for commit message. So I want to use correct description. |
@kou it's already fixed |
Thanks. |
…Like function (apache#40970) ### Rationale for this change Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. ### What changes are included in this PR? added flag set_dot_nl(true) to LikeHolder ### Are these changes tested? add unit tests. ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <[email protected]> Co-authored-by: Ivan Chesnov <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…Like function (apache#40970) (#68) ### Rationale for this change Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. ### What changes are included in this PR? added flag set_dot_nl(true) to LikeHolder ### Are these changes tested? add unit tests. ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0affccc. There were no benchmark performance regressions. 🎉 The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them. |
…Like function (apache#40970) ### Rationale for this change Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. ### What changes are included in this PR? added flag set_dot_nl(true) to LikeHolder ### Are these changes tested? add unit tests. ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <[email protected]> Co-authored-by: Ivan Chesnov <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…Like function (apache#40970) ### Rationale for this change Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. ### What changes are included in this PR? added flag set_dot_nl(true) to LikeHolder ### Are these changes tested? add unit tests. ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <[email protected]> Co-authored-by: Ivan Chesnov <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…Like function (apache#40970) ### Rationale for this change Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. ### What changes are included in this PR? added flag set_dot_nl(true) to LikeHolder ### Are these changes tested? add unit tests. ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <[email protected]> Co-authored-by: Ivan Chesnov <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…Like function (apache#40970) (dremio#68) Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. added flag set_dot_nl(true) to LikeHolder add unit tests. Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…Like function (apache#40970) (dremio#68) Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. added flag set_dot_nl(true) to LikeHolder add unit tests. Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
…Like f (#80) * apacheGH-40968: [C++][Gandiva] add RE2::Options set_dot_nl(true) for Like function (apache#40970) (#68) Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. added flag set_dot_nl(true) to LikeHolder add unit tests. Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]> * apacheGH-43119: [CI][Packaging] Update manylinux 2014 CentOS repos that have been deprecated (apache#43121) Jobs are failing to find mirrorlist.centos.org Updating repos based on solution from: apache#43119 (comment) Via archery No * GitHub Issue: apache#43119 Lead-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]> --------- Signed-off-by: Sutou Kouhei <[email protected]> Signed-off-by: Raúl Cumplido <[email protected]> Co-authored-by: Raúl Cumplido <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]> Co-authored-by: Sutou Kouhei <[email protected]>
…Like function (apache#40970) (dremio#68) ### Rationale for this change Gandiva function "LIKE" does not always work correctly when the string contains \n. String value: `[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]` Pattern '%Space1%' nor '%Space1.%' do not match. ### What changes are included in this PR? added flag set_dot_nl(true) to LikeHolder ### Are these changes tested? add unit tests. ### Are there any user-facing changes? Yes **This PR includes breaking changes to public APIs.** * GitHub Issue: apache#40968 Lead-authored-by: Ivan Chesnov <[email protected]> Signed-off-by: Sutou Kouhei <[email protected]>
Rationale for this change
Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]
Pattern '%Space1%' nor '%Space1.%' do not match.
What changes are included in this PR?
added flag set_dot_nl(true) to LikeHolder
Are these changes tested?
add unit tests.
Are there any user-facing changes?
Yes
This PR includes breaking changes to public APIs.