Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-40968: [C++][Gandiva] add RE2::Options set_dot_nl(true) for Like function #40970

Merged
merged 6 commits into from
Apr 12, 2024

Conversation

xxlaykxx
Copy link
Contributor

@xxlaykxx xxlaykxx commented Apr 3, 2024

Rationale for this change

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]
Pattern '%Space1%' nor '%Space1.%' do not match.

What changes are included in this PR?

added flag set_dot_nl(true) to LikeHolder

Are these changes tested?

add unit tests.

Are there any user-facing changes?

Yes

This PR includes breaking changes to public APIs.

Copy link

github-actions bot commented Apr 3, 2024

⚠️ GitHub issue #40968 has been automatically assigned in GitHub to PR creator.

@xxlaykxx xxlaykxx changed the title GH-40968: [C++][Gandiva] add RE2::Options set_dot_nl(true) for Like option GH-40968: [C++][Gandiva] add RE2::Options set_dot_nl(true) for Like function Apr 4, 2024
@xxlaykxx
Copy link
Contributor Author

xxlaykxx commented Apr 8, 2024

@kou could you plz review?

Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@niyue @js8544 What do you think about this?

Comment on lines 125 to 135
Result<std::shared_ptr<LikeHolder>> LikeHolder::Make(const std::string& sql_pattern) {
std::string pcre_pattern;
ARROW_RETURN_NOT_OK(RegexUtil::SqlLikePatternToPcre(sql_pattern, pcre_pattern));

auto lholder = std::shared_ptr<LikeHolder>(new LikeHolder(pcre_pattern));
ARROW_RETURN_IF(!lholder->regex_.ok(),
Status::Invalid("Building RE2 pattern '", pcre_pattern,
"' failed with: ", lholder->regex_.error()));

return lholder;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to remove this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

because it's not used anymore

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems that this is an exported API. If we remove this, we break backward compatibility. Is it expected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. Restored

@@ -99,13 +99,14 @@ Result<std::shared_ptr<LikeHolder>> LikeHolder::Make(const FunctionNode& node) {
"'like' function requires a string literal as the second parameter"));

RE2::Options regex_op;
regex_op.set_dot_nl(true); // set dotall mode for the regex.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This breaks backward compatibility, right?
Can we keep backward compatibility?

@kou
Copy link
Member

kou commented Apr 8, 2024

Could you rebase on main to fix CI failures?

@github-actions github-actions bot added awaiting changes Awaiting changes and removed awaiting review Awaiting review labels Apr 8, 2024
@js8544
Copy link
Collaborator

js8544 commented Apr 9, 2024

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
[function_name: "Space1.protect" args: "passenger_count" column_name: "passenger_count" ]
Pattern '%Space1%' match string, but '%Space1.%' - not.

Could you please clarify the example? There are no '\n' in the target string. Because the pattern '%Space1.%' does match the string you give.

@js8544
Copy link
Collaborator

js8544 commented Apr 9, 2024

I do believe it is kind of a "bug" in Gandiva, a minimal reproducing example is 'abc\nd' LIKE '%abc%' returns false. I checked with mainstream dbs like Postgres, Redshift and Snowflake. They all return true for this case. But I'm not sure whether this is worth a breaking change.

@@ -46,6 +46,7 @@ class TestLikeHolder : public ::testing::Test {
};

TEST_F(TestLikeHolder, TestMatchAny) {
regex_op.set_dot_nl(true);
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this set_dot_nl be moved to the constructor, i.e. enable it for all test cases?

EXPECT_OK_AND_ASSIGN(auto const like_holder, LikeHolder::Make(".*ab_", regex_op));

auto& like = *like_holder;
EXPECT_TRUE(like(".*abc")); // . and * aren't special in sql regex
EXPECT_FALSE(like("xxabc"));
}

TEST_F(TestLikeHolder, TestPcreSpecialWithNewLine) {
regex_op.set_dot_nl(true);
EXPECT_OK_AND_ASSIGN(auto const like_holder, LikeHolder::Make("%Space1.%", regex_op));
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a simpler test case, e.g. 'abc\nd' LIKE '%abc%' is enough to demonstate this change.

@kou
Copy link
Member

kou commented Apr 9, 2024

I do believe it is kind of a "bug" in Gandiva, a minimal reproducing example is 'abc\nd' LIKE '%abc%' returns false. I checked with mainstream dbs like Postgres, Redshift and Snowflake. They all return true for this case. But I'm not sure whether this is worth a breaking change.

Oh. OK. Let's "fix" this without backward compatibility.

@xxlaykxx xxlaykxx requested review from kou and js8544 April 9, 2024 08:58
@github-actions github-actions bot added awaiting change review Awaiting change review and removed awaiting changes Awaiting changes labels Apr 9, 2024
Copy link
Member

@kou kou left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]
Pattern '%Space1%' match string, but '%Space1.%' - not.

Is this description correct? I think that %Space1% doesn't match the string because the string includes \n.

@github-actions github-actions bot added awaiting merge Awaiting merge and removed awaiting change review Awaiting change review labels Apr 10, 2024
@xxlaykxx
Copy link
Contributor Author

@kou yes, this is how we found a problem. RE2 documentation said that '.' any character, possibly including newline. If in pattern we don't use any special symbols - it work. But with special symbols and with \n in string - not. But flag handle this case.

@js8544
Copy link
Collaborator

js8544 commented Apr 10, 2024

@kou yes, this is how we found a problem. RE2 documentation said that '.' any character, possibly including newline. If in pattern we don't use any special symbols - it work. But with special symbols and with \n in string - not. But flag handle this case.

kou means that '[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]' LIKE '%Space1%' also returns false. So the statement that

Pattern '%Space1%' match string,

is not accurate.

Copy link
Contributor

@niyue niyue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1
Besides the LIKE keyword, I tested postgres 16's regexp_like function, which matches the new line character by default as well

SELECT regexp_like(E'hello\nfoo', 'hello.*'); ==> true

@xxlaykxx
Copy link
Contributor Author

@kou Could you merge this, plz?

@kou
Copy link
Member

kou commented Apr 12, 2024

We use the description for commit message. So I want to use correct description.
Could you answer my question again? I meant what js8544 described #40970 (comment) . (Thanks, js8544!)

@xxlaykxx
Copy link
Contributor Author

@kou it's already fixed

@kou kou merged commit 0affccc into apache:main Apr 12, 2024
35 of 37 checks passed
@kou kou removed the awaiting merge Awaiting merge label Apr 12, 2024
@kou
Copy link
Member

kou commented Apr 12, 2024

Thanks.
I've merged.

xxlaykxx added a commit to xxlaykxx/arrow that referenced this pull request Apr 12, 2024
…Like function (apache#40970)

### Rationale for this change

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
`[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]`
Pattern '%Space1%' nor '%Space1.%' do not match.

### What changes are included in this PR?

added flag set_dot_nl(true) to LikeHolder

### Are these changes tested?

add unit tests.

### Are there any user-facing changes?
Yes

**This PR includes breaking changes to public APIs.**

* GitHub Issue: apache#40968

Lead-authored-by: Ivan Chesnov <[email protected]>
Co-authored-by: Ivan Chesnov <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
xxlaykxx added a commit to dremio/arrow that referenced this pull request Apr 12, 2024
…Like function (apache#40970) (#68)

### Rationale for this change

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
`[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]`
Pattern '%Space1%' nor '%Space1.%' do not match.

### What changes are included in this PR?

added flag set_dot_nl(true) to LikeHolder

### Are these changes tested?

add unit tests.

### Are there any user-facing changes?
Yes

**This PR includes breaking changes to public APIs.**

* GitHub Issue: apache#40968

Lead-authored-by: Ivan Chesnov <[email protected]>

Signed-off-by: Sutou Kouhei <[email protected]>
Copy link

After merging your PR, Conbench analyzed the 4 benchmarking runs that have been run so far on merge-commit 0affccc.

There were no benchmark performance regressions. 🎉

The full Conbench report has more details. It also includes information about 5 possible false positives for unstable benchmarks that are known to sometimes produce them.

vibhatha pushed a commit to vibhatha/arrow that referenced this pull request Apr 15, 2024
…Like function (apache#40970)

### Rationale for this change

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
`[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]`
Pattern '%Space1%' nor '%Space1.%' do not match.

### What changes are included in this PR?

added flag set_dot_nl(true) to LikeHolder

### Are these changes tested?

add unit tests.

### Are there any user-facing changes?
Yes

**This PR includes breaking changes to public APIs.**

* GitHub Issue: apache#40968

Lead-authored-by: Ivan Chesnov <[email protected]>
Co-authored-by: Ivan Chesnov <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
tolleybot pushed a commit to tmct/arrow that referenced this pull request May 2, 2024
…Like function (apache#40970)

### Rationale for this change

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
`[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]`
Pattern '%Space1%' nor '%Space1.%' do not match.

### What changes are included in this PR?

added flag set_dot_nl(true) to LikeHolder

### Are these changes tested?

add unit tests.

### Are there any user-facing changes?
Yes

**This PR includes breaking changes to public APIs.**

* GitHub Issue: apache#40968

Lead-authored-by: Ivan Chesnov <[email protected]>
Co-authored-by: Ivan Chesnov <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
vibhatha pushed a commit to vibhatha/arrow that referenced this pull request May 25, 2024
…Like function (apache#40970)

### Rationale for this change

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
`[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]`
Pattern '%Space1%' nor '%Space1.%' do not match.

### What changes are included in this PR?

added flag set_dot_nl(true) to LikeHolder

### Are these changes tested?

add unit tests.

### Are there any user-facing changes?
Yes

**This PR includes breaking changes to public APIs.**

* GitHub Issue: apache#40968

Lead-authored-by: Ivan Chesnov <[email protected]>
Co-authored-by: Ivan Chesnov <[email protected]>
Signed-off-by: Sutou Kouhei <[email protected]>
xxlaykxx added a commit to xxlaykxx/arrow that referenced this pull request Jul 10, 2024
…Like function (apache#40970) (dremio#68)

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
`[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]`
Pattern '%Space1%' nor '%Space1.%' do not match.

added flag set_dot_nl(true) to LikeHolder

add unit tests.

Yes

**This PR includes breaking changes to public APIs.**

* GitHub Issue: apache#40968

Lead-authored-by: Ivan Chesnov <[email protected]>

Signed-off-by: Sutou Kouhei <[email protected]>
xxlaykxx added a commit to xxlaykxx/arrow that referenced this pull request Jul 11, 2024
…Like function (apache#40970) (dremio#68)

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
`[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]`
Pattern '%Space1%' nor '%Space1.%' do not match.

added flag set_dot_nl(true) to LikeHolder

add unit tests.

Yes

**This PR includes breaking changes to public APIs.**

* GitHub Issue: apache#40968

Lead-authored-by: Ivan Chesnov <[email protected]>

Signed-off-by: Sutou Kouhei <[email protected]>
xxlaykxx added a commit to dremio/arrow that referenced this pull request Jul 12, 2024
…Like f (#80)

* apacheGH-40968: [C++][Gandiva] add RE2::Options set_dot_nl(true) for Like function (apache#40970) (#68)

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
`[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]`
Pattern '%Space1%' nor '%Space1.%' do not match.

added flag set_dot_nl(true) to LikeHolder

add unit tests.

Yes

**This PR includes breaking changes to public APIs.**

* GitHub Issue: apache#40968

Lead-authored-by: Ivan Chesnov <[email protected]>

Signed-off-by: Sutou Kouhei <[email protected]>

* apacheGH-43119: [CI][Packaging] Update manylinux 2014 CentOS repos that have been deprecated (apache#43121)

Jobs are failing to find mirrorlist.centos.org

Updating repos based on solution from: apache#43119 (comment)

Via archery

No
* GitHub Issue: apache#43119

Lead-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>

---------

Signed-off-by: Sutou Kouhei <[email protected]>
Signed-off-by: Raúl Cumplido <[email protected]>
Co-authored-by: Raúl Cumplido <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
Co-authored-by: Sutou Kouhei <[email protected]>
lriggs pushed a commit to lriggs/arrow that referenced this pull request Sep 6, 2024
…Like function (apache#40970) (dremio#68)

### Rationale for this change

Gandiva function "LIKE" does not always work correctly when the string contains \n.
String value:
`[function_name: "Space1.protect"\nargs: "passenger_count"\ncolumn_name: "passenger_count" ]`
Pattern '%Space1%' nor '%Space1.%' do not match.

### What changes are included in this PR?

added flag set_dot_nl(true) to LikeHolder

### Are these changes tested?

add unit tests.

### Are there any user-facing changes?
Yes

**This PR includes breaking changes to public APIs.**

* GitHub Issue: apache#40968

Lead-authored-by: Ivan Chesnov <[email protected]>

Signed-off-by: Sutou Kouhei <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants