Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Expand LIKE simplification: cover NULL pattern/expression and constant #13260

Open
wants to merge 2 commits into
base: main
Choose a base branch
from

Conversation

findepi
Copy link
Member

@findepi findepi commented Nov 5, 2024

@github-actions github-actions bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) labels Nov 5, 2024
@findepi findepi force-pushed the findepi/expand-like-simplification-e96eca branch from c6285bf to be26107 Compare November 5, 2024 13:39
@findepi
Copy link
Member Author

findepi commented Nov 5, 2024

currently depends on #13259

@findepi
Copy link
Member Author

findepi commented Nov 5, 2024

CI is green, so ready to review
but this will conflict with @adriangb's #13061, because this changes code structure a little bit
conflict resolution will be trivial though.
@adriangb if you want, i can do this, adding your changes to this PR

@findepi
Copy link
Member Author

findepi commented Nov 5, 2024

seemed easy enough, done.

@findepi findepi force-pushed the findepi/expand-like-simplification-e96eca branch from 88d6df1 to 6c57af7 Compare November 5, 2024 15:14
@adriangb
Copy link
Contributor

adriangb commented Nov 5, 2024

I’m happy with my changes being included in this PR :)

@findepi findepi marked this pull request as draft November 5, 2024 15:21
@findepi
Copy link
Member Author

findepi commented Nov 5, 2024

draft - to be rebased after #13259 lands

still ready to review
cc @crepererum @goldmedal

Copy link
Contributor

@goldmedal goldmedal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just roughly review now. Overall looks to me. I will check the test cases tomorrow.

datafusion/optimizer/Cargo.toml Outdated Show resolved Hide resolved
@goldmedal
Copy link
Contributor

draft - to be rebased after #13259 lands

still ready to review cc @crepererum @goldmedal

#13259 has been merged. 👍

findepi and others added 2 commits November 6, 2024 08:30
- cover expression known not to be null
- cover NULL pattern
- cover repeated '%%' in pattern
@findepi findepi force-pushed the findepi/expand-like-simplification-e96eca branch from e4eae46 to 520ad2b Compare November 6, 2024 07:31
@findepi findepi marked this pull request as ready for review November 6, 2024 07:31
@findepi
Copy link
Member Author

findepi commented Nov 6, 2024

@goldmedal rebased, thanks!

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi and @goldmedal

This looks like a great change to me except for the handling of %% which I am not sure about. Otherwise 👍

@@ -2987,41 +3051,41 @@ mod tests {
})
}

fn like(expr: Expr, pattern: &str) -> Expr {
fn like(expr: Expr, pattern: impl Into<Expr>) -> Expr {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could use Expr::like https://docs.rs/datafusion/latest/datafusion/logical_expr/enum.Expr.html#method.like and similar here instead of these functions

This is likely left over from when the Expr API was less expressive

let expr = not_ilike(col("c1"), lit("%"));
assert_eq!(simplify(expr), if_not_null(col("c1"), false));

// expr [NOT] [I]LIKE '%%'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this simplification

I thought expr LIKE '%%' means the same as expr = '%' (match the actual literal character % )

Specifically that %% is how to escape the wildcard to not be a wildcard

This test looks like it has applied the rewrite needed for expr LIKE '%' which matches any non null input

🤔

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, since the code seems to handle more than two %, could you add a test for patterns like

  • col LIKE '17%%' // 17%
  • col LIKE '%%five%% // %five%
  • col LIKE '%%%%five%%%% // %%five%%

Copy link
Contributor

@goldmedal goldmedal Nov 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought expr LIKE '%%' means the same as expr = '%' (match the actual literal character % )

Specifically that %% is how to escape the wildcard to not be a wildcard

I think it depends on how we define it. I assume we prefer to follow the Postgres behavior. 🤔
In Postgres, %% doesn't mean matching the actual literal '%'. The escape character is \.

test=# select '1' like '%%';
 ?column? 
----------
 t
(1 row)

test=# select '%' like '%%';
 ?column? 
----------
 t
(1 row)

test=# select '%' like '\%';
 ?column? 
----------
 t
(1 row)

test=# select '1' like '\%';
 ?column? 
----------
 f
(1 row)

I think %% is just a redundant wildcard. So, we can simplify it to be only one %.

By the way, in DuckDB, there is a similar behavior for double '%%'.

D select '1' like '%%';
┌───────────────┐
│ ('1' ~~ '%%') │
│    boolean    │
├───────────────┤
│ true          │
└───────────────┘
D select '%' like '%%';
┌───────────────┐
│ ('%' ~~ '%%') │
│    boolean    │
├───────────────┤
│ true          │
└───────────────┘

However, the escape character should be set by ESCAPE syntax

D select '%' like '\%';
┌───────────────┐
│ ('%' ~~ '\%') │
│    boolean    │
├───────────────┤
│ false         │
└───────────────┘
D select '%' like '\%' escape '\';
┌──────────────────────────────────┐
│ main.like_escape('%', '\%', '\') │
│             boolean              │
├──────────────────────────────────┤
│ true                             │
└──────────────────────────────────┘

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was definitely confused -- thank you

})
})
}
Some(pattern_str)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not sure about this rule (see questions below) -- maybe you could add some comments explanation on what it is supposed to do.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
Some(pattern_str)
Some(pattern_str)
// Repeated occurrences of wildcard are redundant so remove them
// exp LIKE '%%' --> exp LIKE '%'

@@ -396,8 +396,9 @@ EXPLAIN SELECT
FROM test;
----
logical_plan
01)Projection: test.column1_utf8view LIKE Utf8View("foo") AS like, test.column1_utf8view ILIKE Utf8View("foo") AS ilike
02)--TableScan: test projection=[column1_utf8view]
01)Projection: __common_expr_1 AS like, __common_expr_1 AS ilike
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice!

Though I think this test's meaning is now changed it is supposed to be verifying cast exprs for like

Perhaps you can change the patterns to like '%foo%'

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I did some tests. I think they still use the native StringView implementation for pattern matching.

query TT
EXPLAIN SELECT
  column1_utf8view like '%foo%' as "like",
  column1_utf8view ilike '%foo%' as "ilike"
FROM test;
----
logical_plan
01)Projection: test.column1_utf8view LIKE Utf8View("%foo%") AS like, test.column1_utf8view ILIKE Utf8View("%foo%") AS ilike
02)--TableScan: test projection=[column1_utf8view]

@alamb alamb changed the title Expand LIKE simplification Expand LIKE simplification: cover NULL pattern/expression and constant Nov 6, 2024
let null = lit(ScalarValue::Utf8(None));

// expr [NOT] [I]LIKE NULL
let expr = like(col("c1"), null.clone());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let expr = like(col("c1"), null.clone());
let expr = col("c1").like(null.clone());

As @alamb's comment (#13260 (comment)), we can do some refactor like this.

@@ -396,8 +396,9 @@ EXPLAIN SELECT
FROM test;
----
logical_plan
01)Projection: test.column1_utf8view LIKE Utf8View("foo") AS like, test.column1_utf8view ILIKE Utf8View("foo") AS ilike
02)--TableScan: test projection=[column1_utf8view]
01)Projection: __common_expr_1 AS like, __common_expr_1 AS ilike
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. I did some tests. I think they still use the native StringView implementation for pattern matching.

query TT
EXPLAIN SELECT
  column1_utf8view like '%foo%' as "like",
  column1_utf8view ilike '%foo%' as "ilike"
FROM test;
----
logical_plan
01)Projection: test.column1_utf8view LIKE Utf8View("%foo%") AS like, test.column1_utf8view ILIKE Utf8View("%foo%") AS ilike
02)--TableScan: test projection=[column1_utf8view]

@crepererum
Copy link
Contributor

since there are already two reviewers on it, I'll gonna skip this PR. However if you need my input, ping me.

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @findepi and @goldmedal

I had some small testing suggestions but I think we can add that coverage as a follow on PR as well

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants