-
Notifications
You must be signed in to change notification settings - Fork 234
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite some rlike expression to StartsWith/Contains #10715
Conversation
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am a little concerned that you are writing your own Regexp parsing code instead of reusing the existing code
Can we please go off of the existing RegexpParser instead of trying to write something new from scratch.
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
I updated code to use RegexParser, please take another look. It prevents me from writing a regex parser from scratch but makes the matching logic a bit more complicated. But overall I think reusing it is really better than having two parsers. Will adds more tests, such as a UT, to verify that it is taking the speedup path. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Generally it looks good to me
assert_gpu_and_cpu_are_equal_collect( | ||
lambda spark: unary_op_df(spark, gen).selectExpr( | ||
'a', | ||
'regexp_like(a, "(abcd)(.*)")', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about \A and \Z? Is that something that we can support with this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
\Z
means The end of the input but for the final terminator, if any
in java, so it is not the same as endsWith
. Will support \A
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@@ -444,6 +444,28 @@ def test_regexp_like(): | |||
'regexp_like(a, "a[bc]d")'), | |||
conf=_regexp_conf) | |||
|
|||
@pytest.mark.skipif(is_before_spark_320(), reason='regexp_like is synonym for RLike starting in Spark 3.2.0') | |||
def test_regexp_rlike_rewrite_optimization(): | |||
gen = mk_str_gen('[abcd]{3,6}') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add in some new line characters to the generated strings? ^ and $ in some cases can match just begin and end of line, not begin and end of string.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice catch! The test failed in $
case, didn't know that $
means end of line
in java regex.
Sadly it means we could not support endsWith
pattern at all because we haven't support \w
so it will fallback first. (technically we can by check this case when tagging but I don't think we need to do that now) I will remove the endsWith
part.
I'm surprised that ^
passed this test with new line characters because ^
means "The beginning of a line". Will do some investigation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
'a', | ||
'regexp_like(a, "(abcd)(.*)")', | ||
'regexp_like(a, "abcd(.*)")', | ||
'regexp_like(a, "(.*)(abcd)(.*)")', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure how likely it is to have abcd
show up in the generated data for any of these queries.
If we look at a starts with abcd. We have a 25% chance of picking an a as the first char, and 25% chance of picking a b as the second ... That means if we had an input pattern of abcd{4}
then we would only likely have 8 rows in the entire 2048 data set that would match, but we have {3,6}
, which makes it likely that we would have no rows in the data set that match.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
updated.
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
Co-authored-by: Gera Shegalov <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
Signed-off-by: Haoyang Li <[email protected]>
tests/src/test/scala/com/nvidia/spark/rapids/RegularExpressionRewriteSuite.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
build |
Did a simple performance test: data: 1000000 random strings, each string has 0-2000 characters, 30% strings start with "abcde"*20, 30% strings contain "abcde"*20, 40% random strings. startsWith:
contains:
(Maybe the patterns in |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
sql-plugin/src/main/scala/com/nvidia/spark/rapids/RegexParser.scala
Outdated
Show resolved
Hide resolved
Signed-off-by: Haoyang Li <[email protected]>
build |
Signed-off-by: Haoyang Li <[email protected]>
build |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some nits.
val (transpiledAST, _) = | ||
new CudfRegexTranspiler(RegexFindMode).getTranspiledAST(str.toString, None, None) | ||
originalPattern = str.toString | ||
val (transpiledAST, _) = new CudfRegexTranspiler(RegexFindMode) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Could we have a follow on issue to figure out how to parse the regexp once, instead of multiple times?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Filed #10817, will do it in my next regex rewrite pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, pending Bobby's comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. I think the test fix is required, but I would like others to comment on whether to enable the optimization even when the regexp is disabled on GPU.
case _ => throw new IllegalStateException("Unexpected optimization type") | ||
} | ||
} | ||
|
||
override def tagExprForGpu(): Unit = { | ||
GpuRegExpUtils.tagForRegExpEnabled(this) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This method can actually disable regexp on the GPU. This means that these optimizations will never kick in when regexp is disabled. I don't know if that is actually desired. You can look at GpuSplit, where we implemented transpileToSplittableString
and that codepath is not affected by the regexp enable flag.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'm fine with either way.
I think regex rewrite is more like an internal optimization in regex engine from user's perspective, users are still writing regex in rlike
and won't be aware regex rewrite is happening, while in split
case user would be aware that they are writing literal string as split delimiter.
Also, if there is something wrong when using the regex, it could also be a bug in the regex rewrite logic, and disabling regex config won't help it fallback correctly, especially when spark.rapids.sql.rLikeRegexRewrite.enabled
is now an internal config.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can keep it as is. From an end user standpoint, if I say that I want to disable regex, and we go ahead and rewrite the query to do something different it might be kind of confusing. But I think that is minor. The main thing I am worried about is if there are situations where we could convert a regular expression into a custom kernel, but the transpiler cannot support it. We are now stuck. That appears to be simple enough to do, and we can do it when we see a need for it.
@@ -444,6 +444,28 @@ def test_regexp_like(): | |||
'regexp_like(a, "a[bc]d")'), | |||
conf=_regexp_conf) | |||
|
|||
@pytest.mark.skipif(is_before_spark_320(), reason='regexp_like is synonym for RLike starting in Spark 3.2.0') | |||
def test_regexp_rlike_rewrite_optimization(): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why don't we rewrite this test using RLIKE
so it runs on all Spark versions?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea, updated.
Signed-off-by: Haoyang Li <[email protected]>
build |
Closes #10742
WIP
This PR rewrites
RLike
in some simple cases that can be replaced byGpuStartsWith
/GpuEndsWith
/GpuContains
/GpuEqualTo
.Replacing
RLike
withGpuContains
gains about 10% e2e speedup in a customer query. Needs further performance testing.