-
Notifications
You must be signed in to change notification settings - Fork 28.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior #17887
Conversation
…ql parser behavior.
Test build #76540 has started for PR 17887 at commit |
retest this please. |
Test build #76542 has finished for PR 17887 at commit
|
@@ -196,6 +196,14 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val NO_UNESCAPED_SQL_STRING = buildConf("spark.sql.noUnescapedStringLiteral") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Double negatives are not encouraged in conf naming. This sounds the first parser conf.
How about spark.sql.parser.escapeStringLiterals
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
@@ -68,6 +68,11 @@ object ParserUtils { | |||
/** Convert a string node into a string. */ | |||
def string(node: TerminalNode): String = unescapeSQLString(node.getText) | |||
|
|||
/** Convert a string node into a string without unescaping. */ | |||
def stringWithoutUnescape(node: TerminalNode): String = { | |||
node.getText.slice(1, node.getText.size - 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For safety, do we still need to check whether the starting and ending characters are quotes?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The string rule in SqlBase.g4 forces that the input has always quotes at the starting and ending. I may add a comment here.
.internal() | ||
.doc("Since Spark 2.0, we use unescaped SQL string for string literals including regex. " + | ||
"It is different than 1.6 behavior. Enabling this config can use no unescaped SQL string " + | ||
"literals and mitigate migration problem.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about
When true, string literals (including regex patterns) remains escaped in our SQL parser. The default is false since Spark 2.0. Setting it to
true
can restore the behavior prior to Spark 2.0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Generally, it looks reasonable to me. Also cc @jodersky who hit this issue before. |
Test build #76560 has started for PR 17887 at commit |
Test build #76559 has finished for PR 17887 at commit
|
retest this please. |
@@ -160,6 +166,15 @@ class ExpressionParserSuite extends PlanTest { | |||
assertEqual("a not regexp 'pattern%'", !('a rlike "pattern%")) | |||
} | |||
|
|||
test("like expressions with ESCAPED_STRING_LITERALS = true") { | |||
val conf = new SQLConf() | |||
conf.setConfString("spark.sql.parser.escapedStringLiterals", "true") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use SQLConf. ESCAPED_STRING_LITERALS.key
@@ -447,6 +462,44 @@ class ExpressionParserSuite extends PlanTest { | |||
assertEqual("'\\u0057\\u006F\\u0072\\u006C\\u0064\\u0020\\u003A\\u0029'", "World :)") | |||
} | |||
|
|||
test("strings with ESCAPED_STRING_LITERALS = true") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we have a very similar test case strings
, can we merge them?
Test build #76563 has finished for PR 17887 at commit
|
@@ -1168,6 +1169,18 @@ class DatasetSuite extends QueryTest with SharedSQLContext { | |||
val ds = Seq(WithMapInOption(Some(Map(1 -> 1)))).toDS() | |||
checkDataset(ds, WithMapInOption(Some(Map(1 -> 1)))) | |||
} | |||
|
|||
test("do not unescaped regex pattern string") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
add jira id and when we should not unescape
@@ -413,38 +428,102 @@ class ExpressionParserSuite extends PlanTest { | |||
} | |||
|
|||
test("strings") { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how about something like
Seq(true, false).foreach { escape =>
val conf = new SQLConf()
conf.setConfString(SQLConf.ESCAPED_STRING_LITERALS.key, "true")
val parser = new CatalystSqlParser(conf)
// tests that have same result whatever the conf is
assertEqual("\"hello\"", "hello")
...
// tests that have different result regarding the conf
if (escape) {
assert(...)
...
} else {
assert(...)
...
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Test build #76611 has finished for PR 17887 at commit
|
Test build #76722 has finished for PR 17887 at commit
|
@@ -196,6 +196,14 @@ object SQLConf { | |||
.booleanConf | |||
.createWithDefault(true) | |||
|
|||
val ESCAPED_STRING_LITERALS = buildConf("spark.sql.parser.escapedStringLiterals") | |||
.internal() | |||
.doc("When true, string literals (including regex patterns) remains escaped in our SQL " + |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nit: remains
-> remain
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
Could you update the involved function description? For example, |
@gatorsmile OK. Let me update it. |
Please also add some examples in the function descriptions? It might help users understand how to correctly escape it. Thanks! |
OK. I also think about it too after reading the doc of RLike. |
LGTM except the document change like @gatorsmile suggested |
Test build #76735 has finished for PR 17887 at commit
|
true | ||
|
||
See also: | ||
Use LIKE to match with simple string pattern. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shall we also update the document of LIKE?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was afraid of duplication info there. But OK, let me add few lines into Like
too.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. I think we don't need to update the doc of Like
. The two special symbols %
and _
are parsed in the same way as 1.6 parser.
Rethink about this, we still need to add info about string literal parsing...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated the doc of Like
.
Test build #76750 has finished for PR 17887 at commit
|
regexp - a string expression. The pattern string should be a Java regular expression. | ||
|
||
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. | ||
For example, if the `str` parameter is "abc\td", the `regexp` can match it is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, if the
str
parameter is "abc\td", theregexp
can match it is: "^abc\\td$".
->
For example, to match "abc\td", a regular expression for
regexp
can be "^abc\\td$".
> SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*' | ||
true | ||
|
||
There is a SQL config 'spark.sql.parser.escapedStringLiterals' can be used to fallback |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be used
-> that can be used
true | ||
|
||
There is a SQL config 'spark.sql.parser.escapedStringLiterals' can be used to fallback | ||
to Spark 1.6 behavior regarding string literal parsing. For example, if the config is |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Spark 1.6 behavior
-> the Spark 1.6 behavior
|
||
There is a SQL config 'spark.sql.parser.escapedStringLiterals' can be used to fallback | ||
to Spark 1.6 behavior regarding string literal parsing. For example, if the config is | ||
enabled, the `regexp` can match "abc\td" is "^abc\\t$". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can match
-> that can match
|
||
Examples: | ||
> SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*' | ||
true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is when spark.sql.parser.escapedStringLiterals
is set to false.
How about moving these two examples in the same place? Then, we can clearly explain the behavior differences caused by spark.sql.parser.escapedStringLiterals
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok.
Test build #76782 has started for PR 17887 at commit |
retest this please. |
Test build #76786 has finished for PR 17887 at commit
|
regexp - a string expression. The pattern string should be a Java regular expression. | ||
|
||
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. | ||
For example, if to match "abc\td", a regular expression for `regexp` can be "^abc\\\\td$". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the example should be based on SQL shell instead of java string literal, here should be "^abc\\td$"
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK. Let me change to SQL shell string.
Test build #76814 has finished for PR 17887 at commit
|
regexp - a string expression. The pattern string should be a Java regular expression. | ||
|
||
Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser. | ||
For example, if to match "\abc", a regular expression for `regexp` can be "^\\abc$". |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if to match
has a grammar issue. You need to change it to to match
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure.
LGTM cc @cloud-fan |
LGTM |
Test build #76837 has finished for PR 17887 at commit
|
…nsistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes #17887 from viirya/add-config-fallback-string-parsing. (cherry picked from commit 609ba5f) Signed-off-by: Wenchen Fan <[email protected]>
thanks, merging to master/2.2! |
Thanks @cloud-fan @gatorsmile |
…nsistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes apache#17887 from viirya/add-config-fallback-string-parsing.
…nsistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes apache#17887 from viirya/add-config-fallback-string-parsing.
What changes were proposed in this pull request?
The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.
The following codes can reproduce it:
Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.
How was this patch tested?
Jenkins tests.
Please review http://spark.apache.org/contributing.html before opening a pull request.