[SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior #17887

viirya · 2017-05-07T05:11:16Z

What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

val data = Seq("\u0020\u0021\u0023", "abc")
val df = data.toDF()

// 1st usage: works in 1.6
// Let parser parse pattern string
val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
// 2nd usage: works in 1.6, 2.x
// Call Column.rlike so the pattern string is a literal which doesn't go through parser
val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

// In 2.x, we need add backslashes to make regex pattern parsed correctly
val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

…ql parser behavior.

viirya · 2017-05-07T05:11:42Z

cc @dbtsai @cloud-fan @hvanhovell

SparkQA · 2017-05-07T05:12:34Z

Test build #76540 has started for PR 17887 at commit d0b2c22.

viirya · 2017-05-07T07:20:11Z

retest this please.

SparkQA · 2017-05-07T09:37:17Z

Test build #76542 has finished for PR 17887 at commit d0b2c22.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
class CatalystSqlParser(conf: SQLConf) extends AbstractSqlParser
class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf)

gatorsmile · 2017-05-07T22:17:08Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -196,6 +196,14 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val NO_UNESCAPED_SQL_STRING = buildConf("spark.sql.noUnescapedStringLiteral")


Double negatives are not encouraged in conf naming. This sounds the first parser conf.

How about spark.sql.parser.escapeStringLiterals?

gatorsmile · 2017-05-07T22:24:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/parser/ParserUtils.scala

@@ -68,6 +68,11 @@ object ParserUtils {
  /** Convert a string node into a string. */
  def string(node: TerminalNode): String = unescapeSQLString(node.getText)

+  /** Convert a string node into a string without unescaping. */
+  def stringWithoutUnescape(node: TerminalNode): String = {
+    node.getText.slice(1, node.getText.size - 1)


For safety, do we still need to check whether the starting and ending characters are quotes?

The string rule in SqlBase.g4 forces that the input has always quotes at the starting and ending. I may add a comment here.

gatorsmile · 2017-05-07T22:42:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

+    .internal()
+    .doc("Since Spark 2.0, we use unescaped SQL string for string literals including regex. " +
+      "It is different than 1.6 behavior. Enabling this config can use no unescaped SQL string " +
+      "literals and mitigate migration problem.")


How about

When true, string literals (including regex patterns) remains escaped in our SQL parser. The default is false since Spark 2.0. Setting it to true can restore the behavior prior to Spark 2.0.

gatorsmile · 2017-05-07T22:47:58Z

Generally, it looks reasonable to me. Also cc @jodersky who hit this issue before.

SparkQA · 2017-05-08T05:07:33Z

Test build #76560 has started for PR 17887 at commit ab77de7.

SparkQA · 2017-05-08T05:13:23Z

Test build #76559 has finished for PR 17887 at commit 8ae0747.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

viirya · 2017-05-08T07:07:15Z

retest this please.

cloud-fan · 2017-05-08T09:15:40Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala

@@ -160,6 +166,15 @@ class ExpressionParserSuite extends PlanTest {
    assertEqual("a not regexp 'pattern%'", !('a rlike "pattern%"))
  }

+  test("like expressions with ESCAPED_STRING_LITERALS = true") {
+    val conf = new SQLConf()
+    conf.setConfString("spark.sql.parser.escapedStringLiterals", "true")


use SQLConf. ESCAPED_STRING_LITERALS.key

cloud-fan · 2017-05-08T09:25:14Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala

@@ -447,6 +462,44 @@ class ExpressionParserSuite extends PlanTest {
    assertEqual("'\\u0057\\u006F\\u0072\\u006C\\u0064\\u0020\\u003A\\u0029'", "World :)")
  }

+  test("strings with ESCAPED_STRING_LITERALS = true") {


we have a very similar test case strings, can we merge them?

SparkQA · 2017-05-08T09:31:55Z

Test build #76563 has finished for PR 17887 at commit ab77de7.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-09T04:46:37Z

sql/core/src/test/scala/org/apache/spark/sql/DatasetSuite.scala

@@ -1168,6 +1169,18 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
    val ds = Seq(WithMapInOption(Some(Map(1 -> 1)))).toDS()
    checkDataset(ds, WithMapInOption(Some(Map(1 -> 1))))
  }
+
+  test("do not unescaped regex pattern string") {


add jira id and when we should not unescape

cloud-fan · 2017-05-09T04:51:06Z

sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/parser/ExpressionParserSuite.scala

@@ -413,38 +428,102 @@ class ExpressionParserSuite extends PlanTest {
  }

  test("strings") {


how about something like

Seq(true, false).foreach { escape => val conf = new SQLConf() conf.setConfString(SQLConf.ESCAPED_STRING_LITERALS.key, "true") val parser = new CatalystSqlParser(conf) // tests that have same result whatever the conf is assertEqual("\"hello\"", "hello") ... // tests that have different result regarding the conf if (escape) { assert(...) ... } else { assert(...) ... } }

SparkQA · 2017-05-09T05:33:22Z

Test build #76611 has finished for PR 17887 at commit 04a9fd3.

This patch fails SparkR unit tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2017-05-10T06:15:03Z

Test build #76722 has finished for PR 17887 at commit 9ce7eb0.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-10T06:19:50Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/SQLConf.scala

@@ -196,6 +196,14 @@ object SQLConf {
    .booleanConf
    .createWithDefault(true)

+  val ESCAPED_STRING_LITERALS = buildConf("spark.sql.parser.escapedStringLiterals")
+    .internal()
+    .doc("When true, string literals (including regex patterns) remains escaped in our SQL " +


Nit: remains -> remain

gatorsmile · 2017-05-10T06:24:35Z

Could you update the involved function description? For example, RLike? I believe not only @dbtsai 's team hit this issue. It should be documented in the function description. Thanks!

viirya · 2017-05-10T06:25:11Z

@gatorsmile OK. Let me update it.

gatorsmile · 2017-05-10T06:26:08Z

Please also add some examples in the function descriptions? It might help users understand how to correctly escape it. Thanks!

viirya · 2017-05-10T06:27:31Z

OK. I also think about it too after reading the doc of RLike.

cloud-fan · 2017-05-10T06:35:30Z

LGTM except the document change like @gatorsmile suggested

SparkQA · 2017-05-10T09:49:01Z

Test build #76735 has finished for PR 17887 at commit c81f030.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-10T11:37:49Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+      true
+
+    See also:
+      Use LIKE to match with simple string pattern.


shall we also update the document of LIKE?

I was afraid of duplication info there. But OK, let me add few lines into Like too.

~~Ah. I think we don't need to update the doc of Like. The two special symbols % and _ are parsed in the same way as 1.6 parser.~~

Rethink about this, we still need to add info about string literal parsing...

I've updated the doc of Like.

SparkQA · 2017-05-10T17:07:42Z

Test build #76750 has finished for PR 17887 at commit e854b10.

This patch fails from timeout after a configured wait of `250m`.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-10T17:21:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+      regexp - a string expression. The pattern string should be a Java regular expression.
+
+        Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser.
+        For example, if the `str` parameter is "abc\td", the `regexp` can match it is:


For example, if the str parameter is "abc\td", the regexp can match it is: "^abc\\td$".

->

For example, to match "abc\td", a regular expression for regexp can be "^abc\\td$".

gatorsmile · 2017-05-10T17:22:34Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+      > SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*'
+      true
+
+        There is a SQL config 'spark.sql.parser.escapedStringLiterals' can be used to fallback


can be used -> that can be used

gatorsmile · 2017-05-10T17:23:10Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+      true
+
+        There is a SQL config 'spark.sql.parser.escapedStringLiterals' can be used to fallback
+        to Spark 1.6 behavior regarding string literal parsing. For example, if the config is


Spark 1.6 behavior -> the Spark 1.6 behavior

gatorsmile · 2017-05-10T17:23:55Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+
+        There is a SQL config 'spark.sql.parser.escapedStringLiterals' can be used to fallback
+        to Spark 1.6 behavior regarding string literal parsing. For example, if the config is
+        enabled, the `regexp` can match "abc\td" is "^abc\\t$".


can match -> that can match

gatorsmile · 2017-05-10T17:26:56Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+
+    Examples:
+      > SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*'
+      true


This is when spark.sql.parser.escapedStringLiterals is set to false.

How about moving these two examples in the same place? Then, we can clearly explain the behavior differences caused by spark.sql.parser.escapedStringLiterals

SparkQA · 2017-05-11T05:47:40Z

Test build #76782 has started for PR 17887 at commit d8cd670.

viirya · 2017-05-11T07:08:41Z

retest this please.

SparkQA · 2017-05-11T09:22:13Z

Test build #76786 has finished for PR 17887 at commit d8cd670.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2017-05-11T10:48:12Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+      regexp - a string expression. The pattern string should be a Java regular expression.
+
+        Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser.
+        For example, if to match "abc\td", a regular expression for `regexp` can be "^abc\\\\td$".


I think the example should be based on SQL shell instead of java string literal, here should be "^abc\\td$"

OK. Let me change to SQL shell string.

SparkQA · 2017-05-11T17:13:51Z

Test build #76814 has finished for PR 17887 at commit 8ecb2ea.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

gatorsmile · 2017-05-11T18:00:18Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/expressions/regexpExpressions.scala

+      regexp - a string expression. The pattern string should be a Java regular expression.
+
+        Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser.
+        For example, if to match "\abc", a regular expression for `regexp` can be "^\\abc$".


if to match has a grammar issue. You need to change it to to match

gatorsmile · 2017-05-11T18:01:43Z

LGTM cc @cloud-fan

cloud-fan · 2017-05-12T00:33:10Z

LGTM

SparkQA · 2017-05-12T02:38:19Z

Test build #76837 has finished for PR 17887 at commit 375eb9c.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

…nsistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes #17887 from viirya/add-config-fallback-string-parsing. (cherry picked from commit 609ba5f) Signed-off-by: Wenchen Fan <[email protected]>

cloud-fan · 2017-05-12T03:16:01Z

thanks, merging to master/2.2!

viirya · 2017-05-12T03:19:37Z

Thanks @cloud-fan @gatorsmile

…nsistent with old sql parser behavior ## What changes were proposed in this pull request? The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string. The following codes can reproduce it: val data = Seq("\u0020\u0021\u0023", "abc") val df = data.toDF() // 1st usage: works in 1.6 // Let parser parse pattern string val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'") // 2nd usage: works in 1.6, 2.x // Call Column.rlike so the pattern string is a literal which doesn't go through parser val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$")) // In 2.x, we need add backslashes to make regex pattern parsed correctly val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'") Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue. ## How was this patch tested? Jenkins tests. Please review http://spark.apache.org/contributing.html before opening a pull request. Author: Liang-Chi Hsieh <[email protected]> Closes apache#17887 from viirya/add-config-fallback-string-parsing.

Add a config to fallback string literal parsing consistent with old s…

d0b2c22

…ql parser behavior.

gatorsmile reviewed May 7, 2017

View reviewed changes

viirya added 2 commits May 8, 2017 02:53

Address comments.

8ae0747

Fix code comment.

ab77de7

viirya changed the title ~~[SPARK-20399][SQL][WIP] Add a config to fallback string literal parsing consistent with old sql parser behavior~~ [SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior May 8, 2017

cloud-fan reviewed May 8, 2017

View reviewed changes

Merge tests.

04a9fd3

cloud-fan reviewed May 9, 2017

View reviewed changes

Address comments.

9ce7eb0

gatorsmile reviewed May 10, 2017

View reviewed changes

Fix config doc.

3241b88

Update RLike function description.

c81f030

cloud-fan reviewed May 10, 2017

View reviewed changes

Also update doc for Like expression.

e854b10

gatorsmile reviewed May 10, 2017

View reviewed changes

Fix doc.

d8cd670

cloud-fan reviewed May 11, 2017

View reviewed changes

viirya added 2 commits May 11, 2017 14:47

Change java string literal to SQL shell string.

8ecb2ea

Fix doc.

375eb9c

gatorsmile reviewed May 11, 2017

View reviewed changes

asfgit closed this in 609ba5f May 12, 2017

viirya deleted the add-config-fallback-string-parsing branch December 27, 2023 18:20

		@@ -413,38 +428,102 @@ class ExpressionParserSuite extends PlanTest {
		}

		test("strings") {

[SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior #17887

[SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior #17887

Conversation

viirya commented May 7, 2017 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

viirya commented May 7, 2017

SparkQA commented May 7, 2017

viirya commented May 7, 2017

SparkQA commented May 7, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented May 7, 2017

SparkQA commented May 8, 2017

SparkQA commented May 8, 2017

viirya commented May 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 8, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 9, 2017

SparkQA commented May 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented May 10, 2017

viirya commented May 10, 2017

gatorsmile commented May 10, 2017

viirya commented May 10, 2017

cloud-fan commented May 10, 2017

SparkQA commented May 10, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

viirya May 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 10, 2017

gatorsmile May 10, 2017 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 11, 2017

viirya commented May 11, 2017

SparkQA commented May 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented May 11, 2017

Choose a reason for hiding this comment

Choose a reason for hiding this comment

gatorsmile commented May 11, 2017

cloud-fan commented May 12, 2017

SparkQA commented May 12, 2017

cloud-fan commented May 12, 2017

viirya commented May 12, 2017

viirya commented May 7, 2017 •

edited

Loading

viirya May 10, 2017 •

edited

Loading

gatorsmile May 10, 2017 •

edited

Loading