Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior #17887

Closed
wants to merge 11 commits into from

Conversation

viirya
Copy link
Member

@viirya viirya commented May 7, 2017

What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

val data = Seq("\u0020\u0021\u0023", "abc")
val df = data.toDF()

// 1st usage: works in 1.6
// Let parser parse pattern string
val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
// 2nd usage: works in 1.6, 2.x
// Call Column.rlike so the pattern string is a literal which doesn't go through parser
val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

// In 2.x, we need add backslashes to make regex pattern parsed correctly
val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.

How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

@viirya
Copy link
Member Author

viirya commented May 7, 2017

cc @dbtsai @cloud-fan @hvanhovell

@SparkQA
Copy link

SparkQA commented May 7, 2017

Test build #76540 has started for PR 17887 at commit d0b2c22.

@viirya
Copy link
Member Author

viirya commented May 7, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented May 7, 2017

Test build #76542 has finished for PR 17887 at commit d0b2c22.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds the following public classes (experimental):
  • class AstBuilder(conf: SQLConf) extends SqlBaseBaseVisitor[AnyRef] with Logging
  • class CatalystSqlParser(conf: SQLConf) extends AbstractSqlParser
  • class SparkSqlAstBuilder(conf: SQLConf) extends AstBuilder(conf)

@@ -196,6 +196,14 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val NO_UNESCAPED_SQL_STRING = buildConf("spark.sql.noUnescapedStringLiteral")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Double negatives are not encouraged in conf naming. This sounds the first parser conf.

How about spark.sql.parser.escapeStringLiterals?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@@ -68,6 +68,11 @@ object ParserUtils {
/** Convert a string node into a string. */
def string(node: TerminalNode): String = unescapeSQLString(node.getText)

/** Convert a string node into a string without unescaping. */
def stringWithoutUnescape(node: TerminalNode): String = {
node.getText.slice(1, node.getText.size - 1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For safety, do we still need to check whether the starting and ending characters are quotes?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The string rule in SqlBase.g4 forces that the input has always quotes at the starting and ending. I may add a comment here.

.internal()
.doc("Since Spark 2.0, we use unescaped SQL string for string literals including regex. " +
"It is different than 1.6 behavior. Enabling this config can use no unescaped SQL string " +
"literals and mitigate migration problem.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about

When true, string literals (including regex patterns) remains escaped in our SQL parser. The default is false since Spark 2.0. Setting it to true can restore the behavior prior to Spark 2.0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@gatorsmile
Copy link
Member

Generally, it looks reasonable to me. Also cc @jodersky who hit this issue before.

@SparkQA
Copy link

SparkQA commented May 8, 2017

Test build #76560 has started for PR 17887 at commit ab77de7.

@SparkQA
Copy link

SparkQA commented May 8, 2017

Test build #76559 has finished for PR 17887 at commit 8ae0747.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@viirya viirya changed the title [SPARK-20399][SQL][WIP] Add a config to fallback string literal parsing consistent with old sql parser behavior [SPARK-20399][SQL] Add a config to fallback string literal parsing consistent with old sql parser behavior May 8, 2017
@viirya
Copy link
Member Author

viirya commented May 8, 2017

retest this please.

@@ -160,6 +166,15 @@ class ExpressionParserSuite extends PlanTest {
assertEqual("a not regexp 'pattern%'", !('a rlike "pattern%"))
}

test("like expressions with ESCAPED_STRING_LITERALS = true") {
val conf = new SQLConf()
conf.setConfString("spark.sql.parser.escapedStringLiterals", "true")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

use SQLConf. ESCAPED_STRING_LITERALS.key

@@ -447,6 +462,44 @@ class ExpressionParserSuite extends PlanTest {
assertEqual("'\\u0057\\u006F\\u0072\\u006C\\u0064\\u0020\\u003A\\u0029'", "World :)")
}

test("strings with ESCAPED_STRING_LITERALS = true") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we have a very similar test case strings, can we merge them?

@SparkQA
Copy link

SparkQA commented May 8, 2017

Test build #76563 has finished for PR 17887 at commit ab77de7.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -1168,6 +1169,18 @@ class DatasetSuite extends QueryTest with SharedSQLContext {
val ds = Seq(WithMapInOption(Some(Map(1 -> 1)))).toDS()
checkDataset(ds, WithMapInOption(Some(Map(1 -> 1))))
}

test("do not unescaped regex pattern string") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

add jira id and when we should not unescape

@@ -413,38 +428,102 @@ class ExpressionParserSuite extends PlanTest {
}

test("strings") {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about something like

Seq(true, false).foreach { escape =>
  val conf = new SQLConf()
  conf.setConfString(SQLConf.ESCAPED_STRING_LITERALS.key, "true")
  val parser = new CatalystSqlParser(conf)

  // tests that have same result whatever the conf is
  assertEqual("\"hello\"", "hello")
  ...

  // tests that have different result regarding the conf
  if (escape) {
    assert(...) 
    ...
  } else {
    assert(...)
    ...
  }

}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@SparkQA
Copy link

SparkQA commented May 9, 2017

Test build #76611 has finished for PR 17887 at commit 04a9fd3.

  • This patch fails SparkR unit tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@SparkQA
Copy link

SparkQA commented May 10, 2017

Test build #76722 has finished for PR 17887 at commit 9ce7eb0.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

@@ -196,6 +196,14 @@ object SQLConf {
.booleanConf
.createWithDefault(true)

val ESCAPED_STRING_LITERALS = buildConf("spark.sql.parser.escapedStringLiterals")
.internal()
.doc("When true, string literals (including regex patterns) remains escaped in our SQL " +
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: remains -> remain

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@gatorsmile
Copy link
Member

Could you update the involved function description? For example, RLike? I believe not only @dbtsai 's team hit this issue. It should be documented in the function description. Thanks!

@viirya
Copy link
Member Author

viirya commented May 10, 2017

@gatorsmile OK. Let me update it.

@gatorsmile
Copy link
Member

Please also add some examples in the function descriptions? It might help users understand how to correctly escape it. Thanks!

@viirya
Copy link
Member Author

viirya commented May 10, 2017

OK. I also think about it too after reading the doc of RLike.

@cloud-fan
Copy link
Contributor

LGTM except the document change like @gatorsmile suggested

@SparkQA
Copy link

SparkQA commented May 10, 2017

Test build #76735 has finished for PR 17887 at commit c81f030.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

true

See also:
Use LIKE to match with simple string pattern.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

shall we also update the document of LIKE?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was afraid of duplication info there. But OK, let me add few lines into Like too.

Copy link
Member Author

@viirya viirya May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. I think we don't need to update the doc of Like. The two special symbols % and _ are parsed in the same way as 1.6 parser.

Rethink about this, we still need to add info about string literal parsing...

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated the doc of Like.

@SparkQA
Copy link

SparkQA commented May 10, 2017

Test build #76750 has finished for PR 17887 at commit e854b10.

  • This patch fails from timeout after a configured wait of `250m`.
  • This patch merges cleanly.
  • This patch adds no public classes.

regexp - a string expression. The pattern string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser.
For example, if the `str` parameter is "abc\td", the `regexp` can match it is:
Copy link
Member

@gatorsmile gatorsmile May 10, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, if the str parameter is "abc\td", the regexp can match it is: "^abc\\td$".

->

For example, to match "abc\td", a regular expression for regexp can be "^abc\\td$".

> SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*'
true

There is a SQL config 'spark.sql.parser.escapedStringLiterals' can be used to fallback
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can be used -> that can be used

true

There is a SQL config 'spark.sql.parser.escapedStringLiterals' can be used to fallback
to Spark 1.6 behavior regarding string literal parsing. For example, if the config is
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Spark 1.6 behavior -> the Spark 1.6 behavior


There is a SQL config 'spark.sql.parser.escapedStringLiterals' can be used to fallback
to Spark 1.6 behavior regarding string literal parsing. For example, if the config is
enabled, the `regexp` can match "abc\td" is "^abc\\t$".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can match -> that can match


Examples:
> SELECT '%SystemDrive%\Users\John' _FUNC_ '%SystemDrive%\\Users.*'
true
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is when spark.sql.parser.escapedStringLiterals is set to false.

How about moving these two examples in the same place? Then, we can clearly explain the behavior differences caused by spark.sql.parser.escapedStringLiterals

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok.

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76782 has started for PR 17887 at commit d8cd670.

@viirya
Copy link
Member Author

viirya commented May 11, 2017

retest this please.

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76786 has finished for PR 17887 at commit d8cd670.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

regexp - a string expression. The pattern string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser.
For example, if to match "abc\td", a regular expression for `regexp` can be "^abc\\\\td$".
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the example should be based on SQL shell instead of java string literal, here should be "^abc\\td$"

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK. Let me change to SQL shell string.

@SparkQA
Copy link

SparkQA commented May 11, 2017

Test build #76814 has finished for PR 17887 at commit 8ecb2ea.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

regexp - a string expression. The pattern string should be a Java regular expression.

Since Spark 2.0, string literals (including regex patterns) are unescaped in our SQL parser.
For example, if to match "\abc", a regular expression for `regexp` can be "^\\abc$".
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if to match has a grammar issue. You need to change it to to match

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure.

@gatorsmile
Copy link
Member

LGTM cc @cloud-fan

@cloud-fan
Copy link
Contributor

LGTM

@SparkQA
Copy link

SparkQA commented May 12, 2017

Test build #76837 has finished for PR 17887 at commit 375eb9c.

  • This patch passes all tests.
  • This patch merges cleanly.
  • This patch adds no public classes.

asfgit pushed a commit that referenced this pull request May 12, 2017
…nsistent with old sql parser behavior

## What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

    val data = Seq("\u0020\u0021\u0023", "abc")
    val df = data.toDF()

    // 1st usage: works in 1.6
    // Let parser parse pattern string
    val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
    // 2nd usage: works in 1.6, 2.x
    // Call Column.rlike so the pattern string is a literal which doesn't go through parser
    val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

    // In 2.x, we need add backslashes to make regex pattern parsed correctly
    val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Follow the discussion in #17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <[email protected]>

Closes #17887 from viirya/add-config-fallback-string-parsing.

(cherry picked from commit 609ba5f)
Signed-off-by: Wenchen Fan <[email protected]>
@cloud-fan
Copy link
Contributor

thanks, merging to master/2.2!

@asfgit asfgit closed this in 609ba5f May 12, 2017
@viirya
Copy link
Member Author

viirya commented May 12, 2017

Thanks @cloud-fan @gatorsmile

robert3005 pushed a commit to palantir/spark that referenced this pull request May 19, 2017
…nsistent with old sql parser behavior

## What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

    val data = Seq("\u0020\u0021\u0023", "abc")
    val df = data.toDF()

    // 1st usage: works in 1.6
    // Let parser parse pattern string
    val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
    // 2nd usage: works in 1.6, 2.x
    // Call Column.rlike so the pattern string is a literal which doesn't go through parser
    val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

    // In 2.x, we need add backslashes to make regex pattern parsed correctly
    val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#17887 from viirya/add-config-fallback-string-parsing.
liyichao pushed a commit to liyichao/spark that referenced this pull request May 24, 2017
…nsistent with old sql parser behavior

## What changes were proposed in this pull request?

The new SQL parser is introduced into Spark 2.0. All string literals are unescaped in parser. Seems it bring an issue regarding the regex pattern string.

The following codes can reproduce it:

    val data = Seq("\u0020\u0021\u0023", "abc")
    val df = data.toDF()

    // 1st usage: works in 1.6
    // Let parser parse pattern string
    val rlike1 = df.filter("value rlike '^\\x20[\\x20-\\x23]+$'")
    // 2nd usage: works in 1.6, 2.x
    // Call Column.rlike so the pattern string is a literal which doesn't go through parser
    val rlike2 = df.filter($"value".rlike("^\\x20[\\x20-\\x23]+$"))

    // In 2.x, we need add backslashes to make regex pattern parsed correctly
    val rlike3 = df.filter("value rlike '^\\\\x20[\\\\x20-\\\\x23]+$'")

Follow the discussion in apache#17736, this patch adds a config to fallback to 1.6 string literal parsing and mitigate migration issue.

## How was this patch tested?

Jenkins tests.

Please review http://spark.apache.org/contributing.html before opening a pull request.

Author: Liang-Chi Hsieh <[email protected]>

Closes apache#17887 from viirya/add-config-fallback-string-parsing.
@viirya viirya deleted the add-config-fallback-string-parsing branch December 27, 2023 18:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants