Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] GetJsonObject does not process escape sequences in returned strings or queries #10196

Closed
revans2 opened this issue Jan 12, 2024 · 3 comments · Fixed by #10581
Closed

[BUG] GetJsonObject does not process escape sequences in returned strings or queries #10196

revans2 opened this issue Jan 12, 2024 · 3 comments · Fixed by #10581
Assignees
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 12, 2024

Describe the bug
GetJsonObject on the CPU, when returning a string will process any of the escape sequences and turn them into the desired output.

  • \" => "
  • \' => '
  • \/ => /
  • \\ => \
  • \b => ASCII CHAR 0x08
  • \f => ASCII CHAR 0x0B
  • \n => ASCII CHAR 0x0A
  • \t => ASCII CHAR 0x09

It also does not process these for the query string, or the keys when looking for a query string, and it does not like unescaped control characters in the keys at all.

Note that it can be a little hard to tell if this is happening or not, because show in spark adds back in a lot of the escapes. It is also hard to put in an escape sequence without scala processing it.

Steps/Code to reproduce bug
For queries and query strings.

scala> Seq("""{"t\t":"t"}""", "{'t\t':'t'}").toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t\t') as t1", "get_json_object(jsonstr,'$.t\\t') as t2").show()

+----+----+
|  t1|  t2|
+----+----+
|null|null|
|null|null|
+----+----+


scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> Seq("""{"t\t":"t"}""", "{'t\t':'t'}").toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t\t') as t1", "get_json_object(jsonstr,'$.t\\t') as t2").show()
+---+---+
| t1| t2|
+---+---+
|  t|  t|
|  t|  t|
+---+---+

For escaped values in the result.

scala> val data = Seq("""{"t":"\""}""","""{"t":'\"'}""","""{"t":"\'"}""","""{"t":'\''}""","""{"t":"\/"}""","""{"t":"\\"}""","""{"t":"\b"}""","""{"t":"\f"}""","""{"t":"\n"}""","""{"t":"\t"}""")

scala> data.toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t') as t").collect.foreach{ row => System.err.println(Option(row.getString(0)).map(s => s.getBytes("UTF8").toList))}

Some(List(92, 34))
None
None
None
Some(List(92, 47))
Some(List(92, 92))
Some(List(92, 98))
Some(List(92, 102))
Some(List(92, 110))
Some(List(92, 116))

scala> spark.conf.set("spark.rapids.sql.enabled", false)

scala> data.toDF("jsonstr").repartition(1).selectExpr("get_json_object(jsonstr,'$.t') as t").collect.foreach{ row => System.err.println(Option(row.getString(0)).map(s => s.getBytes("UTF8").toList))}
Some(List(34))
Some(List(34))
Some(List(39))
Some(List(39))
Some(List(47))
Some(List(92))
Some(List(8))
Some(List(12))
Some(List(10))
Some(List(9))

Expected behavior
We match Spark much exactly in these cases.

@revans2
Copy link
Collaborator Author

revans2 commented Jan 18, 2024

Actually I have done some more research and spark does not support escape sequences at all in JSON path queries. It just keeps the quoted string as is. So $['\''] is actually an invalid path because it only looks until it sees the next single quote character.

@sameerz
Copy link
Collaborator

sameerz commented Jan 20, 2024

Is this similar to #9033 ?

@sameerz sameerz added the cudf_dependency An issue or PR with this label depends on a new feature in cudf label Jan 23, 2024
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 30, 2024
@res-life
Copy link
Collaborator

Will be fixed by PR: NVIDIA/spark-rapids-jni#1868

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cudf_dependency An issue or PR with this label depends on a new feature in cudf
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants