[SPARK-48148][CORE] JSON objects should not be modified when read as STRING #46408

eric-maynard · 2024-05-06T22:21:02Z

What changes were proposed in this pull request?

Currently, when reading a JSON like this:

{"a": {"b": -999.99999999999999999999999999999999995}}

With the schema:

a STRING

Spark will yield a result like this:

{"b": -1000.0}

Other changes such as changes to the input string's whitespace may also occur. In some cases, we apply scientific notation to an input floating-point number when reading it as STRING.

This applies to reading JSON files (as with spark.read.json) as well as the SQL expression from_json.

Why are the changes needed?

Correctness issues may occur if a field is read as a STRING and then later parsed (e.g. with from_json) after the contents have been modified.

Does this PR introduce any user-facing change?

Yes, when reading non-string fields from a JSON object using the STRING type, we will now extract the field exactly as it appears.

How was this patch tested?

Added a test in JsonSuite.scala

Was this patch authored or co-authored using generative AI tooling?

No

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

sadikovi · 2024-05-08T20:18:29Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

+
+      val df = spark.read.schema("data STRING").json(path.getAbsolutePath)
+
+      val expected = s"""{"v": ${granularFloat}}"""


Can you add more test cases for the following?

{"data": {"v": "abc"}}, expected: "{"v": "abc"}"

{"data":{"v": "0.999"}}, expected: "{"v": "0.999"}"

{"data": [1, 2, 3]}, expected: "[1, 2, 3]"

{"data": }, expected the object as string.

Added more tests -- can you clarify the last example and what we expect that to do? It seems like invalid JSON

sadikovi · 2024-05-08T20:19:32Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala

-          Utils.tryWithResource(factory.createGenerator(writer, JsonEncoding.UTF8)) {
-            generator => generator.copyCurrentStructure(parser)
+          val startLocation = parser.getTokenLocation
+          startLocation.contentReference().getRawContent match {


Is there an existing API to get the remaining content as string? Also, would it work with multi-line JSON?

I was not able to find such an existing API -- there is JacksonParser.getText but that appears to simply get the current value if it's a string value.

wrt. multiline JSON, I have added a test to cover this. It seems that the content reference is not a byte array when using multiline mode.

sadikovi · 2024-05-08T20:30:08Z

cc @dongjoon-hyun @HyukjinKwon

HyukjinKwon · 2024-05-09T02:56:30Z

SPARK-48148: values are unchanged when read as string *** FAILED *** (134 milliseconds)

seems it fails

eric-maynard · 2024-05-09T16:33:50Z

sql/core/src/test/scala/org/apache/spark/sql/execution/datasources/json/JsonSuite.scala

        expectedExactData = Seq(s"""{"v": ${granularFloat}}""")
      )
      // In multiLine, we fall back to the inexact method:
      extractData(
-        s"""{"data": {"white":\n"space"}}""",


This makes \n no longer function as a newline here

eric-maynard · 2024-05-09T18:10:30Z

Hey @HyukjinKwon, can you take another look and possible re-trigger tests? I believe multiline should be working now.

HyukjinKwon · 2024-05-09T23:41:15Z

Merged to master.

HyukjinKwon · 2024-05-09T23:41:45Z

btw you can trigger on your own https://github.com/eric-maynard/spark/runs/24789350525 I can't trigger :-).

…STRING ### What changes were proposed in this pull request? Currently, when reading a JSON like this: ``` {"a": {"b": -999.99999999999999999999999999999999995}} ``` With the schema: ``` a STRING ``` Spark will yield a result like this: ``` {"b": -1000.0} ``` Other changes such as changes to the input string's whitespace may also occur. In some cases, we apply scientific notation to an input floating-point number when reading it as STRING. This applies to reading JSON files (as with `spark.read.json`) as well as the SQL expression `from_json`. ### Why are the changes needed? Correctness issues may occur if a field is read as a STRING and then later parsed (e.g. with `from_json`) after the contents have been modified. ### Does this PR introduce _any_ user-facing change? Yes, when reading non-string fields from a JSON object using the STRING type, we will now extract the field exactly as it appears. ### How was this patch tested? Added a test in `JsonSuite.scala` ### Was this patch authored or co-authored using generative AI tooling? No Closes apache#46408 from eric-maynard/SPARK-48148. Lead-authored-by: Eric Maynard <[email protected]> Co-authored-by: Hyukjin Kwon <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

revans2 · 2024-06-25T16:33:04Z

Is there a reason this was only done for scan and not from_json/get_json_object? Especially from_json where in the past it behaved almost identically to the JSON scan (everything except empty row handling and new line characters).

HyukjinKwon · 2024-06-26T00:17:16Z

@eric-maynard would you mind checking if this is true? we should match the behaviour at least with from_json.

revans2 · 2024-07-11T13:41:01Z

@eric-maynard and @HyukjinKwon any update on fixing at least from_json if not also get_json_object? If you need me to try and try and debug this and put up a patch I am happy to.

HyukjinKwon · 2024-07-11T17:29:12Z

ping @eric-maynard

HyukjinKwon · 2024-07-11T17:29:44Z

sent offline ping too.

sandip-db · 2024-07-12T15:33:08Z

I did some tests and it looks like that this PR doesn't address the same issue with from_json. Json expressions including from_json use CreateJacksonParser.utf8String to create jackson parser, which in turn converts UTF8String to an InputStreamReader. The latter doesn't allow to copy buffers from random offset. I tried to create jackson parser using the underlying byte array of UTF8String, but ran into SPARK-16548 (that was fixed by #17693).

HyukjinKwon · 2024-07-16T00:06:53Z

Let me take a look

HyukjinKwon · 2024-07-16T04:52:46Z

Hm, I think this PR also doesn't cover multiline cases.

Another problem is that it seems parser.currentLocation.getByteOffset doesn't report correctly too so we cannot really get the current byte offset ..

HyukjinKwon · 2024-07-16T04:53:14Z

This is what I tried:

ff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/cat
alyst/json/CreateJacksonParser.scala
index ba7b54fc04e8..721d611bc414 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/CreateJacksonParser.scala
@@ -28,6 +28,10 @@ import org.apache.hadoop.io.Text
 import org.apache.spark.sql.catalyst.InternalRow
 import org.apache.spark.unsafe.types.UTF8String

+private[sql] trait WithByteArrayInputStream {
+  val source: ByteArrayInputStream
+}
+
 object CreateJacksonParser extends Serializable {
   def string(jsonFactory: JsonFactory, record: String): JsonParser = {
     jsonFactory.createParser(record)
@@ -40,7 +44,10 @@ object CreateJacksonParser extends Serializable {
     val bain = new ByteArrayInputStream(
       bb.array(), bb.arrayOffset() + bb.position(), bb.remaining())

-    jsonFactory.createParser(new InputStreamReader(bain, StandardCharsets.UTF_8))
+    jsonFactory.createParser(
+      new InputStreamReader(bain, StandardCharsets.UTF_8) with WithByteArrayInputStream {
+        override val source: ByteArrayInputStream = bain
+      })
   }

   def text(jsonFactory: JsonFactory, record: Text): JsonParser = {
diff --git a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/
json/JacksonParser.scala
index 32a1731a93d4..d98a14302d5a 100644
--- a/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
+++ b/sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala
@@ -17,7 +17,7 @@

 package org.apache.spark.sql.catalyst.json

-import java.io.{ByteArrayOutputStream, CharConversionException}
+import java.io.{ByteArrayOutputStream, CharConversionException, InputStreamReader}
 import java.nio.charset.MalformedInputException

 import scala.collection.mutable.ArrayBuffer
@@ -332,6 +332,14 @@ class JacksonParser(
               val buffer = new Array[Byte](size)
               positionedReadable.read(startLocation.getByteOffset, buffer, 0, size)
               UTF8String.fromBytes(buffer, 0, size)
+            case inputStream: InputStreamReader with WithByteArrayInputStream =>
+              skipAhead()
+              val endLocation = parser.currentLocation.getByteOffset
+
+              val size = endLocation.toInt - (startLocation.getByteOffset.toInt)
+              val buffer = new Array[Byte](size)
+              inputStream.source.read(buffer, startLocation.getByteOffset.toInt, size)
+              UTF8String.fromBytes(buffer, 0, size)
             case _ =>
               val writer = new ByteArrayOutputStream()
               Utils.tryWithResource(factory.createGenerator(writer, JsonEncoding.UTF8)) {
diff --git a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala b/sql/catalyst/src/test/scala/org/apache/spark
/sql/catalyst/expressions/JsonExpressionsSuite.scala
index a23e7f44a48d..12cc8165a39e 100644
--- a/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
+++ b/sql/catalyst/src/test/scala/org/apache/spark/sql/catalyst/expressions/JsonExpressionsSuite.scala
@@ -432,6 +432,16 @@ class JsonExpressionsSuite extends SparkFunSuite with ExpressionEvalHelper with
     )
   }

+  test("from_json - nested object to string") {
+    val jsonData = """{"a": {"b": -999.99999999999999999999999999999999995}}"""
+    val schema = StructType(StructField("a", StringType) :: Nil)
+    checkEvaluation(
+      JsonToStructs(schema, Map.empty, Literal(jsonData), UTC_OPT),
+      InternalRow("""{"b": -999.99999999999999999999999999999999995}""")
+    )
+  }
+
+
   test("from_json - invalid data") {
     val jsonData = """{"a" 1}"""
     val schema = StructType(StructField("a", IntegerType) :: Nil)

### What changes were proposed in this pull request? #46408 attempts to set the feature flag `INCLUDE_SOURCE_IN_LOCATION` in the JSON parser and reverts the flag to the original value. The reverting code is incorrect and accidentally sets the `AUTO_CLOSE_SOURCE` feature to false. The reason is that `overrideStdFeatures(value, mask)` sets the feature flags selected by `mask` to `value`. `originalMask` is a value of 0/1. When it is 1, it selects `AUTO_CLOSE_SOURCE`, whose ordinal is 0 ([reference](https://github.com/FasterXML/jackson-core/blob/172369cc390ace0f68a5032701634bdc984c2af8/src/main/java/com/fasterxml/jackson/core/JsonParser.java#L112)). The old code doesn't revert `INCLUDE_SOURCE_IN_LOCATION` to the original value either. As a result, when the JSON parser is closed, the underlying input stream is not closed, which can lead to memory leak. ### Why are the changes needed? Perform the originally intended feature, and avoid memory leak. ### Does this PR introduce _any_ user-facing change? No. ### How was this patch tested? New unit test. It would fail without the change in the PR. ### Was this patch authored or co-authored using generative AI tooling? No. Closes #49018 from chenhao-db/fix_json_parser_flag. Authored-by: Chenhao Li <[email protected]> Signed-off-by: Hyukjin Kwon <[email protected]>

initial commit

46bdf7d

github-actions bot added the SQL label May 6, 2024

eric-maynard commented May 6, 2024

View reviewed changes

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/json/JacksonParser.scala Show resolved Hide resolved

sadikovi reviewed May 8, 2024

View reviewed changes

Eric Maynard and others added 4 commits May 8, 2024 14:18

add flag

79e8457

improve tests

07083fd

should be stable

2f697da

Apply suggestions from code review

a5c3761

HyukjinKwon approved these changes May 9, 2024

View reviewed changes

sadikovi approved these changes May 9, 2024

View reviewed changes

eric-maynard commented May 9, 2024

View reviewed changes

stable; fixed multiline

9a78a8d

Eric Maynard added 3 commits May 9, 2024 11:10

pull master

114d8a3

resolve conflicts

e9cb9d2

polish

fc77ed0

HyukjinKwon closed this in b47d785 May 9, 2024

revans2 mentioned this pull request Jun 25, 2024

[BUG] JsonToStructs and ScanJson do not normalize numeric output when read as a string NVIDIA/spark-rapids#10458

Open

chenhao-db mentioned this pull request Nov 30, 2024

[SPARK-48148][FOLLOWUP] Fix JSON parser feature flag. #49018

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SPARK-48148][CORE] JSON objects should not be modified when read as STRING #46408

[SPARK-48148][CORE] JSON objects should not be modified when read as STRING #46408

eric-maynard commented May 6, 2024 •

edited

Loading

sadikovi May 8, 2024

eric-maynard May 8, 2024

sadikovi May 8, 2024

eric-maynard May 8, 2024

eric-maynard May 8, 2024 •

edited

Loading

sadikovi commented May 8, 2024

HyukjinKwon commented May 9, 2024

eric-maynard May 9, 2024

eric-maynard commented May 9, 2024

HyukjinKwon commented May 9, 2024

HyukjinKwon commented May 9, 2024

revans2 commented Jun 25, 2024

HyukjinKwon commented Jun 26, 2024

revans2 commented Jul 11, 2024

HyukjinKwon commented Jul 11, 2024

HyukjinKwon commented Jul 11, 2024

sandip-db commented Jul 12, 2024

HyukjinKwon commented Jul 16, 2024

HyukjinKwon commented Jul 16, 2024

HyukjinKwon commented Jul 16, 2024


		val df = spark.read.schema("data STRING").json(path.getAbsolutePath)

		val expected = s"""{"v": ${granularFloat}}"""

[SPARK-48148][CORE] JSON objects should not be modified when read as STRING #46408

[SPARK-48148][CORE] JSON objects should not be modified when read as STRING #46408

Conversation

eric-maynard commented May 6, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Was this patch authored or co-authored using generative AI tooling?

sadikovi May 8, 2024

Choose a reason for hiding this comment

eric-maynard May 8, 2024

Choose a reason for hiding this comment

sadikovi May 8, 2024

Choose a reason for hiding this comment

eric-maynard May 8, 2024

Choose a reason for hiding this comment

eric-maynard May 8, 2024 • edited Loading

Choose a reason for hiding this comment

sadikovi commented May 8, 2024

HyukjinKwon commented May 9, 2024

eric-maynard May 9, 2024

Choose a reason for hiding this comment

eric-maynard commented May 9, 2024

HyukjinKwon commented May 9, 2024

HyukjinKwon commented May 9, 2024

revans2 commented Jun 25, 2024

HyukjinKwon commented Jun 26, 2024

revans2 commented Jul 11, 2024

HyukjinKwon commented Jul 11, 2024

HyukjinKwon commented Jul 11, 2024

sandip-db commented Jul 12, 2024

HyukjinKwon commented Jul 16, 2024

HyukjinKwon commented Jul 16, 2024

HyukjinKwon commented Jul 16, 2024

eric-maynard commented May 6, 2024 •

edited

Loading

eric-maynard May 8, 2024 •

edited

Loading