Make JSON parsing common between JsonToStructs and ScanJson (#10542)

Signed-off-by: Andy Grove <[email protected]> Signed-off-by: Robert (Bobby) Evans <[email protected]> Co-authored-by: Andy Grove <[email protected]>
NVIDIA · Mar 11, 2024 · 3d3ade2 · 3d3ade2
1 parent cf9df50
commit 3d3ade2
Show file tree

Hide file tree

Showing 28 changed files with 1,111 additions and 836 deletions.
diff --git a/docs/additional-functionality/advanced_configs.md b/docs/additional-functionality/advanced_configs.md
@@ -126,7 +126,7 @@ Name | Description | Default Value | Applicable at
 <a name="sql.join.leftOuter.enabled"></a>spark.rapids.sql.join.leftOuter.enabled|When set to true left outer joins are enabled on the GPU|true|Runtime
 <a name="sql.join.leftSemi.enabled"></a>spark.rapids.sql.join.leftSemi.enabled|When set to true left semi joins are enabled on the GPU|true|Runtime
 <a name="sql.join.rightOuter.enabled"></a>spark.rapids.sql.join.rightOuter.enabled|When set to true right outer joins are enabled on the GPU|true|Runtime
-<a name="sql.json.read.decimal.enabled"></a>spark.rapids.sql.json.read.decimal.enabled|JSON reading is not 100% compatible when reading decimals.|false|Runtime
+<a name="sql.json.read.decimal.enabled"></a>spark.rapids.sql.json.read.decimal.enabled|When reading a quoted string as a decimal Spark supports reading non-ascii unicode digits, and the RAPIDS Accelerator does not.|true|Runtime
 <a name="sql.json.read.double.enabled"></a>spark.rapids.sql.json.read.double.enabled|JSON reading is not 100% compatible when reading doubles.|true|Runtime
 <a name="sql.json.read.float.enabled"></a>spark.rapids.sql.json.read.float.enabled|JSON reading is not 100% compatible when reading floats.|true|Runtime
 <a name="sql.json.read.mixedTypesAsString.enabled"></a>spark.rapids.sql.json.read.mixedTypesAsString.enabled|JSON reading is not 100% compatible when reading mixed types as string.|false|Runtime

diff --git a/docs/compatibility.md b/docs/compatibility.md
@@ -316,89 +316,71 @@ case.
 
 ## JSON
 
-The JSON format read is a very experimental feature which is expected to have some issues, so we disable
+The JSON format read is an experimental feature which is expected to have some issues, so we disable
 it by default. If you would like to test it, you need to enable `spark.rapids.sql.format.json.enabled` and
 `spark.rapids.sql.format.json.read.enabled`.
 
-Reading input containing invalid JSON format (in any row) will throw runtime exception.
-An example of valid input is as following:
-``` console
-{"name":"Andy", "age":30}
-{"name":"Justin", "age":19}
-```
-
-The following input is invalid and will cause error:
-```console
-{"name":"Andy", "age":30} ,,,,
-{"name":"Justin", "age":19}
-```
-
-```console
-{"name":  Justin", "age":19}
-```
-
-Reading input with duplicated json key names is also incompatible with CPU Spark.
-
-### JSON supporting types
-
-In the current version, nested types (array, struct, and map types) are not yet supported in regular JSON parsing.
-
-### `from_json` function
+### Invalid JSON
 
-This particular function supports to output a map or struct type with limited functionalities.
+In Apache Spark on the CPU if a line in the JSON file is invalid the entire row is considered
+invalid and will result in nulls being returned for all columns. It is considered invalid if it
+violates the JSON specification, but with a few extensions.
 
-The `from_json` function is disabled by default because it is experimental and has some known incompatibilities
-with Spark, and can be enabled by setting `spark.rapids.sql.expression.JsonToStructs=true`.
+  * Single quotes are allowed to quote strings and keys
+  * Unquoted values like NaN and Infinity can be parsed as floating point values
+  * Control characters do not need to be replaced with the corresponding escape sequences in a 
+    quoted string.
+  * Garbage at the end of a row, if there is valid JSON at the beginning of the row, is ignored.
 
-Dates are partially supported but there are some known issues:
+The GPU implementation does the same kinds of validations, but many of them are done on a per-column
+basis, which, for example, means if a number is formatted incorrectly, it is likely only that value
+will be considered invalid and return a null instead of nulls for the entire row.  
 
-- Only the default `dateFormat` of `yyyy-MM-dd` is supported in Spark 3.1.x. The query will fall back to CPU if any other format
-  is specified ([#9667](https://github.com/NVIDIA/spark-rapids/issues/9667))
-- Strings containing integers with more than four digits will be 
-  parsed as null ([#9664](https://github.com/NVIDIA/spark-rapids/issues/9664)) whereas Spark versions prior to 3.4 
-  will parse these numbers as number of days since the epoch, and in Spark 3.4 and later, an exception will be thrown.
+There are options that can be used to enable and disable many of these features which are mostly
+listed below.
 
-Timestamps are partially supported but there are some known issues:
+### JSON options
 
-- Only the default `timestampFormat` of `yyyy-MM-dd'T'HH:mm:ss[.SSS][XXX]` is supported. The query will fall back to CPU if any other format
-  is specified ([#9273](https://github.com/NVIDIA/spark-rapids/issues/9723))
-- Strings containing integers with more than four digits will be
-  parsed as null ([#9664](https://github.com/NVIDIA/spark-rapids/issues/9664)) whereas Spark versions prior to 3.4
-  will parse these numbers as number of days since the epoch, and in Spark 3.4 and later, an exception will be thrown.
-- Strings containing special date constant values such as `now` and `today` will parse as null ([#9724](https://github.com/NVIDIA/spark-rapids/issues/9724)),
-  which differs from the behavior in Spark 3.1.x
+Spark supports passing options to the JSON parser when reading a dataset.  In most cases if the RAPIDS Accelerator
+sees one of these options that it does not support it will fall back to the CPU. In some cases we do not. The
+following options are documented below.
 
-When reading numeric values, the GPU implementation always supports leading zeros regardless of the setting
-for the JSON option `allowNumericLeadingZeros` ([#9588](https://github.com/NVIDIA/spark-rapids/issues/9588)).
+- `allowNumericLeadingZeros`  - Allows leading zeros in numbers (e.g. 00012). By default this is set to false.
+  When it is false Spark considers the JSON invalid if it encounters this type of number. The RAPIDS
+  Accelerator supports validating columns that are returned to the user with this option on or off. 
 
-For struct output type, the function only supports struct of struct, array, string,  integral, floating-point, and
-decimal types. The output is incompatible if duplicated json key names are present in the input strings. For schemas
-that include IntegerType, if arbitrarily large numbers are specified in the JSON strings, the GPU implementation will
-cast the numbers to IntegerType, whereas CPU Spark will return null.
+- `allowUnquotedControlChars` - Allows JSON Strings to contain unquoted control characters (ASCII characters with
+  value less than 32, including tab and line feed characters) or not. By default this is set to false. If the schema
+  is provided while reading JSON file, then this flag has no impact on the RAPIDS Accelerator as it always allows
+  unquoted control characters but Spark sees these are invalid are returns nulls. However, if the schema is not provided
+  and this option is false, then RAPIDS Accelerator's behavior is same as Spark where an exception is thrown
+  as discussed in `JSON Schema discovery` section.
 
-In particular, the output map is not resulted from a regular JSON parsing but instead it will just contain plain text of key-value pairs extracted directly from the input JSON string. Due to such limitations, the input JSON map type schema must be `MAP<STRING,STRING>` and nothing else. Furthermore, there is no validation, no error tolerance, no data conversion as well as string formatting is performed. This may lead to some minor differences in the output if compared to the result of Spark CPU's `from_json`, such as:
- * Floating point numbers in the input JSON string such as `1.2000` will not be reformatted to `1.2`. Instead, the output will be the same as the input.
- * If the input JSON is given as multiple rows, any row containing invalid JSON format will be parsed as an empty 
-   struct instead of a null value ([#9592](https://github.com/NVIDIA/spark-rapids/issues/9592)).
+- `allowNonNumericNumbers` - Allows `NaN` and `Infinity` values to be parsed (note that these are not valid numeric
+  values in the [JSON specification](https://json.org)). Spark versions prior to 3.3.0 have inconsistent behavior and will
+  parse some variants of `NaN` and `Infinity` even when this option is disabled
+  ([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with
+  Spark version 3.3.0 and later.
 
-When a JSON attribute contains mixed types (different types in different rows), such as a mix of dictionaries 
-and lists, Spark will return a string representation of the JSON, but when running on GPU, the default 
-behavior is to throw an exception. There is an experimental setting 
-`spark.rapids.sql.json.read.mixedTypesAsString.enabled` that can be set to true to support reading
-mixed types as string, but there are known issues where it could also read structs as string in some cases. There
-can also be minor formatting differences. Spark will return a parsed and formatted representation, but the
-GPU implementation returns the unparsed JSON string.
+### Nesting
+In versions of Spark before 3.5.0 there is no maximum to how deeply nested JSON can be.  After 
+3.5.0 this was updated to be 1000 by default. The current GPU implementation limits this to 254 
+no matter what version of Spark is used. If the nesting level is over this the JSON is considered
+invalid and all values will be returned as nulls.
 
-### `to_json` function
+Only structs are supported for nested types. There are also some issues with arrays of structs. If
+your data includes this, even if you are not reading it, you might get an exception. You can
+try to set `spark.rapids.sql.json.read.mixedTypesAsString.enabled` to true to work around this,
+but it also has some issues with it.
 
-The `to_json` function is disabled by default because it is experimental and has some known incompatibilities 
-with Spark, and can be enabled by setting `spark.rapids.sql.expression.StructsToJson=true`.
+Dates and Timestamps have some issues and may return values for technically invalid inputs.
 
-Known issues are:
+Floating point numbers have issues generally like with the rest of Spark, and we can parse them into
+a valid floating point number, but it might not match 100% with the way Spark does it.
 
-- There can be rounding differences when formatting floating-point numbers as strings. For example, Spark may
-  produce `-4.1243574E26` but the GPU may produce `-4.124357351E26`.
-- Not all JSON options are respected
+Strings are supported, but the data returned might not be normalized in the same way as the CPU
+implementation. Generally this comes down to the GPU not modifying the input, whereas Spark will
+do things like remove extra white space and parse numbers before turning them back into a string.
 
 ### JSON Floating Point
 
@@ -413,9 +395,9 @@ consistent with the behavior in Spark 3.3.0 and later.
 Another limitation of the GPU JSON reader is that it will parse strings containing non-string boolean or numeric values where
 Spark will treat them as invalid inputs and will just return `null`.
 
-### JSON Timestamps
+### JSON Timestamps/Dates
 
-The JSON parser does not support the `TimestampNTZ` type and will fall back to CPU if `spark.sql.timestampType` is 
+The JSON parser does not support the `TimestampNTZ` type and will fall back to CPU if `spark.sql.timestampType` is
 set to `TIMESTAMP_NTZ` or if an explicit schema is provided that contains the `TimestampNTZ` type.
 
 There is currently no support for reading numeric values as timestamps and null values are returned instead
@@ -429,28 +411,31 @@ handles schema discovery and there is no GPU acceleration of this. By default Sp
 dataset to determine the schema. This means that some options/errors which are ignored by the GPU may still
 result in an exception if used with schema discovery.
 
-### JSON options
+### `from_json` function
 
-Spark supports passing options to the JSON parser when reading a dataset.  In most cases if the RAPIDS Accelerator
-sees one of these options that it does not support it will fall back to the CPU. In some cases we do not. The
-following options are documented below.
+`JsonToStructs` of `from_json` is based on the same code as reading a JSON lines file.  There are
+a few differences with it.
 
-- `allowNumericLeadingZeros`  - Allows leading zeros in numbers (e.g. 00012). By default this is set to false.
-When it is false Spark throws an exception if it encounters this type of number. The RAPIDS Accelerator
-strips off leading zeros from all numbers and this config has no impact on it.
+The `from_json` function is disabled by default because it is experimental and has some known
+incompatibilities with Spark, and can be enabled by setting 
+`spark.rapids.sql.expression.JsonToStructs=true`. You don't need to set 
+`spark.rapids.sql.format.json.enabled` and`spark.rapids.sql.format.json.read.enabled` to true.
 
-- `allowUnquotedControlChars` - Allows JSON Strings to contain unquoted control characters (ASCII characters with
-value less than 32, including tab and line feed characters) or not. By default this is set to false. If the schema
-is provided while reading JSON file, then this flag has no impact on the RAPIDS Accelerator as it always allows
-unquoted control characters but Spark reads these entries incorrectly as null. However, if the schema is not provided
-and when the option is false, then RAPIDS Accelerator's behavior is same as Spark where an exception is thrown
-as discussed in `JSON Schema discovery` section.
+There is no schema discovery as a schema is required as input to `from_json`
 
-- `allowNonNumericNumbers` - Allows `NaN` and `Infinity` values to be parsed (note that these are not valid numeric
-values in the [JSON specification](https://json.org)). Spark versions prior to 3.3.0 have inconsistent behavior and will
-parse some variants of `NaN` and `Infinity` even when this option is disabled
-([SPARK-38060](https://issues.apache.org/jira/browse/SPARK-38060)). The RAPIDS Accelerator behavior is consistent with
-Spark version 3.3.0 and later.
+In addition to `structs`, a top level `map` type is supported, but only if the key and value are
+strings.
+
+### `to_json` function
+
+The `to_json` function is disabled by default because it is experimental and has some known incompatibilities 
+with Spark, and can be enabled by setting `spark.rapids.sql.expression.StructsToJson=true`.
+
+Known issues are:
+
+- There can be rounding differences when formatting floating-point numbers as strings. For example, Spark may
+  produce `-4.1243574E26` but the GPU may produce `-4.124357351E26`.
+- Not all JSON options are respected
 
 ### get_json_object
 

diff --git a/docs/supported_ops.md b/docs/supported_ops.md
@@ -20379,9 +20379,9 @@ dates or timestamps, or for a lack of type coercion support.
 <td> </td>
 <td><b>NS</b></td>
 <td> </td>
+<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, MAP, UDT</em></td>
 <td><b>NS</b></td>
-<td><b>NS</b></td>
-<td><b>NS</b></td>
+<td><em>PS<br/>UTC is only supported TZ for child TIMESTAMP;<br/>unsupported child types BINARY, MAP, UDT</em></td>
 <td><b>NS</b></td>
 </tr>
 <tr>