[FEA] Add option to read JSON field as unparsed string #14239

andygrove · 2023-09-29T21:37:44Z

Is your feature request related to a problem? Please describe.

When reading JSON in Spark, if a field has mixed types, Spark will infer the type as String to avoid data loss due to the uncertainty of the actual data type.

For example, given this input file, Spark will read column bar as a numeric type and column foo as a string type.

$ cat test.json
{ "foo": [1,2,3], "bar": 123 }
{ "foo": { "a": 1 }, "bar": 456 }

Here is the Spark code that demonstrates this:

scala> val df = spark.read.json("test.json")
df: org.apache.spark.sql.DataFrame = [bar: bigint, foo: string]                 

scala> df.show
+---+-------+
|bar|    foo|
+---+-------+
|123|[1,2,3]|
|456|{"a":1}|
+---+-------+

Currently, Spark RAPIDS fails for this example because cuDF does not support mixed types in a column:

Caused by: ai.rapids.cudf.CudfException: CUDF failure at: /home/jenkins/agent/workspace/jenkins-spark-rapids-jni_nightly-pre_release-181-cuda11/thirdparty/cudf/cpp/src/io/json/json_column.cu:577: A mix of lists and structs within the same column is not supported
  at ai.rapids.cudf.Table.readJSON(Native Method)

Describe the solution you'd like
I would like the ability to specify to read certain columns as unparsed strings.

Describe alternatives you've considered
I am also exploring some workarounds in the Spark RAPIDS plugin.

Additional context

The text was updated successfully, but these errors were encountered:

revans2 · 2023-10-02T21:20:47Z

We have some code that @ttnghia wrote. It will convert a range of tokens to a normalized string that matches what Spark wants. We did this for some Spark specific functionality with JSON parsing related to returning a Map instead of a Struct.

https://github.com/NVIDIA/spark-rapids-jni/blob/54ef9991f46fa873d580315212aeae345da7152a/src/main/cpp/src/map_utils.cu#L63-L112

I am not sure if this is really something that CUDF wants, but it is at least a starting point.

andygrove · 2023-12-07T16:43:47Z

Here are some examples, showing input and expected output.

# Example 1: Mixed primitive types in struct

INPUT:

{ "a": "123" }
{ "a": 123 }

EXPECTED:

+-----------+
|    my_json|
+-----------+
|{"a":"123"}|
|{"a":"123"}|
+-----------+

# Example 2: Mixed structs and lists in struct

INPUT:

{ "a": [1,2,3] }
{ "a": { "b": 1 } }

EXPECTED:

+-----------------+
|          my_json|
+-----------------+
|  {"a":"[1,2,3]"}|
|{"a":"{\"b\":1}"}|
+-----------------+

# Example 3: Mixed structs and primitives in struct

INPUT:

{ "a": "fox" }
{ "a": { "b": 1 } }

EXPECTED:

+-----------------+
|my_json          |
+-----------------+
|{"a":"fox"}      |
|{"a":"{\"b\":1}"}|
+-----------------+

# Example 4: Mixed lists and primitives in struct

INPUT:

{ "a": [1,2,3] }
{ "a": "fox" }

EXPECTED:

+---------------+
|my_json        |
+---------------+
|{"a":"[1,2,3]"}|
|{"a":"fox"}    |
+---------------+

andygrove · 2023-12-07T16:46:51Z

There is a separate use case for arrays where the array element type differs between records. Spark infers the type as Array<String> in this case.

This is not necessarily a high priority and could be split out into a separate issue, but I'd like to point it out here for visibility.

# Example: Mixed primitive arrays in struct

INPUT:

{ "a": [1,2,3] }
{ "a": [true,false,true] }
{ "a": ["a", "b", "c"] }

EXPECTED:

+-----------------------------+
|my_json                      |
+-----------------------------+
|{"a":["1","2","3"]}          |
|{"a":["true","false","true"]}|
|{"a":["a","b","c"]}          |
+-----------------------------+

Addresses #14239 This PR adds an option to read mixed types as string columns. It also adds related functional changes to nested JSON reader (libcudf, cuDF-python, Java). Details: - Added new option `mixed_types_as_string` bool in json_reader_options - This feature requires 2 things: finding end of struct/list nodes, parse struct/list type as string. - For Struct and List, node_range_end was node_range_begin+1 earlier (since it was not used anywhere). Now it is calculated properly by copying only struct and list tokens and their node_range_end is calculated. (Since end token is child of begin token, scattering end token's index to parent' token's corresponding node's node_range_end will get the node_range_end of List and Struct nodes). - In `reduce_to_column_tree()` (which infers the schema), the list and struct node_range_end are changed to node_begin+1 so that it does not copy entire list/struct strings to host for column names. - `reinitialize_as_string` reinitializes an initialized column as string. - Mixed type columns are parsed as strings since their column category is changed to `NC_STR`. - Added tests Authors: - Karthikeyan (https://github.com/karthikeyann) - Andy Grove (https://github.com/andygrove) Approvers: - Andy Grove (https://github.com/andygrove) - Jason Lowe (https://github.com/jlowe) - Elias Stehle (https://github.com/elstehle) - Bradley Dice (https://github.com/bdice) - Shruti Shivakumar (https://github.com/shrshi) URL: #14572

GregoryKimball · 2024-02-16T21:58:25Z

We made significant progress on this issue with #14572, and I believe we will be able to close it after #14936. @andygrove would you please let us know if there are other cases to consider?

andygrove · 2024-02-22T22:17:14Z

For all the examples in #14239 (comment), I see the correct results with #14936.

For the mixed array example in #14239 (comment) I still do not see the correct results, so I filed a separate issue for this one (#15120).

andygrove added feature request New feature or request Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Sep 29, 2023

andygrove added this to libcudf Sep 29, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Sep 29, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Sep 29, 2023

andygrove mentioned this issue Sep 29, 2023

[BUG] [JSON] A mix of lists and structs within the same column is not supported NVIDIA/spark-rapids#9353

Closed

andygrove mentioned this issue Oct 3, 2023

[BUG] from_json generated inconsistent result comparing with CPU for input column with nested json strings NVIDIA/spark-rapids#8558

Closed

GregoryKimball mentioned this issue Oct 4, 2023

[FEA] Provide option to read fields with mixed types #11947

Closed

GregoryKimball added this to the Nested JSON reader milestone Oct 4, 2023

GregoryKimball mentioned this issue Aug 29, 2023

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

This was referenced Oct 16, 2023

[FEA] Support parsing JSON data to include a Map type #14288

Closed

[FEA] support MapType in JSON parsing by recursively parsing child columns. NVIDIA/spark-rapids#9450

Open

andygrove mentioned this issue Oct 17, 2023

[FEA] [EPIC] Priority JSON Issues NVIDIA/spark-rapids#9458

Open

26 tasks

GregoryKimball removed this from libcudf Oct 26, 2023

mattahrens assigned andygrove and unassigned andygrove Oct 27, 2023

GregoryKimball added the 2 - In Progress Currently a work in progress label Nov 9, 2023

GregoryKimball assigned karthikeyann Nov 9, 2023

GregoryKimball added this to libcudf Nov 9, 2023

GregoryKimball moved this to In progress in libcudf Nov 9, 2023

GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue and removed Needs Triage Need team to review and classify labels Nov 9, 2023

karthikeyann mentioned this issue Dec 5, 2023

JSON - Parse mixed types as string in JSON reader #14572

Merged

3 tasks

andygrove mentioned this issue Feb 22, 2024

[FEA] [JSON] Read mixed primitive arrays as string arrays #15120

Closed

andygrove mentioned this issue Feb 22, 2024

Support casting of Map type to string in JSON reader #14936

Merged

3 tasks

This was referenced Mar 8, 2024

[BUG] mixed_type_as_string throws exception for nested data with nested STRING schema request #15260

Closed

[FEA]Support nested types for parsing JSON NVIDIA/spark-rapids#4608

Open

[FEA] JSON input support NVIDIA/spark-rapids#9

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Add option to read JSON field as unparsed string #14239

[FEA] Add option to read JSON field as unparsed string #14239

andygrove commented Sep 29, 2023

revans2 commented Oct 2, 2023

andygrove commented Dec 7, 2023

andygrove commented Dec 7, 2023

GregoryKimball commented Feb 16, 2024

andygrove commented Feb 22, 2024 •

edited

Loading

[FEA] Add option to read JSON field as unparsed string #14239

[FEA] Add option to read JSON field as unparsed string #14239

Comments

andygrove commented Sep 29, 2023

revans2 commented Oct 2, 2023

andygrove commented Dec 7, 2023

andygrove commented Dec 7, 2023

GregoryKimball commented Feb 16, 2024

andygrove commented Feb 22, 2024 • edited Loading

andygrove commented Feb 22, 2024 •

edited

Loading