[BUG] Match `from_json` behaviour on Databricks 14.3. #11711

mythrocks · 2024-11-07T23:10:23Z

The behaviour of from_json seems to have changed on Databricks 14.3.

This was revealed as part of a test failure (json_matrix_test.py::test_from_json_long_structs) on Databricks. Here is the effective repro (using the test input json file from the test):

import pyspark.sql.functions as f
from pyspark.sql.types import *

input_file = "/home/ubuntu/spark-rapids/integration_tests/src/test/resources/int_struct_formatted.json"
schema = StructType([StructField("data", StructType([StructField("A", LongType()),StructField("B", LongType())]))])

df = spark.read.text(input_file).withColumnRenamed("value", "json")
op_df = df.select(f.col('json'), f.from_json(f.col('json'), schema))

op_df.show(100, False)

The output on Apache Spark 3.5 (and all other Apache Spark versions) is:

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{{NULL, NULL}} |
+----------------------------------------------------------------+---------------+

On Databricks 14.3, the last record is NULL, and not {{NULL, NULL}}.

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{NULL}         |
+----------------------------------------------------------------+---------------+

I fear this will involve a policy change in the CUDF implementation of from_json, and using it from a 350db shim. (I'm not an expert on the JSON parsing end of this.)

The text was updated successfully, but these errors were encountered:

revans2 · 2024-11-12T14:37:26Z

I think we can probably do some of this in post processing. We have similar issues for overflow on arrays and structs on really old versions of Spark invalidate the entire struct if there was a single overflow in it.

I am not sure what the priority for this really is through.

ttnghia · 2024-11-12T18:16:26Z

Please confirm if this change applies to only overflow or other invalid input. In particular, please add these input lines to the test file and post the output:

{"data": {"A": 0, "B": xyz}}
{"data": {"A": 0, "B": "0"}}
{"data": {"A": 0, "B": }}
{"data": {"A": 0, "B": "}}

mythrocks · 2024-11-12T21:49:14Z

Here's the input with @ttnghia's corner cases added:

{"data": {"A": 0, "B": 1}}
{"data": {"A": 1}}
{"data": {"B": 50}}
{"data": {"B": -128, "A": 127}}
{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}
{"data": {"A": 0, "B": xyz}}
{"data": {"A": 0, "B": "0"}}
{"data": {"A": 0, "B": }}
{"data": {"A": 0, "B": "}}

Here's the output from Apache Spark 3.5.x, which matches spark-rapids, and nearly all Databricks versions:

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{{NULL, NULL}} |
|{"data": {"A": 0, "B": xyz}}                                    |{NULL}         |
|{"data": {"A": 0, "B": "0"}}                                    |{{0, NULL}}    |
|{"data": {"A": 0, "B": }}                                       |{NULL}         |
|{"data": {"A": 0, "B": "}}                                      |{NULL}         |
+----------------------------------------------------------------+---------------+

Here's what Databricks 14.3 returns:

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{NULL}         |
|{"data": {"A": 0, "B": xyz}}                                    |{NULL}         |
|{"data": {"A": 0, "B": "0"}}                                    |{NULL}         |
|{"data": {"A": 0, "B": }}                                       |{NULL}         |
|{"data": {"A": 0, "B": "}}                                      |{NULL}         |
+----------------------------------------------------------------+---------------+

The 5th and 7th rows are different.

mythrocks · 2024-11-12T21:52:25Z

As discussed, I'm not inclined to "solve" the problem at this time. I'll refactor the tests so that the problematic rows are skipped in an xfailed test. We can revisit this for a proper fix.

ttnghia · 2024-11-12T22:01:41Z

It seems that the null rows of the children column due to failure in casting will always nullify the top level columns. We need to check that when working on this issue. If this is the case, fixing this will be less complex.

Fixes NVIDIA#11533. This commit addresses the test failures reported in NVIDIA#11533, for the following tests: - `json_matrix_test.py::test_from_json_long_structs()` - `json_matrix_test.py::test_scan_json_long_structs()` These failures are a result of NVIDIA#11711. When the JSON parser attempts to read integral struct members from a JSON file, if the parsing leads to an overflow, then the `STRUCT` column value is deemed null on Databricks 14.3 (i.e. *without* `spark-rapids` active). This behaviour differs from that exhibited by Apache Spark versions exceeding 3.4.1. This commit breaks out the problematic JSON test rows into a separate file, whose read is tested in an `xfail` for Databricks 14.3. The remaining rows are tested on all versions. The true fix for NVIDIA#11711 will be addressed later. Signed-off-by: MithunR <[email protected]>

mythrocks · 2024-11-13T18:50:37Z

failure in casting will always nullify the top level columns.

One wonders how far up the chain the nullification is transmitted. That's worth digging into at a different time.

mythrocks added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 7, 2024

mythrocks mentioned this issue Nov 9, 2024

Fix JSON Matrix tests on Databricks 14.3 #11533

Closed

mattahrens assigned mythrocks Nov 12, 2024

mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 12, 2024

mythrocks mentioned this issue Nov 12, 2024

Skip from_json overflow tests for [databricks] 14.3 #11719

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Match `from_json` behaviour on Databricks 14.3. #11711

[BUG] Match `from_json` behaviour on Databricks 14.3. #11711

mythrocks commented Nov 7, 2024

revans2 commented Nov 12, 2024

ttnghia commented Nov 12, 2024 •

edited

Loading

mythrocks commented Nov 12, 2024

mythrocks commented Nov 12, 2024

ttnghia commented Nov 12, 2024

mythrocks commented Nov 13, 2024

[BUG] Match from_json behaviour on Databricks 14.3. #11711

[BUG] Match from_json behaviour on Databricks 14.3. #11711

Comments

mythrocks commented Nov 7, 2024

revans2 commented Nov 12, 2024

ttnghia commented Nov 12, 2024 • edited Loading

mythrocks commented Nov 12, 2024

mythrocks commented Nov 12, 2024

ttnghia commented Nov 12, 2024

mythrocks commented Nov 13, 2024

[BUG] Match `from_json` behaviour on Databricks 14.3. #11711

[BUG] Match `from_json` behaviour on Databricks 14.3. #11711

ttnghia commented Nov 12, 2024 •

edited

Loading