Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Match from_json behaviour on Databricks 14.3. #11711

Open
mythrocks opened this issue Nov 7, 2024 · 6 comments
Open

[BUG] Match from_json behaviour on Databricks 14.3. #11711

mythrocks opened this issue Nov 7, 2024 · 6 comments
Assignees
Labels
bug Something isn't working

Comments

@mythrocks
Copy link
Collaborator

The behaviour of from_json seems to have changed on Databricks 14.3.

This was revealed as part of a test failure (json_matrix_test.py::test_from_json_long_structs) on Databricks. Here is the effective repro (using the test input json file from the test):

import pyspark.sql.functions as f
from pyspark.sql.types import *

input_file = "/home/ubuntu/spark-rapids/integration_tests/src/test/resources/int_struct_formatted.json"
schema = StructType([StructField("data", StructType([StructField("A", LongType()),StructField("B", LongType())]))])

df = spark.read.text(input_file).withColumnRenamed("value", "json")
op_df = df.select(f.col('json'), f.from_json(f.col('json'), schema))

op_df.show(100, False)

The output on Apache Spark 3.5 (and all other Apache Spark versions) is:

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{{NULL, NULL}} |
+----------------------------------------------------------------+---------------+

On Databricks 14.3, the last record is NULL, and not {{NULL, NULL}}.

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{NULL}         |
+----------------------------------------------------------------+---------------+

I fear this will involve a policy change in the CUDF implementation of from_json, and using it from a 350db shim. (I'm not an expert on the JSON parsing end of this.)

@mythrocks mythrocks added ? - Needs Triage Need team to review and classify bug Something isn't working labels Nov 7, 2024
@revans2
Copy link
Collaborator

revans2 commented Nov 12, 2024

I think we can probably do some of this in post processing. We have similar issues for overflow on arrays and structs on really old versions of Spark invalidate the entire struct if there was a single overflow in it.

I am not sure what the priority for this really is through.

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 12, 2024

Please confirm if this change applies to only overflow or other invalid input. In particular, please add these input lines to the test file and post the output:

{"data": {"A": 0, "B": xyz}}
{"data": {"A": 0, "B": "0"}}
{"data": {"A": 0, "B": }}
{"data": {"A": 0, "B": "}}

@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Nov 12, 2024
@mythrocks
Copy link
Collaborator Author

Here's the input with @ttnghia's corner cases added:

{"data": {"A": 0, "B": 1}}
{"data": {"A": 1}}
{"data": {"B": 50}}
{"data": {"B": -128, "A": 127}}
{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}
{"data": {"A": 0, "B": xyz}}
{"data": {"A": 0, "B": "0"}}
{"data": {"A": 0, "B": }}
{"data": {"A": 0, "B": "}}

Here's the output from Apache Spark 3.5.x, which matches spark-rapids, and nearly all Databricks versions:

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{{NULL, NULL}} |
|{"data": {"A": 0, "B": xyz}}                                    |{NULL}         |
|{"data": {"A": 0, "B": "0"}}                                    |{{0, NULL}}    |
|{"data": {"A": 0, "B": }}                                       |{NULL}         |
|{"data": {"A": 0, "B": "}}                                      |{NULL}         |
+----------------------------------------------------------------+---------------+

Here's what Databricks 14.3 returns:

+----------------------------------------------------------------+---------------+
|json                                                            |from_json(json)|
+----------------------------------------------------------------+---------------+
|{"data": {"A": 0, "B": 1}}                                      |{{0, 1}}       |
|{"data": {"A": 1}}                                              |{{1, NULL}}    |
|{"data": {"B": 50}}                                             |{{NULL, 50}}   |
|{"data": {"B": -128, "A": 127}}                                 |{{127, -128}}  |
|{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}|{NULL}         |
|{"data": {"A": 0, "B": xyz}}                                    |{NULL}         |
|{"data": {"A": 0, "B": "0"}}                                    |{NULL}         |
|{"data": {"A": 0, "B": }}                                       |{NULL}         |
|{"data": {"A": 0, "B": "}}                                      |{NULL}         |
+----------------------------------------------------------------+---------------+

The 5th and 7th rows are different.

@mythrocks
Copy link
Collaborator Author

As discussed, I'm not inclined to "solve" the problem at this time. I'll refactor the tests so that the problematic rows are skipped in an xfailed test. We can revisit this for a proper fix.

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 12, 2024

It seems that the null rows of the children column due to failure in casting will always nullify the top level columns. We need to check that when working on this issue. If this is the case, fixing this will be less complex.

mythrocks added a commit to mythrocks/spark-rapids that referenced this issue Nov 12, 2024
Fixes NVIDIA#11533.

This commit addresses the test failures reported in NVIDIA#11533, for the
following tests:
  - `json_matrix_test.py::test_from_json_long_structs()`
  - `json_matrix_test.py::test_scan_json_long_structs()`

These failures are a result of NVIDIA#11711.  When the JSON parser attempts to
read integral struct members from a JSON file, if the parsing leads to
an overflow, then the `STRUCT` column value is deemed null on Databricks
14.3 (i.e. *without* `spark-rapids` active).  This behaviour differs
from that exhibited by Apache Spark versions exceeding 3.4.1.

This commit breaks out the problematic JSON test rows into a separate
file, whose read is tested in an `xfail` for Databricks 14.3.  The
remaining rows are tested on all versions.

The true fix for NVIDIA#11711 will be addressed later.

Signed-off-by: MithunR <[email protected]>
@mythrocks
Copy link
Collaborator Author

failure in casting will always nullify the top level columns.

One wonders how far up the chain the nullification is transmitted. That's worth digging into at a different time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants