-
Notifications
You must be signed in to change notification settings - Fork 236
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Match from_json
behaviour on Databricks 14.3.
#11711
Comments
I think we can probably do some of this in post processing. We have similar issues for overflow on arrays and structs on really old versions of Spark invalidate the entire struct if there was a single overflow in it. I am not sure what the priority for this really is through. |
Please confirm if this change applies to only overflow or other invalid input. In particular, please add these input lines to the test file and post the output:
|
Here's the input with @ttnghia's corner cases added: {"data": {"A": 0, "B": 1}}
{"data": {"A": 1}}
{"data": {"B": 50}}
{"data": {"B": -128, "A": 127}}
{"data": {"B": 99999999999999999999, "A": -9999999999999999999}}
{"data": {"A": 0, "B": xyz}}
{"data": {"A": 0, "B": "0"}}
{"data": {"A": 0, "B": }}
{"data": {"A": 0, "B": "}} Here's the output from Apache Spark 3.5.x, which matches
Here's what Databricks 14.3 returns:
The 5th and 7th rows are different. |
As discussed, I'm not inclined to "solve" the problem at this time. I'll refactor the tests so that the problematic rows are skipped in an |
It seems that the null rows of the children column due to failure in casting will always nullify the top level columns. We need to check that when working on this issue. If this is the case, fixing this will be less complex. |
Fixes NVIDIA#11533. This commit addresses the test failures reported in NVIDIA#11533, for the following tests: - `json_matrix_test.py::test_from_json_long_structs()` - `json_matrix_test.py::test_scan_json_long_structs()` These failures are a result of NVIDIA#11711. When the JSON parser attempts to read integral struct members from a JSON file, if the parsing leads to an overflow, then the `STRUCT` column value is deemed null on Databricks 14.3 (i.e. *without* `spark-rapids` active). This behaviour differs from that exhibited by Apache Spark versions exceeding 3.4.1. This commit breaks out the problematic JSON test rows into a separate file, whose read is tested in an `xfail` for Databricks 14.3. The remaining rows are tested on all versions. The true fix for NVIDIA#11711 will be addressed later. Signed-off-by: MithunR <[email protected]>
One wonders how far up the chain the nullification is transmitted. That's worth digging into at a different time. |
The behaviour of
from_json
seems to have changed on Databricks 14.3.This was revealed as part of a test failure (
json_matrix_test.py::test_from_json_long_structs
) on Databricks. Here is the effective repro (using the test inputjson
file from the test):The output on Apache Spark 3.5 (and all other Apache Spark versions) is:
On Databricks 14.3, the last record is
NULL
, and not{{NULL, NULL}}
.I fear this will involve a policy change in the CUDF implementation of
from_json
, and using it from a350db
shim. (I'm not an expert on the JSON parsing end of this.)The text was updated successfully, but these errors were encountered: