Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG][JSON] from_json to Map type should produce null for invalid entries #9592

Closed
andygrove opened this issue Oct 31, 2023 · 3 comments
Closed
Assignees
Labels
bug Something isn't working

Comments

@andygrove
Copy link
Contributor

Describe the bug
When converting JSON to Map<String,String> using from_json, invalid entries are returned as empty structs instead of nulls.

Steps/Code to reproduce bug

val df = Seq("{}", "BAD", "{\"A\": 100}").toDF.repartition(2)
df.selectExpr("from_json(value, 'MAP<STRING,STRING>')").show()

CPU Output

+----------+                                                                    
|   entries|
+----------+
|        {}|
|      null|
|{A -> 100}|
+----------+

GPU Output

+----------+
|   entries|
+----------+
|        {}|
|        {}|
|{A -> 100}|
+----------+

Expected behavior
Results should match CPU.

Environment details (please complete the following information)
N/A

Additional context
None

@andygrove andygrove added bug Something isn't working ? - Needs Triage Need team to review and classify labels Oct 31, 2023
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Oct 31, 2023
@andygrove
Copy link
Contributor Author

I ran a test in spark-rapids-jni:

ColumnVector input = ColumnVector.fromStrings("{}", "BAD", "{\"A\": 100}");
ColumnVector outputMap = MapUtils.extractRawMapFromJsonString(input);
TableDebug.get().debug("outputMap", outputMap);

This shows the following output:

GPU COLUMN outputMap - NC: 0 DATA: null VAL: null
GPU COLUMN outputMap:DATA - NC: 0 DATA: null VAL: null
GPU COLUMN outputMap:DATA:CHILD_0 - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f48a3c01200, length=1, id=-1} VAL: null
GPU COLUMN outputMap:DATA:CHILD_1 - NC: 0 DATA: DeviceMemoryBufferView{address=0x7f48a3c01a00, length=3, id=-1} VAL: null
COLUMN outputMap - LIST
OFFSETS
0 [0 - 0)
1 [0 - 0)
2 [0 - 1)
COLUMN outputMap:DATA - STRUCT
COLUMN outputMap:DATA:CHILD_0 - STRING
0 "A" 41
COLUMN outputMap:DATA:CHILD_1 - STRING
0 "100" 313030

There is no differentiation between the source strings {} and BAD. None of the values are null.

@ttnghia Would it be possible to detect the invalid item and set the corresponding entry to null?

@ttnghia
Copy link
Collaborator

ttnghia commented Nov 2, 2023

Yes that is possible but it seems not trivial at this point because the underlying JNI code doesn't work based on lines JSON (it concatenates all the input rows into one giant JSON string).

We need to rework it to have this fixed.

@sameerz
Copy link
Collaborator

sameerz commented Nov 13, 2024

NVIDIA/spark-rapids-jni#2562 should close this issue. Confirmed with @ttnghia

@sameerz sameerz closed this as completed Nov 13, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

4 participants