[BUG Table.readJson dropping valid JSON lines #14282

andygrove · 2023-10-13T17:38:47Z

Describe the bug
In the plugin, we have a ColumnVector containing JSON lines (concatenated together in one large string). Here is a sample:

GPU COLUMN combined - NC: 0 DATA: DeviceMemoryBufferView{address=0x30a013800, length=23082, id=-1} VAL: null
COLUMN combined - STRING
0 "{"teacher":null}
{"teacher":null}
...
{"teacher": "Qocwza","student": {"name": "Yiausu", "age": 19}}
{"teacher": "Rfwvcv","student": {"name": "Yxpbtq", "age": 12}}
{"teacher": "Spmydj","student": {"name": "Ggeyhv", "age": 16}}
{"teacher": "Jnbhwy","student": {"name": "Lmnkzw", "age": 19}}
{"teacher": "Gukkmo","student": {"name": "Xyyuy", "age": 19}}
...

We pass this to Table.readJson using the following code:

          val (names, rawTable) = withResource(combinedHost) { combinedHost =>
            val data = combinedHost.getData
            val start = combinedHost.getStartListOffset(0)
            val end = combinedHost.getEndListOffset(0)
            val length = end - start

            val jsonOptions = cudf.JSONOptions.builder().withRecoverWithNull(true).build()
            withResource(cudf.Table.readJSON(jsonOptions, data, start, length)) { tableWithMeta =>
              val names = tableWithMeta.getColumnNames
              (names, tableWithMeta.releaseTable())
            }
          }

The resulting table is missing some values. Note that the entry for {"teacher": "Spmydj","student": {"name": "Ggeyhv", "age": 16}} is NULL here.

DEBUG rawTable Table{columns=[ColumnVector{rows=512, type=STRING, nullCount=Optional.empty, offHeap=(ID: 260 7f4b7c3e1a00)}, ColumnVector{rows=512, type=STRUCT, nullCount=Optional.empty, offHeap=(ID: 261 7f4b7c3c1d60)}], cudfTable=139962183575680, rows=512}
GPU COLUMN 0 - NC: 228 DATA: DeviceMemoryBufferView{address=0x30a011a00, length=1696, id=-1} VAL: DeviceMemoryBufferView{address=0x30a006a00, length=64, id=-1}
COLUMN 0 - STRING
0 NULL
1 NULL
...
447 "Qocwza" 516f63777a61
448 "Rfwvcv" 526677766376
449 NULL
450 "Jnbhwy" 4a6e62687779
451 "Gukkmo" 47756b6b6d6f

Steps/Code to reproduce bug
The repro case is in NVIDIA/spark-rapids#9423 in test_from_json_struct_of_struct.

Expected behavior
Data should not be dropped.

Environment overview (please complete the following information)
N/A

Environment details
N/A

Additional context

The text was updated successfully, but these errors were encountered:

elstehle · 2023-10-14T08:21:54Z

Thanks for reporting the issue, @andygrove.

Unfortunately, I couldn't reproduce the issue. Could you share the full string you're trying to parse? It's surprising that line 449 is getting parsed to NULL, when its format is the same as for other lines that are getting parsed correctly. Could this have some other cause?

andygrove · 2023-10-17T15:57:37Z

I now have a repro case: #14291

elstehle · 2023-10-18T11:50:55Z

Just to follow up, @andygrove, I have finally gotten to the bottom of the issue. The issue only occurs when withRecoverWithNull is used. The issue was introduced with the original PR that introduced that option #13344.

I'm currently elaborating options for resolving the issue and will likely have a resolution by the end of the week.

elstehle · 2023-10-22T16:02:15Z

@andygrove, I've put up #14309 to address this issue. Feel free to check if it properly addresses your issue.

andygrove · 2023-10-24T00:16:16Z

@andygrove, I've put up #14309 to address this issue. Feel free to check if it properly addresses your issue.

Thank you @elstehle. I have confirmed that this resolves the issue.

andygrove · 2023-10-24T17:17:21Z

@elstehle I found one edge case where the last line will be dropped rather than replaced with null if it is invalid. This results in Table.readJSON returning one row less than the input data.

elstehle · 2023-10-24T17:40:00Z

Thanks for sharing, @andygrove! I'll look into it.

elstehle · 2023-10-26T10:27:31Z

@elstehle I found one edge case where the last line will be dropped rather than replaced with null if it is invalid. This results in Table.readJSON returning one row less than the input data.

Thanks, Andy. I investigated the issue and it should only be an issue when the last line is both (a) incomplete, e.g., {"a": and (b) does not end with a newline. I think the easiest way is to make sure that the JSON input is always terminated with a newline character. Is this something that could be done on the Spark side?

…JSON lines (#14309) Addresses #14282. For the JSON lines format that recovers after an invalid JSON line, we've had two issues when we were generating the stack context that is used downstream in the full JSON pushdown transducer. For that format, we need to make sure that we "reset" the stack context after each JSON line. That is, 1. We need to reset the stack to the empty stack after each JSON line, as the stack may not be empty after an erroneous JSON line. E.g. `{"this opening brace is never closed":123\n{"<=this brace should be on the empty stack":...}` 2. We need to reset that we are outside of a string: `{"no matching end-quote on this line\n{"<=this quote is the beginning of a field name, not the end of the previous line's field name"` This fixes above requirements as follows: 1. Was already implemented - but with an inappropriate scan operator that is not associative: ``` StackLevelT new_level = (symbol_to_stack_op_type(rhs.value) == stack_op_type::RESET) ? 0 : (lhs.stack_level + rhs.stack_level); ``` E.g. (`{,n,{`,`},n,{`,`{,n,}`,`},n,}` all fail the associativity test). This was replaced with a `ScanByKey` that would start with a "fresh" stack level with each new key segment. 2. Was addressed by changing the transition table of the finite-state transducer that filters out brackets and braces that are enclosed in quotes to go back to the `OOS` (`outside-of-string`) state after every newline. This behaviour requires that _every_ newline character is treated as a delimiter of a JSON line. This was confirmed by Spark Rapids, who is the targeted user for the recovery option to be the case. Authors: - Elias Stehle (https://github.com/elstehle) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) URL: #14309

elstehle · 2023-10-31T08:21:07Z

Hi @andygrove!
This issue has been addressed by #14309 that was just merged. So, I'm closing this issue for now.

Please let me know if the following issue for incomplete last lines can be addressed by appending a newline on the Spark side. If that should turn out to not be feasible please let me know.

@elstehle I found one edge case where the last line will be dropped rather than replaced with null if it is invalid. This results in Table.readJSON returning one row less than the input data.

andygrove · 2023-10-31T14:05:36Z

Hi @andygrove! This issue has been addressed by #14309 that was just merged. So, I'm closing this issue for now.

Please let me know if the following issue for incomplete last lines can be addressed by appending a newline on the Spark side. If that should turn out to not be feasible please let me know.

@elstehle I found one edge case where the last line will be dropped rather than replaced with null if it is invalid. This results in Table.readJSON returning one row less than the input data.

Thanks @elstehle. Yes, I have confirmed that adding a newline at the end resolves the issue for us. Thanks for looking into this.

andygrove added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Oct 13, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Oct 13, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Oct 13, 2023

elstehle self-assigned this Oct 13, 2023

andygrove mentioned this issue Oct 17, 2023

[FEA] [EPIC] Priority JSON Issues NVIDIA/spark-rapids#9458

Open

26 tasks

elstehle mentioned this issue Oct 22, 2023

Fixes stack context for json lines format that recovers from invalid JSON lines #14309

Merged

3 tasks

elstehle removed the Needs Triage Need team to review and classify label Oct 22, 2023

elstehle closed this as completed Oct 31, 2023

github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Oct 31, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG Table.readJson dropping valid JSON lines #14282

[BUG Table.readJson dropping valid JSON lines #14282

andygrove commented Oct 13, 2023 •

edited

Loading

elstehle commented Oct 14, 2023

andygrove commented Oct 17, 2023

elstehle commented Oct 18, 2023

elstehle commented Oct 22, 2023

andygrove commented Oct 24, 2023

andygrove commented Oct 24, 2023

elstehle commented Oct 24, 2023

elstehle commented Oct 26, 2023

elstehle commented Oct 31, 2023

andygrove commented Oct 31, 2023

[BUG Table.readJson dropping valid JSON lines #14282

[BUG Table.readJson dropping valid JSON lines #14282

Comments

andygrove commented Oct 13, 2023 • edited Loading

elstehle commented Oct 14, 2023

andygrove commented Oct 17, 2023

elstehle commented Oct 18, 2023

elstehle commented Oct 22, 2023

andygrove commented Oct 24, 2023

andygrove commented Oct 24, 2023

elstehle commented Oct 24, 2023

elstehle commented Oct 26, 2023

elstehle commented Oct 31, 2023

andygrove commented Oct 31, 2023

andygrove commented Oct 13, 2023 •

edited

Loading