Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG Table.readJson dropping valid JSON lines #14282

Closed
Tracked by #9458
andygrove opened this issue Oct 13, 2023 · 10 comments
Closed
Tracked by #9458

[BUG Table.readJson dropping valid JSON lines #14282

andygrove opened this issue Oct 13, 2023 · 10 comments
Assignees
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS

Comments

@andygrove
Copy link
Contributor

andygrove commented Oct 13, 2023

Describe the bug
In the plugin, we have a ColumnVector containing JSON lines (concatenated together in one large string). Here is a sample:

GPU COLUMN combined - NC: 0 DATA: DeviceMemoryBufferView{address=0x30a013800, length=23082, id=-1} VAL: null
COLUMN combined - STRING
0 "{"teacher":null}
{"teacher":null}
...
{"teacher": "Qocwza","student": {"name": "Yiausu", "age": 19}}
{"teacher": "Rfwvcv","student": {"name": "Yxpbtq", "age": 12}}
{"teacher": "Spmydj","student": {"name": "Ggeyhv", "age": 16}}
{"teacher": "Jnbhwy","student": {"name": "Lmnkzw", "age": 19}}
{"teacher": "Gukkmo","student": {"name": "Xyyuy", "age": 19}}
...

We pass this to Table.readJson using the following code:

          val (names, rawTable) = withResource(combinedHost) { combinedHost =>
            val data = combinedHost.getData
            val start = combinedHost.getStartListOffset(0)
            val end = combinedHost.getEndListOffset(0)
            val length = end - start

            val jsonOptions = cudf.JSONOptions.builder().withRecoverWithNull(true).build()
            withResource(cudf.Table.readJSON(jsonOptions, data, start, length)) { tableWithMeta =>
              val names = tableWithMeta.getColumnNames
              (names, tableWithMeta.releaseTable())
            }
          }

The resulting table is missing some values. Note that the entry for {"teacher": "Spmydj","student": {"name": "Ggeyhv", "age": 16}} is NULL here.

DEBUG rawTable Table{columns=[ColumnVector{rows=512, type=STRING, nullCount=Optional.empty, offHeap=(ID: 260 7f4b7c3e1a00)}, ColumnVector{rows=512, type=STRUCT, nullCount=Optional.empty, offHeap=(ID: 261 7f4b7c3c1d60)}], cudfTable=139962183575680, rows=512}
GPU COLUMN 0 - NC: 228 DATA: DeviceMemoryBufferView{address=0x30a011a00, length=1696, id=-1} VAL: DeviceMemoryBufferView{address=0x30a006a00, length=64, id=-1}
COLUMN 0 - STRING
0 NULL
1 NULL
...
447 "Qocwza" 516f63777a61
448 "Rfwvcv" 526677766376
449 NULL
450 "Jnbhwy" 4a6e62687779
451 "Gukkmo" 47756b6b6d6f

Steps/Code to reproduce bug
The repro case is in NVIDIA/spark-rapids#9423 in test_from_json_struct_of_struct.

Expected behavior
Data should not be dropped.

Environment overview (please complete the following information)
N/A

Environment details
N/A

Additional context

@andygrove andygrove added bug Something isn't working Needs Triage Need team to review and classify Spark Functionality that helps Spark RAPIDS labels Oct 13, 2023
@elstehle elstehle self-assigned this Oct 13, 2023
@elstehle
Copy link
Contributor

Thanks for reporting the issue, @andygrove.

Unfortunately, I couldn't reproduce the issue. Could you share the full string you're trying to parse? It's surprising that line 449 is getting parsed to NULL, when its format is the same as for other lines that are getting parsed correctly. Could this have some other cause?

@andygrove
Copy link
Contributor Author

I now have a repro case: #14291

@elstehle
Copy link
Contributor

Just to follow up, @andygrove, I have finally gotten to the bottom of the issue. The issue only occurs when withRecoverWithNull is used. The issue was introduced with the original PR that introduced that option #13344.

I'm currently elaborating options for resolving the issue and will likely have a resolution by the end of the week.

@elstehle
Copy link
Contributor

@andygrove, I've put up #14309 to address this issue. Feel free to check if it properly addresses your issue.

@elstehle elstehle removed the Needs Triage Need team to review and classify label Oct 22, 2023
@andygrove
Copy link
Contributor Author

@andygrove, I've put up #14309 to address this issue. Feel free to check if it properly addresses your issue.

Thank you @elstehle. I have confirmed that this resolves the issue.

@andygrove
Copy link
Contributor Author

@elstehle I found one edge case where the last line will be dropped rather than replaced with null if it is invalid. This results in Table.readJSON returning one row less than the input data.

@elstehle
Copy link
Contributor

Thanks for sharing, @andygrove! I'll look into it.

@elstehle
Copy link
Contributor

@elstehle I found one edge case where the last line will be dropped rather than replaced with null if it is invalid. This results in Table.readJSON returning one row less than the input data.

Thanks, Andy. I investigated the issue and it should only be an issue when the last line is both (a) incomplete, e.g., {"a": and (b) does not end with a newline. I think the easiest way is to make sure that the JSON input is always terminated with a newline character. Is this something that could be done on the Spark side?

rapids-bot bot pushed a commit that referenced this issue Oct 31, 2023
…JSON lines (#14309)

Addresses #14282.

For the JSON lines format that recovers after an invalid JSON line, we've had two issues when we were generating the stack context that is used downstream in the full JSON pushdown transducer.

For that format, we need to make sure that we "reset" the stack context after each JSON line. That is, 

1. We need to reset the stack to the empty stack after each JSON line, as the stack may not be empty after an erroneous  JSON line. E.g. `{"this opening brace is never closed":123\n{"<=this brace should be on the empty stack":...}`
2. We need to reset that we are outside of a string: `{"no matching end-quote on this line\n{"<=this quote is the beginning of a field name, not the end of the previous line's field name"`

This fixes above requirements as follows:
1. Was already implemented - but with an inappropriate scan operator that is not associative:
```
StackLevelT new_level = (symbol_to_stack_op_type(rhs.value) == stack_op_type::RESET)
                              ? 0
                              : (lhs.stack_level + rhs.stack_level);
```
E.g. (`{,n,{`,`},n,{`,`{,n,}`,`},n,}` all fail the associativity test). This was replaced with a `ScanByKey` that would start with a "fresh" stack level with each new key segment. 

2. Was addressed by changing the transition table of the finite-state transducer that filters out brackets and braces that are enclosed in quotes to go back to the `OOS` (`outside-of-string`) state after every newline. This behaviour requires that _every_ newline character is treated as a delimiter of a JSON line. This was confirmed by Spark Rapids, who is the targeted user for the recovery option to be the case.

Authors:
  - Elias Stehle (https://github.com/elstehle)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #14309
@elstehle
Copy link
Contributor

Hi @andygrove!
This issue has been addressed by #14309 that was just merged. So, I'm closing this issue for now.

Please let me know if the following issue for incomplete last lines can be addressed by appending a newline on the Spark side. If that should turn out to not be feasible please let me know.

@elstehle I found one edge case where the last line will be dropped rather than replaced with null if it is invalid. This results in Table.readJSON returning one row less than the input data.

@github-project-automation github-project-automation bot moved this from In Progress to Done in cuDF/Dask/Numba/UCX Oct 31, 2023
@andygrove
Copy link
Contributor Author

Hi @andygrove! This issue has been addressed by #14309 that was just merged. So, I'm closing this issue for now.

Please let me know if the following issue for incomplete last lines can be addressed by appending a newline on the Spark side. If that should turn out to not be feasible please let me know.

@elstehle I found one edge case where the last line will be dropped rather than replaced with null if it is invalid. This results in Table.readJSON returning one row less than the input data.

Thanks @elstehle. Yes, I have confirmed that adding a newline at the end resolves the issue for us. Thanks for looking into this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Spark Functionality that helps Spark RAPIDS
Projects
Archived in project
Development

No branches or pull requests

2 participants