Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record #14279

elstehle · 2023-10-13T12:43:31Z

Description

The new behvior of JSON_LINES_RECOVER will now ignore excess characters after the first valid JSON record on each JSON line.

{ "number": 1 } 
{ "number": 1 } xyz
{ "number": 1 } {}
{ "number": 1 } { "number": 4 }

Implementation details:
The JSON parser pushdown automaton was changed for JSON_LINES_RECOVER format such that when in state PD_PVL (post-value, "I have just finished parsing a value") and when the stack context is ROOT ("I'm not somewhere within a list or struct"), we just treat all characters as "white space" until encountering a newline character. post-value in stack context ROOT is exactly the condition we are in after having parsed the first valid record of a JSON line. Thanks to @karthikeyann for suggesting to use PD_PVL as the capturing state.

As the stack context is generated upfront, we have to fix up and correct the stack context to set the stack context as ROOT stack context for all these excess characters. I.e., (_ means ROOT stack context, { means within a STRUCT stack context):

in:    {"a":1}{"this is supposed to be ignored"}
stack: _{{{{{{_{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{{

Needs to be fixed up to become:

in:    {"a":1}{"this is supposed to be ignored"}
stack: _{{{{{{__________________________________

Checklist

I am familiar with the Contributing Guidelines.
New or existing tests cover these changes.
The documentation is up to date with these changes.

elstehle · 2023-10-13T12:46:31Z

cpp/src/io/fst/lookup_tables.cuh

@@ -753,15 +753,15 @@ class TranslationOp {
                                             RelativeOffsetT const relative_offset,
                                             SymbolT const read_symbol) const
  {
-    return translation_op(*this, state_id, match_id, relative_offset, read_symbol);
+    return translation_op(state_id, match_id, relative_offset, read_symbol);


This is just a minor improvement that I piggy-backed into this PR. The delegate that implements translation_op has no benefit from the extra reference of *this, so we removed it from the arguments.

bdice

Just a couple small pieces of feedback. I think this makes sense. The overall complexity of the FST code is getting higher and higher, but I'm not sure if we can do much about that while continuing to add requested features.

cpp/src/io/json/nested_json_gpu.cu

cpp/tests/io/json_test.cpp

cpp/src/io/json/nested_json_gpu.cu

elstehle · 2023-10-14T07:38:31Z

. The overall complexity of the FST code is getting higher and higher, but I'm not sure if we can do much about that while continuing to add requested features

Yes, agreed. Unfortunately, the Pushdown Transducer for a JSON parser is quite complex by nature already. For most of the remaining Spark requests on our radar we will be introducing extra pre-/post-processing steps rather than changing the Pushdown machinery.

karthikeyann

Great work! 🚀 Especially with the additional processing step without complicating the existing JSON FST table.

cpp/src/io/json/nested_json_gpu.cu

…nes-recover-superfluous-data

GregoryKimball · 2023-10-18T03:43:42Z

@andygrove would you please evaluate this solution?

…nes-recover-superfluous-data

andygrove · 2023-10-19T21:46:11Z

@andygrove would you please evaluate this solution?

@GregoryKimball I have confirmed that this resolves the issue. The plugin tests for this issue now pass when run against this PR.

…nes-recover-superfluous-data

elstehle · 2023-10-20T08:00:04Z

/merge

changes recovery behaviour to ignore excess chars

92be873

elstehle requested a review from a team as a code owner October 13, 2023 12:43

elstehle requested review from bdice and nvdbaranec October 13, 2023 12:43

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Oct 13, 2023

elstehle added bug Something isn't working 3 - Ready for Review Ready for review by team cuIO cuIO issue non-breaking Non-breaking change labels Oct 13, 2023

elstehle commented Oct 13, 2023

View reviewed changes

elstehle requested review from karthikeyann and vuule October 13, 2023 12:55

elstehle mentioned this pull request Oct 13, 2023

[FEA] Add JSON reader option to ignore all characters after a valid JSON record #14226

Closed

bdice approved these changes Oct 13, 2023

View reviewed changes

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved

cpp/tests/io/json_test.cpp Show resolved Hide resolved

ttnghia reviewed Oct 14, 2023

View reviewed changes

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved

adds test case for line comments

82b813d

ttnghia approved these changes Oct 14, 2023

View reviewed changes

karthikeyann reviewed Oct 17, 2023

View reviewed changes

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved

cpp/src/io/json/nested_json_gpu.cu Outdated Show resolved Hide resolved

elstehle added 2 commits October 16, 2023 23:48

Merge remote-tracking branch 'upstream/branch-23.12' into fix/json-li…

01b882f

…nes-recover-superfluous-data

renames transition table entries

ae0cac2

elstehle requested a review from karthikeyann October 17, 2023 06:55

karthikeyann approved these changes Oct 17, 2023

View reviewed changes

Merge branch 'branch-23.12' into fix/json-lines-recover-superfluous-data

75696ce

karthikeyann requested a review from andygrove October 18, 2023 03:45

Merge remote-tracking branch 'upstream/branch-23.12' into fix/json-li…

b277665

…nes-recover-superfluous-data

andygrove mentioned this pull request Oct 19, 2023

Specify recoverWithNull when reading JSON files NVIDIA/spark-rapids#9304

Merged

Merge remote-tracking branch 'upstream/branch-23.12' into fix/json-li…

caa88b7

…nes-recover-superfluous-data

rapids-bot bot merged commit 50e2211 into rapidsai:branch-23.12 Oct 20, 2023
57 checks passed

vuule mentioned this pull request Nov 7, 2023

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record #14279

Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record #14279

elstehle commented Oct 13, 2023

elstehle Oct 13, 2023

bdice left a comment

elstehle commented Oct 14, 2023

karthikeyann left a comment •

edited

Loading

GregoryKimball commented Oct 18, 2023

andygrove commented Oct 19, 2023

elstehle commented Oct 20, 2023

Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record #14279

Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record #14279

Conversation

elstehle commented Oct 13, 2023

Description

Checklist

elstehle Oct 13, 2023

Choose a reason for hiding this comment

bdice left a comment

Choose a reason for hiding this comment

elstehle commented Oct 14, 2023

karthikeyann left a comment • edited Loading

Choose a reason for hiding this comment

GregoryKimball commented Oct 18, 2023

andygrove commented Oct 19, 2023

elstehle commented Oct 20, 2023

karthikeyann left a comment •

edited

Loading