-
Notifications
You must be signed in to change notification settings - Fork 906
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record #14279
Changes JSON reader's recovery option's behaviour to ignore all characters after a valid JSON record #14279
Conversation
@@ -753,15 +753,15 @@ class TranslationOp { | |||
RelativeOffsetT const relative_offset, | |||
SymbolT const read_symbol) const | |||
{ | |||
return translation_op(*this, state_id, match_id, relative_offset, read_symbol); | |||
return translation_op(state_id, match_id, relative_offset, read_symbol); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is just a minor improvement that I piggy-backed into this PR. The delegate that implements translation_op
has no benefit from the extra reference of *this
, so we removed it from the arguments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a couple small pieces of feedback. I think this makes sense. The overall complexity of the FST code is getting higher and higher, but I'm not sure if we can do much about that while continuing to add requested features.
Yes, agreed. Unfortunately, the Pushdown Transducer for a JSON parser is quite complex by nature already. For most of the remaining Spark requests on our radar we will be introducing extra pre-/post-processing steps rather than changing the Pushdown machinery. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great work! 🚀 Especially with the additional processing step without complicating the existing JSON FST table.
…nes-recover-superfluous-data
@andygrove would you please evaluate this solution? |
…nes-recover-superfluous-data
@GregoryKimball I have confirmed that this resolves the issue. The plugin tests for this issue now pass when run against this PR. |
…nes-recover-superfluous-data
/merge |
Description
Closes #14226.
The new behvior of
JSON_LINES_RECOVER
will now ignore excess characters after the first valid JSON record on each JSON line.Implementation details:
The JSON parser pushdown automaton was changed for
JSON_LINES_RECOVER
format such that when in statePD_PVL
(post-value
, "I have just finished parsing a value") and when the stack context isROOT
("I'm not somewhere within a list or struct"), we just treat all characters as "white space" until encountering a newline character.post-value
in stack contextROOT
is exactly the condition we are in after having parsed the first valid record of a JSON line. Thanks to @karthikeyann for suggesting to usePD_PVL
as the capturing state.As the stack context is generated upfront, we have to fix up and correct the stack context to set the stack context as
ROOT
stack context for all these excess characters. I.e., (_
meansROOT
stack context,{
means within aSTRUCT
stack context):Needs to be fixed up to become:
Checklist