Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] JSON validator for json strings given in strings column #12532

Closed
ttnghia opened this issue Jan 11, 2023 · 16 comments
Closed

[FEA] JSON validator for json strings given in strings column #12532

ttnghia opened this issue Jan 11, 2023 · 16 comments
Assignees
Labels
2 - In Progress Currently a work in progress cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@ttnghia
Copy link
Contributor

ttnghia commented Jan 11, 2023

Currently, the input into json reader is just one string. If we have the input json given by a strings column, we will have to concatenate rows of the input columns into one unified json string before parsing it.

A problem emerges when some of the rows of the input strings column contain invalid json string. In such situations, some applications such as Spark just output nulls for these invalid rows. The remaining rows are still being parsed independently and an output table is still generated. On the other hand, cudf's json parser just throws an exception and the entire application crashes.

We can opt to build a json parser that works independently on each string row of the input strings column. However, according to @karthikeyann, this needs big effort to have it. Thus, this is just mentioned but not asked for.

A simpler solution we should concentrate on is to have a json validator that can check each input string if it is a valid json string. The logic behind a validator should be easier to implement than a full parser so it is doable. After quickly validating all input rows, we can identify which rows contain invalid json and will just trim them off the input.

@ttnghia ttnghia added feature request New feature or request Needs Triage Need team to review and classify labels Jan 11, 2023
@ttnghia
Copy link
Contributor Author

ttnghia commented Jan 11, 2023

CC @karthikeyann, @GregoryKimball and @revans2.

@revans2
Copy link
Contributor

revans2 commented Jan 11, 2023

We want this for JSON lines file too. If the line is bad it should be independent of the other lines in the file. @ttnghia I think this might be related to how you are configuring the reader and concatenating the lines together. In the code you have you are wrapping the entire file in "[" and "]" and inserting in a "," in between each line. If we just configure the reader as JSON lines and insert a "\n" in between each line I think it will do exactly what we want.

@ttnghia
Copy link
Contributor Author

ttnghia commented Jan 11, 2023

So far there is no customer screaming about the issue. Probably none of them have to deal with invalid input json yet. However, if any of them stumble on it, they will turn to us.

Thus, supporting this FEA should be a high priority. I believe that this is beneficial not just for us (Spark) but for others to widely use the json reader.

@revans2
Copy link
Contributor

revans2 commented Jan 11, 2023

Yes, I just checked this myself and the experimental parser throws an exception if any of the lines are invalid. This is a major regression compared to the previous JSON lines parser and is going to be a blocker for us being able to adopt it.

@ttnghia ttnghia added the Spark Functionality that helps Spark RAPIDS label Jan 13, 2023
@elstehle
Copy link
Contributor

elstehle commented Jan 18, 2023

If it was sufficient to just concatenate rows of the input column into a single JSON Lines string, we could provide a mechanism that recovers after an invalid line once entering a new line (i.e., after seeing a \n).

Currently, the JSON tokenizer component defines a state machine that has an error state that is entered once encountering invalid JSON input. The error state is a trap state, once entered, it remains in that state for the remaining input. For your use case, we could provide the possibility to return (to recover) to beginning-of-JSON-value (PD_BOV) state on encountering a \n. Basically, starting to parse the next JSON value. Apart from the state machine, this would require adapting the tree traversal component's handling of ErrorBegin token, which @karthikeyann can talk better about.

@revans2
Copy link
Contributor

revans2 commented Jan 19, 2023

@elstehle that sounds like a great solution when we are in the json lines parsing mode. If you are not in json lines, then you don't have any kind of guarantee like that. But we plan on only using json lines for our parsing so you should check with others to see what kind of requirements they might have for boxing of errors in other modes.

@karthikeyann
Copy link
Contributor

karthikeyann commented Jan 19, 2023

@elstehle For that idea, tree traversal does not need any update. tree generation will need update.
Besides, is there a guarentee that the json string will be in "records" (struct) or "values" (list) uniformly for all rows?
If that guarentee fails, tree traversal will fail.

@revans2 Actually, the above idea will work if the input string in each row is not in json lines format because we are using the newline to differentiate between different rows. If the json string row itself is jsonlines, this may not work.

@revans2 is it possible for the input json in each string be in "json lines" format?
Is there an expected format of json string in each row?

@revans2
Copy link
Contributor

revans2 commented Jan 19, 2023

To be clear there are two ways that spark parses JSON. An input file where we are guaranteed that each record is separated by a line delimiter ('\n' is the one that is the default and is used everywhere, but it is configurable) aka JSON lines. We also support reading in JSON from a string column (as this issue is talking about). In this case there are no explicit guarantees that a '\n' will not show up in the body of the string. But, Spark already has bugs if that happens, so for now we are just going to document that it is not supported, and in the worst case we can detect if it happens ahead of time and throw an error.

For Spark when parsing a file if it is not in json lines format the data is a single json record per file. This is so slow that no one does it. When reading from a string column each string entry is treated as an entire JSON record. So if it were in a format like [{"A": 100},\n{"A": 200}] then it would be an error unless the output type requested was a LIST<STRUCT<A: INT>> or something similar to that. We can cross that bridge when we have a customer that needs something like this.

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 8, 2023

Thank you everyone for this discussion.

It sounds like the issue should be re-scoped to "Return null rows when parsing JSON Lines and a line is invalid". Then we would not need a JSON validator tool, and then Spark could configure the strings column as JSON Lines.

Aside from that, @karthikeyann brings up a good point that there are also schema issues for Spark compatibility. One schema issue would be if some lines are object root ({...) and some lines are array root ([...) in the same column. Another schema issue would be having type mismatch for the same key (e.g. {"a":[]}\n{"a":{"b":0}}). If the current libcudf behavior is not going to work for Spark, we'll need to write issues for these cases as well.

@ttnghia
Copy link
Contributor Author

ttnghia commented Feb 8, 2023

It sounds like the issue should be re-scoped to "Return null rows when parsing JSON Lines and a line is invalid"

This is not 100% accurate. We also have API like from_json which doesn't parse JSON but just tokenizes the input column. With such use case (and potentially future use case), maybe the issue can be re-scoped to "Return null rows when tokenizing JSON Lines and a line is invalid".

@GregoryKimball GregoryKimball added 1 - On Deck To be worked on next and removed Needs Triage Need team to review and classify labels Mar 11, 2023
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Apr 2, 2023
@GregoryKimball GregoryKimball moved this to In progress in libcudf Apr 2, 2023
@GregoryKimball GregoryKimball added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Apr 2, 2023
@ttnghia ttnghia changed the title [FEA] Json validator for json strings given in strings column [FEA] JSON validator for json strings given in strings column Apr 19, 2023
@GregoryKimball GregoryKimball added 2 - In Progress Currently a work in progress and removed 1 - On Deck To be worked on next labels May 10, 2023
@elstehle
Copy link
Contributor

I've opened this draft PR (#13344) as a first step to address this issue.

I'm currently trying to pin down the exact behaviour for the new option that will recover from invalid JSON lines:

  1. How does Spark currently treat newlines within field names or string values for JSON lines (multiline=false). E.g. {"a\nb":"123\n456"}? Are these three invalid lines: {"a, b":"123, 456"} or just one valid {"a\nb":"123\n456"}?
  2. Is each empty line parsed to a null record (what about consecutive empty lines and the last line)? E.g., will {"a":123}\n\n{"a":456} parse three records with {"a":123}, null, {"a":456}? How about the last JSON line, if it is followed by a newline. E.g. will {"a":123}\n{"a":456}\n parse three records: {"a":123}, {"a":456}, null?
  3. Do we need to differentiate between an invalid JSON line and an empty (null) JSON line for the parsed data? If yes, how would you like to get the invalid lines / rows?

@elstehle
Copy link
Contributor

I ran Spark's JSON reader on the following input:


{"a":123}

{"a
":456}

{"a":123}

Which reads:

+---------------+----+
|_corrupt_record|   a|
+---------------+----+
|           null| 123|
|            {"a|null|
|         ":456}|null|
|           null| 123|
+---------------+----+

So it seems that:

  1. newlines within field names are treated as line delimiters
  2. empty lines are ignored

@revans2
Copy link
Contributor

revans2 commented May 16, 2023

So to be clear there are multiple different modes for parsing JSON. By default when you read the JSON from a file all new lines are line delimiters. This is mostly because of how the file can be split at any point, and they decided that splitting files is more important than supporting new lines in the middle of a record.

But if you use from_json to parse the data from a string column it reads it properly. Unfortunately?? I guess. Spark does not really have a way to have white space characters in the names of columns or by default unescaped new lines in string data. So parsing it to get what you want gets a little complicated. Which is partly why it is going to be difficult to get everything to match perfectly. I am doubtful without a specific Spark mode or a lot of knobs to turn that we are going to get this to match perfectly.

cala> val df = Seq("""{"a":123}""","{\"a\n\":456}", """{"a":123}""", "{\"a\"\n:456}").toDF
df: org.apache.spark.sql.DataFrame = [value: string]

scala> df.selectExpr("from_json(value, \"STRUCT<a: STRING>\")", "from_json(value, \"MAP<STRING, STRING>\")", "value").show()

+----------------+----------+-----------+
|from_json(value)|   entries|      value|
+----------------+----------+-----------+
|           {123}|{a -> 123}|  {"a":123}|
|          {null}|      null|{"a\n":456}|
|           {123}|{a -> 123}|  {"a":123}|
|           {456}|{a -> 456}|{"a"\n:456}|
+----------------+----------+-----------+

@elstehle
Copy link
Contributor

Thanks a lot, @revans2. I think I do have a better understanding of from_json's behaviour now.

I am doubtful without a specific Spark mode or a lot of knobs to turn that we are going to get this to match perfectly.

Spark has a very intricate behaviour, especially around corner cases. From your examples, I understand that from_json behaves as if we were reading each row with multiline=True (I'm extrapolating from the result on the of fourth row), where newlines are treated as whitespace. This raises a conflict with our original idea to concatenate rows with \n and recover from an Error after a newline.

I'm thinking of two alternatives that could work around this issue:

  1. Instead of concatenating the rows by \n, we use another row delimiter, like 0x1E (record separator) to concatenate rows. We recover from an error after such delimiter (0x1E). Newlines are treated as whitespace. If such a delimiter could be part of a row's string, Spark would need to pre-process the strings to escape the delimiter.
  2. Spark needs to pre-process the strings, removing whitespace newlines. This would require distinguishing between newlines enclosed in quotes (strings / field names) and newlines that are just part of whitespace.

I think (1) may be easier, because (2) isn't trivial to pre-process.

Then, we still have this behaviour:

I guess. Spark does not really have a way to have white space characters in the names of columns or by default unescaped new lines in string data.

This is something we don't "fail" for right now. How important would this be for Spark? I'll put some thought to it in the meanwhile.

@revans2
Copy link
Contributor

revans2 commented May 22, 2023

Spark has a very intricate behaviour, especially around corner cases. From your examples, I understand that from_json behaves as if we were reading each row with multiline=True (I'm extrapolating from the result on the of fourth row), where newlines are treated as whitespace. This raises a conflict with our original idea to concatenate rows with \n and recover from an Error after a newline.

I agree that we might end up needing a Spark specific mode, which is not really what I think anyone wants. I am hopeful that we can find a small set of control flags that are also common with pandas to allow us to get really close to what Spark does. Then we can do some extra validation checks/cleanup that is Spark specific. Sadly it looks like pandas does not have that many config options for JSON parsing so it probably would involve more pre-processing and cleanup.

The new lines is the hardest part, so we might need to do some special things for it. Even then, I don't think it is as bad as we think. By default in Spark a newline is not allowed inside of a value or key.

  • allowUnquotedControlChars defaults to false. Allows JSON Strings to contain unquoted control characters (ASCII characters with value less than 32, including tab and line feed characters) or not.

We can fall back to the CPU if this is set to true. In the common case we can write a custom kernel, or use a regular expression to scan for the bad case and know which rows have the error in it. We can then replace/normalize the white space in the string before doing the concat.

I think the kernel should not be too bad. The hardest part would be making it fast for long strings, but I think even then we can have less than optimal performance for highly escaped quotes \\\\\\\\\" so long as the common case is still fast.

I guess. Spark does not really have a way to have white space characters in the names of columns or by default unescaped new lines in string data.

This is something we don't "fail" for right now. How important would this be for Spark? I'll put some thought to it in the meanwhile.

Like I said I think for each difference with Spark we need to decide if this should be common (Pandas and or others are likely to have the same problem) or if it should be done by the Spark team as pre and/or post processing.

https://spark.apache.org/docs/latest/sql-data-sources-json.html are the configs that Spark supports right now.

All of the configs around parsing primitive data types we would just ask CUDF to return the values as strings and we would handle any custom kernels to do the parsing/validation These include timeZone, allowNumericLeadingZeros, dateFormat, timestampFormat, timestampNTZFormat, enabledDateTimeParsingFallback, and allowNonNumericNumbers

For many others we can fall back to the CPU if we see them because the default values are things that I think CUDF can support out of the box, but I need to validate that. allowComments, allowUnquotedFieldNames, multiLine, encoding, lineSep, mode, and columnNameOfCorruptRecord

Some we can just ignore because they deal with inferring the data types, and we don't support that on the GPU right now. primitivesAsString, samplingRatio, dropFieldIfAllNull and prefersDecimal

Others I think we can support them with a combination of pre/post processing on the input itself allowUnquotedControlCharacters and allowBackslashEscapingAnyCharater. These both really come down to a kernel that can detect the bad cases in the input rows, which mostly involves tracking quotes and escapes, and possibly a way to fix up the input so CUDF is happy.

And finally some I think we might need some help from CUDF on. allowSingleQuotes

The other places that we need the help are recovering from badly formatted JSON and for some really odd corner cases if we want to be 100% identical to spark a way to know if the input was quoted or not when it was returned to us as a string. The second part is because if Spark sees a value of 12.3450 with no quotes, but the user asks for a String type it will first parse the data as a floating point value and then convert that back to a string. That I think is super minor and is something that I am happy to live with as an compatibility for now.

rapids-bot bot pushed a commit that referenced this issue Jul 14, 2023
#13344)

This PR adds the option to recover from invalid JSON lines to the JSON tokenizer. 

**New option and behaviour:**
- We add the option `enable_recover_from_error` to `json_reader_options`. When this option is enabled for a JSON lines input, the reader will recover from a parsing error encountered on an invalid JSON line and continue parsing the next line.
- When the new option is not enabled, we expect the behaviour of existing functionality to remain untouched.
- When recovering from invalid JSON lines is enabled, all newline characters that are not enclosed in quotes (i.e., newline characters outside of `strings` and `field names`) are interpreted as delimiters of a JSON line. We will introduce a new option that reflects this behaviour for JSON lines inputs that should not recover from errors in a future PR. Hence, this PR introduces the `JSON_LINES_STRICT` enum but does not yet hook it up.

**Implementation details:**
- When recovering from invalid JSON lines is enabled, `get_token_stream()` will delimit each JSON line with a `LineEnd` token to facilitate the identification of tokens that belong to an invalid JSON line.
- We extend the logical stack and introduce a new operation, `reset()`. A `reset()` operation resets the logical stack to an empty stack. This is necessary to reset the stack of the pushdown automaton (PDA) after an invalid JSON line to make sure the stack in subsequent lines is not corrupted.
- We modify the transition and translation table of the finite-state transducer (FST) that is used to generate the push-down automaton's (PDA) stack context operations to emit such a `reset()` operation, iff `recovery` is enabled.
- We modify the transition and translation table of the finite-state transducer (FST) that is used to simulate the full PDA to (1) recover after an invalid JSON line and (2) emit the `LineEnd` token, iff `recovery` is enabled.
- To clean up JSON lines that contain tokens belonging to an invalid line, a token *post-processing* stage is needed. The *post-processing* will replace sequences of `LineEnd` `token*` `ErrorBegin` with the sequence `StructBegin` `StructEnd` (i.e., effectively a `null` row) for record orient inputs. 
- This post-processing is implemented by running an FST on the reverse token stream, discarding all tokens between `ErrorBegin` and the next `LineEnd`, emitting `StructBegin` `StructEnd` pairs on the end of such an invalid line.

This is an initial PR to addresses #12532.

Authors:
  - Elias Stehle (https://github.com/elstehle)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13344
@GregoryKimball
Copy link
Contributor

I would like to close this issue in favor of #13525

@GregoryKimball GregoryKimball closed this as not planned Won't fix, can't repro, duplicate, stale Oct 9, 2023
@GregoryKimball GregoryKimball removed the status in libcudf Oct 12, 2023
@GregoryKimball GregoryKimball removed this from libcudf Oct 26, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
2 - In Progress Currently a work in progress cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

No branches or pull requests

5 participants