-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] JSON validator for json strings given in strings column #12532
Comments
CC @karthikeyann, @GregoryKimball and @revans2. |
We want this for JSON lines file too. If the line is bad it should be independent of the other lines in the file. @ttnghia I think this might be related to how you are configuring the reader and concatenating the lines together. In the code you have you are wrapping the entire file in "[" and "]" and inserting in a "," in between each line. If we just configure the reader as JSON lines and insert a "\n" in between each line I think it will do exactly what we want. |
So far there is no customer screaming about the issue. Probably none of them have to deal with invalid input json yet. However, if any of them stumble on it, they will turn to us. Thus, supporting this FEA should be a high priority. I believe that this is beneficial not just for us (Spark) but for others to widely use the json reader. |
Yes, I just checked this myself and the experimental parser throws an exception if any of the lines are invalid. This is a major regression compared to the previous JSON lines parser and is going to be a blocker for us being able to adopt it. |
If it was sufficient to just concatenate rows of the input column into a single JSON Lines string, we could provide a mechanism that recovers after an invalid line once entering a new line (i.e., after seeing a Currently, the JSON tokenizer component defines a state machine that has an |
@elstehle that sounds like a great solution when we are in the json lines parsing mode. If you are not in json lines, then you don't have any kind of guarantee like that. But we plan on only using json lines for our parsing so you should check with others to see what kind of requirements they might have for boxing of errors in other modes. |
@elstehle For that idea, tree traversal does not need any update. tree generation will need update. @revans2 Actually, the above idea will work if the input string in each row is not in json lines format because we are using the newline to differentiate between different rows. If the json string row itself is jsonlines, this may not work. @revans2 is it possible for the input json in each string be in "json lines" format? |
To be clear there are two ways that spark parses JSON. An input file where we are guaranteed that each record is separated by a line delimiter ('\n' is the one that is the default and is used everywhere, but it is configurable) aka JSON lines. We also support reading in JSON from a string column (as this issue is talking about). In this case there are no explicit guarantees that a '\n' will not show up in the body of the string. But, Spark already has bugs if that happens, so for now we are just going to document that it is not supported, and in the worst case we can detect if it happens ahead of time and throw an error. For Spark when parsing a file if it is not in json lines format the data is a single json record per file. This is so slow that no one does it. When reading from a string column each string entry is treated as an entire JSON record. So if it were in a format like |
Thank you everyone for this discussion. It sounds like the issue should be re-scoped to "Return null rows when parsing JSON Lines and a line is invalid". Then we would not need a JSON validator tool, and then Spark could configure the strings column as JSON Lines. Aside from that, @karthikeyann brings up a good point that there are also schema issues for Spark compatibility. One schema issue would be if some lines are object root ( |
This is not 100% accurate. We also have API like |
I've opened this draft PR (#13344) as a first step to address this issue. I'm currently trying to pin down the exact behaviour for the new option that will recover from invalid JSON lines:
|
I ran Spark's JSON reader on the following input:
Which reads:
So it seems that:
|
So to be clear there are multiple different modes for parsing JSON. By default when you read the JSON from a file all new lines are line delimiters. This is mostly because of how the file can be split at any point, and they decided that splitting files is more important than supporting new lines in the middle of a record. But if you use cala> val df = Seq("""{"a":123}""","{\"a\n\":456}", """{"a":123}""", "{\"a\"\n:456}").toDF
df: org.apache.spark.sql.DataFrame = [value: string]
scala> df.selectExpr("from_json(value, \"STRUCT<a: STRING>\")", "from_json(value, \"MAP<STRING, STRING>\")", "value").show()
+----------------+----------+-----------+
|from_json(value)| entries| value|
+----------------+----------+-----------+
| {123}|{a -> 123}| {"a":123}|
| {null}| null|{"a\n":456}|
| {123}|{a -> 123}| {"a":123}|
| {456}|{a -> 456}|{"a"\n:456}|
+----------------+----------+-----------+
|
Thanks a lot, @revans2. I think I do have a better understanding of
Spark has a very intricate behaviour, especially around corner cases. From your examples, I understand that I'm thinking of two alternatives that could work around this issue:
I think (1) may be easier, because (2) isn't trivial to pre-process. Then, we still have this behaviour:
This is something we don't "fail" for right now. How important would this be for Spark? I'll put some thought to it in the meanwhile. |
I agree that we might end up needing a Spark specific mode, which is not really what I think anyone wants. I am hopeful that we can find a small set of control flags that are also common with pandas to allow us to get really close to what Spark does. Then we can do some extra validation checks/cleanup that is Spark specific. Sadly it looks like pandas does not have that many config options for JSON parsing so it probably would involve more pre-processing and cleanup. The new lines is the hardest part, so we might need to do some special things for it. Even then, I don't think it is as bad as we think. By default in Spark a newline is not allowed inside of a value or key.
We can fall back to the CPU if this is set to true. In the common case we can write a custom kernel, or use a regular expression to scan for the bad case and know which rows have the error in it. We can then replace/normalize the white space in the string before doing the concat. I think the kernel should not be too bad. The hardest part would be making it fast for long strings, but I think even then we can have less than optimal performance for highly escaped quotes
Like I said I think for each difference with Spark we need to decide if this should be common (Pandas and or others are likely to have the same problem) or if it should be done by the Spark team as pre and/or post processing. https://spark.apache.org/docs/latest/sql-data-sources-json.html are the configs that Spark supports right now. All of the configs around parsing primitive data types we would just ask CUDF to return the values as strings and we would handle any custom kernels to do the parsing/validation These include For many others we can fall back to the CPU if we see them because the default values are things that I think CUDF can support out of the box, but I need to validate that. Some we can just ignore because they deal with inferring the data types, and we don't support that on the GPU right now. Others I think we can support them with a combination of pre/post processing on the input itself And finally some I think we might need some help from CUDF on. The other places that we need the help are recovering from badly formatted JSON and for some really odd corner cases if we want to be 100% identical to spark a way to know if the input was quoted or not when it was returned to us as a string. The second part is because if Spark sees a value of |
#13344) This PR adds the option to recover from invalid JSON lines to the JSON tokenizer. **New option and behaviour:** - We add the option `enable_recover_from_error` to `json_reader_options`. When this option is enabled for a JSON lines input, the reader will recover from a parsing error encountered on an invalid JSON line and continue parsing the next line. - When the new option is not enabled, we expect the behaviour of existing functionality to remain untouched. - When recovering from invalid JSON lines is enabled, all newline characters that are not enclosed in quotes (i.e., newline characters outside of `strings` and `field names`) are interpreted as delimiters of a JSON line. We will introduce a new option that reflects this behaviour for JSON lines inputs that should not recover from errors in a future PR. Hence, this PR introduces the `JSON_LINES_STRICT` enum but does not yet hook it up. **Implementation details:** - When recovering from invalid JSON lines is enabled, `get_token_stream()` will delimit each JSON line with a `LineEnd` token to facilitate the identification of tokens that belong to an invalid JSON line. - We extend the logical stack and introduce a new operation, `reset()`. A `reset()` operation resets the logical stack to an empty stack. This is necessary to reset the stack of the pushdown automaton (PDA) after an invalid JSON line to make sure the stack in subsequent lines is not corrupted. - We modify the transition and translation table of the finite-state transducer (FST) that is used to generate the push-down automaton's (PDA) stack context operations to emit such a `reset()` operation, iff `recovery` is enabled. - We modify the transition and translation table of the finite-state transducer (FST) that is used to simulate the full PDA to (1) recover after an invalid JSON line and (2) emit the `LineEnd` token, iff `recovery` is enabled. - To clean up JSON lines that contain tokens belonging to an invalid line, a token *post-processing* stage is needed. The *post-processing* will replace sequences of `LineEnd` `token*` `ErrorBegin` with the sequence `StructBegin` `StructEnd` (i.e., effectively a `null` row) for record orient inputs. - This post-processing is implemented by running an FST on the reverse token stream, discarding all tokens between `ErrorBegin` and the next `LineEnd`, emitting `StructBegin` `StructEnd` pairs on the end of such an invalid line. This is an initial PR to addresses #12532. Authors: - Elias Stehle (https://github.com/elstehle) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Karthikeyan (https://github.com/karthikeyann) URL: #13344
I would like to close this issue in favor of #13525 |
Currently, the input into json reader is just one string. If we have the input json given by a strings column, we will have to concatenate rows of the input columns into one unified json string before parsing it.
A problem emerges when some of the rows of the input strings column contain invalid json string. In such situations, some applications such as Spark just output nulls for these invalid rows. The remaining rows are still being parsed independently and an output table is still generated. On the other hand, cudf's json parser just throws an exception and the entire application crashes.
We can opt to build a json parser that works independently on each string row of the input strings column. However, according to @karthikeyann, this needs big effort to have it. Thus, this is just mentioned but not asked for.
A simpler solution we should concentrate on is to have a json validator that can check each input string if it is a valid json string. The logic behind a validator should be easier to implement than a full parser so it is doable. After quickly validating all input rows, we can identify which rows contain invalid json and will just trim them off the input.
The text was updated successfully, but these errors were encountered: