[FEA] Find a way to support String column input/fixup for JSON parsing #15277

revans2 · 2024-03-12T14:59:44Z

Is your feature request related to a problem? Please describe.
In Spark we have a requirement to be able to pass in a column of strings and parse them as JSON. Ideally we would just pass this directly to CUDF, but none of the input formats really support this, and neither do any of the pre-processing steps that the JSON reader has put in for us. What we do today is first check to see if a line separator (carriage return) is in the data set. If there is one, then we throw an exception. If not, then we concat the lines together into a single buffer with a line separator in between the inputs. (we do some fixup for NULLs/empty rows too).

This has the problem that we throw an exception when we see a bad character in the data, which is valid for Spark to have in the data.

I think that there are a few options that we have to fix this kind of a problem.

Expose the API that removes unneeded white space. We could then remove the unneeded data from the buffer and replace any remaining line separators with '\n' because then they should only be in quoted strings. (we might need to do single quote normalization too because I am not sure which one comes first)
Provide a way to set a different line separator (Ideally something really unlikely to show up NUL \0). This would not fix the problem 100%, but it would make it super rare, and I would feel okay with a solution like this.
Do nothing and we just take the hit when we see a line with this in it. We would then have to pull back those lines to the CPU and process them on the CPU, and push them back to the GPU afterwards.

I personally like option 2, but I am likely to implement option 3 in the short term unless I hear from CUDF that this is simple to do and can be done really quickly.

…delimiter (#15556) Addresses #15277 Given a JSON lines buffer with records separated by a delimiter passed at runtime, the idea is to modify the JSON tokenization FST to consider the passed delimiter to generate EOL token instead of the newline character currently hard-coded. This PR does not modify the whitespace normalization FST to [strip out unquoted `\n` and `\r`](#14865 (comment)). Whitespace normalization will be handled in follow-up works. Note that this is not a multi-object JSON reader since we are not using the offsets data in the string column, and hence there is no resetting of the start state at every row offset. Current status: - [X] Semantic bracket/brace DFA - [X] DFA removing excess characters after record in line - [X] Pushdown automata generating tokens - [x] Test passing arbitrary delimiter that does not occur in input to the reader Authors: - Shruti Shivakumar (https://github.com/shrshi) Approvers: - Paul Mattione (https://github.com/pmattione-nvidia) - Vukasin Milovanovic (https://github.com/vuule) - Elias Stehle (https://github.com/elstehle) - Karthikeyan (https://github.com/karthikeyann) URL: #15556

GregoryKimball · 2024-08-27T06:09:27Z

@karthikeyann would you please link this issue to your (upcoming) histogram+concat PR in spark-rapids-jni?

karthikeyann · 2024-09-06T18:24:21Z

This is the PR NVIDIA/spark-rapids-jni#2364

revans2 added feature request New feature or request cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Mar 12, 2024

github-project-automation bot added this to cuDF/Dask/Numba/UCX Mar 12, 2024

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Mar 12, 2024

GregoryKimball added this to the Nested JSON reader milestone Mar 12, 2024

GregoryKimball mentioned this issue Mar 12, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

This was referenced Mar 14, 2024

[BUG] from_json does not support input with \n in it. NVIDIA/spark-rapids#10489

Closed

[FEA] JSON input support NVIDIA/spark-rapids#9

Open

This was referenced Apr 12, 2024

[WIP] POC for reading multi-line JSON in string columns #15520

Draft

Reading multi-line JSON in string columns using runtime configurable delimiter #15556

Merged

vuule assigned shrshi May 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Find a way to support String column input/fixup for JSON parsing #15277

[FEA] Find a way to support String column input/fixup for JSON parsing #15277

revans2 commented Mar 12, 2024

GregoryKimball commented Aug 27, 2024 •

edited

Loading

karthikeyann commented Sep 6, 2024

[FEA] Find a way to support String column input/fixup for JSON parsing #15277

[FEA] Find a way to support String column input/fixup for JSON parsing #15277

Comments

revans2 commented Mar 12, 2024

GregoryKimball commented Aug 27, 2024 • edited Loading

karthikeyann commented Sep 6, 2024

GregoryKimball commented Aug 27, 2024 •

edited

Loading