Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Find a way to support String column input/fixup for JSON parsing #15277

Open
Tracked by #9
revans2 opened this issue Mar 12, 2024 · 2 comments
Open
Tracked by #9

[FEA] Find a way to support String column input/fixup for JSON parsing #15277

revans2 opened this issue Mar 12, 2024 · 2 comments
Assignees
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS

Comments

@revans2
Copy link
Contributor

revans2 commented Mar 12, 2024

Is your feature request related to a problem? Please describe.
In Spark we have a requirement to be able to pass in a column of strings and parse them as JSON. Ideally we would just pass this directly to CUDF, but none of the input formats really support this, and neither do any of the pre-processing steps that the JSON reader has put in for us. What we do today is first check to see if a line separator (carriage return) is in the data set. If there is one, then we throw an exception. If not, then we concat the lines together into a single buffer with a line separator in between the inputs. (we do some fixup for NULLs/empty rows too).

This has the problem that we throw an exception when we see a bad character in the data, which is valid for Spark to have in the data.

I think that there are a few options that we have to fix this kind of a problem.

  1. Expose the API that removes unneeded white space. We could then remove the unneeded data from the buffer and replace any remaining line separators with '\n' because then they should only be in quoted strings. (we might need to do single quote normalization too because I am not sure which one comes first)
  2. Provide a way to set a different line separator (Ideally something really unlikely to show up NUL \0). This would not fix the problem 100%, but it would make it super rare, and I would feel okay with a solution like this.
  3. Do nothing and we just take the hit when we see a line with this in it. We would then have to pull back those lines to the CPU and process them on the CPU, and push them back to the GPU afterwards.

I personally like option 2, but I am likely to implement option 3 in the short term unless I hear from CUDF that this is simple to do and can be done really quickly.

@revans2 revans2 added feature request New feature or request cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Mar 12, 2024
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Mar 12, 2024
rapids-bot bot pushed a commit that referenced this issue May 20, 2024
…delimiter (#15556)

Addresses #15277
Given a JSON lines buffer with records separated by a delimiter passed at runtime, the idea is to modify the JSON tokenization FST to consider the passed delimiter to generate EOL token instead of the newline character currently hard-coded. 
This PR does not modify the whitespace normalization FST to [strip out unquoted `\n` and `\r`](#14865 (comment)). Whitespace normalization will be handled in follow-up works.
Note that this is not a multi-object JSON reader since we are not using the offsets data in the string column, and hence there is no resetting of the start state at every row offset.

Current status:
- [X] Semantic bracket/brace DFA 
- [X] DFA removing excess characters after record in line
- [X] Pushdown automata generating tokens
- [x] Test passing arbitrary delimiter that does not occur in input to the reader

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)

Approvers:
  - Paul Mattione (https://github.com/pmattione-nvidia)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Elias Stehle (https://github.com/elstehle)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #15556
@GregoryKimball
Copy link
Contributor

GregoryKimball commented Aug 27, 2024

@karthikeyann would you please link this issue to your (upcoming) histogram+concat PR in spark-rapids-jni?

@karthikeyann
Copy link
Contributor

This is the PR NVIDIA/spark-rapids-jni#2364

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuIO cuIO issue feature request New feature or request Spark Functionality that helps Spark RAPIDS
Projects
Status: In Progress
Development

No branches or pull requests

4 participants