[FEA] JSON reader improvements for Spark-RAPIDS #13525

GregoryKimball · 2023-06-07T16:30:32Z

libcudf includes a GPU-accelerated JSON reader that uses a finite-state transducer parser combined with token-processing tree algorithms to transform character buffers into columnar data. This issue tracks the technical work leading up to the launch of libcudf's JSON reader as a default component of the Spark-RAPIDS plugin. Please also refer to the Nested JSON reader milestone and Spark-RAPIDS JSON epic.

Spark compatibility issues: Blockers

Status	Impact for Spark	Change to libcudf
✅ #13344	#12532, Blocker: if any line has an error, libcudf throws an exception	Rework state machine to include error states and scrub tokens from lines with error
✅ #14252	#14227, Blocker: Incorrect parsing	Fix bug in error recovery state transitions
✅ #14279	#14226, Blocker: requesting alternate error recovery behavior from #13344, where valid data before an error state are preserved	Changes in JSON parser pushdown automaton for JSON_LINES_RECOVER option
✅ #14936	#14288, Blocker: libcudf does not have an efficient representation for map types in Spark	libcudf does not support map types, and modeling the map types as structs results in poor performance due to one child column per unique key. We will return the struct data that represents map types as string and then the plugin can use unify_json_strings to parse tokens
✅ #14572	#14239, Blocker: fields with mixed types raise an exception	add libcudf reader option to return mixed types as strings. Also see improvements in #15236 and #14939
✅ #14545	#10004, Blocker: Can't parse data with single quote variant of JSON when `allowSingleQuotes` is enabled in Spark	Introduce a preprocessing function to normalize single and double quotes as double quotes
✅ #15324	#15303, escaped single quotes have their escapes dropped during quote normalization	Adjust quote normalization FST
🔄 #15419	#15390 + #15409, Blocker: race conditions found in nested JSON reader	Solve synchronization problems in nested JSON reader
	#15260, Blocker: crash in mixed type support
🔄	#15278, Blocker: allow list type to be coerced to string, also see #14239. Without this, Spark-RAPIDS will fallback when user requests a field as "string"	Support List types coercion to string
	#15277, Blocker: we need to support multi-line JSON objects. Also see #10267	libcudf is scoping a "multi-object" reader

Spark compatibility issues: non-blockers

Status	Impact for Spark	Change to libcudf
	#15222, compatibility problems with leading zeros, "NAN" and escape options	None for now. This feature should live in Spark-RAPIDS as a post-processing option for now, based on the approach for `get_json_object` modeled after Spark CPU code (see NVIDIA/spark-rapids-jni#1836). Then the plugin can set to null any entries from objects that Spark would treat as invalid. Later we could provide Spark-RAPIDS access to raw tokens that they could run through a more efficient validator.
✅ #15033	#14865, Strip whitespace from JSON inputs, otherwise Spark will have to add this in post-processing the coerced strings types	Create new normalization pre-processing tool for whitespace
🔄 #14996	#13473, Performance: only process columns in the schema	Skip parsing and column creation for keys not specified in the schema
🔄 #15124	Reader option performance is unknown	#15041, add JSON reader option benchmarking
	Performance: Avoid preprocessing to replace empty lines with `{}`. Also see #5712	libcudf provides strings column data source
	#15280 find a solution when whitespace normalization fixes a line that originally was invalid	We could move whitespace normalization after tokenization. Also we would like to address #15277 so that we can remove unquoted newline characters as well.
	n/a, Spark-RAPIDS doesn't use byte range reading	#15185, reduce IO overhead in JSON byte range reading
	n/a, Spark-RAPIDS doesn't use byte range reading	#15186, address data loss edge case for byte range reading
	reduce peak memory usage	add chunking to the JSON reader
	#15222, Spark-RAPIDS must return null if any field is invalid	Provide token stream to Spark-RAPIDS for validation, including checks or leading zeros, special string numbers like `NaN`, `+INF`, `-INF`, and optional limits for which characters can be escaped

The text was updated successfully, but these errors were encountered:

revans2 · 2024-03-15T15:52:21Z

@GregoryKimball From the Spark perspective The following are in priority order. This is based mostly on how likely I think it is that a customer would see these problems/limitations. And also if we have a work around that would let us enable the JSON parsing functionality by default or not without this change, even if it is limited functionality.

Blocker:

[BUG] mixed_type_as_string throws exception for nested data with nested STRING schema request #15260
[FEA] Support casting of LIST type to STRING in JSON #15278
[FEA] Find a way to support String column input/fixup for JSON parsing #15277
[BUG] JSON white space normalization removes too much for unquoted values #15280
[FEA] JSON parsing is not handling escaped single quote the same as Spark #15303
[FEA] Options to validate JSON fields #15222 - This is likely going to need to be broken down into smaller pieces, not all of which are going to be blockers. I also think we need to what is the best way to support this because there will be a performance impact to others that don't want validation like this.
[FEA] JSON number normalization when returned as a string #15318 I don't want to mark this a blocker, but we have a customer that insists on it. We are in the process of trying to develop normalization code that would work, but a lot of the problem is how can/would we be able to integrate this with the existing JSON parsing code.

Non-Blocker:

[BUG] JSON reader fails to parse files with empty rows #5712 - I think I can work around this, but it will end up being a performance hit if we don't have a better way to deal with it.
[FEA] have an option for the schema to filter the columns read from JSON #14951 / [BUG] JSON reader has no option to return the columns only for the requested schema #13473 - performance optimization (I think these might be dupes of each other)

GregoryKimball · 2024-03-15T21:36:39Z

Thank you @revans2 for summarizing your investigation. We've been studying these requirements and we would like to continue the discussion with you next week.

libcudf will soon address:
1, 2, 5

libcudf is doing design work on:
emitting raw strings (helps with 6, 7)
moving whitespace normalization after tokenization (helps with 4)

libcudf suggests that 3 is a non-blocker

revans2 · 2024-03-18T14:19:50Z

Like I said I can work around 3, but I don't know how to make it performant without help from CUDF, and we have seen this in actual customer data. Perhaps I can write a custom kernel myself that looks at quotes and replaces values in quotes vs outside of quotes as needed. I'll see.

GregoryKimball · 2024-03-22T20:01:18Z

We had more discussions on the JSON compatibility issues and identified "multi-line" support as a blocker (relates to 3 above). We don't currently have a way to process a strings column as JSON Lines when the rows contain unquoted newline characters. Also our whitespace normalization can't remove unquoted newline characters. (See #10267 and #15277 for related requests)

GregoryKimball added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Jun 7, 2023

GregoryKimball added this to the Nested JSON reader milestone Jun 7, 2023

GregoryKimball added this to libcudf Jun 7, 2023

GPUtester added this to cuDF/Dask/Numba/UCX Jun 7, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Jun 7, 2023

GregoryKimball removed this from cuDF/Dask/Numba/UCX Jun 7, 2023

GregoryKimball moved this to In progress in libcudf Jun 7, 2023

GregoryKimball changed the title ~~[FEA] JSON reader improvements for Spark-RAPIDS~~ [FEA] Story - JSON reader improvements for Spark-RAPIDS Jun 7, 2023

GregoryKimball added 0 - Backlog In queue waiting for assignment and removed 2 - In Progress Currently a work in progress labels Aug 2, 2023

GregoryKimball moved this from In progress to Story Issue in libcudf Aug 7, 2023

GregoryKimball mentioned this issue Aug 7, 2023

[FEA][JSON reader] to support parsing with single quotes #10004

Closed

GregoryKimball removed the status in libcudf Aug 8, 2023

GregoryKimball moved this to Story Issue in libcudf Aug 22, 2023

github-project-automation bot added this to cuDF/Dask/Numba/UCX Aug 22, 2023

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Aug 22, 2023

GregoryKimball mentioned this issue Oct 9, 2023

[FEA] JSON validator for json strings given in strings column #12532

Closed

GregoryKimball changed the title ~~[FEA] Story - JSON reader improvements for Spark-RAPIDS~~ [FEA] JSON reader improvements for Spark-RAPIDS Mar 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] JSON reader improvements for Spark-RAPIDS #13525

[FEA] JSON reader improvements for Spark-RAPIDS #13525

GregoryKimball commented Jun 7, 2023 •

edited

Loading

revans2 commented Mar 15, 2024

GregoryKimball commented Mar 15, 2024

revans2 commented Mar 18, 2024

GregoryKimball commented Mar 22, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Comments

GregoryKimball commented Jun 7, 2023 • edited Loading

Spark compatibility issues: Blockers

Spark compatibility issues: non-blockers

revans2 commented Mar 15, 2024

GregoryKimball commented Mar 15, 2024

revans2 commented Mar 18, 2024

GregoryKimball commented Mar 22, 2024

GregoryKimball commented Jun 7, 2023 •

edited

Loading