-
Notifications
You must be signed in to change notification settings - Fork 907
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[FEA] JSON reader improvements for Spark-RAPIDS #13525
Comments
@GregoryKimball From the Spark perspective The following are in priority order. This is based mostly on how likely I think it is that a customer would see these problems/limitations. And also if we have a work around that would let us enable the JSON parsing functionality by default or not without this change, even if it is limited functionality. Blocker:
Non-Blocker:
|
Thank you @revans2 for summarizing your investigation. We've been studying these requirements and we would like to continue the discussion with you next week. libcudf will soon address: libcudf is doing design work on: libcudf suggests that 3 is a non-blocker |
Like I said I can work around 3, but I don't know how to make it performant without help from CUDF, and we have seen this in actual customer data. Perhaps I can write a custom kernel myself that looks at quotes and replaces values in quotes vs outside of quotes as needed. I'll see. |
We had more discussions on the JSON compatibility issues and identified "multi-line" support as a blocker (relates to 3 above). We don't currently have a way to process a strings column as JSON Lines when the rows contain unquoted newline characters. Also our whitespace normalization can't remove unquoted newline characters. (See #10267 and #15277 for related requests) |
libcudf includes a GPU-accelerated JSON reader that uses a finite-state transducer parser combined with token-processing tree algorithms to transform character buffers into columnar data. This issue tracks the technical work leading up to the launch of libcudf's JSON reader as a default component of the Spark-RAPIDS plugin. Please also refer to the Nested JSON reader milestone and Spark-RAPIDS JSON epic.
Spark compatibility issues: Blockers
allowSingleQuotes
is enabled in SparkSpark compatibility issues: non-blockers
get_json_object
modeled after Spark CPU code (see NVIDIA/spark-rapids-jni#1836). Then the plugin can set to null any entries from objects that Spark would treat as invalid. Later we could provide Spark-RAPIDS access to raw tokens that they could run through a more efficient validator.{}
. Also see #5712NaN
,+INF
,-INF
, and optional limits for which characters can be escapedThe text was updated successfully, but these errors were encountered: