Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Update JSON reader benchmarks to include JSON lines and normalization #15041

Open
3 of 8 tasks
GregoryKimball opened this issue Feb 13, 2024 · 0 comments
Open
3 of 8 tasks
Assignees
Labels
1 - On Deck To be worked on next cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS

Comments

@GregoryKimball
Copy link
Contributor

GregoryKimball commented Feb 13, 2024

Is your feature request related to a problem? Please describe.

First pass changes:

  • I believe this line in the benchmark nested_json.cpp should use max_list_size instead of max_struct_size. We should also add int64 nvbench axes for these two size values, sticking with a standard value of {10}, and adding the ability to sweep these parameters in custom tests.
  • Add JSON versus JSON Lines benchmark. We have a parquet_reader_options benchmark and we could add something similar e.g. json_reader_options. This benchmark can start by choosing a single data type and a device buffer data source. As a follow-on step we would want to allow data type and IO source to be nvbench enum axes.
  • Add _normalize_single_quotes and _normalize_whitespace to the json_reader_options benchmark. Since the JSON writer can't generate single quotes or extra whitespace, these normalization steps will not change the resulting table, but we should track the added runtime.
  • Add _recovery_mode and _mixed_types_as_string to the json_reader_options benchmark as "no-op" tests. The benchmark would use the the existing data generator without invalid records and without mixed types.
  • Add post-processing to the generated data to introduce mixed types, and then benchmark against similar data without mixed types. The approach could be using the existing data generator, but then changing one list entry into a struct entry, e.g. [1,2,3] => {"a": [1,2,3]}

Lower priority ideas. If we have reason to believe these benchmarks would highlight performance issues, then we should raise their priority.

  • For the quote and whitespace normalization options, create a modified data generator or character buffer post-processing to introduce un-normalized data. For instance, we could replace " with ' for quote normalization and : with : for whitespace normalization.
  • Update the data generator to introduce invalid JSON lines and exercises the _recovery_mode as nulls code path. We could add a fraction of invalid records as well as valid records followed by invalid characters.
  • Add a normalization benchmark into the benchmarks/io/json/ suite that measures the runtime of detail::normalize_single_quotes and the upcoming detil API for whitespace normalization. This benchmark would not test the overall reader, but only the FST-based normalization functions.
@GregoryKimball GregoryKimball added feature request New feature or request 1 - On Deck To be worked on next libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Feb 13, 2024
@GregoryKimball GregoryKimball added this to the Nested JSON reader milestone Feb 13, 2024
@github-project-automation github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Mar 6, 2024
rapids-bot bot pushed a commit that referenced this issue Apr 9, 2024
The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes. 
Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs.
Addresses #15041

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: #15124
jjacobelli pushed a commit to jjacobelli/cudf that referenced this issue Apr 9, 2024
The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes. 
Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs.
Addresses rapidsai#15041

Authors:
  - Shruti Shivakumar (https://github.com/shrshi)
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Robert Maynard (https://github.com/robertmaynard)
  - Vukasin Milovanovic (https://github.com/vuule)
  - Yunsong Wang (https://github.com/PointKernel)
  - Nghia Truong (https://github.com/ttnghia)

URL: rapidsai#15124
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
1 - On Deck To be worked on next cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS
Projects
Status: In Progress
Status: No status
Development

No branches or pull requests

2 participants