[FEA] Update JSON reader benchmarks to include JSON lines and normalization #15041

GregoryKimball · 2024-02-13T21:15:10Z

Is your feature request related to a problem? Please describe.

First pass changes:

I believe this line in the benchmark nested_json.cpp should use max_list_size instead of max_struct_size. We should also add int64 nvbench axes for these two size values, sticking with a standard value of {10}, and adding the ability to sweep these parameters in custom tests.
Add JSON versus JSON Lines benchmark. We have a parquet_reader_options benchmark and we could add something similar e.g. json_reader_options. This benchmark can start by choosing a single data type and a device buffer data source. As a follow-on step we would want to allow data type and IO source to be nvbench enum axes.
Add _normalize_single_quotes and _normalize_whitespace to the json_reader_options benchmark. Since the JSON writer can't generate single quotes or extra whitespace, these normalization steps will not change the resulting table, but we should track the added runtime.
Add _recovery_mode and _mixed_types_as_string to the json_reader_options benchmark as "no-op" tests. The benchmark would use the the existing data generator without invalid records and without mixed types.
Add post-processing to the generated data to introduce mixed types, and then benchmark against similar data without mixed types. The approach could be using the existing data generator, but then changing one list entry into a struct entry, e.g. [1,2,3] => {"a": [1,2,3]}

Lower priority ideas. If we have reason to believe these benchmarks would highlight performance issues, then we should raise their priority.

For the quote and whitespace normalization options, create a modified data generator or character buffer post-processing to introduce un-normalized data. For instance, we could replace " with ' for quote normalization and : with : for whitespace normalization.
Update the data generator to introduce invalid JSON lines and exercises the _recovery_mode as nulls code path. We could add a fraction of invalid records as well as valid records followed by invalid characters.
Add a normalization benchmark into the benchmarks/io/json/ suite that measures the runtime of detail::normalize_single_quotes and the upcoming detil API for whitespace normalization. This benchmark would not test the overall reader, but only the FST-based normalization functions.

The text was updated successfully, but these errors were encountered:

The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes. Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs. Addresses #15041 Authors: - Shruti Shivakumar (https://github.com/shrshi) - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: #15124

The goal of this piece of work is to analyze the performance of the reader for JSON lines. This PR establishes a baseline for the performance of single quote normalization, white space normalization, mixed type as string parsing and recovery mode options when the input JSON is valid, and does not have any single quotes. Modifying the data generation to produce inputs with single quotes/mixed types/invalid lines will be the focus of follow-on PRs. Addresses rapidsai#15041 Authors: - Shruti Shivakumar (https://github.com/shrshi) - Nghia Truong (https://github.com/ttnghia) Approvers: - Robert Maynard (https://github.com/robertmaynard) - Vukasin Milovanovic (https://github.com/vuule) - Yunsong Wang (https://github.com/PointKernel) - Nghia Truong (https://github.com/ttnghia) URL: rapidsai#15124

GregoryKimball added feature request New feature or request 1 - On Deck To be worked on next libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS labels Feb 13, 2024

GregoryKimball added this to libcudf Feb 13, 2024

GregoryKimball added this to the Nested JSON reader milestone Feb 13, 2024

GregoryKimball assigned shrshi Feb 14, 2024

This was referenced Feb 23, 2024

Introduce benchmark suite for JSON reader options #15124

Merged

[PERF] Performance impact of mixed_type_as_string JSON reader option in reading JSON lines #15196

Closed

github-project-automation bot added this to cuDF/Dask/Numba/UCX Mar 6, 2024

github-project-automation bot moved this to In Progress in cuDF/Dask/Numba/UCX Mar 6, 2024

GregoryKimball mentioned this issue Mar 13, 2024

[FEA] JSON reader improvements for Spark-RAPIDS #13525

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Update JSON reader benchmarks to include JSON lines and normalization #15041

[FEA] Update JSON reader benchmarks to include JSON lines and normalization #15041

GregoryKimball commented Feb 13, 2024 •

edited by shrshi

Loading

[FEA] Update JSON reader benchmarks to include JSON lines and normalization #15041

[FEA] Update JSON reader benchmarks to include JSON lines and normalization #15041

Comments

GregoryKimball commented Feb 13, 2024 • edited by shrshi Loading

GregoryKimball commented Feb 13, 2024 •

edited by shrshi

Loading