[FEA] Update JSON reader benchmarks to include JSON lines and normalization #15041
Open
3 of 8 tasks
Labels
1 - On Deck
To be worked on next
cuIO
cuIO issue
feature request
New feature or request
libcudf
Affects libcudf (C++/CUDA) code.
Spark
Functionality that helps Spark RAPIDS
Milestone
Is your feature request related to a problem? Please describe.
First pass changes:
nested_json.cpp
should usemax_list_size
instead ofmax_struct_size
. We should also addint64
nvbench axes for these two size values, sticking with a standard value of{10}
, and adding the ability to sweep these parameters in custom tests.parquet_reader_options
benchmark and we could add something similar e.g.json_reader_options
. This benchmark can start by choosing a single data type and a device buffer data source. As a follow-on step we would want to allow data type and IO source to be nvbench enum axes._normalize_single_quotes
and_normalize_whitespace
to thejson_reader_options
benchmark. Since the JSON writer can't generate single quotes or extra whitespace, these normalization steps will not change the resulting table, but we should track the added runtime._recovery_mode
and_mixed_types_as_string
to thejson_reader_options
benchmark as "no-op" tests. The benchmark would use the the existing data generator without invalid records and without mixed types.[1,2,3]
=>{"a": [1,2,3]}
Lower priority ideas. If we have reason to believe these benchmarks would highlight performance issues, then we should raise their priority.
"
with'
for quote normalization and:
with:
for whitespace normalization._recovery_mode
as nulls code path. We could add a fraction of invalid records as well as valid records followed by invalid characters.benchmarks/io/json/
suite that measures the runtime ofdetail::normalize_single_quotes
and the upcoming detil API for whitespace normalization. This benchmark would not test the overall reader, but only the FST-based normalization functions.The text was updated successfully, but these errors were encountered: