Use default value for decimal precision in parquet writer when not specified #9963

devavret · 2022-01-03T22:01:07Z

cpp/benchmarks/io/parquet/parquet_reader_benchmark.cpp

devavret · 2022-01-04T13:57:37Z

After fixing the missing decimal precision, this still fails with Benchmark did not read the entire table. I think the logic to determine row_groups is improper.

cudf/cpp/benchmarks/io/parquet/parquet_reader_benchmark.cpp

Line 117 in ce02856

auto const num_row_groups = data_size / (128 << 20);

because

It uses 128 MB hardcoded
Row group is no longer fixed to being 128 MB with [FEA] Control rowgroup and page size when writing Parquet files #9615
Even if it was fixed, there is no guarantee the row groups will be data size/128 MB because 128 MB is still just an upper limit. There can be more row groups.

I'm thinking the fix could be that we add a libcudf API to read metadata. But this could take longer.

Apart from this failure, I see some more in the benchmarks for row_selection::NROWS. I think we should just disable ParquetRead/row_selection benchmarks for now.

cc @vuule

hyperbolic2346

Overall this change seems fine, but it sounds like there is more work to be done here as always.

hyperbolic2346 · 2022-01-04T16:08:48Z

cpp/benchmarks/io/parquet/parquet_reader_benchmark.cpp

@@ -101,8 +109,17 @@ void BM_parq_read_varying_options(benchmark::State& state)
  auto const view = tbl->view();

  std::vector<char> parquet_data;
+  auto table_meta = cudf::io::table_input_metadata(view);
+  // Precision is required for decimal columns but the value doesn't affect the performance
+  for (cudf::size_type c = 0; c < view.num_columns(); ++c) {


Suggested change

for (cudf::size_type c = 0; c < view.num_columns(); ++c) {

for (auto c = 0; c < view.num_columns(); ++c) {

This might read easier.

vuule · 2022-01-04T20:56:05Z

I'm fine with temporarily disabling the row_selection benchmarks 👍

Why does the writer require precision to be manually specified? Can we default to the max precision for the input decimal type?

devavret · 2022-01-04T21:06:40Z

Can we default to the max precision for the input decimal type?

I think this was avoided because there were some spark rules that limited precision based on decimal type width. And I figured all other libcudf users might not agree with those rules. The precursor to the precision setting in table_input_metadata was a writer option and that also threw when precision wasn't specified. @hyperbolic2346 would know why.

Although it wouldn't hurt because precision is merely a schema thing. It doesn't affect the data written into the pages, which are still the underlying rep.

codecov · 2022-01-04T23:31:28Z

Codecov Report

Merging #9963 (0e55611) into branch-22.02 (967a333) will decrease coverage by 0.03%.
The diff coverage is n/a.

@@               Coverage Diff                @@
##           branch-22.02    #9963      +/-   ##
================================================
- Coverage         10.49%   10.45%   -0.04%     
================================================
  Files               119      119              
  Lines             20305    20417     +112     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18283     +108

Impacted Files	Coverage Δ
python/custreamz/custreamz/kafka.py	`29.16% <0.00%> (-0.63%)`	⬇️
python/dask_cudf/dask_cudf/sorting.py	`92.30% <0.00%> (-0.61%)`	⬇️
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/io/parquet.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/series.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/utils.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/dtypes.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/utils/ioutils.py	`0.00% <0.00%> (ø)`
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4dc42d...0e55611. Read the comment docs.

devavret · 2022-01-07T17:34:12Z

rerun tests

devavret · 2022-01-07T22:29:45Z

@gpucibot merge

Add required precision for fixed point columns

8829e08

devavret requested a review from a team as a code owner January 3, 2022 22:01

devavret requested review from hyperbolic2346 and ttnghia January 3, 2022 22:01

github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 3, 2022

ttnghia reviewed Jan 3, 2022

View reviewed changes

cpp/benchmarks/io/parquet/parquet_reader_benchmark.cpp Outdated Show resolved Hide resolved

ttnghia reviewed Jan 3, 2022

View reviewed changes

cpp/benchmarks/io/parquet/parquet_reader_benchmark.cpp Outdated Show resolved Hide resolved

devavret added 2 commits January 4, 2022 05:34

Fixing it in the other benchmark

ab7da9b

re-enable disabled bench

1ec1e47

hyperbolic2346 approved these changes Jan 4, 2022

View reviewed changes

vuule added non-breaking Non-breaking change bug Something isn't working labels Jan 4, 2022

Disabling broken benchmark

633fbd7

devavret requested a review from ttnghia January 4, 2022 22:08

ttnghia approved these changes Jan 4, 2022

View reviewed changes

devavret added 3 commits January 6, 2022 05:10

Add default precision values and revert reader benchmark changes

bf13e1a

fix gtest

e497078

re-enable benchs

0e55611

devavret changed the title ~~Add required precision for fixed point columns in parquet reader benchmark~~ Use default value for decimal precision in parquet writer when not specified Jan 7, 2022

rapids-bot bot merged commit 7656277 into rapidsai:branch-22.02 Jan 7, 2022

karthikeyann mentioned this pull request Jun 2, 2022

[BUG] OrcReader benchmark failure #10158

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use default value for decimal precision in parquet writer when not specified #9963

Use default value for decimal precision in parquet writer when not specified #9963

devavret commented Jan 3, 2022

devavret commented Jan 4, 2022 •

edited

Loading

hyperbolic2346 left a comment

hyperbolic2346 Jan 4, 2022

vuule commented Jan 4, 2022

devavret commented Jan 4, 2022

codecov bot commented Jan 4, 2022 •

edited

Loading

devavret commented Jan 7, 2022

devavret commented Jan 7, 2022

	for (cudf::size_type c = 0; c < view.num_columns(); ++c) {
	for (auto c = 0; c < view.num_columns(); ++c) {

Use default value for decimal precision in parquet writer when not specified #9963

Use default value for decimal precision in parquet writer when not specified #9963

Conversation

devavret commented Jan 3, 2022

devavret commented Jan 4, 2022 • edited Loading

hyperbolic2346 left a comment

Choose a reason for hiding this comment

hyperbolic2346 Jan 4, 2022

Choose a reason for hiding this comment

vuule commented Jan 4, 2022

devavret commented Jan 4, 2022

codecov bot commented Jan 4, 2022 • edited Loading

Codecov Report

devavret commented Jan 7, 2022

devavret commented Jan 7, 2022

devavret commented Jan 4, 2022 •

edited

Loading

codecov bot commented Jan 4, 2022 •

edited

Loading