Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use default value for decimal precision in parquet writer when not specified #9963

Merged

Conversation

devavret
Copy link
Contributor

@devavret devavret commented Jan 3, 2022

Fixes #9962

@devavret devavret requested a review from a team as a code owner January 3, 2022 22:01
@github-actions github-actions bot added the libcudf Affects libcudf (C++/CUDA) code. label Jan 3, 2022
@devavret
Copy link
Contributor Author

devavret commented Jan 4, 2022

After fixing the missing decimal precision, this still fails with Benchmark did not read the entire table. I think the logic to determine row_groups is improper.

auto const num_row_groups = data_size / (128 << 20);

because

  1. It uses 128 MB hardcoded
  2. Row group is no longer fixed to being 128 MB with [FEA] Control rowgroup and page size when writing Parquet files #9615
  3. Even if it was fixed, there is no guarantee the row groups will be data size/128 MB because 128 MB is still just an upper limit. There can be more row groups.

I'm thinking the fix could be that we add a libcudf API to read metadata. But this could take longer.

Apart from this failure, I see some more in the benchmarks for row_selection::NROWS. I think we should just disable ParquetRead/row_selection benchmarks for now.

cc @vuule

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this change seems fine, but it sounds like there is more work to be done here as always.

@@ -101,8 +109,17 @@ void BM_parq_read_varying_options(benchmark::State& state)
auto const view = tbl->view();

std::vector<char> parquet_data;
auto table_meta = cudf::io::table_input_metadata(view);
// Precision is required for decimal columns but the value doesn't affect the performance
for (cudf::size_type c = 0; c < view.num_columns(); ++c) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
for (cudf::size_type c = 0; c < view.num_columns(); ++c) {
for (auto c = 0; c < view.num_columns(); ++c) {

This might read easier.

@vuule vuule added non-breaking Non-breaking change bug Something isn't working labels Jan 4, 2022
@vuule
Copy link
Contributor

vuule commented Jan 4, 2022

I'm fine with temporarily disabling the row_selection benchmarks 👍

Why does the writer require precision to be manually specified? Can we default to the max precision for the input decimal type?

@devavret
Copy link
Contributor Author

devavret commented Jan 4, 2022

Can we default to the max precision for the input decimal type?

I think this was avoided because there were some spark rules that limited precision based on decimal type width. And I figured all other libcudf users might not agree with those rules. The precursor to the precision setting in table_input_metadata was a writer option and that also threw when precision wasn't specified. @hyperbolic2346 would know why.

Although it wouldn't hurt because precision is merely a schema thing. It doesn't affect the data written into the pages, which are still the underlying rep.

@devavret devavret requested a review from ttnghia January 4, 2022 22:08
@codecov
Copy link

codecov bot commented Jan 4, 2022

Codecov Report

Merging #9963 (0e55611) into branch-22.02 (967a333) will decrease coverage by 0.03%.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9963      +/-   ##
================================================
- Coverage         10.49%   10.45%   -0.04%     
================================================
  Files               119      119              
  Lines             20305    20417     +112     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18283     +108     
Impacted Files Coverage Δ
python/custreamz/custreamz/kafka.py 29.16% <0.00%> (-0.63%) ⬇️
python/dask_cudf/dask_cudf/sorting.py 92.30% <0.00%> (-0.61%) ⬇️
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <0.00%> (ø)
python/cudf/cudf/io/parquet.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/series.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/utils.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/dtypes.py 0.00% <0.00%> (ø)
python/cudf/cudf/utils/ioutils.py 0.00% <0.00%> (ø)
... and 17 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a4dc42d...0e55611. Read the comment docs.

@devavret
Copy link
Contributor Author

devavret commented Jan 7, 2022

rerun tests

@devavret devavret changed the title Add required precision for fixed point columns in parquet reader benchmark Use default value for decimal precision in parquet writer when not specified Jan 7, 2022
@devavret
Copy link
Contributor Author

devavret commented Jan 7, 2022

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 7656277 into rapidsai:branch-22.02 Jan 7, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Parquet Reader Bench throwing cudf::logic_error
4 participants