Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Pick smallest decimal type with required precision in ORC reader #9775

Merged
merged 19 commits into from
Dec 8, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Nov 24, 2021

Depends on #9853

Current behavior is to throw when an ORC column has precision that is too high for decimal64.
This PR changes the behavior to instead read the column as decimal128, when precision is too high for 64 bits. This reduces the need for the use of decimal128_columns option.
Also modified the decimal type inference to use decimal32 when the precision is sufficiently low, reducing memory use in such case.
Adds a temporary option to disable decimal128 use. This option is used in Python to get a readable error message in this case, while allowing decimal128 use by other callers.

@vuule vuule self-assigned this Nov 24, 2021
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Nov 24, 2021
@vuule vuule added breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function and removed Python Affects Python cuDF API. labels Nov 24, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Dec 4, 2021
@vuule
Copy link
Contributor Author

vuule commented Dec 4, 2021

@galipremsagar please review behavior when calling from Python. Since decimal128 is disabled through the new option, users shoudl get a nice error message when reading columns with too high precision, and have an option to read as float (no changes there).

@vuule vuule changed the title Read decimal ORC columns as 128bit when the precision is over 18 Pick smallest decimal type with required precision in ORC reader Dec 4, 2021
@codecov
Copy link

codecov bot commented Dec 4, 2021

Codecov Report

Merging #9775 (349d4e2) into branch-22.02 (967a333) will decrease coverage by 0.05%.
The diff coverage is 5.68%.

Impacted file tree graph

@@               Coverage Diff                @@
##           branch-22.02    #9775      +/-   ##
================================================
- Coverage         10.49%   10.43%   -0.06%     
================================================
  Files               119      119              
  Lines             20305    20449     +144     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18315     +140     
Impacted Files Coverage Δ
python/cudf/cudf/__init__.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/_base_index.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/column.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/column/string.py 0.00% <ø> (ø)
python/cudf/cudf/core/dataframe.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/groupby/groupby.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/index.py 0.00% <ø> (ø)
python/cudf/cudf/core/indexed_frame.py 0.00% <0.00%> (ø)
python/cudf/cudf/core/multiindex.py 0.00% <0.00%> (ø)
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5633c2...349d4e2. Read the comment docs.

@github-actions github-actions bot added the Java Affects Java cuDF API. label Dec 4, 2021
@github-actions github-actions bot removed the Java Affects Java cuDF API. label Dec 7, 2021
@vuule vuule marked this pull request as ready for review December 7, 2021 21:17
@vuule vuule requested review from a team as code owners December 7, 2021 21:17
Copy link
Member

@harrism harrism left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Just one comment.

Comment on lines 223 to 228
bool _use_np_dtypes = true;
std::vector<std::string> _decimal_cols_as_float;
std::vector<std::string> decimal128_columns;
bool is_decimal128_enabled;
data_type _timestamp_type{type_id::EMPTY};
reader_column_meta _col_meta;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inconsistency in what is initialized and how things are initialized here. Two bool members above are initialized to true using =. This bool is not initialized. data_type is initialized using {}. Suggest adding initialization for everything and doing them all the same way.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done, only left the vectors without "visible" initialization.

@hyperbolic2346
Copy link
Contributor

Is there a performance penalty if reading decimal128 data?

@vuule
Copy link
Contributor Author

vuule commented Dec 8, 2021

Is there a performance penalty if reading decimal128 data?

Do you mean reading the same column as decimal128 as opposed to decimal32/64? Shouldn't be a big difference (just larger stride between writes), but I don't have a benchmark set up for this.

What I could do with existing benchmarks is to read N megabytes of decimal32 vs 64 vs 128. Here, decimal128 Is much faster than decimal32/64. Performance is almost proportional to the type width in this case. It's possible that the random number generator does not generate fair data, but at least it does not look like there's an inherent penalty from using decimal128.

Copy link
Contributor

@hyperbolic2346 hyperbolic2346 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reading the initial description I thought this was a multi-pass algorithm if the data was decimal128 or decimal32. I was worried about the performance implications, but I see there is metadata for the precision. Looks good to me.

@vuule
Copy link
Contributor Author

vuule commented Dec 8, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 2e95fb1 into rapidsai:branch-22.02 Dec 8, 2021
@vuule vuule deleted the bug-dec128-from-precision branch December 8, 2021 06:28
rapids-bot bot pushed a commit that referenced this pull request Dec 17, 2021
Closes #9769
Depends on #9775

Benchmarks now include decimal32/64/128 columns for all supported formats.
Also fixes an issue in distribution factory, which caused all normal distributions to generate `upper_bound` in many cases.

Authors:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Jason Lowe (https://github.com/jlowe)

Approvers:
  - Devavret Makkar (https://github.com/devavret)
  - https://github.com/nvdbaranec

URL: #9776
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants