Pick smallest decimal type with required precision in ORC reader #9775

vuule · 2021-11-24T22:59:27Z

Depends on #9853

Current behavior is to throw when an ORC column has precision that is too high for decimal64.
This PR changes the behavior to instead read the column as decimal128, when precision is too high for 64 bits. This reduces the need for the use of decimal128_columns option.
Also modified the decimal type inference to use decimal32 when the precision is sufficiently low, reducing memory use in such case.
Adds a temporary option to disable decimal128 use. This option is used in Python to get a readable error message in this case, while allowing decimal128 use by other callers.

…bug-dec128-from-precision

vuule · 2021-12-04T00:48:28Z

@galipremsagar please review behavior when calling from Python. Since decimal128 is disabled through the new option, users shoudl get a nice error message when reading columns with too high precision, and have an option to read as float (no changes there).

…bug-dec128-from-precision

codecov · 2021-12-04T03:44:36Z

Codecov Report

Merging #9775 (349d4e2) into branch-22.02 (967a333) will decrease coverage by 0.05%.
The diff coverage is 5.68%.

@@               Coverage Diff                @@
##           branch-22.02    #9775      +/-   ##
================================================
- Coverage         10.49%   10.43%   -0.06%     
================================================
  Files               119      119              
  Lines             20305    20449     +144     
================================================
+ Hits               2130     2134       +4     
- Misses            18175    18315     +140

Impacted Files	Coverage Δ
python/cudf/cudf/__init__.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/_base_index.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/column.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/column/string.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/dataframe.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/groupby/groupby.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/index.py	`0.00% <ø> (ø)`
python/cudf/cudf/core/indexed_frame.py	`0.00% <0.00%> (ø)`
python/cudf/cudf/core/multiindex.py	`0.00% <0.00%> (ø)`
... and 12 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a5633c2...349d4e2. Read the comment docs.

…bug-dec128-from-precision

…into bug-dec128-from-precision

…cudf into bug-dec128-from-precision

…bug-dec128-from-precision

harrism

Looks good. Just one comment.

harrism · 2021-12-07T23:07:54Z

cpp/src/io/orc/reader_impl.hpp

  bool _use_np_dtypes = true;
  std::vector<std::string> _decimal_cols_as_float;
  std::vector<std::string> decimal128_columns;
+  bool is_decimal128_enabled;
  data_type _timestamp_type{type_id::EMPTY};
  reader_column_meta _col_meta;


Inconsistency in what is initialized and how things are initialized here. Two bool members above are initialized to true using =. This bool is not initialized. data_type is initialized using {}. Suggest adding initialization for everything and doing them all the same way.

Done, only left the vectors without "visible" initialization.

hyperbolic2346 · 2021-12-07T23:31:37Z

Is there a performance penalty if reading decimal128 data?

vuule · 2021-12-08T00:35:41Z

Is there a performance penalty if reading decimal128 data?

Do you mean reading the same column as decimal128 as opposed to decimal32/64? Shouldn't be a big difference (just larger stride between writes), but I don't have a benchmark set up for this.

What I could do with existing benchmarks is to read N megabytes of decimal32 vs 64 vs 128. Here, decimal128 Is much faster than decimal32/64. Performance is almost proportional to the type width in this case. It's possible that the random number generator does not generate fair data, but at least it does not look like there's an inherent penalty from using decimal128.

hyperbolic2346

Reading the initial description I thought this was a multi-pass algorithm if the data was decimal128 or decimal32. I was worried about the performance implications, but I see there is metadata for the precision. Looks good to me.

vuule · 2021-12-08T06:28:47Z

@gpucibot merge

Closes #9769 Depends on #9775 Benchmarks now include decimal32/64/128 columns for all supported formats. Also fixes an issue in distribution factory, which caused all normal distributions to generate `upper_bound` in many cases. Authors: - Vukasin Milovanovic (https://github.com/vuule) - Jason Lowe (https://github.com/jlowe) Approvers: - Devavret Makkar (https://github.com/devavret) - https://github.com/nvdbaranec URL: #9776

vuule added 2 commits November 24, 2021 11:08

read as dec128 when precision > 18 instead of throwing

d1b09bc

update tests to match the new behavior

3065bac

vuule self-assigned this Nov 24, 2021

github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Nov 24, 2021

vuule added breaking Breaking change cuIO cuIO issue improvement Improvement / enhancement to an existing function and removed Python Affects Python cuDF API. labels Nov 24, 2021

vuule mentioned this pull request Nov 25, 2021

Add decimal types to cuIO benchmarks #9776

Merged

vuule added 4 commits December 1, 2021 16:13

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

c5966a3

…bug-dec128-from-precision

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

1fb2164

…bug-dec128-from-precision

decode as decimal32

761ed54

read as dec32 when possible; update tests; disable dec128 from python

6e11f75

github-actions bot added the Python Affects Python cuDF API. label Dec 4, 2021

vuule changed the title ~~Read decimal ORC columns as 128bit when the precision is over 18~~ Pick smallest decimal type with required precision in ORC reader Dec 4, 2021

vuule added 2 commits December 3, 2021 18:15

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

41ac21c

…bug-dec128-from-precision

set scale for decimal32

7df6e76

adjust Java test

fe03e08

github-actions bot added the Java Affects Java cuDF API. label Dec 4, 2021

java fix try 2

b98b4a8

vuule mentioned this pull request Dec 7, 2021

[BUG] Java bindings for ORC/Parquet writers always set decimal precision #9851

Closed

jlowe and others added 5 commits December 7, 2021 11:05

Remove deprecated methods from Java Table class

1869255

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

509a153

…bug-dec128-from-precision

Merge commit 'refs/pull/9853/head' of https://github.com/rapidsai/cudf …

fd1c6da

…into bug-dec128-from-precision

Merge branch 'bug-dec128-from-precision' of https://github.com/vuule/…

d56b9b9

…cudf into bug-dec128-from-precision

revert Java test changes

b6c9d7b

galipremsagar approved these changes Dec 7, 2021

View reviewed changes

vuule added 2 commits December 7, 2021 12:36

clean up

58af480

Merge branch 'branch-22.02' of https://github.com/rapidsai/cudf into …

92c689e

…bug-dec128-from-precision

github-actions bot removed the Java Affects Java cuDF API. label Dec 7, 2021

vuule marked this pull request as ready for review December 7, 2021 21:17

vuule requested review from a team as code owners December 7, 2021 21:17

vuule requested review from harrism, hyperbolic2346, cwharris and isVoid December 7, 2021 21:17

harrism approved these changes Dec 7, 2021

View reviewed changes

clean up initialization

a21bff4

stylin'

349d4e2

hyperbolic2346 approved these changes Dec 8, 2021

View reviewed changes

rapids-bot bot merged commit 2e95fb1 into rapidsai:branch-22.02 Dec 8, 2021

vuule deleted the bug-dec128-from-precision branch December 8, 2021 06:28

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Pick smallest decimal type with required precision in ORC reader #9775

Pick smallest decimal type with required precision in ORC reader #9775

vuule commented Nov 24, 2021 •

edited

Loading

vuule commented Dec 4, 2021

codecov bot commented Dec 4, 2021 •

edited

Loading

harrism left a comment

harrism Dec 7, 2021

vuule Dec 7, 2021

hyperbolic2346 commented Dec 7, 2021

vuule commented Dec 8, 2021

hyperbolic2346 left a comment

vuule commented Dec 8, 2021

Pick smallest decimal type with required precision in ORC reader #9775

Pick smallest decimal type with required precision in ORC reader #9775

Conversation

vuule commented Nov 24, 2021 • edited Loading

vuule commented Dec 4, 2021

codecov bot commented Dec 4, 2021 • edited Loading

Codecov Report

harrism left a comment

Choose a reason for hiding this comment

harrism Dec 7, 2021

Choose a reason for hiding this comment

vuule Dec 7, 2021

Choose a reason for hiding this comment

hyperbolic2346 commented Dec 7, 2021

vuule commented Dec 8, 2021

hyperbolic2346 left a comment

Choose a reason for hiding this comment

vuule commented Dec 8, 2021

vuule commented Nov 24, 2021 •

edited

Loading

codecov bot commented Dec 4, 2021 •

edited

Loading