-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pick smallest decimal type with required precision in ORC reader #9775
Pick smallest decimal type with required precision in ORC reader #9775
Conversation
…bug-dec128-from-precision
…bug-dec128-from-precision
@galipremsagar please review behavior when calling from Python. Since decimal128 is disabled through the new option, users shoudl get a nice error message when reading columns with too high precision, and have an option to read as float (no changes there). |
…bug-dec128-from-precision
Codecov Report
@@ Coverage Diff @@
## branch-22.02 #9775 +/- ##
================================================
- Coverage 10.49% 10.43% -0.06%
================================================
Files 119 119
Lines 20305 20449 +144
================================================
+ Hits 2130 2134 +4
- Misses 18175 18315 +140
Continue to review full report at Codecov.
|
…bug-dec128-from-precision
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Just one comment.
cpp/src/io/orc/reader_impl.hpp
Outdated
bool _use_np_dtypes = true; | ||
std::vector<std::string> _decimal_cols_as_float; | ||
std::vector<std::string> decimal128_columns; | ||
bool is_decimal128_enabled; | ||
data_type _timestamp_type{type_id::EMPTY}; | ||
reader_column_meta _col_meta; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Inconsistency in what is initialized and how things are initialized here. Two bool
members above are initialized to true using =
. This bool
is not initialized. data_type
is initialized using {}
. Suggest adding initialization for everything and doing them all the same way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done, only left the vectors without "visible" initialization.
Is there a performance penalty if reading decimal128 data? |
Do you mean reading the same column as decimal128 as opposed to decimal32/64? Shouldn't be a big difference (just larger stride between writes), but I don't have a benchmark set up for this. What I could do with existing benchmarks is to read N megabytes of decimal32 vs 64 vs 128. Here, decimal128 Is much faster than decimal32/64. Performance is almost proportional to the type width in this case. It's possible that the random number generator does not generate fair data, but at least it does not look like there's an inherent penalty from using decimal128. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reading the initial description I thought this was a multi-pass algorithm if the data was decimal128 or decimal32. I was worried about the performance implications, but I see there is metadata for the precision. Looks good to me.
@gpucibot merge |
Closes #9769 Depends on #9775 Benchmarks now include decimal32/64/128 columns for all supported formats. Also fixes an issue in distribution factory, which caused all normal distributions to generate `upper_bound` in many cases. Authors: - Vukasin Milovanovic (https://github.com/vuule) - Jason Lowe (https://github.com/jlowe) Approvers: - Devavret Makkar (https://github.com/devavret) - https://github.com/nvdbaranec URL: #9776
Depends on #9853
Current behavior is to throw when an ORC column has precision that is too high for decimal64.
This PR changes the behavior to instead read the column as decimal128, when precision is too high for 64 bits. This reduces the need for the use of
decimal128_columns
option.Also modified the decimal type inference to use decimal32 when the precision is sufficiently low, reducing memory use in such case.
Adds a temporary option to disable decimal128 use. This option is used in Python to get a readable error message in this case, while allowing decimal128 use by other callers.