Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix ORC issue with incorrect timestamp nanosecond values #7581

Merged
merged 5 commits into from
Mar 15, 2021

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Mar 12, 2021

Closes #7355

Use 64 bit variables/buffers to handle nanosecond values since nanosecond encode can overflow a 32bit value in some cases.
Removed the overloaded intrle_minmax function, using templated numeric_limits functions instead (the alternative was to add another overload).

Performance impact evaluation pending, but this fix seems unavoidable regardless of the impact.

@vuule vuule added bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue non-breaking Non-breaking change labels Mar 12, 2021
@vuule vuule self-assigned this Mar 12, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Mar 12, 2021
@vuule vuule marked this pull request as ready for review March 12, 2021 20:48
@vuule vuule requested review from a team as code owners March 12, 2021 20:48
@vuule vuule requested review from trxcllnt, jrhemstad, galipremsagar, devavret and kaatish and removed request for trxcllnt and jrhemstad March 12, 2021 20:48
python/cudf/cudf/tests/test_orc.py Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor Author

vuule commented Mar 12, 2021

There's no significant perf impact on the writer, but the reader is up to 5% slower in some cases (timestamp columns only). I expected the writer to be more impacted, will look into why the reader is so much slower.

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find it weird that ORC does not require any change in metadata to signify that the stream has been changed from 32 bit to 64 bit.

Please confirm!?

Approved otherwise.

@vuule
Copy link
Contributor Author

vuule commented Mar 12, 2021

I find it weird that ORC does not require any change in metadata to signify that the stream has been changed from 32 bit to 64 bit.

Please confirm!?

ORC uses varint so integer encoding does not depend on the size.

@codecov
Copy link

codecov bot commented Mar 12, 2021

Codecov Report

Merging #7581 (1e1b785) into branch-0.19 (7871e7a) will increase coverage by 0.51%.
The diff coverage is 92.85%.

Impacted file tree graph

@@               Coverage Diff               @@
##           branch-0.19    #7581      +/-   ##
===============================================
+ Coverage        81.86%   82.38%   +0.51%     
===============================================
  Files              101      101              
  Lines            16884    17340     +456     
===============================================
+ Hits             13822    14285     +463     
+ Misses            3062     3055       -7     
Impacted Files Coverage Δ
python/cudf/cudf/core/index.py 93.34% <ø> (+0.48%) ⬆️
python/cudf/cudf/core/column/column.py 87.80% <75.00%> (+0.04%) ⬆️
python/cudf/cudf/core/column/numerical.py 94.85% <85.71%> (-0.17%) ⬇️
python/cudf/cudf/core/frame.py 89.12% <89.47%> (+0.10%) ⬆️
python/cudf/cudf/core/column/decimal.py 93.33% <90.47%> (-1.54%) ⬇️
python/cudf/cudf/core/dataframe.py 90.58% <95.00%> (+0.11%) ⬆️
python/cudf/cudf/core/series.py 91.57% <95.55%> (+0.78%) ⬆️
python/cudf/cudf/core/column/string.py 86.76% <100.00%> (+0.26%) ⬆️
python/cudf/cudf/core/indexing.py 96.29% <100.00%> (+0.23%) ⬆️
python/cudf/cudf/utils/gpu_utils.py 53.65% <0.00%> (-4.88%) ⬇️
... and 51 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 3bcd1af...1e1b785. Read the comment docs.

Copy link
Contributor

@kaatish kaatish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@vuule
Copy link
Contributor Author

vuule commented Mar 15, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 36f18c8 into rapidsai:branch-0.19 Mar 15, 2021
@vuule vuule deleted the bug-orc-nanos-encode branch March 15, 2021 18:09
hyperbolic2346 pushed a commit to hyperbolic2346/cudf that referenced this pull request Mar 25, 2021
Closes rapidsai#7355

Use 64 bit variables/buffers to handle nanosecond values since nanosecond encode can overflow a 32bit value in some cases.
Removed the overloaded `intrle_minmax` function, using templated `numeric_limits` functions instead (the alternative was to add another overload).

Performance impact evaluation pending, but this fix seems unavoidable regardless of the impact.

Authors:
  - Vukasin Milovanovic (@vuule)

Approvers:
  - GALI PREM SAGAR (@galipremsagar)
  - Devavret Makkar (@devavret)
  - Kumar Aatish (@kaatish)

URL: rapidsai#7581
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] Datetime data is being written incorrectly by orc writer
4 participants