Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix offset of the string dictionary length stream #8515

Merged

Conversation

vuule
Copy link
Contributor

@vuule vuule commented Jun 15, 2021

Fixes #8514

String dictionary length is RLE encoded and rle_data_size and non_rle_data_size take this into account. However, When computing chunk stream offsets, these streams were treated as non-RLE and non_rle_data_size was not added. This caused discrepancy between non-RLE stream sizes and available space, leading to overlap between chunk streams.

Applied the non_rle_data_size to the offset to correct the discrepancy and added a test that uses decimal columns to increase the size of non-RLE encoded data and enable the overflow.

@vuule vuule added libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue labels Jun 15, 2021
@vuule vuule self-assigned this Jun 15, 2021
@github-actions github-actions bot added the Python Affects Python cuDF API. label Jun 15, 2021
@vuule vuule added bug Something isn't working non-breaking Non-breaking change labels Jun 15, 2021
@vuule vuule marked this pull request as ready for review June 15, 2021 06:45
@vuule vuule requested review from a team as code owners June 15, 2021 06:45
Copy link
Contributor

@rgsl888prabhu rgsl888prabhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, may be we can add a comment how rle and non-rle data is composed.

@vuule
Copy link
Contributor Author

vuule commented Jun 15, 2021

LGTM, may be we can add a comment how rle and non-rle data is composed.

Added a few comments, hopefully they are helpful in understanding the layout.

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 4 - Needs cuIO Reviewer labels Jun 15, 2021
@galipremsagar
Copy link
Contributor

rerun tests

@codecov
Copy link

codecov bot commented Jun 16, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@20f08da). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8515   +/-   ##
===============================================
  Coverage                ?   82.95%           
===============================================
  Files                   ?      110           
  Lines                   ?    18181           
  Branches                ?        0           
===============================================
  Hits                    ?    15082           
  Misses                  ?     3099           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 20f08da...cfe1da3. Read the comment docs.

@vuule
Copy link
Contributor Author

vuule commented Jun 16, 2021

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 15b18f4 into rapidsai:branch-21.08 Jun 16, 2021
@vuule vuule deleted the bug-orc-writer-dict_len-offset branch June 16, 2021 00:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working cuIO cuIO issue libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] ORC writer writes corrupted data.
5 participants