Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes issue with null struct columns in ORC reader #8819

Merged
merged 17 commits into from
Jul 22, 2021

Conversation

rgsl888prabhu
Copy link
Contributor

In case of liborc, pyarrow and pyorc:
If the parent has a null element, that element is skipped while writing child data, and same goes with mask
So, you would have to keep track of null count and null mask in parent column, so that you can merge both the parent and child null masks.

In case of pyspark, spark:

If the parent has a null element, and if child column also has null element, then upper explanation holds.
But if all the child rows are valid, then you need to copy the mask from parent.

These scenarios have been take care in the code changes.

Earlier struct column and its child columns used to be in the same level of nesting, but since we need parent null mask before decoding child, changes have been made so that child columns will be moved one level down for all types of nested columns.

closes #8704

@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jul 21, 2021
@rgsl888prabhu rgsl888prabhu changed the title Fixes issue with null struct columns Fixes issue with null struct columns in ORC reader Jul 21, 2021
@rgsl888prabhu rgsl888prabhu self-assigned this Jul 21, 2021
@rgsl888prabhu rgsl888prabhu added 3 - Ready for Review Ready for review by team 4 - Needs cuDF (Python) Reviewer bug Something isn't working non-breaking Non-breaking change and removed Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jul 21, 2021
@rgsl888prabhu rgsl888prabhu marked this pull request as ready for review July 21, 2021 22:09
@rgsl888prabhu rgsl888prabhu requested review from a team as code owners July 21, 2021 22:09
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for fixing such a big issue so quickly!
The algorithm looks good, just got a bunch of minor suggestions/questions. Some are definitely optional.

python/cudf/cudf/tests/test_orc.py Show resolved Hide resolved
python/cudf/cudf/tests/test_orc.py Show resolved Hide resolved
python/cudf/cudf/tests/test_orc.py Outdated Show resolved Hide resolved
cpp/src/io/orc/stripe_data.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/orc_gpu.h Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.hpp Outdated Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Show resolved Hide resolved
cpp/src/io/orc/reader_impl.cu Show resolved Hide resolved
@github-actions github-actions bot added Python Affects Python cuDF API. libcudf Affects libcudf (C++/CUDA) code. labels Jul 22, 2021
@codecov
Copy link

codecov bot commented Jul 22, 2021

Codecov Report

❗ No coverage uploaded for pull request base (branch-21.08@00f8dfb). Click here to learn what that means.
The diff coverage is n/a.

Impacted file tree graph

@@               Coverage Diff               @@
##             branch-21.08    #8819   +/-   ##
===============================================
  Coverage                ?   10.59%           
===============================================
  Files                   ?      116           
  Lines                   ?    19033           
  Branches                ?        0           
===============================================
  Hits                    ?     2017           
  Misses                  ?    17016           
  Partials                ?        0           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 00f8dfb...bf57b6b. Read the comment docs.

cpp/src/io/orc/orc_gpu.h Outdated Show resolved Hide resolved
const size_t level,
const uint32_t id,
bool& has_timestamp_column,
bool& has_nested_column)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these are not in/out params, why not just return them and AND them at callsite

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is a recursive function and it might be easier for reader this way compared to returning.

Copy link
Contributor

@devavret devavret left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None of my comments are for required changes.

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥

@vuule vuule added 5 - Ready to Merge Testing and reviews complete, ready to merge and removed 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer labels Jul 22, 2021
@rgsl888prabhu
Copy link
Contributor Author

@gpucibot merge

@rapids-bot rapids-bot bot merged commit 825f132 into rapidsai:branch-21.08 Jul 22, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
5 - Ready to Merge Testing and reviews complete, ready to merge bug Something isn't working libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Python Affects Python cuDF API.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[BUG] ORC reader produces different struct rows than Pandas ORC reader when there are null rows.
4 participants