-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes issue with null struct columns in ORC reader #8819
Fixes issue with null struct columns in ORC reader #8819
Conversation
…orc_struct_null_issue
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for fixing such a big issue so quickly!
The algorithm looks good, just got a bunch of minor suggestions/questions. Some are definitely optional.
Codecov Report
@@ Coverage Diff @@
## branch-21.08 #8819 +/- ##
===============================================
Coverage ? 10.59%
===============================================
Files ? 116
Lines ? 19033
Branches ? 0
===============================================
Hits ? 2017
Misses ? 17016
Partials ? 0 Continue to review full report at Codecov.
|
const size_t level, | ||
const uint32_t id, | ||
bool& has_timestamp_column, | ||
bool& has_nested_column) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these are not in/out params, why not just return them and AND them at callsite
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is a recursive function and it might be easier for reader this way compared to returning.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
None of my comments are for required changes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥
@gpucibot merge |
In case of liborc, pyarrow and pyorc:
If the parent has a null element, that element is skipped while writing child data, and same goes with mask
So, you would have to keep track of null count and null mask in parent column, so that you can merge both the parent and child null masks.
In case of pyspark, spark:
If the parent has a null element, and if child column also has null element, then upper explanation holds.
But if all the child rows are valid, then you need to copy the mask from parent.
These scenarios have been take care in the code changes.
Earlier struct column and its child columns used to be in the same level of nesting, but since we need parent null mask before decoding child, changes have been made so that child columns will be moved one level down for all types of nested columns.
closes #8704