-
Notifications
You must be signed in to change notification settings - Fork 902
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
More granular column selection in ORC reader #9496
More granular column selection in ORC reader #9496
Conversation
Codecov Report
@@ Coverage Diff @@
## branch-21.12 #9496 +/- ##
================================================
- Coverage 10.79% 10.66% -0.13%
================================================
Files 116 117 +1
Lines 18869 19725 +856
================================================
+ Hits 2036 2104 +68
- Misses 16833 17621 +788
Continue to review full report at Codecov.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
CMake changes 👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
First pass, looks pretty good, just had few questions.
…fea-select-nested-cols-orc
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good to see more code in cpp files, and I like the new feature (although I had a question about nesting depth that I've left inline). One overarching question: there's a lot of int32_t
usage in this PR, is that because of the ORC specification or should the code be making use of cudf::size_type
? Aside from that, I don't know this code too well so I left some questions in places that I didn't understand and comments on things that weren't really changes in this PR. Hopefully they're not too troublesome, feel free to ignore things that you need to.
@vyasr Thank you for the detailed review :)
I did considered this option. My understanding is that size_type is used to represent the number of rows/row index. The uses here are all column indices. I wasn't sure is size_type applied there too, so I left as int32_t. ORC specs suggest uint32_t, which is more error-prone etc. Open for suggestions, maybe there should even be a new alias in use here. |
Co-authored-by: Vyas Ramasubramani <[email protected]>
Just following up even though you did already make the change: I think |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for answering my questions! This PR LGTM, but would probably benefit from an additional CPP review from a cuIO expert.
@rgsl888prabhu is the expert on this code, so this should be covered :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Code looks good, only have couple of queries on testing.
@gpucibot merge |
Closes #8848
columns
parameter are excluded.columns
parameter.aggregate_orc_metadata
implementation to a separate file (can be.cpp
!)uint32_t
toint32_t
to avoid unsigned arithmetic.