-
Notifications
You must be signed in to change notification settings - Fork 904
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor orc chunked writer #12949
Refactor orc chunked writer #12949
Conversation
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
# Conflicts: # cpp/src/io/orc/orc.hpp # cpp/src/io/orc/writer_impl.cu # cpp/src/io/orc/writer_impl.hpp
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good, just minor suggestions.
It feels like this initial split opens up opportunities for further clean up, now that the dependencies between steps are obvious.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🔥 🔥
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR was confusing for a minute until I realized that orc streams were not CUDA streams...
Nice refactor. I have some questions about the code, but AFAIK none of the questions are related to changes in this PR. The moves all look fine.
/merge |
Similar to #12949, this refactors Parquet writer to support retry mechanism. The internal `writer::impl::write()` function is rewritten such that it is separated into multiple pieces: * A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer. * After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing. Closes: * #13042 Depends on: * #13206 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec URL: #13076
The current ORC chunked writer performs compressing/encoding and writing data into the output data sink without any safeguard. This PR modifies the internal
writer::impl::write()
function, separating it into multiple pieces:Some cleanup is also performed on the existing code. That includes moving some member functions into free functions, which helps reducing potential dependencies between translation units.
There is no new implementation added in this work. Only the existing code is moved around.
Partially contributes to #12792.