Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor orc chunked writer #12949

Merged
merged 45 commits into from
Mar 21, 2023
Merged

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Mar 14, 2023

The current ORC chunked writer performs compressing/encoding and writing data into the output data sink without any safeguard. This PR modifies the internal writer::impl::write() function, separating it into multiple pieces:

  • A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer. As such, the writer can be isolated from failures of this free function, allowing to retry upon failure.
  • After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing.

Some cleanup is also performed on the existing code. That includes moving some member functions into free functions, which helps reducing potential dependencies between translation units.

There is no new implementation added in this work. Only the existing code is moved around.

Partially contributes to #12792.

@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Mar 14, 2023
@ttnghia ttnghia self-assigned this Mar 14, 2023
# Conflicts:
#	cpp/src/io/orc/orc.hpp
#	cpp/src/io/orc/writer_impl.cu
#	cpp/src/io/orc/writer_impl.hpp
@ttnghia ttnghia marked this pull request as ready for review March 17, 2023 23:29
@ttnghia ttnghia requested a review from a team as a code owner March 17, 2023 23:29
@ttnghia ttnghia added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 17, 2023
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good, just minor suggestions.
It feels like this initial split opens up opportunities for further clean up, now that the dependencies between steps are obvious.

cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/orc/writer_impl.cu Outdated Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔥 🔥

Copy link
Contributor

@vyasr vyasr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This PR was confusing for a minute until I realized that orc streams were not CUDA streams...

Nice refactor. I have some questions about the code, but AFAIK none of the questions are related to changes in this PR. The moves all look fine.

@ttnghia
Copy link
Contributor Author

ttnghia commented Mar 21, 2023

/merge

@rapids-bot rapids-bot bot merged commit 17a2cdc into rapidsai:branch-23.04 Mar 21, 2023
@ttnghia ttnghia deleted the refac_orc_writer branch March 21, 2023 18:52
rapids-bot bot pushed a commit that referenced this pull request May 1, 2023
Similar to #12949, this refactors Parquet writer to support retry mechanism. The internal `writer::impl::write()` function is rewritten such that it is separated into multiple pieces:
 * A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer.
 * After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing.

Closes: 
 * #13042

Depends on:
 * #13206

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - https://github.com/nvdbaranec

URL: #13076
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants