Refactor orc chunked writer #12949

ttnghia · 2023-03-14T23:33:18Z

The current ORC chunked writer performs compressing/encoding and writing data into the output data sink without any safeguard. This PR modifies the internal writer::impl::write() function, separating it into multiple pieces:

A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer. As such, the writer can be isolated from failures of this free function, allowing to retry upon failure.
After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing.

Some cleanup is also performed on the existing code. That includes moving some member functions into free functions, which helps reducing potential dependencies between translation units.

There is no new implementation added in this work. Only the existing code is moved around.

Partially contributes to #12792.

Signed-off-by: Nghia Truong <[email protected]>

# Conflicts: # cpp/src/io/orc/orc.hpp # cpp/src/io/orc/writer_impl.cu # cpp/src/io/orc/writer_impl.hpp

vuule

Looks good, just minor suggestions.
It feels like this initial split opens up opportunities for further clean up, now that the dependencies between steps are obvious.

cpp/src/io/orc/writer_impl.cu

vuule

🔥 🔥

vyasr

This PR was confusing for a minute until I realized that orc streams were not CUDA streams...

Nice refactor. I have some questions about the code, but AFAIK none of the questions are related to changes in this PR. The moves all look fine.

ttnghia · 2023-03-21T18:52:25Z

/merge

Similar to #12949, this refactors Parquet writer to support retry mechanism. The internal `writer::impl::write()` function is rewritten such that it is separated into multiple pieces: * A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer. * After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing. Closes: * #13042 Depends on: * #13206 Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - https://github.com/nvdbaranec URL: #13076

ttnghia added 22 commits March 1, 2023 11:22

Move const position

9422fd5

Signed-off-by: Nghia Truong <[email protected]>

Get rid of the internal buffer_ state

33d541e

Signed-off-by: Nghia Truong <[email protected]>

Merge branch 'branch-23.04' into refactor_protocol_buffer

1ecca31

Rename variable

aacbbd5

Signed-off-by: Nghia Truong <[email protected]>

Update copyright year

7380cb9

Signed-off-by: Nghia Truong <[email protected]>

Merge branch 'branch-23.04' into refactor_protocol_buffer

cb35432

Write data with bound check

465b569

Signed-off-by: Nghia Truong <[email protected]>

Misc

c827d68

Signed-off-by: Nghia Truong <[email protected]>

Merge branch 'branch-23.04' into refactor_protocol_buffer

73e1ad2

Merge branch 'branch-23.04' into refactor_protocol_buffer

23d7f22

Merge branch 'branch-23.04' into refactor_protocol_buffer

e5226e7

Merge branch 'branch-23.04' into refac_orc_writer

7c522b0

Add static functions

52bb152

Simplify buffer implementation

7dda1e9

No longer use std::unique_ptr

8e3f6ae

Merge from refactor_protocol_buffer

cbd049a

Update copyright year

3fadb59

Merge branch 'branch-23.04' into refactor_protocol_buffer

c0c4c5e

Merge branch 'refactor_protocol_buffer' into refac_orc_writer

4b657d2

Refactoring...

3e107e9

Complete refactor

b793db2

Make table_meta const

237fdd0

ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Mar 14, 2023

ttnghia self-assigned this Mar 14, 2023

ttnghia added 2 commits March 15, 2023 15:28

WIP

a959820

Merge branch 'branch-23.04' into refac_orc_writer

cb0f3e7

# Conflicts: # cpp/src/io/orc/orc.hpp # cpp/src/io/orc/writer_impl.cu # cpp/src/io/orc/writer_impl.hpp

ttnghia added 6 commits March 17, 2023 15:26

Further move some static functions into free functions

03d2825

Move functions into namespace

a16475c

Re-organize code

ae1bc9d

Fix headers

147851d

Update docs

5916cf2

Rename function and update docs

98fe426

ttnghia marked this pull request as ready for review March 17, 2023 23:29

ttnghia requested a review from a team as a code owner March 17, 2023 23:29

ttnghia requested review from vyasr, elstehle and vuule March 17, 2023 23:29

ttnghia added 3 - Ready for Review Ready for review by team and removed 2 - In Progress Currently a work in progress labels Mar 17, 2023

ttnghia added 2 commits March 17, 2023 16:36

Add comment

5c98a70

Merge branch 'branch-23.04' into refac_orc_writer

13e3b02

vuule requested changes Mar 20, 2023

View reviewed changes

ttnghia added 4 commits March 20, 2023 13:29

Rename functions

d2f6e0b

Log error if exception was thrown

41fc8da

Remove num_rows parameter

5cce6cc

Merge branch 'branch-23.04' into refac_orc_writer

da814d1

vuule approved these changes Mar 20, 2023

View reviewed changes

vyasr approved these changes Mar 21, 2023

View reviewed changes

rapids-bot bot merged commit 17a2cdc into rapidsai:branch-23.04 Mar 21, 2023

ttnghia deleted the refac_orc_writer branch March 21, 2023 18:52

jlowe mentioned this pull request Mar 23, 2023

[BUG] YARN IT test test_optimized_hive_ctas_basic failures NVIDIA/spark-rapids#7922

Closed

This was referenced Mar 27, 2023

[FEA] Retry support for chunked ORC writer #12792

Closed

[FEA] Retry support for chunked Parquet writer #13042

Closed

ttnghia mentioned this pull request Apr 6, 2023

Refactor Parquet chunked writer #13076

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor orc chunked writer #12949

Refactor orc chunked writer #12949

ttnghia commented Mar 14, 2023 •

edited

Loading

vuule left a comment

vuule left a comment

vyasr left a comment

ttnghia commented Mar 21, 2023

Refactor orc chunked writer #12949

Refactor orc chunked writer #12949

Conversation

ttnghia commented Mar 14, 2023 • edited Loading

vuule left a comment

Choose a reason for hiding this comment

vuule left a comment

Choose a reason for hiding this comment

vyasr left a comment

Choose a reason for hiding this comment

ttnghia commented Mar 21, 2023

ttnghia commented Mar 14, 2023 •

edited

Loading