-
Notifications
You must be signed in to change notification settings - Fork 903
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactor Parquet chunked writer #13076
Refactor Parquet chunked writer #13076
Conversation
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
Signed-off-by: Nghia Truong <[email protected]>
# Conflicts: # cpp/src/io/parquet/writer_impl.cu # cpp/src/io/parquet/writer_impl.hpp
This reverts commit 3407b0f.
Of course we can copy exactly the Parquet writer code (a lot) into spark-rapids-jni, then do this refactor there. However, this PR doesn't implement anything new. It is just a refactor. But it can support Spark need, so this have to happen in cudf to avoid code duplicate (a lot if doing so in spark-rapids-jni). |
The new writer would not be able to append to the existing file (only relevant for chunked writer). The existing writer can at least retry OR close to write the footer. |
Co-authored-by: Alessandro Bellina <[email protected]>
Co-authored-by: Alessandro Bellina <[email protected]>
Co-authored-by: Alessandro Bellina <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great!
# Conflicts: # cpp/src/io/parquet/writer_impl.cu
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Small stuff.
}); | ||
|
||
// Init page fragments | ||
// 5000 is good enough for up to ~200-character strings. Longer strings and deeply nested columns | ||
// will start producing fragments larger than the desired page size, so calculate fragment sizes | ||
// for each leaf column. Skip if the fragment size is not the default. | ||
auto max_page_fragment_size = _max_page_fragment_size.value_or(default_max_page_fragment_size); | ||
size_type max_page_fragment_size = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
size_type max_page_fragment_size = | |
size_type const max_page_fragment_size = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh this can't be const
since it will be modified.
Co-authored-by: nvdbaranec <[email protected]>
Co-authored-by: nvdbaranec <[email protected]>
Co-authored-by: nvdbaranec <[email protected]>
Co-authored-by: nvdbaranec <[email protected]>
/merge |
Fix some unused variables/parameters warnings introduced by #13076. My old nvcc 11.5 compiler found these. Removing some of these found functions that could be removed as well. Some variables/parameters are now declared with `[[maybe_unused]]` ``` /cudf/cpp/src/io/parquet/writer_impl.cu(575): error #177-D: variable "data_col_type" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(906): error #177-D: parameter "stream" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(908): error #177-D: variable "col" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(1290): error #177-D: parameter "max_page_uncomp_data_size" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(1411): error #177-D: parameter "input" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(1712): error #177-D: variable "dict_info_owner" was declared but never referenced ``` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #13263
In parquet writer, the input table is divided into multiple batches (at 1GB limit), each batch is processed and flushed to sink one after another. The buffers storing data for processing each batch are reused among batches. This is to reduce peak GPU memory usage. Unfortunately, in order to support retry mechanism, we have to have separate buffers for each batch. This is equivalent to always having one batch. The benefit of batch processing is stripped away. In #13076, we expect to keep data for all batches but failed to do that, causing a bug reported in #13414. This PR fixes the issue introduced in #13076. And since we have to strip away the benefit of batch processing, peak memory usage may go up. Flag this as `breaking` because peak GPU memory usage may go up and cause the downstream application to crash. Note that this PR is a temporary fix for the outstanding issue. With this fix, the batch processing mechanism no longer gives any benefit for reducing peak memory usage. We consider removing all the batch processing code completely in the follow-up work, which involves a lot more changes. Closes #13414. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Lawrence Mitchell (https://github.com/wence-) - Benjamin Zaitlen (https://github.com/quasiben) URL: #13438
Similar to #12949, this refactors Parquet writer to support retry mechanism. The internal
writer::impl::write()
function is rewritten such that it is separated into multiple pieces:Closes:
Depends on: