Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor Parquet chunked writer #13076

Merged
merged 51 commits into from
May 1, 2023

Conversation

ttnghia
Copy link
Contributor

@ttnghia ttnghia commented Apr 6, 2023

Similar to #12949, this refactors Parquet writer to support retry mechanism. The internal writer::impl::write() function is rewritten such that it is separated into multiple pieces:

  • A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer.
  • After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing.

Closes:

Depends on:

@ttnghia ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Apr 6, 2023
@ttnghia ttnghia self-assigned this Apr 6, 2023
@ttnghia ttnghia linked an issue Apr 6, 2023 that may be closed by this pull request
@ttnghia
Copy link
Contributor Author

ttnghia commented Apr 26, 2023

Out of curiosity, why does this have to happen inside cudf? Can't the Spark code just create a new writer and use that?

Of course we can copy exactly the Parquet writer code (a lot) into spark-rapids-jni, then do this refactor there.

However, this PR doesn't implement anything new. It is just a refactor. But it can support Spark need, so this have to happen in cudf to avoid code duplicate (a lot if doing so in spark-rapids-jni).

@vuule
Copy link
Contributor

vuule commented Apr 26, 2023

Out of curiosity, why does this have to happen inside cudf? Can't the Spark code just create a new writer and use that?

The new writer would not be able to append to the existing file (only relevant for chunked writer). The existing writer can at least retry OR close to write the footer.

cpp/src/io/parquet/writer_impl.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved
Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great!

# Conflicts:
#	cpp/src/io/parquet/writer_impl.cu
Copy link
Contributor

@nvdbaranec nvdbaranec left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small stuff.

});

// Init page fragments
// 5000 is good enough for up to ~200-character strings. Longer strings and deeply nested columns
// will start producing fragments larger than the desired page size, so calculate fragment sizes
// for each leaf column. Skip if the fragment size is not the default.
auto max_page_fragment_size = _max_page_fragment_size.value_or(default_max_page_fragment_size);
size_type max_page_fragment_size =
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
size_type max_page_fragment_size =
size_type const max_page_fragment_size =

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh this can't be const since it will be modified.

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
@ttnghia ttnghia requested a review from nvdbaranec April 28, 2023 14:36
@ttnghia ttnghia removed the request for review from harrism May 1, 2023 19:43
@ttnghia
Copy link
Contributor Author

ttnghia commented May 1, 2023

/merge

@rapids-bot rapids-bot bot merged commit d7e7c0b into rapidsai:branch-23.06 May 1, 2023
@ttnghia ttnghia deleted the refactor_parquet_writer branch May 1, 2023 21:23
rapids-bot bot pushed a commit that referenced this pull request May 2, 2023
Fix some unused variables/parameters warnings introduced by #13076.
My old nvcc 11.5 compiler found these. Removing some of these found functions that could be removed as well.
Some variables/parameters are now declared with `[[maybe_unused]]`

```
/cudf/cpp/src/io/parquet/writer_impl.cu(575): error #177-D: variable "data_col_type" was declared but never referenced

/cudf/cpp/src/io/parquet/writer_impl.cu(906): error #177-D: parameter "stream" was declared but never referenced

/cudf/cpp/src/io/parquet/writer_impl.cu(908): error #177-D: variable "col" was declared but never referenced

/cudf/cpp/src/io/parquet/writer_impl.cu(1290): error #177-D: parameter "max_page_uncomp_data_size" was declared but never referenced

/cudf/cpp/src/io/parquet/writer_impl.cu(1411): error #177-D: parameter "input" was declared but never referenced

/cudf/cpp/src/io/parquet/writer_impl.cu(1712): error #177-D: variable "dict_info_owner" was declared but never referenced

```

Authors:
  - David Wendt (https://github.com/davidwendt)

Approvers:
  - Nghia Truong (https://github.com/ttnghia)
  - Karthikeyan (https://github.com/karthikeyann)

URL: #13263
rapids-bot bot pushed a commit that referenced this pull request May 25, 2023
In parquet writer, the input table is divided into multiple batches (at 1GB limit), each batch is processed and flushed to sink one after another. The buffers storing data for processing each batch are reused among batches. This is to reduce peak GPU memory usage.

Unfortunately, in order to support retry mechanism, we have to have separate buffers for each batch. This is equivalent to always having one batch. The benefit of batch processing is stripped away.  In #13076, we expect to keep data for all batches but failed to do that, causing a bug reported in #13414.

This PR fixes the issue introduced in #13076. And since we have to strip away the benefit of batch processing, peak memory usage may go up.

Flag this as `breaking` because peak GPU memory usage may go up and cause the downstream application to crash.

Note that this PR is a temporary fix for the outstanding issue. With this fix, the batch processing mechanism no longer gives any benefit for reducing peak memory usage. We consider removing all the batch processing code completely in the follow-up work, which involves a lot more changes.

Closes #13414.

Authors:
  - Nghia Truong (https://github.com/ttnghia)

Approvers:
  - Vukasin Milovanovic (https://github.com/vuule)
  - Lawrence Mitchell (https://github.com/wence-)
  - Benjamin Zaitlen (https://github.com/quasiben)

URL: #13438
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team cuIO cuIO issue feature request New feature or request libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change Spark Functionality that helps Spark RAPIDS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEA] Retry support for chunked Parquet writer
4 participants