Refactor Parquet chunked writer #13076

ttnghia · 2023-04-06T03:10:54Z

Similar to #12949, this refactors Parquet writer to support retry mechanism. The internal writer::impl::write() function is rewritten such that it is separated into multiple pieces:

A free function that performs compressing/encoding the input table into intermediate results. These intermediate results are totally independent of the writer.
After having the intermediate results in the previous step, these results will be actually applied to the output data sink to start the actual data writing.

Closes:

[FEA] Retry support for chunked Parquet writer #13042

Depends on:

Refactor pinned memory vector and ORC+Parquet writers #13206

Signed-off-by: Nghia Truong <[email protected]>

# Conflicts: # cpp/src/io/parquet/writer_impl.cu # cpp/src/io/parquet/writer_impl.hpp

This reverts commit 3407b0f.

ttnghia · 2023-04-26T22:13:49Z

Out of curiosity, why does this have to happen inside cudf? Can't the Spark code just create a new writer and use that?

Of course we can copy exactly the Parquet writer code (a lot) into spark-rapids-jni, then do this refactor there.

However, this PR doesn't implement anything new. It is just a refactor. But it can support Spark need, so this have to happen in cudf to avoid code duplicate (a lot if doing so in spark-rapids-jni).

vuule · 2023-04-26T22:27:05Z

Out of curiosity, why does this have to happen inside cudf? Can't the Spark code just create a new writer and use that?

The new writer would not be able to append to the existing file (only relevant for chunked writer). The existing writer can at least retry OR close to write the footer.

cpp/src/io/parquet/writer_impl.hpp

cpp/src/io/parquet/writer_impl.cu

Co-authored-by: Alessandro Bellina <[email protected]>

vuule

Looks great!

# Conflicts: # cpp/src/io/parquet/writer_impl.cu

nvdbaranec

Small stuff.

nvdbaranec · 2023-04-28T13:58:19Z

cpp/src/io/parquet/writer_impl.cu

    });

  // Init page fragments
  // 5000 is good enough for up to ~200-character strings. Longer strings and deeply nested columns
  // will start producing fragments larger than the desired page size, so calculate fragment sizes
  // for each leaf column.  Skip if the fragment size is not the default.
-  auto max_page_fragment_size = _max_page_fragment_size.value_or(default_max_page_fragment_size);
+  size_type max_page_fragment_size =


Suggested change

size_type max_page_fragment_size =

size_type const max_page_fragment_size =

Oh this can't be const since it will be modified.

cpp/src/io/parquet/writer_impl.cu

Co-authored-by: nvdbaranec <[email protected]>

ttnghia · 2023-05-01T19:50:00Z

/merge

Fix some unused variables/parameters warnings introduced by #13076. My old nvcc 11.5 compiler found these. Removing some of these found functions that could be removed as well. Some variables/parameters are now declared with `[[maybe_unused]]` ``` /cudf/cpp/src/io/parquet/writer_impl.cu(575): error #177-D: variable "data_col_type" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(906): error #177-D: parameter "stream" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(908): error #177-D: variable "col" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(1290): error #177-D: parameter "max_page_uncomp_data_size" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(1411): error #177-D: parameter "input" was declared but never referenced /cudf/cpp/src/io/parquet/writer_impl.cu(1712): error #177-D: variable "dict_info_owner" was declared but never referenced ``` Authors: - David Wendt (https://github.com/davidwendt) Approvers: - Nghia Truong (https://github.com/ttnghia) - Karthikeyan (https://github.com/karthikeyann) URL: #13263

In parquet writer, the input table is divided into multiple batches (at 1GB limit), each batch is processed and flushed to sink one after another. The buffers storing data for processing each batch are reused among batches. This is to reduce peak GPU memory usage. Unfortunately, in order to support retry mechanism, we have to have separate buffers for each batch. This is equivalent to always having one batch. The benefit of batch processing is stripped away. In #13076, we expect to keep data for all batches but failed to do that, causing a bug reported in #13414. This PR fixes the issue introduced in #13076. And since we have to strip away the benefit of batch processing, peak memory usage may go up. Flag this as `breaking` because peak GPU memory usage may go up and cause the downstream application to crash. Note that this PR is a temporary fix for the outstanding issue. With this fix, the batch processing mechanism no longer gives any benefit for reducing peak memory usage. We consider removing all the batch processing code completely in the follow-up work, which involves a lot more changes. Closes #13414. Authors: - Nghia Truong (https://github.com/ttnghia) Approvers: - Vukasin Milovanovic (https://github.com/vuule) - Lawrence Mitchell (https://github.com/wence-) - Benjamin Zaitlen (https://github.com/quasiben) URL: #13438

Change member functions to free functions

6ac30a2

Signed-off-by: Nghia Truong <[email protected]>

ttnghia added feature request New feature or request 2 - In Progress Currently a work in progress libcudf Affects libcudf (C++/CUDA) code. cuIO cuIO issue Spark Functionality that helps Spark RAPIDS non-breaking Non-breaking change labels Apr 6, 2023

ttnghia self-assigned this Apr 6, 2023

ttnghia linked an issue Apr 6, 2023 that may be closed by this pull request

[FEA] Retry support for chunked Parquet writer #13042

Closed

ttnghia added 2 commits April 6, 2023 13:09

Put many functions into anonymous namespace

1a0c061

Signed-off-by: Nghia Truong <[email protected]>

Fix stale docs

528d080

Signed-off-by: Nghia Truong <[email protected]>

ttnghia mentioned this pull request Apr 7, 2023

[FEA] Retry/SplitAndRetry on Parquet Writes NVIDIA/spark-rapids#8028

Closed

ttnghia and others added 18 commits April 14, 2023 10:40

Merge branch 'branch-23.06' into refactor_parquet_writer

6cab5f3

# Conflicts: # cpp/src/io/parquet/writer_impl.cu # cpp/src/io/parquet/writer_impl.hpp

Extract fill_table_meta

3ca4e45

Move variables

271302d

Moving stuffs around

c08dcc9

WIP

d75d6fc

WIP

b065a1a

WIP

f363735

Merge branch 'branch-23.06' into refactor_parquet_writer

a757ba7

WIP

6866500

WIP

54b676c

WIP

3407b0f

Revert "WIP"

9a67895

This reverts commit 3407b0f.

Cleanup

2173025

WIP

0bcfe3f

Implement write_parquet_data_to_sink

f711388

Merge branch 'branch-23.06' into refactor_parquet_writer

a8256a5

Cleanup

5294312

Cleanup

a230d8e

abellina reviewed Apr 26, 2023

View reviewed changes

cpp/src/io/parquet/writer_impl.hpp Outdated Show resolved Hide resolved

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved

cpp/src/io/parquet/writer_impl.cu Show resolved Hide resolved

ttnghia and others added 4 commits April 26, 2023 16:34

Update cpp/src/io/parquet/writer_impl.cu

1d1d9fe

Co-authored-by: Alessandro Bellina <[email protected]>

Update cpp/src/io/parquet/writer_impl.hpp

4d930bb

Co-authored-by: Alessandro Bellina <[email protected]>

Update cpp/src/io/parquet/writer_impl.cu

4cd8c8d

Co-authored-by: Alessandro Bellina <[email protected]>

Merge branch 'branch-23.06' into refactor_parquet_writer

6d86cb5

vuule approved these changes Apr 27, 2023

View reviewed changes

Merge branch 'branch-23.06' into refactor_parquet_writer

a578ad4

# Conflicts: # cpp/src/io/parquet/writer_impl.cu

nvdbaranec requested changes Apr 28, 2023

View reviewed changes

ttnghia and others added 6 commits April 28, 2023 08:33

Update cpp/src/io/parquet/writer_impl.cu

b957369

Co-authored-by: nvdbaranec <[email protected]>

Update cpp/src/io/parquet/writer_impl.cu

d124097

Co-authored-by: nvdbaranec <[email protected]>

Update cpp/src/io/parquet/writer_impl.cu

f773a6d

Co-authored-by: nvdbaranec <[email protected]>

Update cpp/src/io/parquet/writer_impl.cu

a831d5b

Co-authored-by: nvdbaranec <[email protected]>

Merge branch 'branch-23.06' into refactor_parquet_writer

b17b5a6

Reverse one change and fix style

d29b75e

ttnghia requested a review from nvdbaranec April 28, 2023 14:36

ttnghia and others added 3 commits April 28, 2023 07:39

Remove outdated code

b293ddd

Merge branch 'branch-23.06' into refactor_parquet_writer

5795752

Merge branch 'branch-23.06' into refactor_parquet_writer

611d0cf

ttnghia removed the request for review from harrism May 1, 2023 19:43

nvdbaranec approved these changes May 1, 2023

View reviewed changes

rapids-bot bot merged commit d7e7c0b into rapidsai:branch-23.06 May 1, 2023

ttnghia deleted the refactor_parquet_writer branch May 1, 2023 21:23

davidwendt mentioned this pull request May 2, 2023

Fix unused variables/parameters in parquet/writer_impl.cu #13263

Merged

3 tasks

quasiben mentioned this pull request May 23, 2023

[BUG] Partitioned Writing/Reading Failure #13414

Closed

ttnghia mentioned this pull request May 24, 2023

Fix batch processing for parquet writer #13438

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Parquet chunked writer #13076

Refactor Parquet chunked writer #13076

ttnghia commented Apr 6, 2023 •

edited

Loading

ttnghia commented Apr 26, 2023

vuule commented Apr 26, 2023 •

edited

Loading

vuule left a comment

nvdbaranec left a comment

nvdbaranec Apr 28, 2023

ttnghia Apr 28, 2023

ttnghia commented May 1, 2023

	size_type max_page_fragment_size =
	size_type const max_page_fragment_size =

Refactor Parquet chunked writer #13076

Refactor Parquet chunked writer #13076

Conversation

ttnghia commented Apr 6, 2023 • edited Loading

ttnghia commented Apr 26, 2023

vuule commented Apr 26, 2023 • edited Loading

vuule left a comment

Choose a reason for hiding this comment

nvdbaranec left a comment

Choose a reason for hiding this comment

nvdbaranec Apr 28, 2023

Choose a reason for hiding this comment

ttnghia Apr 28, 2023

Choose a reason for hiding this comment

ttnghia commented May 1, 2023

ttnghia commented Apr 6, 2023 •

edited

Loading

vuule commented Apr 26, 2023 •

edited

Loading