Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Parquet writer dictionary encoding refactor #8476

Merged

Conversation

devavret
Copy link
Contributor

@devavret devavret commented Jun 9, 2021

Replaces previous parquet dictionary encoding code with one that uses cuCollections' static map.

Adds cuCollections to libcudf

Closes #7873
Fixes #8890

Currently blocked on Pascal support for static_map in cuCollections

(More details to be added)

cpp/CMakeLists.txt Outdated Show resolved Hide resolved
@devavret devavret requested a review from ttnghia August 11, 2021 21:33
@devavret devavret added 3 - Ready for Review Ready for review by team 4 - Needs cuIO Reviewer and removed 2 - In Progress Currently a work in progress labels Aug 11, 2021
cpp/src/io/parquet/chunk_dict.cu Show resolved Hide resolved
cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved
auto t = threadIdx.x;

size_type start_row = block_x * 5000;
size_type end_row = min(start_row + 5000, num_rows);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Then we need to explicitly add the prefix cuda::min or something similar to that?

cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved

void InitializeChunkHashMaps(device_span<EncColumnChunk> chunks, rmm::cuda_stream_view stream)
{
constexpr int block_size = 1024;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use int32_t instead, for clarification.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe kernel launch parameters are not sensitive to bit size.

cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved
@vuule
Copy link
Contributor

vuule commented Aug 12, 2021

How does this fix #8890?

@devavret
Copy link
Contributor Author

How does this fix #8890?

The problem was that we allocated 256 kb as scratch space for every chunk even if the chunk contained much less values. Now we only allocate any temp memory proportional to data size. There was one instance in this PR where we still allocated 256kb per chunk but that has been limited to min(256kb, chunk size).

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Partial review, still not understanding the kernels that well.

cpp/src/io/parquet/parquet_gpu.hpp Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/writer_impl.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/page_enc.cu Outdated Show resolved Hide resolved
cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved
@devavret devavret requested review from ttnghia and vuule August 17, 2021 23:00
Causing a problem where early return would result in num_dict_entries == 65536 but that meant that the dict bit calc logic would think it's ok to use dict as 65536 fits in 16 bits
Copy link
Contributor

@ttnghia ttnghia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. Note that there are still magic numbers which can be replaced by constexpr variables:

  • Magic numbers should always be avoided
  • Any constexpr value replacing a magic number should be associated with a full comment/doxygen clarifying why it has such value

Copy link
Contributor

@vuule vuule left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a formatting error, otherwise good to go.

cpp/src/io/parquet/chunk_dict.cu Show resolved Hide resolved
cpp/src/io/parquet/chunk_dict.cu Outdated Show resolved Hide resolved
Co-authored-by: Vukasin Milovanovic <[email protected]>
@vuule
Copy link
Contributor

vuule commented Aug 19, 2021

@gpucibot merge

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
3 - Ready for Review Ready for review by team 4 - Needs Review Waiting for reviewer to review or respond CMake CMake build issue cuIO cuIO issue improvement Improvement / enhancement to an existing function libcudf Affects libcudf (C++/CUDA) code. non-breaking Non-breaking change
Projects
None yet
7 participants