Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Include encode type in the error message when unsupported Parquet encoding is detected #14453

Merged
merged 31 commits into from
Dec 20, 2023
Merged
Changes from 20 commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
a56a568
Add functionality to print out list of unsupported encodings if found.
ZelboK Nov 20, 2023
2b11e5f
Merge branch 'branch-24.02' into feat-14209-exception-msg
ZelboK Nov 20, 2023
121aaa2
Clean up code to use more modern C++ & follow code style
ZelboK Nov 20, 2023
8efd769
run pre-commit
ZelboK Nov 20, 2023
16417af
address comments and clean up
ZelboK Nov 20, 2023
2bc0345
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Nov 20, 2023
f6e5fc6
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Nov 20, 2023
0cec213
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Nov 20, 2023
8a67853
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Nov 20, 2023
879e3e1
address comments/cleanup
ZelboK Nov 20, 2023
6264daf
remove unintended change
ZelboK Nov 20, 2023
e7fa44b
Merge branch 'branch-24.02' into feat-14209-exception-msg
ZelboK Nov 21, 2023
e11a697
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Nov 29, 2023
8986ab8
Merge branch 'branch-24.02' into feat-14209-exception-msg
ZelboK Dec 11, 2023
0d8baad
address comments on PR. Use bitset over raw bit manipulation. Give mo…
ZelboK Dec 11, 2023
f8ed58f
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Dec 12, 2023
324cd61
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Dec 14, 2023
501dca9
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Dec 14, 2023
7c97f94
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Dec 14, 2023
651c912
Update cpp/src/io/parquet/reader_impl_preprocess.cu
ZelboK Dec 14, 2023
d6b38cb
Merge branch 'branch-24.02' into feat-14209-exception-msg
ZelboK Dec 18, 2023
4575dde
Merge branch 'branch-24.02' into feat-14209-exception-msg
ZelboK Dec 19, 2023
11a42b6
typo fix
vuule Dec 19, 2023
da710ff
save
ZelboK Dec 19, 2023
cd197cd
have count page headers ignore unsuported encoding errors
ZelboK Dec 19, 2023
dd859de
Merge remote-tracking branch 'refs/remotes/zelbok/feat-14209-exceptio…
ZelboK Dec 19, 2023
1365398
Merge branch 'branch-24.02' into feat-14209-exception-msg
ZelboK Dec 19, 2023
99ddb69
add comment clarifying why count_page_headers skips unsupported encod…
ZelboK Dec 19, 2023
90ae791
clean up
ZelboK Dec 19, 2023
91e6fcb
change to device_span, d_begin() not a member
ZelboK Dec 19, 2023
fbd836a
Merge branch 'branch-24.02' into feat-14209-exception-msg
ttnghia Dec 20, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
71 changes: 69 additions & 2 deletions cpp/src/io/parquet/reader_impl_preprocess.cu
vuule marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
Expand Up @@ -43,6 +43,7 @@

#include <cuda/functional>

#include <bitset>
#include <numeric>

namespace cudf::io::parquet::detail {
Expand Down Expand Up @@ -282,6 +283,67 @@ void generate_depth_remappings(std::map<int, std::pair<std::vector<int>, std::ve
return total_pages;
}

/**
* @brief Returns a string representation of known encodings
*
* @param encoding Given encoding
* @return String representation of encoding
*/
std::string encoding_to_string(Encoding encoding)
{
switch (encoding) {
case Encoding::PLAIN: return "PLAIN";
case Encoding::GROUP_VAR_INT: return "GROUP_VAR_INT";
case Encoding::PLAIN_DICTIONARY: return "PLAIN_DICTIONARY";
case Encoding::RLE: return "RLE";
case Encoding::BIT_PACKED: return "BIT_PACKED";
case Encoding::DELTA_BINARY_PACKED: return "DELTA_BINARY_PACKED";
case Encoding::DELTA_LENGTH_BYTE_ARRAY: return "DELTA_LENGTH_BYTE_ARRAY";
case Encoding::DELTA_BYTE_ARRAY: return "DELTA_BYTE_ARRAY";
case Encoding::RLE_DICTIONARY: return "RLE_DICTIONARY";
case Encoding::BYTE_STREAM_SPLIT: return "BYTE_STREAM_SPLIT";
case Encoding::NUM_ENCODINGS:
default: return "UNKNOWN(" + std::to_string(static_cast<int>(encoding)) + ")";
}
}

/**
* @brief Helper function to convert an encoding bitmask to a readable string
*
* @param bitmask Bitmask of found unsupported encodings
* @returns Human readable string with unsupported encodings
*/
std::string encoding_bitmask_to_str(int32_t encoding_bitmask)
vuule marked this conversation as resolved.
Show resolved Hide resolved
{
std::bitset<32> bits(encoding_bitmask);
std::string result;

for (size_t i = 0; i < bits.size(); ++i) {
if (bits.test(i)) {
auto const current = static_cast<Encoding>(i);
if (!is_supported_encoding(current)) { result.append(encoding_to_string(current) + " "); }
}
}
return result;
}
/**
* @brief Create a readable string for the user that will list out all unsupported encodings found.
*
* @param pages List of page information
* @param stream CUDA stream used for device memory operations and kernel launches
* @returns Human readable string with unsupported encodings
*/
std::string list_unsupported_encodings(cudf::detail::hostdevice_vector<PageInfo> const& pages,
vuule marked this conversation as resolved.
Show resolved Hide resolved
rmm::cuda_stream_view stream)
{
auto const to_mask = [] __device__(auto const& page) {
return is_supported_encoding(page.encoding) ? 0U : encoding_to_mask(page.encoding);
};
uint32_t const unsupported = thrust::transform_reduce(
rmm::exec_policy(stream), pages.d_begin(), pages.d_end(), to_mask, 0U, thrust::bit_or<uint32_t>());
return encoding_bitmask_to_str(unsupported);
}

/**
* @brief Decode the page information from the given column chunks.
*
Expand All @@ -307,8 +369,13 @@ int decode_page_headers(cudf::detail::hostdevice_vector<ColumnChunkDesc>& chunks
DecodePageHeaders(chunks.device_ptr(), chunks.size(), error_code.data(), stream);

if (error_code.value() != 0) {
// TODO(ets): if an unsupported encoding was detected, do extra work to figure out which one
CUDF_FAIL("Parquet header parsing failed with code(s)" + error_code.str());
if (BitAnd(error_code.value(), decode_error::UNSUPPORTED_ENCODING) != 0) {
auto const unsupported_str =
". With unsupported encodings found: " + list_unsupported_encodings(pages, stream);
CUDF_FAIL("Parquet header parsing failed with code(s) " + error_code.str() + unsupported);
vuule marked this conversation as resolved.
Show resolved Hide resolved
} else {
CUDF_FAIL("Parquet header parsing failed with code(s) " + error_code.str());
}
}

// compute max bytes needed for level data
Expand Down