Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-31992: [C++][Parquet] Handling the special case when DataPageV2 values buffer is empty #45252

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 14 additions & 7 deletions cpp/src/parquet/column_reader.cc
Original file line number Diff line number Diff line change
Expand Up @@ -580,13 +580,20 @@ std::shared_ptr<Buffer> SerializedPageReader::DecompressIfNeeded(
memcpy(decompressed, page_buffer->data(), levels_byte_len);
}

// Decompress the values
PARQUET_ASSIGN_OR_THROW(
auto decompressed_len,
decompressor_->Decompress(compressed_len - levels_byte_len,
page_buffer->data() + levels_byte_len,
uncompressed_len - levels_byte_len,
decompression_buffer_->mutable_data() + levels_byte_len));
// GH-31992: DataPageV2 may store only levels and no values
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we fix this in the Snappy decompressor instead?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which one do you prefer?

Copy link
Member

@pitrou pitrou Jan 14, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends whether an empty buffer is a normal compression result or a shortcut taken by parquet-java. Let me see.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, it's really a bug in parquet-java, because a 0-size buffer compressed to a 1-size buffer using Snappy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I handle the 0,0 case in snappy 🤔?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So should I handle the 0,0 case in snappy 🤔?

No, sorry. We should work around it in Parquet.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I mean parquet-java is able to produce 0-sized compressed data as the example. How can I reproduce your case where 0 input is compressed to 1 byte?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not my case, it's just what Snappy produces when you ask it to compress 0 byte.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it's just what Snappy produces when you ask it to compress 0 byte.

IMO that's:

  1. "Compressed" Data is 0 byte
  2. Actually, the levels holds k bytes ?

I don't know how parquet-java works, I tried parquet rust and it failed to read

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, a 0-byte compressed data is invalid. It's probably a special case in the Parquet Java implementation.

// when all values are null. In this case, we can avoid decompressing
// the rest of the page.
int64_t decompressed_len = 0;
if (uncompressed_len - levels_byte_len != 0) {
mapleFU marked this conversation as resolved.
Show resolved Hide resolved
// Decompress the values
PARQUET_ASSIGN_OR_THROW(
decompressed_len,
decompressor_->Decompress(
compressed_len - levels_byte_len, page_buffer->data() + levels_byte_len,
uncompressed_len - levels_byte_len,
decompression_buffer_->mutable_data() + levels_byte_len));
}

if (decompressed_len != uncompressed_len - levels_byte_len) {
throw ParquetException("Page didn't decompress to expected size, expected: " +
std::to_string(uncompressed_len - levels_byte_len) +
Expand Down
Loading