-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GH-45185: Add bad_data file with invalid repetition levels #67
base: master
Are you sure you want to change the base?
Conversation
CC @wgtmac |
The file size is 1.2K. Could we reduce it as much as possible? For example:
BTW, |
* Reduce row count * Use int32 values * Disable dictionary encoding and statistics * Use correct list structure with logical type annotation
Thanks for the update! Will merge it tomorrow if no objection. |
Thanks! I thought I left a comment earlier but GitHub was having an outage so maybe it got lost. I made the suggested changes but kept the data uncompressed as enabling zstd compression actually increased the file size slightly with such small data. For reference, this is the code I used to generate the file: auto element_node = PrimitiveNode::Make("element", Repetition::REQUIRED, LogicalType::Int(32, true), Type::INT32);
auto list_node = GroupNode::Make("list", Repetition::REPEATED, {element_node});
auto column_node = GroupNode::Make("x", Repetition::REQUIRED, {list_node}, LogicalType::List());
auto root_node = GroupNode::Make("root", Repetition::REQUIRED, {column_node}, nullptr);
WriterProperties::Builder prop_builder;
std::shared_ptr<WriterProperties> writer_properties = prop_builder
.disable_dictionary()
->disable_write_page_index()
->disable_statistics()
->compression(Compression::UNCOMPRESSED)
->build();
PARQUET_ASSIGN_OR_THROW(auto out_file, ::arrow::io::FileOutputStream::Open(file_path));
auto file_writer = ParquetFileWriter::Open(out_file, std::static_pointer_cast<GroupNode>(root_node), writer_properties);
auto row_group_writer = file_writer->AppendRowGroup();
auto column_writer =
static_cast<TypedColumnWriter<Int32Type>*>(row_group_writer->NextColumn());
constexpr size_t num_leaf_values = 10;
std::vector<int32_t> values(num_leaf_values);
std::vector<int16_t> rep_levels(num_leaf_values);
std::vector<int16_t> def_levels(num_leaf_values);
for (size_t i = 0; i < num_leaf_values; ++i) {
values[i] = static_cast<int32_t>(i);
rep_levels[i] = i % 2 == 0 ? 1 : 0;
def_levels[i] = 1;
}
column_writer->WriteBatch(num_leaf_values, def_levels.data(), rep_levels.data(), values.data());
row_group_writer->Close();
file_writer->Close(); |
Follow up to #65. For apache/arrow#45185
This adds a file generated with the same bad logic previously used for generating encryption test files but without any encryption. It also contains only the problematic int64 list column rather than all the test data columns, and I've disabled compression so that this can be used for tests without needing Snappy enabled.