Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GH-45185: Add bad_data file with invalid repetition levels #67

Open
wants to merge 2 commits into
base: master
Choose a base branch
from

Conversation

adamreeve
Copy link
Contributor

Follow up to #65. For apache/arrow#45185

This adds a file generated with the same bad logic previously used for generating encryption test files but without any encryption. It also contains only the problematic int64 list column rather than all the test data columns, and I've disabled compression so that this can be used for tests without needing Snappy enabled.

@raulcd
Copy link
Member

raulcd commented Jan 7, 2025

CC @wgtmac

@wgtmac
Copy link
Member

wgtmac commented Jan 8, 2025

File path:  bad_data/ARROW-GH-45185.parquet
Created by: parquet-cpp-arrow version 19.0.0-SNAPSHOT
Properties: (none)
Schema:
message schema {
  repeated int64 int64_field;
}

Row group 0:  count: 50  19.10 B records  start: 4  total(compressed): 955 B total(uncompressed):955 B
--------------------------------------------------------------------------------
             type      encodings count     avg size   nulls   min / max
int64_field  INT64     _ _ R     100       9.55 B     0       "0" / "99000000000000"


Column: int64_field
--------------------------------------------------------------------------------
  page   type  enc  count   avg size   size       rows     nulls   min / max
  0-D    dict  _ _  100     8.00 B     800 B
  0-1    data  _ R  100     1.18 B     118 B

      "columnIndexReference" : {
        "offset" : 959,
        "length" : 31
      },
      "offsetIndexReference" : {
        "offset" : 990,
        "length" : 12
      },

The file size is 1.2K. Could we reduce it as much as possible? For example:

  • leverage compression like zstd
  • disable dictionary encoding
  • disable page index
  • reduce row count

BTW, repeated int64 int64_field is a special case of unannotated list type which we should avoid: https://github.com/apache/parquet-format/blob/master/LogicalTypes.md?plain=1#L607-L624. Should we replace it with LIST-annotated type? cc @pitrou @mapleFU

* Reduce row count
* Use int32 values
* Disable dictionary encoding and statistics
* Use correct list structure with logical type annotation
@wgtmac
Copy link
Member

wgtmac commented Jan 9, 2025

Thanks for the update! Will merge it tomorrow if no objection.

@adamreeve
Copy link
Contributor Author

adamreeve commented Jan 9, 2025

Thanks! I thought I left a comment earlier but GitHub was having an outage so maybe it got lost. I made the suggested changes but kept the data uncompressed as enabling zstd compression actually increased the file size slightly with such small data.

For reference, this is the code I used to generate the file:

auto element_node = PrimitiveNode::Make("element", Repetition::REQUIRED, LogicalType::Int(32, true), Type::INT32);
auto list_node = GroupNode::Make("list", Repetition::REPEATED, {element_node});
auto column_node = GroupNode::Make("x", Repetition::REQUIRED, {list_node}, LogicalType::List());
auto root_node = GroupNode::Make("root", Repetition::REQUIRED, {column_node}, nullptr);

WriterProperties::Builder prop_builder;
std::shared_ptr<WriterProperties> writer_properties = prop_builder
  .disable_dictionary()
  ->disable_write_page_index()
  ->disable_statistics()
  ->compression(Compression::UNCOMPRESSED)
  ->build();

PARQUET_ASSIGN_OR_THROW(auto out_file, ::arrow::io::FileOutputStream::Open(file_path));
auto file_writer = ParquetFileWriter::Open(out_file, std::static_pointer_cast<GroupNode>(root_node), writer_properties);

auto row_group_writer = file_writer->AppendRowGroup();
auto column_writer =
    static_cast<TypedColumnWriter<Int32Type>*>(row_group_writer->NextColumn());

constexpr size_t num_leaf_values = 10;
std::vector<int32_t> values(num_leaf_values);
std::vector<int16_t> rep_levels(num_leaf_values);
std::vector<int16_t> def_levels(num_leaf_values);
for (size_t i = 0; i < num_leaf_values; ++i) {
  values[i] = static_cast<int32_t>(i);
  rep_levels[i] = i % 2 == 0 ? 1 : 0;
  def_levels[i] = 1;
}

column_writer->WriteBatch(num_leaf_values, def_levels.data(), rep_levels.data(), values.data());

row_group_writer->Close();
file_writer->Close();

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants