Reduce `__sequence` field size in parquet files #5010

WenyXu · 2024-11-18T06:35:40Z

What type of enhancement is this?

Refactor

What does the enhancement do?

In our Parquet file analysis, the __sequence field occupies a disproportionate amount of file size, accounting for approximately 67% of the total size. This results in inefficient storage usage and potential performance bottlenecks.

File: 9bc23ce8-7046-4ff8-a209-1245827a7a89.parquet

Column Name	Size (Bytes)	Size (Ratio)
`__op_type`	54,825	0.00016 (0.016%)
`greptime_value`	39,894,514	0.117 (11.75%)
`__sequence`	228,302,552	0.672 (67.23%)
`__primary_key`	18,000,415	0.053 (5.30%)
`greptime_timestamp`	53,318,216	0.157 (15.70%)

The __sequence field clearly dominates the file size, overshadowing other important columns such as greptime_value and greptime_timestamp.

Implementation challenges

No response

The text was updated successfully, but these errors were encountered:

WenyXu added the C-enhancement Category Enhancements label Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce `__sequence` field size in parquet files #5010

Reduce `__sequence` field size in parquet files #5010

WenyXu commented Nov 18, 2024

Reduce __sequence field size in parquet files #5010

Reduce __sequence field size in parquet files #5010

Comments

WenyXu commented Nov 18, 2024

What type of enhancement is this?

What does the enhancement do?

Implementation challenges

Reduce `__sequence` field size in parquet files #5010

Reduce `__sequence` field size in parquet files #5010