Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce __sequence field size in parquet files #5010

Open
WenyXu opened this issue Nov 18, 2024 · 0 comments
Open

Reduce __sequence field size in parquet files #5010

WenyXu opened this issue Nov 18, 2024 · 0 comments
Labels
C-enhancement Category Enhancements

Comments

@WenyXu
Copy link
Member

WenyXu commented Nov 18, 2024

What type of enhancement is this?

Refactor

What does the enhancement do?

In our Parquet file analysis, the __sequence field occupies a disproportionate amount of file size, accounting for approximately 67% of the total size. This results in inefficient storage usage and potential performance bottlenecks.

File: 9bc23ce8-7046-4ff8-a209-1245827a7a89.parquet

Column Name Size (Bytes) Size (Ratio)
__op_type 54,825 0.00016 (0.016%)
greptime_value 39,894,514 0.117 (11.75%)
__sequence 228,302,552 0.672 (67.23%)
__primary_key 18,000,415 0.053 (5.30%)
greptime_timestamp 53,318,216 0.157 (15.70%)

The __sequence field clearly dominates the file size, overshadowing other important columns such as greptime_value and greptime_timestamp.

Implementation challenges

No response

@WenyXu WenyXu added the C-enhancement Category Enhancements label Nov 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
C-enhancement Category Enhancements
Projects
None yet
Development

No branches or pull requests

1 participant