Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tune CADD output size #57

Open
holtgrewe opened this issue May 31, 2023 · 3 comments · Fixed by #58
Open

Tune CADD output size #57

holtgrewe opened this issue May 31, 2023 · 3 comments · Fixed by #58
Labels
enhancement New feature or request

Comments

@holtgrewe
Copy link
Contributor

Is your feature request related to a problem? Please describe.
At the current compression, the rocksdb for CADD clocks in at 426GB. The bgzip-ed TSV files only have 252GB. This is clearly too much.

Describe the solution you'd like
Consider other encoding strategies.

Describe alternatives you've considered
N/A

Additional context
N/A

@holtgrewe holtgrewe added the enhancement New feature or request label May 31, 2023
@holtgrewe
Copy link
Contributor Author

Going through the windows in random order (after shuffling) does not help. I suspect zstd is smart enough about building local dictionaries. Next attempt will be to store the raw TSV lines and convert them on the fly.

@holtgrewe holtgrewe transferred this issue from varfish-org/varfish-server-worker May 31, 2023
@holtgrewe
Copy link
Contributor Author

The size goes down from 426GB to 358GB (bgzip-ed is 252GB). This is still a 40% increase in data. I'll merge #58 for now but keep this ticket open.

holtgrewe added a commit that referenced this issue Jun 1, 2023
For CADD, this still uses 40% more than the bgzip-ed downloaded data but is an improvement over storing as native Rust data types (e.g., f64).
@holtgrewe
Copy link
Contributor Author

Re-opening as compression does not work well enough yet.

@holtgrewe holtgrewe reopened this Jun 1, 2023
@holtgrewe holtgrewe moved this to To triage in Release Planning Jan 24, 2024
@holtgrewe holtgrewe moved this from To triage to Backlog in Release Planning Jan 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Backlog
Development

Successfully merging a pull request may close this issue.

1 participant