Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Download compression is too slow #717

Closed
chaoran-chen opened this issue Mar 27, 2024 · 0 comments · Fixed by #727
Closed

Download compression is too slow #717

chaoran-chen opened this issue Mar 27, 2024 · 0 comments · Fixed by #727
Assignees
Milestone

Comments

@chaoran-chen
Copy link
Member

@corneliusroemer posted in a downstream project:

When downloading all Ebola data, I was surprised to find that zstd compressed download was very slow.

The bottleneck is clearly compression on the server, as uncompressed download happened in 2 seconds for me (60 MB raw), whereas zstd compressed download took 8 seconds (60 MB raw -> 340kB compressed).

That's a meager compression rate (of uncompressed input) of ~8MB/s.

Running zstd on the CLI on the same uncompresed file gets me a throughput of 2GiB/s and results in a higher compression ratio (327 KiB vs 340 KiB for the server compressed file).

Clearly the zstd compression on the server can benefit from some tuning.

Gzip is also slower than it need be, running at only 5MB/s, when on my CLI I get 230MiB/s.

@chaoran-chen chaoran-chen added this to the COVID-LAPIS milestone Mar 27, 2024
@fengelniederhammer fengelniederhammer self-assigned this Apr 3, 2024
fengelniederhammer added a commit that referenced this issue Apr 3, 2024
and reduce logging noise from data version checker
fengelniederhammer added a commit that referenced this issue Apr 3, 2024
Tests on a medium-sized data set (3552 sequences):
I measured request times of /sample/unalignedNucleotideSequences:
- uncompressed: 1300 ms, 65 MB
- gzip compressed: 3100 ms (before: 14 s), 8 MB
- zstd compressed: 970 ms (before: 7.4 s), 342 kB
fengelniederhammer added a commit that referenced this issue Apr 3, 2024
Tests on a medium-sized data set (3552 sequences):
I measured request times of /sample/unalignedNucleotideSequences:
- uncompressed: 1300 ms, 65 MB
- gzip compressed: 3100 ms (before: 14 s), 8 MB
- zstd compressed: 970 ms (before: 7.4 s), 342 kB
fengelniederhammer added a commit that referenced this issue Apr 3, 2024
and reduce logging noise from data version checker
fengelniederhammer added a commit that referenced this issue Apr 3, 2024
Tests on a medium-sized data set (3552 sequences):
I measured request times of /sample/unalignedNucleotideSequences:
- uncompressed: 1300 ms, 65 MB
- gzip compressed: 3100 ms (before: 14 s), 8 MB
- zstd compressed: 970 ms (before: 7.4 s), 342 kB
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants