-
Notifications
You must be signed in to change notification settings - Fork 38
bottomless: add xz compression option #780
base: main
Are you sure you want to change the base?
Conversation
TODO: I still need to go over the code and check if there are no more hardcoded assumptions about using gzip for backups. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. How do we choose which compression is used? Is there an env var?
Empirical testing shows, that gzip achieves mere x2 compression ratio even with very simple and repeatable data patterns. Since compression is very important for optimizing our egress traffic and throughput in general, .xz algorithm is hereby implemented as well. Ran with the same data set, it achieved ~x50 compression ratio, which is orders of magnitude better than gzip, at the cost of elevated CPU usage. Note: with more algos implemented, we should also consider adding code that detects which compression methods was used when restoring a snapshot, to allow restoring from a gzip file, but continue new snapshots with xz. Currently, setting the compression methods via the env var assumes that both restore and backup use the same algorithm.
The reasoning is as follows: 10000 uncompressed frames weigh 40MiB. Gzip is expected to create a ~20MiB file from them, while xz can compress it down to ~800KiB. The previous limit would make xz create a 50KiB file, which is less than the minimum 128KiB that S3-like services charge for when writing to an object store.
env var, |
I'm getting corrupted .xz files produced with this crate in "Best" compression level. Let me try the default one, but that's off. The file compressed with the crate didn't properly unpack with |
(yep, regular compression level works, and looks only ~10% worse than Best) |
Best level seems to produce corrupted files.
There's one more place where compression isn't correctly autodetected - in loading main db snapshots. I'l add the code |
If the db snapshot is not found with given compression algo, other choices are checked too. This code will fire if somebody used to use Gzip, but then decided to restore a database that declares to use Xz for compressing bottomless.
k, done |
Transplanted from libsql/sqld#780
Transplanted to the new repo: tursodatabase/libsql#468 |
Transplanted from libsql/sqld#780
Empirical testing shows, that gzip achieves mere x2 compression ratio even with very simple and repeatable data patterns. Since compression is very important for optimizing our egress traffic and throughput in general, .xz algorithm is hereby implemented as well. Ran with the same data set, it achieved ~x50 compression ratio, which is orders of magnitude better than gzip, at the cost of elevated CPU usage.
Note: with more algos implemented, we should also consider adding code that detects which compression methods was used when restoring a snapshot, to allow restoring from a gzip file, but continue new snapshots with xz. Currently, setting the compression methods via the env var assumes that both restore and backup use the same algorithm.