Skip to content
This repository has been archived by the owner on Oct 18, 2023. It is now read-only.

bottomless: add xz compression option #780

Open
wants to merge 4 commits into
base: main
Choose a base branch
from
Open

Conversation

psarna
Copy link
Contributor

@psarna psarna commented Oct 16, 2023

Empirical testing shows, that gzip achieves mere x2 compression ratio even with very simple and repeatable data patterns. Since compression is very important for optimizing our egress traffic and throughput in general, .xz algorithm is hereby implemented as well. Ran with the same data set, it achieved ~x50 compression ratio, which is orders of magnitude better than gzip, at the cost of elevated CPU usage.

Note: with more algos implemented, we should also consider adding code that detects which compression methods was used when restoring a snapshot, to allow restoring from a gzip file, but continue new snapshots with xz. Currently, setting the compression methods via the env var assumes that both restore and backup use the same algorithm.

@psarna psarna requested review from penberg and haaawk October 16, 2023 10:20
@psarna
Copy link
Contributor Author

psarna commented Oct 16, 2023

TODO: I still need to go over the code and check if there are no more hardcoded assumptions about using gzip for backups.

Copy link
Collaborator

@haaawk haaawk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. How do we choose which compression is used? Is there an env var?

Empirical testing shows, that gzip achieves mere x2 compression ratio
even with very simple and repeatable data patterns.
Since compression is very important for optimizing our egress traffic
and throughput in general, .xz algorithm is hereby implemented
as well. Ran with the same data set, it achieved ~x50 compression ratio,
which is orders of magnitude better than gzip, at the cost of elevated
CPU usage.

Note: with more algos implemented, we should also consider adding code
that detects which compression methods was used when restoring a snapshot,
to allow restoring from a gzip file, but continue new snapshots with xz.
Currently, setting the compression methods via the env var assumes
that both restore and backup use the same algorithm.
The reasoning is as follows: 10000 uncompressed frames weigh 40MiB.
Gzip is expected to create a ~20MiB file from them, while xz
can compress it down to ~800KiB. The previous limit would make xz
create a 50KiB file, which is less than the minimum 128KiB that S3-like
services charge for when writing to an object store.
@psarna
Copy link
Contributor Author

psarna commented Oct 17, 2023

env var, LIBSQL_BOTTOMLESS_COMPRESSION=xz. But before we go ahead with this, I think I need to add code that detects the previous compression scheme on restore. Without that, it will be impossible to restore from a gz, but use xz for all new backups.

@psarna
Copy link
Contributor Author

psarna commented Oct 17, 2023

I'm getting corrupted .xz files produced with this crate in "Best" compression level. Let me try the default one, but that's off. The file compressed with the crate didn't properly unpack with xz -d shell command, which is suspicious.

@psarna
Copy link
Contributor Author

psarna commented Oct 17, 2023

(yep, regular compression level works, and looks only ~10% worse than Best)

Best level seems to produce corrupted files.
@psarna
Copy link
Contributor Author

psarna commented Oct 17, 2023

There's one more place where compression isn't correctly autodetected - in loading main db snapshots. I'l add the code

If the db snapshot is not found with given compression algo,
other choices are checked too. This code will fire if somebody
used to use Gzip, but then decided to restore a database
that declares to use Xz for compressing bottomless.
@psarna
Copy link
Contributor Author

psarna commented Oct 17, 2023

k, done

psarna added a commit to psarna/libsql that referenced this pull request Oct 17, 2023
@psarna
Copy link
Contributor Author

psarna commented Oct 17, 2023

Transplanted to the new repo: tursodatabase/libsql#468

psarna added a commit to psarna/libsql that referenced this pull request Oct 18, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants