Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feature request: generate and use preset compression dictionaries #197

Open
CAFxX opened this issue Aug 12, 2022 · 3 comments
Open

Feature request: generate and use preset compression dictionaries #197

CAFxX opened this issue Aug 12, 2022 · 3 comments
Assignees
Milestone

Comments

@CAFxX
Copy link

CAFxX commented Aug 12, 2022

Some compression algorithms (zstd, lz4, lzo, deflate, ...) support using a custom preset dictionary to achieve higher compression ratios especially for "small" source data sizes.

Given a corpus of known source files (or source blocks), it is thus possible to generate one or more preset dictionaries, and then use these as preset dictionaries during compression and decompression. Depending on the number and size of the source blocks/files, as well as the size and number of dictionaries, it is more or less likely that the resulting compressed size (including the size of the resulting preset dictionaries, that must be available during decompression) will be smaller than if no preset dictionaries were used.

If my understanding of squashfs is correct, squashfs could achieve higher compression ratios, especially when using small block sizes, by (optionally) attempting to generate one or more preset dictionaries based on the blocks that the image is composed of, and then attempting to compress each block using one of the generated preset dictionaries. If the resulting compressed size (+ dictionary size) is smaller than the compressed size without dictionaries, and if the extra memory usage required during decompression is acceptable, then the image compressed using the preset dictionaries would be preferable.

Note that while above I mentioned generating "one or more" dictionaries, It would be likely easier to start with a single one. Generating multiple ones would require somehow grouping "similar" files together, generating a dedicated dictionary for each group, and the use different dictionaries for different files/blocks.

An important use case for this is to efficiently compress executables that are statically compiled, and especially those that embed large runtimes, e.g. Go or Java (native image). In these cases different executables are likely to contain large identical or very similar chunks, something that a custom dictionary would be quite effective at compressing.

@CAFxX CAFxX changed the title Feature request: generate and use compression dictionaries Feature request: generate and use preset compression dictionaries Aug 12, 2022
@plougher plougher self-assigned this Aug 13, 2022
@plougher plougher added this to the Undecided milestone Aug 13, 2022
@plougher
Copy link
Owner

plougher commented Aug 13, 2022

Interesting you should bring that up ... That idea, along with block deduplication (see issue #58) and others has been on the back-burner as improvements since at least 2009 !

2009 was when the Squashfs kernel code was mainlined by myself into the upstream kernel. One of the results of the mainlining was a commitment to keep the filesystem layout stable. This was an understandable request. because between 2002 and 2009 Squashfs had undergone rapid development, producing a whole set of incompatible versions (five layout changes in 24 releases over 6 years).

But never say never, and these improvements are still on my todo list. It all depends on amount of time, priorities and what else people want done. Squashfs is still self-funded and done as unpaid voluntary work, which limits the amount of work I can do.

@CAFxX
Copy link
Author

CAFxX commented Aug 13, 2022

Understood, although I am not sure if this (or at least a baseline version of this) would really require changes to the filesystem layout. Maybe it would be a bit hackish, but I can kinda imagine ways to implement this that would be no more disruptive than adding support for a new compression algorithm (e.g. define a new compressor ID for "zstd + preset dictionary", and store the dictionary somewhere known like in the compressor options1, in a special hidden file, or directly as a dedicated data block2), that as we know is something that has been done since mainlining.

Or are there plans for a future "5.0" version of the layout? (in which case this could be done properly, including support for per-block dictionaries...)

Footnotes

  1. although this would likely limit the dictionary size to <8000 bytes

  2. with its offset/length stored in the compression options

@Sonicadvance1
Copy link

ZSTD has posted some pretty big performance number improvements in their readme when using dictionaries.
https://github.com/facebook/zstd#the-case-for-small-data-compression
Huge decompression speed increases would be awesome when using ZSTD. 645MB/s going to 2772MB/s when using a dictionary, but also compression ratio and compression performance improvements.

For myself this would be interesting to see how much the compression ratio improves for compressing a read-only filesystem (Like a Linux rootfs), since the read-only performance delta between LZ4HC and ZSTD has prevented me from using ZSTD today.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants