Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compression ratio example #2765

Closed
aleksazr opened this issue Sep 2, 2021 · 12 comments · Fixed by #2771
Closed

Compression ratio example #2765

aleksazr opened this issue Sep 2, 2021 · 12 comments · Fixed by #2771

Comments

@aleksazr
Copy link

aleksazr commented Sep 2, 2021

Hello,

first time working with zstd, I'm testing it and found a little problem.

I have 3 files: "part1.bin", "part2.bin" and a combined "bothparts.bin".

part1 is compressed to 500 bytes, part2 to 30 bytes, so I expected
the combined file to be compressed to around 530, but it is 1k bytes.

Don't know if you guys are aware of this... the files are here:
http://users.beotel.net/~gwh/zstd/zstd.zip

(both rar and zip are good with these example files)

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 2, 2021

part1.bin is strange.

part2.bin can be pretty reliably compressed into 25/26 bytes.
But part1.bin swings widely between ~540 bytes and ~1050 bytes,
with each compression level being in one category or the other,
depending on some random effect, which seems only partially related to compression level.

For example, if you were to compress part1.bin (alone) with level 15, it would result in 1042 bytes.
But at level 16, it's only 513 bytes.
A lower level 12 succeeds at providing 517 bytes, while level 13 is 1042 bytes too.

Same thing for bothparts.bin, depending on luck, it's one category or the other.
At level 12, bothparts.bin compresses to 536 bytes.
But level 13 is 1057 bytes.

It's not clear why this swings exist.
This makes part1.bin a bit special.
Probably worth looking into.

@Cyan4973
Copy link
Contributor

Cyan4973 commented Sep 3, 2021

A few early results, analyzing part1.bin :

level 12 is able to compress a decent portion of part1.bin as a bunch of 3-bytes matches (the smallest possible), at always the same distance (repcode 1).

This explains why level 11 and before are unable to compress this file : they are looking for 4+ bytes matches, and there are none in this file (which suggests a bunch of 32-bit fields which are non-repetitive). As consequence, only the last huffman stage is employed.

Now, level 13-15 also fail to compress this file, even though they are also looking for matches of size 3.
Levels 13-15 employ strategy btultra, while level 12 employs btopt.
This suggests a slight variation in cost estimation, which ends up, in this case, favoring level 12.
The different cost estimation of btultra makes it select a "different path", and it misses the succession of small 3-bytes matches, which are discarded in favor of something "more promising".

Example of cost estimation for level 13 :
better price (3.48<=4.65) using literal
vs level 12 :
rPos:4 : literal would cost more (3.75>1.90) (hist:4,1,8)

So an initial rounding error in cost estimation makes these algorithms select vastly different paths, with one ultimately proving way better than the other. It's obviously rare, in most cases, 2 differents paths are not that different in term of cost, and whatever difference exist, they sort of even out, or there is a sort of broad tendency that tends to favor a set of choices over the other one (aka btultra is generally expected to do better than btopt without an absolutely clear guarantee).

This is the classical local-vs-global optimization challenge, and is a known side-effect of these path-searching algorithms.
It's fairly difficult to improve, though, obviously, that's where progresses are...

@aleksazr
Copy link
Author

aleksazr commented Sep 3, 2021

Now I have an example where bothparts can't be compressed to ~500, even though part1 can.
http://users.beotel.net/~gwh/zstd/zstd.zip (subfolder v2)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants
@Cyan4973 @aleksazr and others