Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

erratic compression rate #4236

Open
Paladynee opened this issue Jan 12, 2025 · 3 comments
Open

erratic compression rate #4236

Paladynee opened this issue Jan 12, 2025 · 3 comments
Assignees

Comments

@Paladynee
Copy link

Paladynee commented Jan 12, 2025

Describe the bug
For this specific file
tm29.zip (zipped, deflate)
, the compression rate, compression speed and decompression speed all erratically change for levels 1-22.

To Reproduce
Steps to reproduce the behavior:

  1. Download the tm29.zip file.
  2. Unzip the tm29.zip file for compression with zstd.
  3. Run zstd on the output file tm29: zstd tm29 -f -b1 -e22 -i5
  4. Observe statistics

Expected behavior
Compression rate, compression speed and compression rate should've (approximately) gone from highest to lowest according to the level used (-1 through -22)

Observed behavior

C:\Users\eurydice\Desktop\test>zstd tm29 -f -b1 -e22 -i5
 1#tm29              : 268435456 ->     35034 (x7662.14), 10680.6 MB/s  10083.5 MB/s
 2#tm29              : 268435456 ->     35032 (x7662.58), 11306.5 MB/s, 10025.0 MB/s
 3#tm29              : 268435456 ->     35032 (x7662.58), 10236.6 MB/s, 10042.0 MB/s
 4#tm29              : 268435456 ->     35032 (x7662.58), 10313.3 MB/s, 9960.2 MB/s
 5#tm29              : 268435456 ->   3548566 (x75.65),  393.0 MB/s, 4591.9 MB/s
 6#tm29              : 268435456 ->   3541741 (x75.79),  367.9 MB/s, 4833.4 MB/s
 7#tm29              : 268435456 ->   4083117 (x65.74),  347.3 MB/s, 6353.3 MB/s
 8#tm29              : 268435456 ->   4083117 (x65.74),  320.5 MB/s, 6299.9 MB/s
 9#tm29              : 268435456 ->   4083117 (x65.74),  318.9 MB/s, 6253.8 MB/s
10#tm29              : 268435456 ->   2505849 (x107.12),  524.1 MB/s  8443.7 MB/s
11#tm29              : 268435456 ->     26535 (x10116.28), 3927.8 MB/s, 10093.3 MB/s
12#tm29              : 268435456 ->     26535 (x10116.28), 3947.6 MB/s, 9964.2 MB/s
13#tm29              : 268435456 ->   2823617 (x95.07),  237.6 MB/s, 7313.2 MB/s
14#tm29              : 268435456 ->   1797563 (x149.33),  192.3 MB/s, 8575.8 MB/s
15#tm29              : 268435456 ->   1078705 (x248.85),  115.6 MB/s, 9238.2 MB/s
16#tm29              : 268435456 ->   1583171 (x169.56),   55.5 MB/s, 8611.2 MB/s
17#tm29              : 268435456 ->    925769 (x289.96),   55.2 MB/s, 8898.8 MB/s
18#tm29              : 268435456 ->    927815 (x289.32),   55.1 MB/s, 8936.6 MB/s
19#tm29              : 268435456 ->    290902 (x922.77),   62.3 MB/s, 10052.7 MB/s
20#tm29              : 268435456 ->    290902 (x922.77),   62.0 MB/s, 10097.7 MB/s
21#tm29              : 268435456 ->     92431 (x2904.17),   42.0 MB/s, 10204.0 MB/s
22#tm29              : 268435456 ->    228787 (x1173.30),   66.9 MB/s, 10252.8 MB/s

Compression rate: 5 (huge regression), 11 (huge improvement), 13 (regression), 21 (improvement)
Compression speed: 5 (huge regression), 11 (improvement), 13 (regression)
Decompression speed: 5 (regression), 10 (improvement), 11 (improvement)

Desktop (please complete the following information):

  • OS: Windows 10 x64
  • Version: 21H2
  • Compiler: none (prebuilt binary from releases)
  • Other relevant hardware specs AMD Ryzen 7 4800H 16 core

Zstd version:

C:\Users\eurydice>zstd --version
*** Zstandard CLI (64-bit) v1.5.6, by Yann Collet ***

Relevant information:
The given file tm29 is a compressor benchmark file with hugely repetitive ab patterns and does not relate to real data.

@Cyan4973
Copy link
Contributor

This file is synthetic and describes an unlikely scenario that is implausible in real-world cases.

This is evident from the compression ratio, which can reach an extraordinary x10000 ratio, far exceeding the expected parameters.

It is not surprising that behavior becomes unpredictable in these conditions.

The file itself consists largely of the characters a and b.

If the content were random, we would expect a ratio of x8. However, the significantly better ratio suggests that these seemingly nonsensical sequences are actually repeated, with large segments essentially copied and pasted.

With so many matches available, searching becomes very challenging, and finding the "best match" becomes an implausibly costly task. This situation favors different approaches based on probabilistic methods (i.e., random chance), which are more commonly used at lower compression levels. Given that many expectations are defied in this file, it is not surprising that the match-finding algorithms are pushed to their limits, resulting in a scenario where chance plays a disproportionately large role.

@Cyan4973 Cyan4973 self-assigned this Jan 13, 2025
@Paladynee
Copy link
Author

i understand, thank you for the detailed response and explanation.

@tansy
Copy link

tansy commented Jan 14, 2025

One can always find data like that.

 1060424 required_block_states.nbt
   93043 required_block_states.nbt.zst-02
   86999 required_block_states.nbt.zst-01
   80818 required_block_states.nbt.zst-03
   80814 required_block_states.nbt.zst-04
   67911 required_block_states.nbt.zst-05
   60318 required_block_states.nbt.zst-06
   55341 required_block_states.nbt.zst-07
   54478 required_block_states.nbt.zst-08
   54452 required_block_states.nbt.zst-09
   50664 required_block_states.nbt.zst-10
   47866 required_block_states.nbt.zst-12
   47866 required_block_states.nbt.zst-11
   47171 required_block_states.nbt.zst-13
   45346 required_block_states.nbt.zst-18
   45173 required_block_states.nbt.zst-14
   44637 required_block_states.nbt.zst-17
   44404 required_block_states.nbt.zst-16
   44397 required_block_states.nbt.zst-15
   40761 required_block_states.nbt.zst-19


 1008188 e1m7.dem
  229306 e1m7.dem.zst-01
  228970 e1m7.dem.zst-02
  209440 e1m7.dem.zst-04
  208572 e1m7.dem.zst-03
  207954 e1m7.dem.zst-05
  206875 e1m7.dem.zst-07
  206845 e1m7.dem.zst-06
  206112 e1m7.dem.zst-15
  206098 e1m7.dem.zst-12
  206098 e1m7.dem.zst-11
  206090 e1m7.dem.zst-14
  205949 e1m7.dem.zst-13
  205651 e1m7.dem.zst-10
  205615 e1m7.dem.zst-09
  205600 e1m7.dem.zst-08
  203339 e1m7.dem.zst-16
  202717 e1m7.dem.zst-17
  201127 e1m7.dem.zst-18
  200530 e1m7.dem.zst-19

Maybe not that extreme, but I didn't make them artificially mid/long term redundant, which would make them 'behave' even more 'erratic'.

They also have something in common. They are not real real, only game real data.

Anything that would look similar is DNA sequence and though it experiences 'anomalies' it's nowhere near to that.

10652155 DNA_ScPo
 3241541 DNA_ScPo.zst-01
 3224028 DNA_ScPo.zst-02
 3203281 DNA_ScPo.zst-03
 3165842 DNA_ScPo.zst-04
 3117768 DNA_ScPo.zst-05
 3109233 DNA_ScPo.zst-06
 3078787 DNA_ScPo.zst-07
 3078456 DNA_ScPo.zst-08
 3077464 DNA_ScPo.zst-09
 3050357 DNA_ScPo.zst-13
 3048055 DNA_ScPo.zst-10
 3022373 DNA_ScPo.zst-11
 3021637 DNA_ScPo.zst-12
 2926669 DNA_ScPo.zst-14
 2887919 DNA_ScPo.zst-15
 2763037 DNA_ScPo.zst-17
 2758631 DNA_ScPo.zst-16
 2624450 DNA_ScPo.zst-19
 2624307 DNA_ScPo.zst-18

Your case is extreme, as explained by @Cyan4973.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants