Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inaccurate Training Data File Size #400

Closed
dcwardell7 opened this issue Oct 3, 2016 · 7 comments
Closed

Inaccurate Training Data File Size #400

dcwardell7 opened this issue Oct 3, 2016 · 7 comments

Comments

@dcwardell7
Copy link

I am attempting to train a dictionary on a number of mp4 files. When I begin training, zstd reports the total size of these files as significantly lower than the actual size.

Here is my command and some output. The 10 files are actually ~26MB in size.

zstd --train -vvv data/* -o out/dictionary
*** zstd command line interface 64-bits v1.1.0, by Yann Collet ***
sorting 10 files of total size 1 MB ...                               
finding patterns ... 
minimum ratio : 4 

found 490 matches of length >= 7 at pos      12  
Selected ref at position 621832, of length 31 : saves 7641 (ratio: 246.48)  

found  10 matches of length >= 7 at pos      43  
Selected ref at position 393259, of length 63 : saves 540 (ratio: 8.57)  

found  10 matches of length >= 7 at pos     112  

found  30 matches of length >= 7 at pos     121  
Selected ref at position 69807, of length 17 : saves 168 (ratio: 9.88)  

...

found   4 matches of length >= 7 at pos 1299012  

found   4 matches of length >= 7 at pos 1304503  

found   4 matches of length >= 7 at pos 1304588  

 80 segments found, of total size 1771 
list 25 best segments 
  1: 91 bytes at pos   655575, savings   13451 bytes |........................................| 
  2: 31 bytes at pos   621832, savings    7641 bytes |............................!.T| 
  3: 63 bytes at pos   427700, savings    5370 bytes |.................@......................| 
  4: 63 bytes at pos   389781, savings    3462 bytes |.............@..........................| 
  5: 63 bytes at pos   189452, savings    2320 bytes |.................@......................| 
  6: 32 bytes at pos   342374, savings    2178 bytes |..............................!.| 
  7: 90 bytes at pos   127725, savings    2116 bytes |.@......................................| 
  8: 63 bytes at pos   384640, savings    1495 bytes |........B........@......................| 
  9: 33 bytes at pos   865135, savings    1066 bytes |..............................!.T| 
 10: 43 bytes at pos   655882, savings     584 bytes |........................................| 
 11: 63 bytes at pos   393259, savings     540 bytes |........................................| 
 12: 63 bytes at pos   156095, savings     390 bytes |.....................@..................| 
 13: 17 bytes at pos   162461, savings     339 bytes |.................| 
 14: 17 bytes at pos   751637, savings     325 bytes |T................| 
 15: 17 bytes at pos  1021895, savings     248 bytes |.................| 
 16: 17 bytes at pos  1133975, savings     234 bytes |_................| 
 17: 17 bytes at pos  1214643, savings     222 bytes |0p...............| 
 18: 17 bytes at pos   427934, savings     211 bytes |.................| 
 19: 57 bytes at pos   427869, savings     209 bytes |.....A..................................| 
 20: 17 bytes at pos    69807, savings     168 bytes |.F...............| 
 21: 17 bytes at pos   997075, savings     156 bytes |.................| 
 22: 17 bytes at pos   328272, savings     132 bytes |.................| 
 23: 17 bytes at pos   608203, savings     117 bytes |xz...............| 
 24: 51 bytes at pos   438492, savings      89 bytes |........................................| 
 25: 51 bytes at pos   161544, savings      60 bytes |.@......................................| 
!  warning : selected content significantly smaller than requested (1771 < 112640) 
statistics ...                                                        
HUF_writeCTable error 
dictionary training failed : Error (generic) 
@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 4, 2016

Dictionary is only useful for small files. "small" means a few kilobytes.
It's not that there are less savings : the savings are basically the same, so in relative terms, they become less and less important, up to insignificant, on larger files.

In this sample set, the samples are very large (> 1 MB each).
The dictionary builder knows that a dictionary is only useful during the first few kilobytes.
So it only consider the beginning of each sample.

This is done automatically, because in many sample sets, there are a lot of small files with occasionally a few large ones. There is no need to stop the training process just for a few outliers, in such a case, selecting just the beginning of large samples is a good enough process.
Though, maybe there should be some kind of warning messages to explain this behavior when it's triggered.

Also : note that mp4 files are already compressed.

@dcwardell7
Copy link
Author

Great response. Thanks for the help. Would this also explain why the dictionary training failed?

@Cyan4973
Copy link
Contributor

Cyan4973 commented Oct 4, 2016

I suspect it's a combination of above and the fact that source data is mostly non compressible. What ends up in the dictionary is very small and what remains for statistics is basically noise.
When noise is close to perfect, the statistics module fails.

This is tracked in #304 .

@dcwardell7
Copy link
Author

Thanks again for the help.

@zhangfuwen
Copy link

Hello @Cyan4973 , Could you explain the reason why dictionary is only useful for small files or beginning of large files?

I use zstd for compressing openGL command stream of games. I noticed that when I play the game for 10 minutes, the algorithm becomes faster and compression ratio becomes better. It is quite apparent. so I think maybe I should train a large dictionary so that it will compress faster from the beginning.

@Cyan4973
Copy link
Contributor

There is probably a confusion about what "dictionary" means.

In Zstandard context, "dictionary" is a static asset, which is used to boot strap the compression process, leading to better initial performance. Hence the predominant impact on small files.

But streaming Zstandard also saves a state, storing last seen data up to a maximum of windowSize bytes, which quickly better reflects the content of data being compressed. This is the classical "sliding window" technique, which is sometimes referred to by the word "dictionary" in other sources. We never use this word here to refer to the sliding window, but understand the possible confusion.

The sliding window is likely the reason why your observed compression ratio increases over time.

@eloff
Copy link

eloff commented Apr 7, 2023

@Cyan4973 sorry to revive a very old thread, I'm just wondering if it's still true that dictionaries are only beneficial for small files? Maybe that could be made more clear in the documentation?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants