Inaccurate Training Data File Size #400

dcwardell7 · 2016-10-03T23:24:56Z

I am attempting to train a dictionary on a number of mp4 files. When I begin training, zstd reports the total size of these files as significantly lower than the actual size.

Here is my command and some output. The 10 files are actually ~26MB in size.

zstd --train -vvv data/* -o out/dictionary

*** zstd command line interface 64-bits v1.1.0, by Yann Collet ***
sorting 10 files of total size 1 MB ...                               
finding patterns ... 
minimum ratio : 4 

found 490 matches of length >= 7 at pos      12  
Selected ref at position 621832, of length 31 : saves 7641 (ratio: 246.48)  

found  10 matches of length >= 7 at pos      43  
Selected ref at position 393259, of length 63 : saves 540 (ratio: 8.57)  

found  10 matches of length >= 7 at pos     112  

found  30 matches of length >= 7 at pos     121  
Selected ref at position 69807, of length 17 : saves 168 (ratio: 9.88)  

...

found   4 matches of length >= 7 at pos 1299012  

found   4 matches of length >= 7 at pos 1304503  

found   4 matches of length >= 7 at pos 1304588  

 80 segments found, of total size 1771 
list 25 best segments 
  1: 91 bytes at pos   655575, savings   13451 bytes |........................................| 
  2: 31 bytes at pos   621832, savings    7641 bytes |............................!.T| 
  3: 63 bytes at pos   427700, savings    5370 bytes |.................@......................| 
  4: 63 bytes at pos   389781, savings    3462 bytes |.............@..........................| 
  5: 63 bytes at pos   189452, savings    2320 bytes |.................@......................| 
  6: 32 bytes at pos   342374, savings    2178 bytes |..............................!.| 
  7: 90 bytes at pos   127725, savings    2116 bytes |.@......................................| 
  8: 63 bytes at pos   384640, savings    1495 bytes |........B........@......................| 
  9: 33 bytes at pos   865135, savings    1066 bytes |..............................!.T| 
 10: 43 bytes at pos   655882, savings     584 bytes |........................................| 
 11: 63 bytes at pos   393259, savings     540 bytes |........................................| 
 12: 63 bytes at pos   156095, savings     390 bytes |.....................@..................| 
 13: 17 bytes at pos   162461, savings     339 bytes |.................| 
 14: 17 bytes at pos   751637, savings     325 bytes |T................| 
 15: 17 bytes at pos  1021895, savings     248 bytes |.................| 
 16: 17 bytes at pos  1133975, savings     234 bytes |_................| 
 17: 17 bytes at pos  1214643, savings     222 bytes |0p...............| 
 18: 17 bytes at pos   427934, savings     211 bytes |.................| 
 19: 57 bytes at pos   427869, savings     209 bytes |.....A..................................| 
 20: 17 bytes at pos    69807, savings     168 bytes |.F...............| 
 21: 17 bytes at pos   997075, savings     156 bytes |.................| 
 22: 17 bytes at pos   328272, savings     132 bytes |.................| 
 23: 17 bytes at pos   608203, savings     117 bytes |xz...............| 
 24: 51 bytes at pos   438492, savings      89 bytes |........................................| 
 25: 51 bytes at pos   161544, savings      60 bytes |.@......................................| 
!  warning : selected content significantly smaller than requested (1771 < 112640) 
statistics ...                                                        
HUF_writeCTable error 
dictionary training failed : Error (generic)

Cyan4973 · 2016-10-04T04:12:08Z

Dictionary is only useful for small files. "small" means a few kilobytes.
It's not that there are less savings : the savings are basically the same, so in relative terms, they become less and less important, up to insignificant, on larger files.

In this sample set, the samples are very large (> 1 MB each).
The dictionary builder knows that a dictionary is only useful during the first few kilobytes.
So it only consider the beginning of each sample.

This is done automatically, because in many sample sets, there are a lot of small files with occasionally a few large ones. There is no need to stop the training process just for a few outliers, in such a case, selecting just the beginning of large samples is a good enough process.
Though, maybe there should be some kind of warning messages to explain this behavior when it's triggered.

Also : note that mp4 files are already compressed.

dcwardell7 · 2016-10-04T15:45:33Z

Great response. Thanks for the help. Would this also explain why the dictionary training failed?

Cyan4973 · 2016-10-04T15:56:02Z

I suspect it's a combination of above and the fact that source data is mostly non compressible. What ends up in the dictionary is very small and what remains for statistics is basically noise.
When noise is close to perfect, the statistics module fails.

This is tracked in #304 .

dcwardell7 · 2016-10-04T15:58:38Z

Thanks again for the help.

zhangfuwen · 2017-10-23T01:44:32Z

Hello @Cyan4973 , Could you explain the reason why dictionary is only useful for small files or beginning of large files?

I use zstd for compressing openGL command stream of games. I noticed that when I play the game for 10 minutes, the algorithm becomes faster and compression ratio becomes better. It is quite apparent. so I think maybe I should train a large dictionary so that it will compress faster from the beginning.

Cyan4973 · 2017-10-24T07:35:37Z

There is probably a confusion about what "dictionary" means.

In Zstandard context, "dictionary" is a static asset, which is used to boot strap the compression process, leading to better initial performance. Hence the predominant impact on small files.

But streaming Zstandard also saves a state, storing last seen data up to a maximum of windowSize bytes, which quickly better reflects the content of data being compressed. This is the classical "sliding window" technique, which is sometimes referred to by the word "dictionary" in other sources. We never use this word here to refer to the sliding window, but understand the possible confusion.

The sliding window is likely the reason why your observed compression ratio increases over time.

eloff · 2023-04-07T22:54:39Z

@Cyan4973 sorry to revive a very old thread, I'm just wondering if it's still true that dictionaries are only beneficial for small files? Maybe that could be made more clear in the documentation?

dcwardell7 closed this as completed Oct 4, 2016

rtviii mentioned this issue Apr 13, 2022

Keep hitting memory limits on large training set with samples > 1MB. What's the strategy? #3111

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Inaccurate Training Data File Size #400

Inaccurate Training Data File Size #400

dcwardell7 commented Oct 3, 2016

Cyan4973 commented Oct 4, 2016

dcwardell7 commented Oct 4, 2016

Cyan4973 commented Oct 4, 2016

dcwardell7 commented Oct 4, 2016

zhangfuwen commented Oct 23, 2017

Cyan4973 commented Oct 24, 2017

eloff commented Apr 7, 2023

Inaccurate Training Data File Size #400

Inaccurate Training Data File Size #400

Comments

dcwardell7 commented Oct 3, 2016

Cyan4973 commented Oct 4, 2016

dcwardell7 commented Oct 4, 2016

Cyan4973 commented Oct 4, 2016

dcwardell7 commented Oct 4, 2016

zhangfuwen commented Oct 23, 2017

Cyan4973 commented Oct 24, 2017

eloff commented Apr 7, 2023