-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Inaccurate Training Data File Size #400
Comments
Dictionary is only useful for small files. "small" means a few kilobytes. In this sample set, the samples are very large (> 1 MB each). This is done automatically, because in many sample sets, there are a lot of small files with occasionally a few large ones. There is no need to stop the training process just for a few outliers, in such a case, selecting just the beginning of large samples is a good enough process. Also : note that mp4 files are already compressed. |
Great response. Thanks for the help. Would this also explain why the dictionary training failed? |
I suspect it's a combination of above and the fact that source data is mostly non compressible. What ends up in the dictionary is very small and what remains for statistics is basically noise. This is tracked in #304 . |
Thanks again for the help. |
Hello @Cyan4973 , Could you explain the reason why dictionary is only useful for small files or beginning of large files? I use zstd for compressing openGL command stream of games. I noticed that when I play the game for 10 minutes, the algorithm becomes faster and compression ratio becomes better. It is quite apparent. so I think maybe I should train a large dictionary so that it will compress faster from the beginning. |
There is probably a confusion about what "dictionary" means. In Zstandard context, "dictionary" is a static asset, which is used to boot strap the compression process, leading to better initial performance. Hence the predominant impact on small files. But streaming Zstandard also saves a state, storing last seen data up to a maximum of The sliding window is likely the reason why your observed compression ratio increases over time. |
@Cyan4973 sorry to revive a very old thread, I'm just wondering if it's still true that dictionaries are only beneficial for small files? Maybe that could be made more clear in the documentation? |
I am attempting to train a dictionary on a number of mp4 files. When I begin training, zstd reports the total size of these files as significantly lower than the actual size.
Here is my command and some output. The 10 files are actually ~26MB in size.
zstd --train -vvv data/* -o out/dictionary
The text was updated successfully, but these errors were encountered: