-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Keep hitting memory limits on large training set with samples > 1MB. What's the strategy? #3111
Comments
There are several points in this post, some are misunderstanding, and some are surprising behaviors that should probably be fixed or improved on our side.
This is a hard limit, unfortunately. Moreover, it's expected to bring little gains, When presenting more samples than the training can ingest, the only useful thing we can do is to randomize which samples are selected to be part of the training sample, in order to improve representativity of the dictionary by avoiding sample clustering (by date, by name, etc.). Also note that the memory cost of training is typically a multiple of the training set. It depends on the exact training mode, but it can go as high as 10x training set size. So a 2 GB training set would need ~20 GB of RAM, which is still a lot by today's standards.
Because dictionary is only effective at the beginning of a file, we intentionally limit the useful range of a training sample to its first 128 KB. As source size grows, the relative efficiency of a dictionary decreases. It's not useless, but the relative compression ratio benefit becomes much smaller. So it's generally not worth it to employ dictionaries for large files. Even if there is a compression ratio benefit, the benefit is probably tiny, and therefore it's questionable if it's worth the dependency cost to a dictionary. Anyway, the side effect of this strategy is that, if one provides many large samples as part of the training set, each one of them is considered worth 128 KB only, even when they are many megabytes large.
This leads us to the strange part.
That behavior is not normal. So there is likely some problems here, that would deserve some investigation. |
Follow up :
I notice the This would match with the following error message :
so I think the issue in this case is the |
Thanks a ton, Ian. I'm reading and seeing certain things that you have definitely mentioned elsewhere -- sorry for getting you to write those again. I think certain things just don't click with me like ex. "dictionary is only effective at the beginning of a file" because i don't have the intuition for how this works mechanically. To this end i actually tried to find any sort of publication for zstd but couldn't except for the website. Is there one out? I feel like there is a great deal of customization that zstd enables, especially for manually tunable params like The resources i found myself mainly referring to were https://www.mankier.com/1/zstd and http://facebook.github.io/zstd/zstd_manual.html. I found myself a little frustrated with the lack of unit of measure for certain params in the first one (whether it's a bit or a byte or something else). Thanks a ton for the explanation above! |
In most (if not all?) cases, values are expressed in bytes by default. https://www.mankier.com/1/zstd#Options-Integer_suffixes_and_special_values |
following questions from #3111. Note : only the source markdown has been updated, the actual man page zstd.1 still need to be processed.
Note that the rust code sample has a few probable issues:
|
Seems to be answered, closing. Please create a new issue if you have further questions. |
Opening an issue because I have exhausted the docs and am still not sure how to go about this.
Context
I'm trying to train a dictionary on a relatively large data-set and I keep bumping into the memory limit. The training set consists of ~100000 .json files, each with a similar schema and between 2 and 8 MB in size amounting to ~300GB of training data total. I have successfully trained a ~10MB dictionary on 25GB of such data and got decent compression with it before without encountering the memory limit.
I encounter the following errors in the current case, without playing with any other parameters except for dict size:
I think, short of doing something nonsensical, my problem is that i can't quite get the combination of memory-limit knobs right. To my understanding and according to docs, there are:
-M#, --memory=#
flag used to limit memory for dictionary training. I definitely want to increase that from the default 2GB(right?).--maxdict=#
flag to limit the dictionary to specified size (default: 112640).-B#
flag split input files into blocks of size # (default: no split)--size-hint
Issue
My issues are:
-M5000000000
, which should roughly be ~5GB i geterror: numeric value overflows 32-bit unsigned int
. I don't suppose this is related to the build:zstd --version
yields*** zstd command line interface 64-bits v1.5.2, by Yann Collet ***
-B
flag.It's a little unsatisfying that the only error log there is, even in
-vvv
mode, isSrc size is incorrect
.It's an external package, but incidentally, i kept seeing this error line when using the zstd::dict rust crate separately from the cli and was confused for a while. When calling hence for example:
I do realize that Ian explains in this issue that files over ~1MB are considered large and the returns are probably diminished relative to the size.
My questions is whether there is an inherent limit on how large my training set (and correspondingly, the dictionary) can be and how to properly feed it into memory given that my samples are on average larger than 1MB. Again, it's roughly 300GB worth of JSON files ~5MB each, even though i'd like to see this for the general case.
Should i chunk them manually or provide
--size-hint
or something else entirely?Thanks a lot in advance.
I'm running a fresh Ubuntu and built
zstd
withmake install
:The text was updated successfully, but these errors were encountered: