Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Keep hitting memory limits on large training set with samples > 1MB. What's the strategy? #3111

Closed
rtviii opened this issue Apr 13, 2022 · 6 comments
Labels

Comments

@rtviii
Copy link

rtviii commented Apr 13, 2022

Opening an issue because I have exhausted the docs and am still not sure how to go about this.

Context

I'm trying to train a dictionary on a relatively large data-set and I keep bumping into the memory limit. The training set consists of ~100000 .json files, each with a similar schema and between 2 and 8 MB in size amounting to ~300GB of training data total. I have successfully trained a ~10MB dictionary on 25GB of such data and got decent compression with it before without encountering the memory limit.

I encounter the following errors in the current case, without playing with any other parameters except for dict size:

ᢹ rt-zstd-training-data.root zstd --train -r /mnt/volume_sfo3_01/messages/training -o dict25MB.zstd --maxdict=200000000 -1

! Warning : some sample(s) are very large
! Note that dictionary is only useful for small samples.
! As a consequence, only the first 131072 bytes of each sample are loaded
Training samples set too large (14342 MB); training on 2048 MB only...

I think, short of doing something nonsensical, my problem is that i can't quite get the combination of memory-limit knobs right. To my understanding and according to docs, there are:

  • -M#, --memory=# flag used to limit memory for dictionary training. I definitely want to increase that from the default 2GB(right?).
  • --maxdict=# flag to limit the dictionary to specified size (default: 112640).
  • -B# flag split input files into blocks of size # (default: no split)
  • --size-hint

Issue

My issues are:

  1. When i try to add -M5000000000, which should roughly be ~5GB i get error: numeric value overflows 32-bit unsigned int . I don't suppose this is related to the build: zstd --version yields *** zstd command line interface 64-bits v1.5.2, by Yann Collet ***
  2. Given that i can only get the maximum memory limit to 2048 MB that severely limits the size of my dictionary.
  3. I'm not sure how to use the -B flag.

zstd --train -r /mnt/volume_sfo3_01/messages/training -vvv -o dict.zstd --maxdict=200000000 -1 -B80000 -M500000 -vvv
Shuffling input files
Found training data 115002 files, 298066576 KB, 3872950 samples
! Warning : setting manual memory limit for dictionary training data at 0 MB
Training samples set too large (291080 MB); training on 0 MB only...
Loaded 468 KB total training data, 6 nb samples
Trying 5 different sets of parameters
d=8
Total number of training samples is 4 and is invalid
Failed to initialize context
dictionary training failed : Src size is incorrect

It's a little unsatisfying that the only error log there is, even in -vvv mode, is Src size is incorrect.
It's an external package, but incidentally, i kept seeing this error line when using the zstd::dict rust crate separately from the cli and was confused for a while. When calling hence for example:

    // ...
    let in_path = Path::new("~/training_files");
    let zdict     = zstd::dict::from_files([in_path], 1024*1024*10)?;  

I do realize that Ian explains in this issue that files over ~1MB are considered large and the returns are probably diminished relative to the size.

My questions is whether there is an inherent limit on how large my training set (and correspondingly, the dictionary) can be and how to properly feed it into memory given that my samples are on average larger than 1MB. Again, it's roughly 300GB worth of JSON files ~5MB each, even though i'd like to see this for the general case.

Should i chunk them manually or provide --size-hint or something else entirely?

Thanks a lot in advance.


I'm running a fresh Ubuntu and built zstd with make install:

Linux version 5.4.0-97-generic (buildd@lcy02-amd64-032) 
(gcc version 9.3.0 (Ubuntu 9.3.0-17ubuntu1~20.04)) 
*** zstd command line interface 64-bits v1.5.2, by Yann Collet ***
@Cyan4973
Copy link
Contributor

Cyan4973 commented Apr 13, 2022

There are several points in this post, some are misunderstanding, and some are surprising behaviors that should probably be fixed or improved on our side.

  1. Dictionary training size is limited to 2 GB

This is a hard limit, unfortunately.
This is implementation related.
Going past this limit would require rewriting some portions of the training algorithms, so that's a large effort.

Moreover, it's expected to bring little gains,
because the primary usage of dictionary compression is to compress small data,
aka ~KB range,
in which case, if there are common elements among these samples, they should appear pretty quickly,
so we don't expect noticeable improvements beyond a few dozens of MB of training samples.

When presenting more samples than the training can ingest, the only useful thing we can do is to randomize which samples are selected to be part of the training sample, in order to improve representativity of the dictionary by avoiding sample clustering (by date, by name, etc.).
As far as I know, this capability is already in place.

Also note that the memory cost of training is typically a multiple of the training set. It depends on the exact training mode, but it can go as high as 10x training set size. So a 2 GB training set would need ~20 GB of RAM, which is still a lot by today's standards.

  1. Training samples are limited to 128 KB

Because dictionary is only effective at the beginning of a file, we intentionally limit the useful range of a training sample to its first 128 KB.
This is useful for situations where most samples are small, but a few of them happen to be large, if not very large. It avoids a situation where a single large sample overwhelms all the small ones, resulting in a dictionary which is not representative of small data, where it matters most.
This limit is hard coded, so there's no way to change that. Maybe we could add a knob to control it, assuming we find some use cases where this capability brings some measurable benefits.

As source size grows, the relative efficiency of a dictionary decreases. It's not useless, but the relative compression ratio benefit becomes much smaller. So it's generally not worth it to employ dictionaries for large files. Even if there is a compression ratio benefit, the benefit is probably tiny, and therefore it's questionable if it's worth the dependency cost to a dictionary.
(note : in some cases, the dictionary dependency is just part of an automated chain, so it's always there anyway, so it's not really a "cost").

Anyway, the side effect of this strategy is that, if one provides many large samples as part of the training set, each one of them is considered worth 128 KB only, even when they are many megabytes large.

  1. Limited number of samples

This leads us to the strange part.
If I read your report correctly,
in your scenario, you provide a lot of samples of large sizes,
but the training algorithm end up keeping only a very small portion of them,
and then complain about having not enough data to train.

Found training data 115002 files, 298066576 KB, 3872950 samples
! Warning : setting manual memory limit for dictionary training data at 0 MB
Training samples set too large (291080 MB); training on 0 MB only...
Loaded 468 KB total training data, 6 nb samples
(...)
Total number of training samples is 4 and is invalid

That behavior is not normal.
Presuming all files are very large, and presuming your system has enough available memory,
the expectation is that it should have accepted 128 KB per training sample,
resulting in an ability to ingest > 10K samples.
So this limit to 6 is weird.
As a potential culprit, the limit for dictionary training data at 0 MB is very strange, clearly not expected (except if your system's memory is limited, which I assume is not the case).

So there is likely some problems here, that would deserve some investigation.
The problem will be to reproduce the same issue, in order to observe it, and then find a fix for it.

@Cyan4973
Copy link
Contributor

Follow up :
in the command :

zstd --train -r /mnt/volume_sfo3_01/messages/training -vvv -o dict.zstd --maxdict=200000000 -1 -B80000 -M500000 -vvv

I notice the -M5000000,
which I read as "memory usage limited to ~500 KB".

This would match with the following error message :

Training samples set too large (291080 MB); training on 0 MB only...
Loaded 468 KB total training data, 6 nb samples

so I think the issue in this case is the -M500000.
Remove it.
I don't think that's what you want, since it decreases the amount of selected samples, and it seems you would prefer to maximize that amount instead.

@rtviii
Copy link
Author

rtviii commented Apr 13, 2022

Thanks a ton, Ian. I'm reading and seeing certain things that you have definitely mentioned elsewhere -- sorry for getting you to write those again.

I think certain things just don't click with me like ex. "dictionary is only effective at the beginning of a file" because i don't have the intuition for how this works mechanically. To this end i actually tried to find any sort of publication for zstd but couldn't except for the website. Is there one out? I feel like there is a great deal of customization that zstd enables, especially for manually tunable params like k and d and searchLength and many others, but i couldn't find explanations of any reasonable depth on those. Heuristics fitted for a given structure inherent to the data being compressed would be an amazing addition to the resources out there, i feel. I realize that that's asking for a lot, mostly wondering if i missed it.

The resources i found myself mainly referring to were https://www.mankier.com/1/zstd and http://facebook.github.io/zstd/zstd_manual.html. I found myself a little frustrated with the lack of unit of measure for certain params in the first one (whether it's a bit or a byte or something else).

Thanks a ton for the explanation above!

@Cyan4973
Copy link
Contributor

I found myself a little frustrated with the lack of unit of measure for certain params in the first one (whether it's a bit or a byte or something else).

In most (if not all?) cases, values are expressed in bytes by default.
But, it's possible to use suffixes to indicate KB or MB instead.

https://www.mankier.com/1/zstd#Options-Integer_suffixes_and_special_values

Cyan4973 added a commit that referenced this issue Apr 14, 2022
following questions from #3111.

Note : only the source markdown has been updated,
the actual man page zstd.1 still need to be processed.
@gyscos
Copy link
Contributor

gyscos commented Jul 6, 2022

Note that the rust code sample has a few probable issues:

  • Path doesn't automatically expands ~ into your home directory
  • The argument to from_files is a list of files to train, not a file with the list of files to train. So the sample would train a dictionary on a single file ~/training_files.

@terrelln
Copy link
Contributor

Seems to be answered, closing. Please create a new issue if you have further questions.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants