Lower weight compression memory footprint by sorting weights according to their size #2803
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Changes
Sort weights for compression:
Reason for changes
During weights compression, memory footprint gradually increases when new low-bit constants are created. At the same time there are temporary spikes in memory footprint which happen during compressed weight computation. For example, here:
Multiple temporary full precision arrays are needed to be created here. After that they get garbage-collected. However, as it was said, this creates temporary spikes in memory footprint. Taking this into account, it makes sense to compress large constants first so that there are not many low-bit constants taking up memory yet. This mostly is affected by embedding matrices.
Please see memory figures below. They were obtain during 8-bit weights compression. Figures were gathered with the memory_logger.py, memory_type=SYSTEM_NORMALIZED.
For example, for qwen2-7b OV model there is a reduction from ~12GB peak footprint to ~7GB.
Much lower values for OV backend compared to PT are because OV models are read using mmap which allows to avoid allocating memory for the whole full-precision model.
Related tickets
144501