Lower weight compression memory footprint by sorting weights according to their size #2803

nikita-savelyevv · 2024-07-10T13:38:15Z

Changes

Sort weights for compression:

all_weight_params = sorted(all_weight_params, key=lambda wp: wp.num_weights, reverse=True)

Reason for changes

During weights compression, memory footprint gradually increases when new low-bit constants are created. At the same time there are temporary spikes in memory footprint which happen during compressed weight computation. For example, here:

if invert_scale:
    scale = fns.power(scale, -1)
    compressed_weights = weight * scale
else:
    compressed_weights = weight / scale
if zero_point is not None:
    compressed_weights += zero_point.astype(weight.dtype)
compressed_weights = fns.round(compressed_weights)
compressed_weights = fns.clip(compressed_weights, level_low, level_high).astype(dtype)

Multiple temporary full precision arrays are needed to be created here. After that they get garbage-collected. However, as it was said, this creates temporary spikes in memory footprint. Taking this into account, it makes sense to compress large constants first so that there are not many low-bit constants taking up memory yet. This mostly is affected by embedding matrices.

Please see memory figures below. They were obtain during 8-bit weights compression. Figures were gathered with the memory_logger.py, memory_type=SYSTEM_NORMALIZED.

Backend	Model	Before	After
OV	qwen2-7b
PT	qwen2-7b

For example, for qwen2-7b OV model there is a reduction from ~12GB peak footprint to ~7GB.

Much lower values for OV backend compared to PT are because OV models are read using mmap which allows to avoid allocating memory for the whole full-precision model.

Related tickets

144501

nncf/quantization/algorithms/weight_compression/algorithm.py

andreyanufr · 2024-07-11T06:54:17Z

@andreyanufr Could you please confirm that AWQ and Scale Estimation algorithms are agnostic to the weight ordering? @alexsu52 And also GPTQ?

AWQ and Scale Estimation algorithms are agnostic to the weight ordering, but not GPTQ.

nikita-savelyevv · 2024-07-11T08:37:34Z

Weight compression build with id 103 has passed

### Changes During nncf weight compression, `rich` progress bar is used to display the progress. In this PR, progress bar is changed to be weighted according to model weights. With these changes, each weight contributes proportional amount of percent to the progress bar. Iteration number was removed from weight compression progress bar to avoid confusion between different speeds in percent and iteration coordinates. For example now a single weight might contribute 5-10% to the whole progress. ### Reason for changes The time it takes to compress a weight is roughly proportional to its size, so incrementing the progress by 1 for each weight is not ideal. Especially after #2803 when weight sorting was added. Now, the largest weights come first and the smallest ones are at the end of the compression. This leads to misleading time estimation when progress contribution from every weight is equal. Weights sizes for tinyllama-1.1b for reference: ![weight_size_hist](https://github.com/user-attachments/assets/30ba1e1b-0fc5-4d6b-84db-948362672bf2) ![weight_size_cumsum_hist](https://github.com/user-attachments/assets/b00e79e8-5000-44a4-97a5-4102c9aed0ae)

Sort weights before compression

afa68f0

nikita-savelyevv requested a review from a team as a code owner July 10, 2024 13:38

github-actions bot added the NNCF PTQ Pull requests that updates NNCF PTQ label Jul 10, 2024

nikita-savelyevv requested a review from ljaljushkin July 10, 2024 13:40

ljaljushkin requested changes Jul 10, 2024

View reviewed changes

nncf/quantization/algorithms/weight_compression/algorithm.py Outdated Show resolved Hide resolved

nikita-savelyevv added 2 commits July 10, 2024 17:08

Move reordering a few lines lower

e974bef

Move reordering right before weights compression

75376fb

ljaljushkin approved these changes Jul 11, 2024

View reviewed changes

ljaljushkin merged commit 232b435 into openvinotoolkit:develop Jul 12, 2024
12 checks passed

nikita-savelyevv mentioned this pull request Aug 20, 2024

Add weighted progress tracking for weight compression #2892

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Lower weight compression memory footprint by sorting weights according to their size #2803

Lower weight compression memory footprint by sorting weights according to their size #2803

nikita-savelyevv commented Jul 10, 2024 •

edited

Loading

andreyanufr commented Jul 11, 2024

nikita-savelyevv commented Jul 11, 2024

Lower weight compression memory footprint by sorting weights according to their size #2803

Lower weight compression memory footprint by sorting weights according to their size #2803

Conversation

nikita-savelyevv commented Jul 10, 2024 • edited Loading

Changes

Reason for changes

Related tickets

andreyanufr commented Jul 11, 2024

nikita-savelyevv commented Jul 11, 2024

nikita-savelyevv commented Jul 10, 2024 •

edited

Loading