-
Notifications
You must be signed in to change notification settings - Fork 2.4k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
1 parent
2cd1308
commit bcb469a
Showing
5 changed files
with
52 additions
and
5 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
6 changes: 6 additions & 0 deletions
6
docs/optimization_guide/nncf/code/weight_compression_openvino.py
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,6 @@ | ||
#! [compression_8bit] | ||
from nncf import compress_weights | ||
|
||
... | ||
model = compress_weights(model) # model is openvino.Model object | ||
#! [compression_8bit] |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
# Weight Compression {#weight_compression} | ||
|
||
@sphinxdirective | ||
|
||
Enhancing Model Efficiency with Weight Compression | ||
################################################################## | ||
|
||
Weight compression aims to reduce the memory footprint of a model. It can also lead to significant performance improvement for large memory-bound models, such as Large Language Models (LLMs). LLMs and other models, which require extensive memory to store the weights during inference, can benefit from weight compression in the following ways: | ||
|
||
- enabling the inference of exceptionally large models that cannot be accommodated in the memory of the device; | ||
- improving the inference performance of the models by reducing the latency of the memory access when computing the operations with weights, for example, Linear layers. | ||
|
||
Currently, NNCF provides 8-bit weight quantization as a compression method primarily designed to optimize LLMs. The main difference between weights compression and full model quantization (post-training quantization) is that activations remain floating-point in the case of weights compression which leads to a better accuracy. Weight compression for LLMs provides a solid inference performance improvement which is on par with the performance of the full model quantization. In addition, weight compression is data-free and does not require a calibration dataset, making it easy to use. | ||
|
||
Compress Model Weights | ||
###################### | ||
|
||
The code snippet below shows how to compress the weights of the model represented in OpenVINO IR using NNCF: | ||
|
||
.. tab-set:: | ||
|
||
.. tab-item:: OpenVINO | ||
:sync: openvino | ||
|
||
.. doxygensnippet:: docs/optimization_guide/nncf/code/weight_compression_openvino.py | ||
:language: python | ||
:fragment: [compression_8bit] | ||
|
||
Now, the model is ready for compilation and inference. It can be also saved into a compressed format, resulting in a smaller binary file. | ||
|
||
Additional Resources | ||
#################### | ||
|
||
- :doc:`Post-training Quantization <ptq_introduction>` | ||
- :doc:`Training-time Optimization <tmo_introduction>` | ||
|
||
@endsphinxdirective |