From c17b6ed00ecccc20a3c1ad51c41ee3cae743cbb2 Mon Sep 17 00:00:00 2001 From: Aleksei Kashapov Date: Fri, 8 Nov 2024 14:07:40 +0100 Subject: [PATCH] Docs for statistics caching (#3052) ### Changes Add section regarding statistics caching for WeightCompression --- .../weights_compression/Usage.md | 120 ++++++++++++------ 1 file changed, 79 insertions(+), 41 deletions(-) diff --git a/docs/usage/post_training_compression/weights_compression/Usage.md b/docs/usage/post_training_compression/weights_compression/Usage.md index 6d2f3a3afea..31beb7dcfeb 100644 --- a/docs/usage/post_training_compression/weights_compression/Usage.md +++ b/docs/usage/post_training_compression/weights_compression/Usage.md @@ -1,11 +1,24 @@ -## Weights Compression -[OpenVINO](https://github.com/openvinotoolkit/openvino) is the preferred backend to run Weights Compression with. PyTorch and Torch FX are also supported. +- [The algorithm description](#the-algorithm-description) +- [Supported modes](#supported-modes) +- [User guide](#user-guide) + - [Data-free methods](#data-free-methods) + - [Data-aware methods](#data-aware-methods) + - [Caching Statistics](#caching-statistics) +- [Evaluation results](#evaluation-results) + - [Data-free Mixed-Precision on Lambada OpenAI dataset](#data-free-mixed-precision-on-lambada-openai-dataset) + - [Data-aware Mixed-Precision and AWQ methods on Wikitext dataset](#data-aware-mixed-precision-and-awq-methods-on-wikitext-dataset) + - [Scale Estimation and GPTQ methods on Lambada OpenAI dataset](#scale-estimation-and-gptq-methods-on-lambada-openai-dataset) + - [Accuracy/Footprint trade-off](#accuracyfootprint-trade-off) +- [Limitations](#limitations) +- [Additional resources](#additional-resources) ### The algorithm description The Weights Compression algorithm is aimed at compressing the weights of the models and can be used to optimize the model footprint and performance of large models where the size of weights is relatively larger than the size of activations, for example, Large Language Models (LLM). The algorithm compresses weights for Linear, Convolution and Embedding layers. +[OpenVINO](https://github.com/openvinotoolkit/openvino) is the preferred backend to run Weights Compression with. PyTorch and Torch FX are also supported. + ### Supported modes By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode. @@ -16,6 +29,8 @@ Percent of the rest layers compressed to 4-bit can be configured by "ratio" para ### User guide +#### Data-free methods + - Compress weights asymmetrically to 8-bit integer data type. ```python @@ -56,6 +71,8 @@ from nncf import compress_weights, CompressWeightsMode compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, group_size=64, ratio=0.9) # model is openvino.Model object ``` +#### Data-aware methods + - Accuracy of the 4-bit compressed models can be improved by using data-aware mixed-precision algorithm. It is capable to find outliers in the input activations and assign different quantization precision to minimize accuracy degradation. Below is the example how to compress 80% of layers to 4-bit integer with a default data-aware mixed precision algorithm. It requires just one extra parameter - a NNCF wrapper of the dataset. Refer to the [full example](https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino) of data-aware weight compression for more details. If dataset is not specified, data-free mixed precision algorithm works based on weights only. @@ -80,56 +97,59 @@ nncf_dataset = nncf.Dataset(synthetic_data, transform_fn) - Accuracy of the 4-bit compressed models also can be improved by using AWQ, Scale Estimation, GPTQ or Lora Correction algorithms over data-based mixed-precision algorithm. These algorithms work by equalizing a subset of weights to minimize the difference between the original precision and the 4-bit precision. Unlike all others, the Lora Correction algorithm inserts an additional Linear layers for reducing quantization noise and further accuracy improvement. Inevitably, this approach introduces a memory and a runtime overheads, but they are negligible, since the inserted weight much smaller and can be quantized to 8-bit. The AWQ, Scale Estimation (SE) and Lora Correction (LC) algo can be used in any combination together: AWQ + SE, AWQ + LC, SE + LC, AWQ + SE + LC. The GPTQ algorithm can be combined with AWQ and Scale Estimation in any combination: AWQ + GPTQ, GPTQ + SE, AWQ + GPTQ + SE. Below are examples demonstrating how to enable the AWQ, Scale Estimation, GPTQ or Lora Correction algorithms: - Prepare the calibration dataset for data-based algorithms: +
+ Prepare the calibration dataset for data-based algorithms -```python -from datasets import load_dataset -from functools import partial -from nncf import compress_weights, CompressWeightsMode, Dataset -from optimum.intel.openvino import OVModelForCausalLM -from transformers import AutoTokenizer + ```python + from datasets import load_dataset + from functools import partial + from nncf import compress_weights, CompressWeightsMode, Dataset + from optimum.intel.openvino import OVModelForCausalLM + from transformers import AutoTokenizer -def transform_func(item, tokenizer, input_shapes): - text = item['text'] - tokens = tokenizer(text) + def transform_func(item, tokenizer, input_shapes): + text = item['text'] + tokens = tokenizer(text) - res = {'input_ids': np.expand_dims(np.array(tokens['input_ids']), 0), - 'attention_mask': np.expand_dims(np.array(tokens['attention_mask']), 0)} + res = {'input_ids': np.expand_dims(np.array(tokens['input_ids']), 0), + 'attention_mask': np.expand_dims(np.array(tokens['attention_mask']), 0)} - if 'position_ids' in input_shapes: - position_ids = np.cumsum(res['attention_mask'], axis=1) - 1 - position_ids[res['attention_mask'] == 0] = 1 - res['position_ids'] = position_ids + if 'position_ids' in input_shapes: + position_ids = np.cumsum(res['attention_mask'], axis=1) - 1 + position_ids[res['attention_mask'] == 0] = 1 + res['position_ids'] = position_ids - for name, shape in input_shapes.items(): - if name in res: - continue - res[name] = np.zeros(shape) + for name, shape in input_shapes.items(): + if name in res: + continue + res[name] = np.zeros(shape) - return res + return res -def get_input_shapes(model, batch_size = 1): - inputs = {} + def get_input_shapes(model, batch_size = 1): + inputs = {} - for val in model.model.inputs: - name = val.any_name - shape = list(val.partial_shape.get_min_shape()) - shape[0] = batch_size - inputs[name] = shape + for val in model.model.inputs: + name = val.any_name + shape = list(val.partial_shape.get_min_shape()) + shape[0] = batch_size + inputs[name] = shape - return inputs + return inputs -# load your model and tokenizer -model = OVModelForCausalLM.from_pretrained(...) -tokenizer = AutoTokenizer.from_pretrained(...) + # load your model and tokenizer + model = OVModelForCausalLM.from_pretrained(...) + tokenizer = AutoTokenizer.from_pretrained(...) -# prepare dataset for compression -dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train') -dataset = dataset.filter(lambda example: len(example["text"]) > 80) -input_shapes = get_input_shapes(model) -nncf_dataset = Dataset(dataset, partial(transform_func, tokenizer=tokenizer, - input_shapes=input_shapes)) -``` + # prepare dataset for compression + dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train') + dataset = dataset.filter(lambda example: len(example["text"]) > 80) + input_shapes = get_input_shapes(model) + nncf_dataset = Dataset(dataset, partial(transform_func, tokenizer=tokenizer, + input_shapes=input_shapes)) + ``` + +
- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation. It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm. @@ -180,6 +200,24 @@ from nncf import compress_weights, CompressWeightsMode compressed_model = compress_weights(model, mode=CompressWeightsMode.E2M1, group_size=32, all_layers=True) ``` +#### Caching Statistics + +To optimize compression time and reuse statistics across multiple configurations, you can use the `statistics_path` option. This feature enables caching of calculated statistics, allowing them to be loaded from a specified path rather than recalculated for each configuration. This approach can significantly reduce compression time during repeated model compression iterations, making it ideal when searching for optimal compression parameters. + +To enable statistics caching, set the `statistics_path` parameter to your chosen path. + +```python +from nncf.quantization.advanced_parameters import AdvancedCompressionParameters +from nncf import compress_weights + +compressed_model = compress_weights( + model, + advanced_parameters=AdvancedCompressionParameters(statistics_path="statistics") +) +``` + +When `statistics_path` is provided, the system first checks if the specified path exists. If it does, the statistics are loaded from this path. If the path does not exist, the statistics are computed and saved to this path for future use. + ### Evaluation results #### Data-free Mixed-Precision on Lambada OpenAI dataset