Skip to content

Commit

Permalink
Docs for statistics caching (#3052)
Browse files Browse the repository at this point in the history
### Changes

Add section regarding statistics caching for WeightCompression
  • Loading branch information
kshpv authored Nov 8, 2024
1 parent b8b0dee commit c17b6ed
Showing 1 changed file with 79 additions and 41 deletions.
120 changes: 79 additions & 41 deletions docs/usage/post_training_compression/weights_compression/Usage.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,24 @@
## Weights Compression

[OpenVINO](https://github.com/openvinotoolkit/openvino) is the preferred backend to run Weights Compression with. PyTorch and Torch FX are also supported.
- [The algorithm description](#the-algorithm-description)
- [Supported modes](#supported-modes)
- [User guide](#user-guide)
- [Data-free methods](#data-free-methods)
- [Data-aware methods](#data-aware-methods)
- [Caching Statistics](#caching-statistics)
- [Evaluation results](#evaluation-results)
- [Data-free Mixed-Precision on Lambada OpenAI dataset](#data-free-mixed-precision-on-lambada-openai-dataset)
- [Data-aware Mixed-Precision and AWQ methods on Wikitext dataset](#data-aware-mixed-precision-and-awq-methods-on-wikitext-dataset)
- [Scale Estimation and GPTQ methods on Lambada OpenAI dataset](#scale-estimation-and-gptq-methods-on-lambada-openai-dataset)
- [Accuracy/Footprint trade-off](#accuracyfootprint-trade-off)
- [Limitations](#limitations)
- [Additional resources](#additional-resources)

### The algorithm description

The Weights Compression algorithm is aimed at compressing the weights of the models and can be used to optimize the model footprint and performance of large models where the size of weights is relatively larger than the size of activations, for example, Large Language Models (LLM). The algorithm compresses weights for Linear, Convolution and Embedding layers.

[OpenVINO](https://github.com/openvinotoolkit/openvino) is the preferred backend to run Weights Compression with. PyTorch and Torch FX are also supported.

### Supported modes

By default, weights are compressed asymmetrically to 8-bit integer data type - "INT8_ASYM" mode.
Expand All @@ -16,6 +29,8 @@ Percent of the rest layers compressed to 4-bit can be configured by "ratio" para

### User guide

#### Data-free methods

- Compress weights asymmetrically to 8-bit integer data type.

```python
Expand Down Expand Up @@ -56,6 +71,8 @@ from nncf import compress_weights, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_ASYM, group_size=64, ratio=0.9) # model is openvino.Model object
```

#### Data-aware methods

- Accuracy of the 4-bit compressed models can be improved by using data-aware mixed-precision algorithm. It is capable to find outliers in the input activations and assign different quantization precision to minimize accuracy degradation.
Below is the example how to compress 80% of layers to 4-bit integer with a default data-aware mixed precision algorithm.
It requires just one extra parameter - a NNCF wrapper of the dataset. Refer to the [full example](https://github.com/openvinotoolkit/nncf/tree/develop/examples/llm_compression/openvino) of data-aware weight compression for more details. If dataset is not specified, data-free mixed precision algorithm works based on weights only.
Expand All @@ -80,56 +97,59 @@ nncf_dataset = nncf.Dataset(synthetic_data, transform_fn)
- Accuracy of the 4-bit compressed models also can be improved by using AWQ, Scale Estimation, GPTQ or Lora Correction algorithms over data-based mixed-precision algorithm. These algorithms work by equalizing a subset of weights to minimize the difference between the original precision and the 4-bit precision.
Unlike all others, the Lora Correction algorithm inserts an additional Linear layers for reducing quantization noise and further accuracy improvement. Inevitably, this approach introduces a memory and a runtime overheads, but they are negligible, since the inserted weight much smaller and can be quantized to 8-bit. The AWQ, Scale Estimation (SE) and Lora Correction (LC) algo can be used in any combination together: AWQ + SE, AWQ + LC, SE + LC, AWQ + SE + LC. The GPTQ algorithm can be combined with AWQ and Scale Estimation in any combination: AWQ + GPTQ, GPTQ + SE, AWQ + GPTQ + SE. Below are examples demonstrating how to enable the AWQ, Scale Estimation, GPTQ or Lora Correction algorithms:

Prepare the calibration dataset for data-based algorithms:
<details>
<summary>Prepare the calibration dataset for data-based algorithms</summary>

```python
from datasets import load_dataset
from functools import partial
from nncf import compress_weights, CompressWeightsMode, Dataset
from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer
```python
from datasets import load_dataset
from functools import partial
from nncf import compress_weights, CompressWeightsMode, Dataset
from optimum.intel.openvino import OVModelForCausalLM
from transformers import AutoTokenizer

def transform_func(item, tokenizer, input_shapes):
text = item['text']
tokens = tokenizer(text)
def transform_func(item, tokenizer, input_shapes):
text = item['text']
tokens = tokenizer(text)

res = {'input_ids': np.expand_dims(np.array(tokens['input_ids']), 0),
'attention_mask': np.expand_dims(np.array(tokens['attention_mask']), 0)}
res = {'input_ids': np.expand_dims(np.array(tokens['input_ids']), 0),
'attention_mask': np.expand_dims(np.array(tokens['attention_mask']), 0)}

if 'position_ids' in input_shapes:
position_ids = np.cumsum(res['attention_mask'], axis=1) - 1
position_ids[res['attention_mask'] == 0] = 1
res['position_ids'] = position_ids
if 'position_ids' in input_shapes:
position_ids = np.cumsum(res['attention_mask'], axis=1) - 1
position_ids[res['attention_mask'] == 0] = 1
res['position_ids'] = position_ids

for name, shape in input_shapes.items():
if name in res:
continue
res[name] = np.zeros(shape)
for name, shape in input_shapes.items():
if name in res:
continue
res[name] = np.zeros(shape)

return res
return res

def get_input_shapes(model, batch_size = 1):
inputs = {}
def get_input_shapes(model, batch_size = 1):
inputs = {}

for val in model.model.inputs:
name = val.any_name
shape = list(val.partial_shape.get_min_shape())
shape[0] = batch_size
inputs[name] = shape
for val in model.model.inputs:
name = val.any_name
shape = list(val.partial_shape.get_min_shape())
shape[0] = batch_size
inputs[name] = shape

return inputs
return inputs

# load your model and tokenizer
model = OVModelForCausalLM.from_pretrained(...)
tokenizer = AutoTokenizer.from_pretrained(...)
# load your model and tokenizer
model = OVModelForCausalLM.from_pretrained(...)
tokenizer = AutoTokenizer.from_pretrained(...)

# prepare dataset for compression
dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train')
dataset = dataset.filter(lambda example: len(example["text"]) > 80)
input_shapes = get_input_shapes(model)
nncf_dataset = Dataset(dataset, partial(transform_func, tokenizer=tokenizer,
input_shapes=input_shapes))
```
# prepare dataset for compression
dataset = load_dataset('wikitext', 'wikitext-2-v1', split='train')
dataset = dataset.filter(lambda example: len(example["text"]) > 80)
input_shapes = get_input_shapes(model)
nncf_dataset = Dataset(dataset, partial(transform_func, tokenizer=tokenizer,
input_shapes=input_shapes))
```

</details>

- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation. It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm.

Expand Down Expand Up @@ -180,6 +200,24 @@ from nncf import compress_weights, CompressWeightsMode
compressed_model = compress_weights(model, mode=CompressWeightsMode.E2M1, group_size=32, all_layers=True)
```

#### Caching Statistics

To optimize compression time and reuse statistics across multiple configurations, you can use the `statistics_path` option. This feature enables caching of calculated statistics, allowing them to be loaded from a specified path rather than recalculated for each configuration. This approach can significantly reduce compression time during repeated model compression iterations, making it ideal when searching for optimal compression parameters.

To enable statistics caching, set the `statistics_path` parameter to your chosen path.

```python
from nncf.quantization.advanced_parameters import AdvancedCompressionParameters
from nncf import compress_weights

compressed_model = compress_weights(
model,
advanced_parameters=AdvancedCompressionParameters(statistics_path="statistics")
)
```

When `statistics_path` is provided, the system first checks if the specified path exists. If it does, the statistics are loaded from this path. If the path does not exist, the statistics are computed and saved to this path for future use.

### Evaluation results

#### Data-free Mixed-Precision on Lambada OpenAI dataset
Expand Down

0 comments on commit c17b6ed

Please sign in to comment.