Skip to content

Commit

Permalink
GPTQ documentation (#2735)
Browse files Browse the repository at this point in the history
### Changes

Added GPTQ documentation

### Related tickets

126887

### Tests

N/A
  • Loading branch information
alexsu52 authored Jun 13, 2024
1 parent 85b3263 commit b1cc78e
Showing 1 changed file with 35 additions and 7 deletions.
42 changes: 35 additions & 7 deletions docs/usage/post_training_compression/weights_compression/Usage.md
Original file line number Diff line number Diff line change
Expand Up @@ -61,11 +61,9 @@ nncf_dataset = nncf.Dataset(data_source, transform_fn)
compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM, ratio=0.8, dataset=nncf_dataset) # model is openvino.Model object
```

- Accuracy of the 4-bit compressed models also can be improved by using AWQ algorithm or Scale Estimation algorithm over data-based mixed-precision algorithm. It is capable to equalize some subset of weights to minimize difference between
original precision and 4-bit.
Below is the example how to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation.
It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm.
Both algorithms, AWQ and Scale Estimation, can be enabled together or separately.
- Accuracy of the 4-bit compressed models also can be improved by using AWQ, Scale Estimation or GPTQ algorithms over data-based mixed-precision algorithm. These algorithms work by equalizing a subset of weights to minimize the difference between the original precision and the 4-bit precision. The AWQ algorithm can be used in conjunction with either the Scale Estimation or GPTQ algorithm. However, Scale Estimation and GPTQ algorithms are mutually exclusive and cannot be used together. Below are examples demonstrating how to enable the AWQ, Scale Estimation or GPTQ algorithms:

Prepare the calibration dataset for data-based algorithms:

```python
from datasets import load_dataset
Expand Down Expand Up @@ -114,15 +112,27 @@ dataset = dataset.filter(lambda example: len(example["text"]) > 80)
input_shapes = get_input_shapes(model)
nncf_dataset = Dataset(dataset, partial(transform_func, tokenizer=tokenizer,
input_shapes=input_shapes))
```

- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation. It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm.

```python
model.model = compress_weights(model.model,
mode=CompressWeightsMode.INT4_SYM,
ratio=0.8,
dataset=nncf_dataset,
awq=True,
scale_estimation=True)
```

model.save_pretrained(...)
- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and GPTQ. It requires to set `gptq` to `True` additionally to data-based mixed-precision algorithm.

```python
model.model = compress_weights(model.model,
mode=CompressWeightsMode.INT4_SYM,
ratio=0.8,
dataset=nncf_dataset,
gptq=True)
```

- `NF4` mode can be considered for improving accuracy, but currently models quantized to nf4 should not be faster models
Expand Down Expand Up @@ -396,7 +406,7 @@ This modification applies only for patterns `MatMul-Multiply-MatMul` (for exampl
</table>

Here is the perplexity and accuracy with data-free and data-aware mixed-precision INT4-INT8 weight compression for different language models on the [lambada openai dataset](https://huggingface.co/datasets/EleutherAI/lambada_openai).
`_scale` suffix refers to the data-aware mixed-precision with Scale Estimation algorithm.
`_scale` suffix refers to the data-aware mixed-precision with Scale Estimation algorithm. `_gptq` suffix refers to the data-aware mixed-precision with GPTQ algorithm.
`r100` means that embeddings and lm_head have INT8 precision and all other linear layers have INT4 precision.
<table>
<tr bgcolor='#B4B5BB'>
Expand All @@ -411,6 +421,12 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
<td>0.5925</td>
<td>6.3024</td>
</tr>
<tr>
<td></td>
<td>int4_sym_r100_gs64_gptq</td>
<td>0.5676</td>
<td>7.2391</td>
</tr>
<tr>
<td></td>
<td>int4_sym_r100_gs64_scale</td>
Expand All @@ -434,6 +450,12 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
<td>int4_sym_r100_gs64_scale</td>
<td>0.595</td>
<td>7.037</td>
</tr>
<tr>
<td></td>
<td>int4_sym_r100_gs64_gptq</td>
<td>0.567</td>
<td>8.6787</td>
</tr>
<tr>
<td></td>
Expand All @@ -453,6 +475,12 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
<td>0.6736</td>
<td>4.4711</td>
</tr>
<tr>
<td></td>
<td>int4_sym_r100_gs128_gptq</td>
<td>0.6513</td>
<td>4.8365</td>
</tr>
<tr>
<td></td>
<td>int4_sym_r100_gs128</td>
Expand Down

0 comments on commit b1cc78e

Please sign in to comment.