GPTQ documentation (#2735)

### Changes Added GPTQ documentation ### Related tickets 126887 ### Tests N/A
openvinotoolkit · Jun 13, 2024 · b1cc78e · b1cc78e
1 parent 85b3263
commit b1cc78e
Showing 1 changed file with 35 additions and 7 deletions.
diff --git a/docs/usage/post_training_compression/weights_compression/Usage.md b/docs/usage/post_training_compression/weights_compression/Usage.md
@@ -61,11 +61,9 @@ nncf_dataset = nncf.Dataset(data_source, transform_fn)
 compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM, ratio=0.8, dataset=nncf_dataset) # model is openvino.Model object
 ```
 
-- Accuracy of the 4-bit compressed models also can be improved by using AWQ algorithm or Scale Estimation algorithm over data-based mixed-precision algorithm. It is capable to equalize some subset of weights to minimize difference between
-original precision and 4-bit.
-Below is the example how to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation.
-It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm.
-Both algorithms, AWQ and Scale Estimation, can be enabled together or separately.
+- Accuracy of the 4-bit compressed models also can be improved by using AWQ, Scale Estimation or GPTQ algorithms over data-based mixed-precision algorithm. These algorithms work by equalizing a subset of weights to minimize the difference between the original precision and the 4-bit precision. The AWQ algorithm can be used in conjunction with either the Scale Estimation or GPTQ algorithm. However, Scale Estimation and GPTQ algorithms are mutually exclusive and cannot be used together. Below are examples demonstrating how to enable the AWQ, Scale Estimation or GPTQ algorithms:
+
+  Prepare the calibration dataset for data-based algorithms:
 
 ```python
 from datasets import load_dataset
@@ -114,15 +112,27 @@ dataset = dataset.filter(lambda example: len(example["text"]) > 80)
 input_shapes = get_input_shapes(model)
 nncf_dataset = Dataset(dataset, partial(transform_func, tokenizer=tokenizer,
                                                         input_shapes=input_shapes))
+```
+
+- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation. It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm.
 
+```python
 model.model = compress_weights(model.model,
                                mode=CompressWeightsMode.INT4_SYM,
                                ratio=0.8,
                                dataset=nncf_dataset,
                                awq=True,
                                scale_estimation=True)
+```
 
-model.save_pretrained(...)
+- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and GPTQ. It requires to set `gptq` to `True` additionally to data-based mixed-precision algorithm.
+
+```python
+model.model = compress_weights(model.model,
+                               mode=CompressWeightsMode.INT4_SYM,
+                               ratio=0.8,
+                               dataset=nncf_dataset,
+                               gptq=True)
 ```
 
 - `NF4` mode can be considered for improving accuracy, but currently models quantized to nf4 should not be faster models
@@ -396,7 +406,7 @@ This modification applies only for patterns `MatMul-Multiply-MatMul` (for exampl
 </table>
 
 Here is the perplexity and accuracy with data-free and data-aware mixed-precision INT4-INT8 weight compression for different language models on the [lambada openai dataset](https://huggingface.co/datasets/EleutherAI/lambada_openai).
-`_scale` suffix refers to the data-aware mixed-precision with Scale Estimation algorithm.
+`_scale` suffix refers to the data-aware mixed-precision with Scale Estimation algorithm. `_gptq` suffix refers to the data-aware mixed-precision with GPTQ algorithm.
 `r100` means that embeddings and lm_head have INT8 precision and all other linear layers have INT4 precision.
 <table>
     <tr bgcolor='#B4B5BB'>
@@ -411,6 +421,12 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
         <td>0.5925</td>
         <td>6.3024</td>
     </tr>
+    <tr>
+        <td></td>
+        <td>int4_sym_r100_gs64_gptq</td>
+        <td>0.5676</td>
+        <td>7.2391</td>
+    </tr>
     <tr>
         <td></td>
         <td>int4_sym_r100_gs64_scale</td>
@@ -434,6 +450,12 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
         <td>int4_sym_r100_gs64_scale</td>
         <td>0.595</td>
         <td>7.037</td>
+    </tr>
+        <tr>
+        <td></td>
+        <td>int4_sym_r100_gs64_gptq</td>
+        <td>0.567</td>
+        <td>8.6787</td>
     </tr>
     <tr>
         <td></td>
@@ -453,6 +475,12 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
         <td>0.6736</td>
         <td>4.4711</td>
     </tr>
+    <tr>
+        <td></td>
+        <td>int4_sym_r100_gs128_gptq</td>
+        <td>0.6513</td>
+        <td>4.8365</td>
+    </tr>
     <tr>
         <td></td>
         <td>int4_sym_r100_gs128</td>