From b1cc78e3afc58a4f0d3afd4e68a4db8992a6ba7c Mon Sep 17 00:00:00 2001
From: Alexander Suslov <alexander.suslov@intel.com>
Date: Thu, 13 Jun 2024 18:05:58 +0400
Subject: [PATCH] GPTQ documentation (#2735)

### Changes

Added GPTQ documentation

### Related tickets

126887

### Tests

N/A
---
 .../weights_compression/Usage.md              | 42 +++++++++++++++----
 1 file changed, 35 insertions(+), 7 deletions(-)
diff --git a/docs/usage/post_training_compression/weights_compression/Usage.md b/docs/usage/post_training_compression/weights_compression/Usage.md
index 8d2f56143f5..528ed46ed15 100644
--- a/docs/usage/post_training_compression/weights_compression/Usage.md
+++ b/docs/usage/post_training_compression/weights_compression/Usage.md
@@ -61,11 +61,9 @@ nncf_dataset = nncf.Dataset(data_source, transform_fn)
 compressed_model = compress_weights(model, mode=CompressWeightsMode.INT4_SYM, ratio=0.8, dataset=nncf_dataset) # model is openvino.Model object
 ```
 
-- Accuracy of the 4-bit compressed models also can be improved by using AWQ algorithm or Scale Estimation algorithm over data-based mixed-precision algorithm. It is capable to equalize some subset of weights to minimize difference between
-original precision and 4-bit.
-Below is the example how to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation.
-It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm.
-Both algorithms, AWQ and Scale Estimation, can be enabled together or separately.
+- Accuracy of the 4-bit compressed models also can be improved by using AWQ, Scale Estimation or GPTQ algorithms over data-based mixed-precision algorithm. These algorithms work by equalizing a subset of weights to minimize the difference between the original precision and the 4-bit precision. The AWQ algorithm can be used in conjunction with either the Scale Estimation or GPTQ algorithm. However, Scale Estimation and GPTQ algorithms are mutually exclusive and cannot be used together. Below are examples demonstrating how to enable the AWQ, Scale Estimation or GPTQ algorithms:
+
+  Prepare the calibration dataset for data-based algorithms:
 
 ```python
 from datasets import load_dataset
@@ -114,15 +112,27 @@ dataset = dataset.filter(lambda example: len(example["text"]) > 80)
 input_shapes = get_input_shapes(model)
 nncf_dataset = Dataset(dataset, partial(transform_func, tokenizer=tokenizer,
                                                         input_shapes=input_shapes))
+```
+
+- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and AWQ with Scale Estimation. It requires to set `awq` to `True` and `scale_estimation` to `True` additionally to data-based mixed-precision algorithm.
 
+```python
 model.model = compress_weights(model.model,
                                mode=CompressWeightsMode.INT4_SYM,
                                ratio=0.8,
                                dataset=nncf_dataset,
                                awq=True,
                                scale_estimation=True)
+```
 
-model.save_pretrained(...)
+- How to compress 80% of layers to 4-bit integer with a default data-based mixed precision algorithm and GPTQ. It requires to set `gptq` to `True` additionally to data-based mixed-precision algorithm.
+
+```python
+model.model = compress_weights(model.model,
+                               mode=CompressWeightsMode.INT4_SYM,
+                               ratio=0.8,
+                               dataset=nncf_dataset,
+                               gptq=True)
 ```
 
 - `NF4` mode can be considered for improving accuracy, but currently models quantized to nf4 should not be faster models
@@ -396,7 +406,7 @@ This modification applies only for patterns `MatMul-Multiply-MatMul` (for exampl
 </table>
 
 Here is the perplexity and accuracy with data-free and data-aware mixed-precision INT4-INT8 weight compression for different language models on the [lambada openai dataset](https://huggingface.co/datasets/EleutherAI/lambada_openai).
-`_scale` suffix refers to the data-aware mixed-precision with Scale Estimation algorithm.
+`_scale` suffix refers to the data-aware mixed-precision with Scale Estimation algorithm. `_gptq` suffix refers to the data-aware mixed-precision with GPTQ algorithm.
 `r100` means that embeddings and lm_head have INT8 precision and all other linear layers have INT4 precision.
 <table>
     <tr bgcolor='#B4B5BB'>
@@ -411,6 +421,12 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
         <td>0.5925</td>
         <td>6.3024</td>
     </tr>
+    <tr>
+        <td></td>
+        <td>int4_sym_r100_gs64_gptq</td>
+        <td>0.5676</td>
+        <td>7.2391</td>
+    </tr>
     <tr>
         <td></td>
         <td>int4_sym_r100_gs64_scale</td>
@@ -434,6 +450,12 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
         <td>int4_sym_r100_gs64_scale</td>
         <td>0.595</td>
         <td>7.037</td>
+    </tr>
+        <tr>
+        <td></td>
+        <td>int4_sym_r100_gs64_gptq</td>
+        <td>0.567</td>
+        <td>8.6787</td>
     </tr>
     <tr>
         <td></td>
@@ -453,6 +475,12 @@ Here is the perplexity and accuracy with data-free and data-aware mixed-precisio
         <td>0.6736</td>
         <td>4.4711</td>
     </tr>
+    <tr>
+        <td></td>
+        <td>int4_sym_r100_gs128_gptq</td>
+        <td>0.6513</td>
+        <td>4.8365</td>
+    </tr>
     <tr>
         <td></td>
         <td>int4_sym_r100_gs128</td>

0.5925	6.3024
	int4_sym_r100_gs64_gptq	0.5676	7.2391
	int4_sym_r100_gs64_scale	int4_sym_r100_gs64_scale	0.595	7.037
	int4_sym_r100_gs64_gptq	0.567	8.6787
	0.6736	4.4711
	int4_sym_r100_gs128_gptq	0.6513	4.8365
	int4_sym_r100_gs128