Update doc for client-usage and LWQ (#1947)

Signed-off-by: yiliu30 <[email protected]>
intel · Jul 24, 2024 · d254d50 · d254d50
1 parent f253d35
commit d254d50
Show file tree

Hide file tree

Showing 4 changed files with 30 additions and 15 deletions.
diff --git a/README.md b/README.md
@@ -28,7 +28,7 @@ support AMD CPU, ARM CPU, and NVidia GPU through ONNX Runtime with limited testi
 
 ## What's New
 * [2024/07] From 3.0 release, framework extension API is recommended to be used for quantization.
-* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
+* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md).
 
 ## Installation
 

diff --git a/docs/source/3x/PT_WeightOnlyQuant.md b/docs/source/3x/PT_WeightOnlyQuant.md
@@ -15,6 +15,7 @@ PyTorch Weight Only Quantization
     - [HQQ](#hqq)
   - [Specify Quantization Rules](#specify-quantization-rules)
   - [Saving and Loading](#saving-and-loading)
+- [Layer Wise Quantization](#layer-wise-quantization)
 - [Efficient Usage on Client-Side](#efficient-usage-on-client-side)
 - [Examples](#examples)
 
@@ -277,9 +278,33 @@ loaded_model = load(
 )  # Please note that the original_model parameter passes the original model.
 ```
 
+## Layer Wise Quantization
+
+As the size of LLMs continues to grow, loading the entire model into a single GPU card or the RAM of a client machine becomes impractical. To address this challenge, we introduce Layer-wise Quantization (LWQ), a method that quantizes LLMs layer by layer or block by block. This approach significantly reduces memory consumption. The diagram below illustrates the LWQ process.
+
+<img src="./imgs/lwq.png" width=780 height=429>
+
+*Figure 1: The process of layer-wise quantization for PyTorch model. The color grey means empty parameters and the color blue represents parameters need to be quantized. Every rectangle inside model represents one layer.*
+
+
+Currently, we support LWQ for `RTN`, `AutoRound`, and `GPTQ`.
+
+Here, we take the `RTN` algorithm as example to demonstrate the usage of LWQ.
+
+```python
+from neural_compressor.torch.quantization import RTNConfig, convert, prepare
+from neural_compressor.torch import load_empty_model
+
+model_state_dict_path = "/path/to/model/state/dict"
+float_model = load_empty_model(model_state_dict_path)
+quant_config = RTNConfig(use_layer_wise=True)
+prepared_model = prepare(float_model, quant_config)
+quantized_model = convert(prepared_model)
+```
+
 ## Efficient Usage on Client-Side
 
-For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md).
+For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/source/3x/client_quant.md).
 
 
 ## Examples

diff --git a/docs/3x/client_quant.md → docs/source/3x/client_quant.md b/docs/3x/client_quant.md → docs/source/3x/client_quant.md
@@ -2,20 +2,15 @@ Quantization on Client
 ==========================================
 
 1. [Introduction](#introduction)
-2. [Get Started](#get-started) \
-   2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\
-   2.2 [Optimal Performance and Peak Memory Usage](#optimal-performance-and-peak-memory-usage)
-
+2. [Get Started](#get-started)
 
 ## Introduction
 
-For `RTN`, `GPTQ`, and `Auto-Round` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
+For `RTN`, and `GPTQ` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency.
 
 
 ## Get Started
 
-### Get Default Algorithm Configuration
-
 Here, we take the `RTN` algorithm as example to demonstrate the usage on a client machine.
 
 ```python
@@ -42,9 +37,4 @@ python main.py
 > [!TIP]
 > For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`.
 
-### Optimal Performance and Peak Memory Usage
-
-Below are approximate performance and memory usage figures conducted on a client machine with 24 cores and 32GB of RAM. These figures provide a rough estimate for quick reference and may vary based on specific hardware and configurations.
-
-- 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB.
-- 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)),  the quantization process takes about 20 seconds, with a peak memory usage of around 5GB.
+RTN quantization is a quick process, finishing in tens of seconds and using several GB of RAM when working with 7B models, e.g.,[meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf). However, for the higher accuracy, GPTQ algorithm is recommended, but be prepared for a longer quantization time.
diff --git a/docs/source/3x/imgs/lwq.png b/docs/source/3x/imgs/lwq.png