From 30a95bff3b45f5d8e75f185f501efdd6fec895ae Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 10 Jul 2024 15:05:55 +0800 Subject: [PATCH 01/18] add client quant doc Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 36 ++++++++++++++++++++++++++++++++++++ 1 file changed, 36 insertions(+) create mode 100644 docs/3x/client_quant.md diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md new file mode 100644 index 00000000000..ce8351d4894 --- /dev/null +++ b/docs/3x/client_quant.md @@ -0,0 +1,36 @@ +Quantization on Client + +========================================== + +1. [Introduction](#introduction) +2. [Support matrix](#supported-matrix) +3. [Get Started](#get-started) \ + 2.1 [Get default lightweight algorithm configuration for client]\ + 2.2 [Override the auto-detect result]\ + 2.3 [Set several environment variables for optimal performance] + +## Introduction +Currently, we supported different default algorithm configuration based on the type of machine for RTN, GPTQ, and Auto-Round on Pytorch framework. + +## Support matrix + + +## Get Started +### Get default algorithm configuration + +Currently, we detect the machine as server if one of below conditions meet, user can override it by setting the `processor_type` explicitly. + +```python +config_for_client = get_default_rtn_config(processor_type="client") +``` +### Compare the default configuration between client and server + + +### Set several environment variables for optimal performance +Takes [Intel® Core™ Ultra 7 Processor 155H](https://www.intel.com/content/www/us/en/products/sku/236847/intel-core-ultra-7-processor-155h-24m-cache-up-to-4-80-ghz/specifications.html) as example, it include 6 P-cores and 10 E-cores. Use `taskset` to bind task on all P-cores to achieve optimal performance. + +```bash +taskset -c 0-11 python ./main.py +``` + +> Note: To detect the E-cores and P-cores in Linux system, please refer [here](https://stackoverflow.com/a/71282744/23445462). From aa19ff09fd27973a32b54192a70a72b2ad0b402a Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 11 Jul 2024 19:31:40 +0800 Subject: [PATCH 02/18] udapte Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 7 +++---- 1 file changed, 3 insertions(+), 4 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index ce8351d4894..659bec73702 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -10,15 +10,14 @@ Quantization on Client 2.3 [Set several environment variables for optimal performance] ## Introduction -Currently, we supported different default algorithm configuration based on the type of machine for RTN, GPTQ, and Auto-Round on Pytorch framework. - -## Support matrix - +Currently, we supported different default algorithm configuration based on the type of processor for `RTN`, `GPTQ`, and `Auto-Round` on Pytorch framework. +We roughly divide processors into two categories, client and server, and provide the lightweight configuration for client. ## Get Started ### Get default algorithm configuration Currently, we detect the machine as server if one of below conditions meet, user can override it by setting the `processor_type` explicitly. +- ```python config_for_client = get_default_rtn_config(processor_type="client") From 1f03ad513547705f187a6cb5aa44a163d0a8939b Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 11 Jul 2024 19:55:18 +0800 Subject: [PATCH 03/18] refine doc Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 38 +++++++++++++++++++++++--------------- 1 file changed, 23 insertions(+), 15 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 659bec73702..6fd3c9ef127 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -3,33 +3,41 @@ Quantization on Client ========================================== 1. [Introduction](#introduction) -2. [Support matrix](#supported-matrix) 3. [Get Started](#get-started) \ - 2.1 [Get default lightweight algorithm configuration for client]\ - 2.2 [Override the auto-detect result]\ - 2.3 [Set several environment variables for optimal performance] - + 2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\ + 2.2 [Set Environment Variables for Optimal Performance](#set-environment-variables-for-optimal-performance) + + ## Introduction -Currently, we supported different default algorithm configuration based on the type of processor for `RTN`, `GPTQ`, and `Auto-Round` on Pytorch framework. -We roughly divide processors into two categories, client and server, and provide the lightweight configuration for client. + +Currently, we support different default algorithm configurations based on the type of processor for `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework. We roughly divide processors into two categories, client and server, and provide a lightweight configuration for clients. ## Get Started -### Get default algorithm configuration +### Get Default Algorithm Configuration + +Users can get the default algorithm configuration by passing the processor_type explicitly to the get default configuration API, or leave it empty, and we will return the appropriate configuration according to the hardware information. Currently, the machine is detected as a server if one of the following conditions is met: + +- If there is more than one sockets +- If the brand name includes `Xeon` +- If the DRAM size is greater than 32GB + + +> The last condition may not be very accurate, but models greater than 7B generally need more than 32GB, and we assume that the user won't try these models on a client machine. -Currently, we detect the machine as server if one of below conditions meet, user can override it by setting the `processor_type` explicitly. -- +Below is an example to get the default configuration of RTN. ```python +config_by_auto_detect = get_default_rtn_config() config_for_client = get_default_rtn_config(processor_type="client") +config_for_server = get_default_rtn_config(processor_type="server") ``` -### Compare the default configuration between client and server +### Set Environment Variables for Optimal Performance -### Set several environment variables for optimal performance -Takes [Intel® Core™ Ultra 7 Processor 155H](https://www.intel.com/content/www/us/en/products/sku/236847/intel-core-ultra-7-processor-155h-24m-cache-up-to-4-80-ghz/specifications.html) as example, it include 6 P-cores and 10 E-cores. Use `taskset` to bind task on all P-cores to achieve optimal performance. +To achieve optimal performance, we need to set the right environment variables. For example, [Intel® Core™ Ultra 7 Processor 155H](https://www.intel.com/content/www/us/en/products/sku/236847/intel-core-ultra-7-processor-155h-24m-cache-up-to-4-80-ghz/specifications.html) includes 6 P-cores and 10 E-cores. Use `taskset` to bind tasks on all P-cores to achieve optimal performance. ```bash -taskset -c 0-11 python ./main.py +OMP_NUM_THREADS=12 taskset -c 0-11 python ./main.py ``` -> Note: To detect the E-cores and P-cores in Linux system, please refer [here](https://stackoverflow.com/a/71282744/23445462). +> Note: To detect the E-cores and P-cores on a Linux system, please refer [this](https://stackoverflow.com/a/71282744/23445462). From d739e48d0991feb6e9e573757afe5d6438351e13 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 11 Jul 2024 19:58:38 +0800 Subject: [PATCH 04/18] format Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 2 -- 1 file changed, 2 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 6fd3c9ef127..cf5fd5f7f24 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -1,7 +1,5 @@ Quantization on Client - ========================================== - 1. [Introduction](#introduction) 3. [Get Started](#get-started) \ 2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\ From 6ccbed364922dfda5ac386d66695b74a704d5157 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 11 Jul 2024 19:59:50 +0800 Subject: [PATCH 05/18] correct typo Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index cf5fd5f7f24..451e49022db 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -13,7 +13,7 @@ Currently, we support different default algorithm configurations based on the ty ## Get Started ### Get Default Algorithm Configuration -Users can get the default algorithm configuration by passing the processor_type explicitly to the get default configuration API, or leave it empty, and we will return the appropriate configuration according to the hardware information. Currently, the machine is detected as a server if one of the following conditions is met: +Users can get the default algorithm configuration by passing the `processor_type` explicitly to the get configuration API, or leave it empty, and we will return the appropriate configuration according to the hardware information. Currently, the machine is detected as a server if one of the following conditions is met: - If there is more than one sockets - If the brand name includes `Xeon` From 2308c63b5a9f9b240282cbff90ec2f5acf25ef7b Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Thu, 11 Jul 2024 20:00:49 +0800 Subject: [PATCH 06/18] correct typo Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 451e49022db..3326000135b 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -20,7 +20,7 @@ Users can get the default algorithm configuration by passing the `processor_type - If the DRAM size is greater than 32GB -> The last condition may not be very accurate, but models greater than 7B generally need more than 32GB, and we assume that the user won't try these models on a client machine. +> The last condition may not be very accurate, but models greater than 7B generally need more than 32GB DRAM, and we assume that the user won't try these models on a client machine. Below is an example to get the default configuration of RTN. From 0ee78cc81078d5d8f07dfd2057a6bd266ccb2bd8 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 15 Jul 2024 16:49:48 +0800 Subject: [PATCH 07/18] udpate win usage Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 36 ++++++++++++++++++++++++++++-------- 1 file changed, 28 insertions(+), 8 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 3326000135b..2553caaacba 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -3,24 +3,25 @@ Quantization on Client 1. [Introduction](#introduction) 3. [Get Started](#get-started) \ 2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\ - 2.2 [Set Environment Variables for Optimal Performance](#set-environment-variables-for-optimal-performance) + 2.2 [Optimal Performance](#optimal-performance) ## Introduction -Currently, we support different default algorithm configurations based on the type of processor for `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework. We roughly divide processors into two categories, client and server, and provide a lightweight configuration for clients. +Currently, we support different default algorithm configurations based on the type of processor type of machine for `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework. Processors are roughly categorized into client and server types, with a lightweight configuration provided for a machine with client processors. + ## Get Started ### Get Default Algorithm Configuration -Users can get the default algorithm configuration by passing the `processor_type` explicitly to the get configuration API, or leave it empty, and we will return the appropriate configuration according to the hardware information. Currently, the machine is detected as a server if one of the following conditions is met: +To obtain the default algorithm configuration, users can either specify the `processor_type` explicitly when calling the configuration API or leave it unspecified. In the latter case, we will automatically determine the appropriate configuration based on hardware information. A machine is identified as a server if it meets one of the following criteria: - If there is more than one sockets - If the brand name includes `Xeon` - If the DRAM size is greater than 32GB - -> The last condition may not be very accurate, but models greater than 7B generally need more than 32GB DRAM, and we assume that the user won't try these models on a client machine. +> [!TIP] +> The last criterion may not always be accurate, but models larger than 7B typically require more than 32GB DRAM. We assume that users won't run these models on client machines. Below is an example to get the default configuration of RTN. @@ -30,12 +31,31 @@ config_for_client = get_default_rtn_config(processor_type="client") config_for_server = get_default_rtn_config(processor_type="server") ``` -### Set Environment Variables for Optimal Performance +### Optimal Performance + + +> [!CAUTION] +> Please use `neural_compressor.torch.load_empty_model` to initialize a empty model to reduce the memory usage. + +#### Windows +On Windows machines, it is recommended to run the application directly. The system will automatically utilize all available cores. + +```bash +python ./main.py +``` +> [!TIP] +> For 7B models, like [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), the quantization process takes about 65 seconds and the peak memory usage is about 6GB. + +> For 1.5B models, like [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), the quantization process takes about 20 seconds and the peak memory usage is about 5GB. + +### Linux -To achieve optimal performance, we need to set the right environment variables. For example, [Intel® Core™ Ultra 7 Processor 155H](https://www.intel.com/content/www/us/en/products/sku/236847/intel-core-ultra-7-processor-155h-24m-cache-up-to-4-80-ghz/specifications.html) includes 6 P-cores and 10 E-cores. Use `taskset` to bind tasks on all P-cores to achieve optimal performance. +For optimal performance on Linux systems, configure the environment variables appropriately. For instance. For example, the 12th Generation and later processors, which is Hybrid Architecture include both P-cores and E-Cores. It is recommended to run the example with all of P-cores to achieve optimal performance. ```bash +# e.g. for Intel® Core™ Ultra 7 Processor 155H, it includes 6 P-cores and 10 E-cores OMP_NUM_THREADS=12 taskset -c 0-11 python ./main.py ``` -> Note: To detect the E-cores and P-cores on a Linux system, please refer [this](https://stackoverflow.com/a/71282744/23445462). +> [!NOTE]: +> To identify E-cores and P-cores on a Linux system,, please refer [this](https://stackoverflow.com/a/71282744/23445462). From b6c83c1e44f950835522489cac4305e157cc3808 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 15 Jul 2024 16:54:32 +0800 Subject: [PATCH 08/18] correct typo Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 13 ++++++------- 1 file changed, 6 insertions(+), 7 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 2553caaacba..9591d4fbe82 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -8,7 +8,7 @@ Quantization on Client ## Introduction -Currently, we support different default algorithm configurations based on the type of processor type of machine for `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework. Processors are roughly categorized into client and server types, with a lightweight configuration provided for a machine with client processors. +Currently, we support different default algorithm configurations based on the type of processor type for `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework. Processors are roughly categorized into client and server types, with a lightweight configuration provided for a machine with client processors. ## Get Started @@ -33,10 +33,6 @@ config_for_server = get_default_rtn_config(processor_type="server") ### Optimal Performance - -> [!CAUTION] -> Please use `neural_compressor.torch.load_empty_model` to initialize a empty model to reduce the memory usage. - #### Windows On Windows machines, it is recommended to run the application directly. The system will automatically utilize all available cores. @@ -45,12 +41,11 @@ python ./main.py ``` > [!TIP] > For 7B models, like [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), the quantization process takes about 65 seconds and the peak memory usage is about 6GB. - > For 1.5B models, like [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), the quantization process takes about 20 seconds and the peak memory usage is about 5GB. ### Linux -For optimal performance on Linux systems, configure the environment variables appropriately. For instance. For example, the 12th Generation and later processors, which is Hybrid Architecture include both P-cores and E-Cores. It is recommended to run the example with all of P-cores to achieve optimal performance. +On Linux machines, users need configure the environment variables appropriately. For example, the 12th Generation and later processors, which is Hybrid Architecture include both P-cores and E-Cores. It is recommended to run the example with all of P-cores to achieve optimal performance. ```bash # e.g. for Intel® Core™ Ultra 7 Processor 155H, it includes 6 P-cores and 10 E-cores @@ -59,3 +54,7 @@ OMP_NUM_THREADS=12 taskset -c 0-11 python ./main.py > [!NOTE]: > To identify E-cores and P-cores on a Linux system,, please refer [this](https://stackoverflow.com/a/71282744/23445462). + + +> [!CAUTION] +> Please use `neural_compressor.torch.load_empty_model` to initialize a empty model to reduce the memory usage. From 2ef57c505b74c8d6297b56607921fa4df1361fb6 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 15 Jul 2024 16:55:26 +0800 Subject: [PATCH 09/18] update Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 9591d4fbe82..124f78459fa 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -39,9 +39,9 @@ On Windows machines, it is recommended to run the application directly. The syst ```bash python ./main.py ``` -> [!TIP] -> For 7B models, like [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), the quantization process takes about 65 seconds and the peak memory usage is about 6GB. -> For 1.5B models, like [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), the quantization process takes about 20 seconds and the peak memory usage is about 5GB. +> [!NOTE] +> - For 7B models, like [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), the quantization process takes about 65 seconds and the peak memory usage is about 6GB. +> - For 1.5B models, like [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), the quantization process takes about 20 seconds and the peak memory usage is about 5GB. ### Linux From 08066d854f440ac7303214f79a770445cc662a63 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 15 Jul 2024 16:56:16 +0800 Subject: [PATCH 10/18] update Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 124f78459fa..20832594e1a 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -52,9 +52,10 @@ On Linux machines, users need configure the environment variables appropriately. OMP_NUM_THREADS=12 taskset -c 0-11 python ./main.py ``` -> [!NOTE]: +> [!NOTE] > To identify E-cores and P-cores on a Linux system,, please refer [this](https://stackoverflow.com/a/71282744/23445462). + > [!CAUTION] > Please use `neural_compressor.torch.load_empty_model` to initialize a empty model to reduce the memory usage. From 1363f11ddd8859089c13082e0f1b1d2dc3bf3dce Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 15 Jul 2024 17:06:12 +0800 Subject: [PATCH 11/18] update Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 29 ++++++++++++++--------------- 1 file changed, 14 insertions(+), 15 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 20832594e1a..9b3a070aa2b 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -8,22 +8,22 @@ Quantization on Client ## Introduction -Currently, we support different default algorithm configurations based on the type of processor type for `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework. Processors are roughly categorized into client and server types, with a lightweight configuration provided for a machine with client processors. +We offer default algorithm configurations tailored for different processor types for `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework. Processors are roughly categorized into client and server types, with a lightweight configuration specifically designed for client machines. ## Get Started ### Get Default Algorithm Configuration -To obtain the default algorithm configuration, users can either specify the `processor_type` explicitly when calling the configuration API or leave it unspecified. In the latter case, we will automatically determine the appropriate configuration based on hardware information. A machine is identified as a server if it meets one of the following criteria: +To obtain the default algorithm configuration, users can either specify the `processor_type` explicitly when calling the configuration API or leave it empty. In the latter case, we will automatically determine the appropriate configuration based on hardware information. A machine is identified as a server if it meets one of the following criteria: -- If there is more than one sockets -- If the brand name includes `Xeon` -- If the DRAM size is greater than 32GB +- It has more than one socket. +- If the brand name includes `Xeon`. +- If the DRAM size is exceeds 32GB. > [!TIP] -> The last criterion may not always be accurate, but models larger than 7B typically require more than 32GB DRAM. We assume that users won't run these models on client machines. +> The DRAM criterion may not always be accurate. However, models larger than 7B typically require more than 32GB of DRAM, and it is assumed that such models will not be used on client machines. -Below is an example to get the default configuration of RTN. +Here’s an example of how to get the default configuration for `RTN`: ```python config_by_auto_detect = get_default_rtn_config() @@ -34,28 +34,27 @@ config_for_server = get_default_rtn_config(processor_type="server") ### Optimal Performance #### Windows -On Windows machines, it is recommended to run the application directly. The system will automatically utilize all available cores. +On Windows machines, simply running the application will allow the system to utilize all available cores automatically. ```bash -python ./main.py +python main.py ``` > [!NOTE] -> - For 7B models, like [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), the quantization process takes about 65 seconds and the peak memory usage is about 6GB. -> - For 1.5B models, like [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), the quantization process takes about 20 seconds and the peak memory usage is about 5GB. +> - For 7B models, such as [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), the quantization process takes about 65 seconds, with a peak memory usage of around 6GB. +> - For 1.5B models, like [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB. ### Linux -On Linux machines, users need configure the environment variables appropriately. For example, the 12th Generation and later processors, which is Hybrid Architecture include both P-cores and E-Cores. It is recommended to run the example with all of P-cores to achieve optimal performance. +On Linux systems, you need to configure the environment variables appropriately to achieve optimal performance. For instance, with Intel 12th generation and later processors featuring hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores. ```bash # e.g. for Intel® Core™ Ultra 7 Processor 155H, it includes 6 P-cores and 10 E-cores -OMP_NUM_THREADS=12 taskset -c 0-11 python ./main.py +OMP_NUM_THREADS=12 taskset -c 0-11 python main.py ``` > [!NOTE] > To identify E-cores and P-cores on a Linux system,, please refer [this](https://stackoverflow.com/a/71282744/23445462). - > [!CAUTION] -> Please use `neural_compressor.torch.load_empty_model` to initialize a empty model to reduce the memory usage. +> Please use `neural_compressor.torch.load_empty_model` to initialize a empty model and reduce the memory usage. From 8ad38f9f1918e2183cd608cd6dba6ec300367013 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 15 Jul 2024 17:16:29 +0800 Subject: [PATCH 12/18] typo Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 9b3a070aa2b..b7dfb3997cb 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -53,7 +53,7 @@ OMP_NUM_THREADS=12 taskset -c 0-11 python main.py ``` > [!NOTE] -> To identify E-cores and P-cores on a Linux system,, please refer [this](https://stackoverflow.com/a/71282744/23445462). +> To identify E-cores and P-cores on a Linux system, please refer [this](https://stackoverflow.com/a/71282744/23445462). > [!CAUTION] From 60367e66eb0327ceb28f8290359bc637a332702c Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Mon, 15 Jul 2024 18:36:10 +0800 Subject: [PATCH 13/18] update Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 12 ++++++------ 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index b7dfb3997cb..008a4a41320 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -3,12 +3,12 @@ Quantization on Client 1. [Introduction](#introduction) 3. [Get Started](#get-started) \ 2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\ - 2.2 [Optimal Performance](#optimal-performance) + 2.2 [Get Optimal Performance](#get-optimal-performance) ## Introduction -We offer default algorithm configurations tailored for different processor types for `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework. Processors are roughly categorized into client and server types, with a lightweight configuration specifically designed for client machines. +For `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework, we offer default algorithm configurations tailored for different processor types. Processors are roughly categorized into client and server types, with a lightweight configuration specifically designed for client machines. ## Get Started @@ -17,8 +17,8 @@ We offer default algorithm configurations tailored for different processor types To obtain the default algorithm configuration, users can either specify the `processor_type` explicitly when calling the configuration API or leave it empty. In the latter case, we will automatically determine the appropriate configuration based on hardware information. A machine is identified as a server if it meets one of the following criteria: - It has more than one socket. -- If the brand name includes `Xeon`. -- If the DRAM size is exceeds 32GB. +- Its brand name includes `Xeon`. +- Its DRAM size is exceeds 32GB. > [!TIP] > The DRAM criterion may not always be accurate. However, models larger than 7B typically require more than 32GB of DRAM, and it is assumed that such models will not be used on client machines. @@ -31,10 +31,10 @@ config_for_client = get_default_rtn_config(processor_type="client") config_for_server = get_default_rtn_config(processor_type="server") ``` -### Optimal Performance +### Get Optimal Performance #### Windows -On Windows machines, simply running the application will allow the system to utilize all available cores automatically. +On Windows machines, simply running the program will allow the system to utilize all available cores automatically. ```bash python main.py From daf37bb173d7e7fd8b3b41d3c1d3f563e7848f7a Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 17 Jul 2024 12:39:58 +0800 Subject: [PATCH 14/18] update doc Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 57 +++++++++++++++++------------------------ 1 file changed, 24 insertions(+), 33 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 008a4a41320..2154c452a54 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -1,60 +1,51 @@ Quantization on Client ========================================== + 1. [Introduction](#introduction) -3. [Get Started](#get-started) \ - 2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\ - 2.2 [Get Optimal Performance](#get-optimal-performance) +2. [Get Started](#get-started) \ + 2.1 [Get Default Algorithm Configuration](#get-default-algorithm-configuration)\ + 2.2 [Optimal Performance and Peak Memory Usage](#optimal-performance-and-peak-memory-usage) ## Introduction -For `RTN`, `GPTQ`, and `Auto-Round` on the PyTorch framework, we offer default algorithm configurations tailored for different processor types. Processors are roughly categorized into client and server types, with a lightweight configuration specifically designed for client machines. +For `RTN`, `GPTQ`, and `Auto-Round`, we offer default algorithm configurations tailored for different processor types (`client` and `sever`). These configurations are specifically designed for both client and server machines, with a lightweight setup optimized for client devices. ## Get Started -### Get Default Algorithm Configuration - -To obtain the default algorithm configuration, users can either specify the `processor_type` explicitly when calling the configuration API or leave it empty. In the latter case, we will automatically determine the appropriate configuration based on hardware information. A machine is identified as a server if it meets one of the following criteria: -- It has more than one socket. -- Its brand name includes `Xeon`. -- Its DRAM size is exceeds 32GB. - -> [!TIP] -> The DRAM criterion may not always be accurate. However, models larger than 7B typically require more than 32GB of DRAM, and it is assumed that such models will not be used on client machines. +### Get Default Algorithm Configuration -Here’s an example of how to get the default configuration for `RTN`: +We take the `RTN` algorithm as example to demonstrate the usage on a client machine. ```python -config_by_auto_detect = get_default_rtn_config() -config_for_client = get_default_rtn_config(processor_type="client") -config_for_server = get_default_rtn_config(processor_type="server") +from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare +from neural_compressor.torch.utils import load_empty_model + +model_state_dict_path = "/path/to/model/state/dict" +float_model = load_empty_model(model_state_dict_path) +quant_config = get_default_rtn_config() +prepared_model = prepare(float_model, quant_config) +quantized_model = convert(prepared_model) ``` -### Get Optimal Performance +> [!TIP] +> By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify `processor_type` as `client` or `server`. + -#### Windows -On Windows machines, simply running the program will allow the system to utilize all available cores automatically. +For Windows machines, simply run the program to automatically utilize all available cores: ```bash python main.py ``` -> [!NOTE] -> - For 7B models, such as [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf), the quantization process takes about 65 seconds, with a peak memory usage of around 6GB. -> - For 1.5B models, like [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB. -### Linux -On Linux systems, you need to configure the environment variables appropriately to achieve optimal performance. For instance, with Intel 12th generation and later processors featuring hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores. +### Optimal Performance and Peak Memory Usage -```bash -# e.g. for Intel® Core™ Ultra 7 Processor 155H, it includes 6 P-cores and 10 E-cores -OMP_NUM_THREADS=12 taskset -c 0-11 python main.py -``` -> [!NOTE] -> To identify E-cores and P-cores on a Linux system, please refer [this](https://stackoverflow.com/a/71282744/23445462). +- 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB. +- 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB. -> [!CAUTION] -> Please use `neural_compressor.torch.load_empty_model` to initialize a empty model and reduce the memory usage. +> [!TIP] +> For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly, For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`. From fe01f735c3e163b324a30b5fb07820bbbc0e1441 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 17 Jul 2024 12:55:29 +0800 Subject: [PATCH 15/18] update the doc Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 17 +++++++++-------- 1 file changed, 9 insertions(+), 8 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index 2154c452a54..c9de16593c4 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -9,18 +9,18 @@ Quantization on Client ## Introduction -For `RTN`, `GPTQ`, and `Auto-Round`, we offer default algorithm configurations tailored for different processor types (`client` and `sever`). These configurations are specifically designed for both client and server machines, with a lightweight setup optimized for client devices. +For `RTN`, `GPTQ`, and `Auto-Round` algorithms, we provide default algorithm configurations for different processor types (`client` and `sever`). Generally, lightweight configurations are tailored specifically for client devices to enhance performance and efficiency. ## Get Started ### Get Default Algorithm Configuration -We take the `RTN` algorithm as example to demonstrate the usage on a client machine. +Here, we take the `RTN` algorithm as example to demonstrate the usage on a client machine. ```python from neural_compressor.torch.quantization import get_default_rtn_config, convert, prepare -from neural_compressor.torch.utils import load_empty_model +from neural_compressor.torch import load_empty_model model_state_dict_path = "/path/to/model/state/dict" float_model = load_empty_model(model_state_dict_path) @@ -30,15 +30,17 @@ quantized_model = convert(prepared_model) ``` > [!TIP] -> By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify `processor_type` as `client` or `server`. +> By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify `processor_type` as either `client` or `server`. -For Windows machines, simply run the program to automatically utilize all available cores: +For Windows machines, run the following command to utilize all available cores automatically: ```bash python main.py ``` +> [!TIP] +> For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly. For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`. ### Optimal Performance and Peak Memory Usage @@ -46,6 +48,5 @@ python main.py - 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB. - 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB. - -> [!TIP] -> For Linux systems, users need to configure the environment variables appropriately to achieve optimal performance. For example, set the `OMP_NUM_THREADS` explicitly, For processors with hybrid architecture (including both P-cores and E-cores), it is recommended to bind tasks to all P-cores using `taskset`. +> [!NOTE] +> The above results are based on testing conducted on a machine with 24 cores and 32GB of RAM. From 65ce1de439aefd86395a3fd6cb72eb014b58d74a Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 17 Jul 2024 13:07:38 +0800 Subject: [PATCH 16/18] add link Signed-off-by: yiliu30 --- README.md | 1 + docs/3x/PT_WeightOnlyQuant.md | 6 ++++++ 2 files changed, 7 insertions(+) diff --git a/README.md b/README.md index 91690432918..31772f4d025 100644 --- a/README.md +++ b/README.md @@ -26,6 +26,7 @@ In particular, the tool provides the key features, typical examples, and open co * Collaborate with cloud marketplaces such as [Google Cloud Platform](https://console.cloud.google.com/marketplace/product/bitnami-launchpad/inc-tensorflow-intel?project=verdant-sensor-286207), [Amazon Web Services](https://aws.amazon.com/marketplace/pp/prodview-yjyh2xmggbmga#pdp-support), and [Azure](https://azuremarketplace.microsoft.com/en-us/marketplace/apps/bitnami.inc-tensorflow-intel), software platforms such as [Alibaba Cloud](https://www.intel.com/content/www/us/en/developer/articles/technical/quantize-ai-by-oneapi-analytics-on-alibaba-cloud.html), [Tencent TACO](https://new.qq.com/rain/a/20221202A00B9S00) and [Microsoft Olive](https://github.com/microsoft/Olive), and open AI ecosystem such as [Hugging Face](https://huggingface.co/blog/intel), [PyTorch](https://pytorch.org/tutorials/recipes/intel_neural_compressor_for_pytorch.html), [ONNX](https://github.com/onnx/models#models), [ONNX Runtime](https://github.com/microsoft/onnxruntime), and [Lightning AI](https://github.com/Lightning-AI/lightning/blob/master/docs/source-pytorch/advanced/post_training_quantization.rst) ## What's New +* [2024/07] Performance optimizations and usability improvements on [client-side](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md). * [2024/03] A new SOTA approach [AutoRound](https://github.com/intel/auto-round) Weight-Only Quantization on [Intel Gaudi2 AI accelerator](https://habana.ai/products/gaudi2/) is available for LLMs. ## Installation diff --git a/docs/3x/PT_WeightOnlyQuant.md b/docs/3x/PT_WeightOnlyQuant.md index 5a84a2d3474..65d9367e9fa 100644 --- a/docs/3x/PT_WeightOnlyQuant.md +++ b/docs/3x/PT_WeightOnlyQuant.md @@ -15,6 +15,7 @@ PyTorch Weight Only Quantization - [HQQ](#hqq) - [Specify Quantization Rules](#specify-quantization-rules) - [Saving and Loading](#saving-and-loading) +- [Efficient Usage on Client-Side](#efficient-usage-on-client-side) - [Examples](#examples) ## Introduction @@ -276,6 +277,11 @@ loaded_model = load( ) # Please note that the original_model parameter passes the original model. ``` +## Efficient Usage on Client-Side + +For client machines with limited RAM and cores, we offer optimizations to reduce computational overhead and minimize memory usage. For detailed information, please refer to [Quantization on Client](https://github.com/intel/neural-compressor/blob/master/docs/3x/client_quant.md). + + ## Examples Users can also refer to [examples](https://github.com/intel/neural-compressor/blob/master/examples/3.x_api/pytorch/nlp/huggingface_models/language-modeling/quantization/weight_only) on how to quantize a model with WeightOnlyQuant. From a5c8fc33f96072b3565141cb6bf5cb72ac4a21a2 Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 17 Jul 2024 13:17:26 +0800 Subject: [PATCH 17/18] update Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index c9de16593c4..f11831062d5 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -30,7 +30,7 @@ quantized_model = convert(prepared_model) ``` > [!TIP] -> By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify `processor_type` as either `client` or `server`. +> By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify `processor_type` as either `client` or `server` when calling get_default_rtn_config.. For Windows machines, run the following command to utilize all available cores automatically: @@ -44,9 +44,7 @@ python main.py ### Optimal Performance and Peak Memory Usage +Below are approximate performance and memory usage figures conducted on a client machine with 24 cores and 32GB of RAM. These figures provide a rough estimate for quick reference and may vary based on specific hardware and configurations. - 7B models (e.g., [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf)): the quantization process takes about 65 seconds, with a peak memory usage of around 6GB. - 1.5B models (e.g., [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct)), the quantization process takes about 20 seconds, with a peak memory usage of around 5GB. - -> [!NOTE] -> The above results are based on testing conducted on a machine with 24 cores and 32GB of RAM. From 0ab28e497c3bba90f2ead275e2602c481f7580fb Mon Sep 17 00:00:00 2001 From: yiliu30 Date: Wed, 17 Jul 2024 13:22:29 +0800 Subject: [PATCH 18/18] fix typo Signed-off-by: yiliu30 --- docs/3x/client_quant.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/3x/client_quant.md b/docs/3x/client_quant.md index f11831062d5..181834caf23 100644 --- a/docs/3x/client_quant.md +++ b/docs/3x/client_quant.md @@ -30,7 +30,7 @@ quantized_model = convert(prepared_model) ``` > [!TIP] -> By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify `processor_type` as either `client` or `server` when calling get_default_rtn_config.. +> By default, the appropriate configuration is determined based on hardware information, but users can explicitly specify `processor_type` as either `client` or `server` when calling `get_default_rtn_config`. For Windows machines, run the following command to utilize all available cores automatically: