From 14dddfc0d667828a742cc53fc4e530b6755035c3 Mon Sep 17 00:00:00 2001 From: binbin Deng <108676127+plusbang@users.noreply.github.com> Date: Tue, 27 Aug 2024 12:44:58 +0800 Subject: [PATCH] Update NPU example readme (#11931) --- .../HF-Transformers-AutoModels/LLM/README.md | 67 +++++-------------- 1 file changed, 16 insertions(+), 51 deletions(-) diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md index 12bce0de868..52d71ed4c6b 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md @@ -9,7 +9,7 @@ In this directory, you will find examples on how you could apply IPEX-LLM INT4 o | Llama3 | [meta-llama/Meta-Llama-3-8B-Instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) | | Chatglm3 | [THUDM/chatglm3-6b](https://huggingface.co/THUDM/chatglm3-6b) | | Chatglm2 | [THUDM/chatglm2-6b](https://huggingface.co/THUDM/chatglm2-6b) | -| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct) | +| Qwen2 | [Qwen/Qwen2-7B-Instruct](https://huggingface.co/Qwen/Qwen2-7B-Instruct), [Qwen/Qwen2-1.5B-Instruct](https://huggingface.co/Qwen/Qwen2-1.5B-Instruct) | | MiniCPM | [openbmb/MiniCPM-2B-sft-bf16](https://huggingface.co/openbmb/MiniCPM-2B-sft-bf16) | | Phi-3 | [microsoft/Phi-3-mini-4k-instruct](https://huggingface.co/microsoft/Phi-3-mini-4k-instruct) | | Stablelm | [stabilityai/stablelm-zephyr-3b](https://huggingface.co/stabilityai/stablelm-zephyr-3b) | @@ -23,10 +23,8 @@ Go to https://www.intel.com/content/www/us/en/download/794734/intel-npu-driver-w Then go to **Device Manager**, find **Neural Processors** -> **Intel(R) AI Boost**. Right click and select **Update Driver**. And then manually select the folder unzipped from the driver. -## Example 1: Predict Tokens using `generate()` API -In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs. -### 1. Install -#### 1.1 Installation on Windows +## 1. Install +### 1.1 Installation on Windows We suggest using conda to manage environment: ```bash conda create -n llm python=3.10 @@ -36,9 +34,9 @@ conda activate llm pip install --pre --upgrade ipex-llm[npu] ``` -### 2. Runtime Configurations +## 2. Runtime Configurations For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. -#### 2.1 Configurations for Windows +### 2.1 Configurations for Windows > [!NOTE] > For optimal performance, we recommend running code in `conhost` rather than Windows Terminal: @@ -54,19 +52,20 @@ For optimal performance, it is recommended to set several environment variables. set BIGDL_USE_NPU=1 ``` -### 3. Running examples +## 3. Run models +In the example [generate.py](./generate.py), we show a basic use case for a Llama2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel NPUs. ``` python ./generate.py ``` Arguments info: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf` and `meta-llama/Llama-2-13b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`, and more verified models please see the list in [Verified Models](#verified-models). +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Llama2 model (e.g. `meta-llama/Llama-2-7b-chat-hf`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'meta-llama/Llama-2-7b-chat-hf'`, and more verified models please see the list in [Verified Models](#verified-models). - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. - `--load_in_low_bit`: argument defining the `load_in_low_bit` format used. It is default to be `sym_int8`, `sym_int4` can also be used. -#### Sample Output +### Sample Output #### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) ```log @@ -77,48 +76,14 @@ Inference time: xxxx s done ``` -## Example 2: Predict Tokens using `generate()` API using multi processes -In the example [llama2.py](./llama2.py) and [qwen2.py](./qwen2.py), we show an experimental support for a Llama2 / Qwen2 model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimization and fused decoderlayer optimization on Intel NPUs. - -> [!IMPORTANT] -> To run Qwen2 and Llama2 with IPEX-LLM on Intel NPUs, we recommend using version **32.0.100.2540** for the Intel NPU. -> -> Go to https://www.intel.com/content/www/us/en/download/794734/825735/intel-npu-driver-windows.html to download and unzip the driver. Then follow the same steps on [Requirements](#0-requirements). - -### 1. Install -#### 1.1 Installation on Windows -We suggest using conda to manage environment: -```bash -conda create -n llm python=3.10 -conda activate llm - -# install ipex-llm with 'npu' option -pip install --pre --upgrade ipex-llm[npu] -``` - -### 2. Runtime Configurations -For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. -#### 2.1 Configurations for Windows - -> [!NOTE] -> For optimal performance, we recommend running code in `conhost` rather than Windows Terminal: -> - Press Win+R and input `conhost`, then press Enter to launch `conhost`. -> - Run following command to use conda in `conhost`. Replace `` with your conda install location. -> ``` -> call \Scripts\activate -> ``` - -**Following envrionment variables are required**: - -```cmd -set BIGDL_USE_NPU=1 -``` - -### 3. Running examples +## 4. Run Optimized Models (Experimental) +The example below shows how to run the **_optimized model implementations_** on Intel NPU, including +- [Llama2-7B](./llama2.py) +- [Qwen2-1.5B](./qwen2.py) ``` # to run Llama-2-7b-chat-hf -python  llama2.py +python llama2.py # to run Qwen2-1.5B-Instruct python qwen2.py @@ -132,7 +97,7 @@ Arguments info: - `--max-prompt-len MAX_PROMPT_LEN`: Defines the maximum number of tokens that the input prompt can contain. It is default to be `512`. - `--disable-transpose-value-cache`: Disable the optimization of transposing value cache. -### 4. Troubleshooting +### Troubleshooting If you encounter output problem, please try to disable the optimization of transposing value cache with following command: ```bash @@ -144,7 +109,7 @@ python qwen2.py --disable-transpose-value-cache ``` -#### Sample Output +### Sample Output #### [meta-llama/Llama-2-7b-chat-hf](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) ```log