From 5e0144997665fdb3b522168d52a6fbd7c41b1f89 Mon Sep 17 00:00:00 2001
From: cranechu <1340390339@qq.com>
Date: Tue, 20 Aug 2024 16:41:34 +0800
Subject: [PATCH 01/11] feat: update readme for ppl test
---
python/llm/dev/benchmark/perplexity/README.md | 68 +++++++++++++++++--
1 file changed, 63 insertions(+), 5 deletions(-)
diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md
index 8e6d5bacb89..d20ccf686f1 100644
--- a/python/llm/dev/benchmark/perplexity/README.md
+++ b/python/llm/dev/benchmark/perplexity/README.md
@@ -1,29 +1,87 @@
# Perplexity
Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py)
-## Run on Wikitext
+## Requirements
+To run perplexity test with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
+### 1. Install IPEX
+We suggest using conda to manage environment:
```bash
-pip install datasets
+conda create -n llm python=3.11
+conda activate llm
+# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
+pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
```
-An example to run perplexity on wikitext:
+
+
+### 2. Configures OneAPI environment variables for Linux
+
+> [!NOTE]
+> Skip this step if you are running on Windows.
+
+This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
+
```bash
+source /opt/intel/oneapi/setvars.sh
+```
-python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096
+### 3. Runtime Configurations
+For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
+
+For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series
+
+```bash
+export USE_XETLA=OFF
+export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
+export SYCL_CACHE_PERSISTENT=1
```
-## Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset
+
+
+
+
+For Intel Data Center GPU Max Series
+```bash
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
+export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
+export SYCL_CACHE_PERSISTENT=1
+export ENABLE_SDP_FUSION=1
+```
+> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
+
+
+
+
+For Intel iGPU
+
+```bash
+export SYCL_CACHE_PERSISTENT=1
+export BIGDL_LLM_XMX_DISABLED=1
+```
+
+
+
+### 4. installing dependency
+Install the dataset dependency to download and load dataset for the test.
```bash
pip install datasets
```
+## Running the test
+### 1.Run on Wikitext
+An example to run perplexity on wikitext:
+```bash
+python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096
+```
+### 2.Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset
An example to run perplexity on chatglm3-6b using the default Chinese datasets("multifieldqa_zh", "dureader", "vcsum", "lsht", "passage_retrieval_zh")
```bash
python run_longbench.py --model_path THUDM/chatglm3-6b --precisions float16 sym_int4 --device xpu --language zh
```
+
Notes:
- If you want to test model perplexity on a few selected datasets from the `LongBench` dataset, please use the format below.
```bash
From 6122714f9d454b65111b801a2aba86fa21f6bd7a Mon Sep 17 00:00:00 2001
From: cranechu <1340390339@qq.com>
Date: Tue, 20 Aug 2024 17:04:12 +0800
Subject: [PATCH 02/11] fix: textual adjustments
---
python/llm/dev/benchmark/perplexity/README.md | 63 +++----------------
1 file changed, 8 insertions(+), 55 deletions(-)
diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md
index d20ccf686f1..6f8e721129d 100644
--- a/python/llm/dev/benchmark/perplexity/README.md
+++ b/python/llm/dev/benchmark/perplexity/README.md
@@ -1,80 +1,33 @@
# Perplexity
Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py)
-## Requirements
+## Environment Preparations
To run perplexity test with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
-### 1. Install IPEX
-We suggest using conda to manage environment:
+We suggest using conda to manage iprx environment:
```bash
conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
```
-
-
-### 2. Configures OneAPI environment variables for Linux
-
-> [!NOTE]
-> Skip this step if you are running on Windows.
-
-This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
-
-```bash
-source /opt/intel/oneapi/setvars.sh
-```
-
-### 3. Runtime Configurations
-For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
-
-
-For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series
-
-```bash
-export USE_XETLA=OFF
-export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-export SYCL_CACHE_PERSISTENT=1
-```
-
-
-
-
-
-For Intel Data Center GPU Max Series
-
+Install the dataset dependency to download and load dataset for the test.
```bash
-export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
-export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-export SYCL_CACHE_PERSISTENT=1
-export ENABLE_SDP_FUSION=1
+pip install datasets
```
-> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
-
-
-
-
-For Intel iGPU
+This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
```bash
-export SYCL_CACHE_PERSISTENT=1
-export BIGDL_LLM_XMX_DISABLED=1
+source /opt/intel/oneapi/setvars.sh
```
-
-
-### 4. installing dependency
-Install the dataset dependency to download and load dataset for the test.
-```bash
-pip install datasets
-```
## Running the test
-### 1.Run on Wikitext
+### 1. Run on Wikitext
An example to run perplexity on wikitext:
```bash
python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096
```
-### 2.Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset
+### 2. Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset
An example to run perplexity on chatglm3-6b using the default Chinese datasets("multifieldqa_zh", "dureader", "vcsum", "lsht", "passage_retrieval_zh")
```bash
From 9e67b22502337b783c22fff47ec3240002d7de91 Mon Sep 17 00:00:00 2001
From: cranechu <1340390339@qq.com>
Date: Tue, 20 Aug 2024 17:34:42 +0800
Subject: [PATCH 03/11] fix: textual adjustments
---
python/llm/dev/benchmark/perplexity/README.md | 13 +++----------
1 file changed, 3 insertions(+), 10 deletions(-)
diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md
index 6f8e721129d..3d824ac570f 100644
--- a/python/llm/dev/benchmark/perplexity/README.md
+++ b/python/llm/dev/benchmark/perplexity/README.md
@@ -1,18 +1,11 @@
# Perplexity
Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py)
-## Environment Preparations
-To run perplexity test with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
-
-We suggest using conda to manage iprx environment:
+## Environment Preparation
+Install ipex-llm and dataset.
```bash
-conda create -n llm python=3.11
-conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-```
-Install the dataset dependency to download and load dataset for the test.
-```bash
pip install datasets
```
This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
@@ -21,7 +14,7 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
source /opt/intel/oneapi/setvars.sh
```
-## Running the test
+## Running PPL Evaluation
### 1. Run on Wikitext
An example to run perplexity on wikitext:
```bash
From 979c738194d9afa8281878ab8c38dc01d62b64d7 Mon Sep 17 00:00:00 2001
From: SONG Ge <38711238+sgwhat@users.noreply.github.com>
Date: Tue, 20 Aug 2024 17:29:49 +0800
Subject: [PATCH 04/11] Add ipex-llm npu option in setup.py (#11858)
* add ipex-llm npu release
* update example doc
* meet latest release changes
---
.../example/NPU/HF-Transformers-AutoModels/LLM/README.md | 7 ++-----
python/llm/setup.py | 7 +++++++
2 files changed, 9 insertions(+), 5 deletions(-)
diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
index 31e055b5bea..728617f0a45 100644
--- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
+++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
@@ -91,11 +91,8 @@ We suggest using conda to manage environment:
conda create -n llm python=3.10
conda activate llm
-# install ipex-llm with 'all' option
-pip install --pre --upgrade ipex-llm[all]
-pip install --pre --upgrade bigdl-core-npu
-
-pip install transformers==4.40
+# install ipex-llm with 'npu' option
+pip install --pre --upgrade ipex-llm[npu]
```
### 2. Runtime Configurations
diff --git a/python/llm/setup.py b/python/llm/setup.py
index ecb7aea861b..f9adc5f39f8 100644
--- a/python/llm/setup.py
+++ b/python/llm/setup.py
@@ -300,6 +300,12 @@ def setup_package():
serving_requires = ['py-cpuinfo']
serving_requires += SERVING_DEP
+ npu_requires = copy.deepcopy(all_requires)
+ cpu_transformers_version = ['transformers == 4.37.0', 'tokenizers == 0.15.2']
+ for exclude_require in cpu_transformers_version:
+ npu_requires.remove(exclude_require)
+ npu_requires += ["transformers==4.40.0",
+ "bigdl-core-npu==" + CORE_XE_VERSION + ";platform_system=='Windows'"]
metadata = dict(
name='ipex_llm',
@@ -323,6 +329,7 @@ def setup_package():
},
extras_require={"all": all_requires,
"xpu": xpu_requires, # default to ipex 2.1 for linux and windows
+ "npu": npu_requires,
"xpu-2-1": xpu_21_requires,
"serving": serving_requires,
"cpp": cpp_requires,
From a9ab309690ef1e69e85153c9963f0b6feab011ab Mon Sep 17 00:00:00 2001
From: Yishuo Wang
Date: Tue, 20 Aug 2024 17:32:51 +0800
Subject: [PATCH 05/11] optimize phi3 memory usage (#11867)
---
python/llm/src/ipex_llm/transformers/kv.py | 15 +++++++++++++++
.../llm/src/ipex_llm/transformers/models/phi3.py | 14 +++++++++++---
2 files changed, 26 insertions(+), 3 deletions(-)
diff --git a/python/llm/src/ipex_llm/transformers/kv.py b/python/llm/src/ipex_llm/transformers/kv.py
index 100da837a9e..8b20f546893 100644
--- a/python/llm/src/ipex_llm/transformers/kv.py
+++ b/python/llm/src/ipex_llm/transformers/kv.py
@@ -121,6 +121,21 @@ def update(
return self.key_cache[layer_idx], self.value_cache[layer_idx]
+ @classmethod
+ def from_reserved(cls, layers: int,
+ bsz: int, n_head: int, length: int, head_dim: int,
+ dtype: torch.dtype, device: torch.device):
+ past_key_values = cls()
+ for _i in range(layers):
+ k_cache, v_cache = init_kv_cache(
+ bsz, n_head, head_dim,
+ 0, length + cls.KV_ALLOC_BLOCK_LENGTH,
+ dtype, device
+ )
+ past_key_values.key_cache.append(k_cache)
+ past_key_values.value_cache.append(v_cache)
+ return past_key_values
+
# Copied from transformers.models.llama.modeling_llama.repeat_kv
def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
diff --git a/python/llm/src/ipex_llm/transformers/models/phi3.py b/python/llm/src/ipex_llm/transformers/models/phi3.py
index 5c630681cc9..823fb10391a 100644
--- a/python/llm/src/ipex_llm/transformers/models/phi3.py
+++ b/python/llm/src/ipex_llm/transformers/models/phi3.py
@@ -254,9 +254,9 @@ def model_forward(
):
# IPEX-LLM OPT: kv cache and quantize kv cache and sdp
use_cache = use_cache if use_cache is not None else self.config.use_cache
- input = input_ids if input_ids is not None else inputs_embeds
- use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.down_proj, input)
- use_compress_kv = should_use_compresskv(input, input.shape[1])
+ inputs = input_ids if input_ids is not None else inputs_embeds
+ use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.down_proj, inputs)
+ use_compress_kv = should_use_compresskv(inputs, inputs.shape[1])
if use_cache:
if use_compress_kv and not isinstance(past_key_values,
DynamicCompressCache):
@@ -272,6 +272,14 @@ def model_forward(
DynamicCompressCache
)):
past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
+ if past_key_values.get_seq_length() == 0:
+ n_layer = self.config.num_hidden_layers
+ n_head = self.config.num_attention_heads
+ head_dim = self.config.hidden_size // self.config.num_attention_heads
+ past_key_values = DynamicNormalCache.from_reserved(
+ n_layer, inputs.size(0), n_head, inputs.size(1), head_dim,
+ inputs.dtype, inputs.device
+ )
return origin_model_forward(
self=self,
input_ids=input_ids,
From f5f3f19f98efe77c23eb2f5ccadbdaf58643ba8b Mon Sep 17 00:00:00 2001
From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com>
Date: Tue, 20 Aug 2024 17:37:58 +0800
Subject: [PATCH 06/11] Update `ipex-llm` default transformers version to
4.37.0 (#11859)
* Update default transformers version to 4.37.0
* Add dependency requirements for qwen and qwen-vl
* Temp fix transformers version for these not yet verified models
* Skip qwen test in UT for now as it requires transformers<4.37.0
---
.../CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md | 2 ++
.../CPU/HF-Transformers-AutoModels/Model/qwen/README.md | 4 ++++
python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md | 4 ++++
python/llm/example/GPU/HuggingFace/LLM/qwen/README.md | 2 ++
.../llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md | 2 ++
.../GPU/HuggingFace/Multimodal/voiceassistant/README.md | 2 ++
.../llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md | 2 ++
python/llm/example/GPU/PyTorch-Models/Model/llava/README.md | 2 --
python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md | 2 ++
.../llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md | 2 ++
python/llm/setup.py | 2 +-
python/llm/test/inference_gpu/test_transformers_api.py | 2 +-
.../llm/test/inference_gpu/test_transformers_api_RMSNorm.py | 2 +-
.../llm/test/inference_gpu/test_transformers_api_attention.py | 2 +-
python/llm/test/inference_gpu/test_transformers_api_mlp.py | 2 +-
15 files changed, 27 insertions(+), 7 deletions(-)
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md
index 7dc3dedc5cb..7f5061eccd6 100644
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md
@@ -20,6 +20,7 @@ conda activate llm
# install the latest ipex-llm nightly build with 'all' option
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
+pip install "transformers<4.37.0"
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
```
@@ -32,6 +33,7 @@ conda activate llm
pip install --pre --upgrade ipex-llm[all]
+pip install "transformers<4.37.0"
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib
```
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md
index cee06098d2d..992ea9ee10e 100644
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md
@@ -22,6 +22,8 @@ conda activate llm
# install the latest ipex-llm nightly build with 'all' option
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
+
+pip install "transformers<4.37.0"
pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation
```
@@ -32,6 +34,8 @@ conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all]
+
+pip install "transformers<4.37.0"
pip install tiktoken einops transformers_stream_generator
```
diff --git a/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md b/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md
index 25744465c26..f6f5f1ffe8e 100644
--- a/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md
+++ b/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md
@@ -19,6 +19,8 @@ conda activate llm
# install the latest ipex-llm nightly build with 'all' option
pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
+
+pip install "transformers<4.37.0"
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
```
@@ -29,6 +31,8 @@ conda create -n llm python=3.11
conda activate llm
pip install --pre --upgrade ipex-llm[all]
+
+pip install "transformers<4.37.0"
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib
```
diff --git a/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md
index 500e2b0f2ad..8311f7f1369 100644
--- a/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md
+++ b/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md
@@ -15,6 +15,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install "transformers<4.37.0"
pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation
```
@@ -27,6 +28,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install "transformers<4.37.0"
pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation
```
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md
index fb02816b1f0..737232661fd 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md
@@ -15,6 +15,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install "transformers<4.37.0"
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
```
@@ -27,6 +28,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install "transformers<4.37.0"
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
```
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md
index 67c0fb26249..7dea109b078 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md
@@ -17,6 +17,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install transformers==4.36.2
pip install librosa soundfile datasets
pip install accelerate
pip install SpeechRecognition sentencepiece colorama
@@ -33,6 +34,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install transformers==4.36.2
pip install librosa soundfile datasets
pip install accelerate
pip install SpeechRecognition sentencepiece colorama
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md
index 29a4dc4619c..ac664fb0a36 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md
@@ -16,6 +16,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install transformers==4.36.2
pip install datasets soundfile librosa # required by audio processing
```
@@ -28,6 +29,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install transformers==4.36.2
pip install datasets soundfile librosa # required by audio processing
```
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md b/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md
index 461ae53a8dd..77e0f1cfd9c 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md
@@ -16,7 +16,6 @@ conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install einops # install dependencies required by llava
-pip install transformers==4.36.2
git clone https://github.com/haotian-liu/LLaVA.git # clone the llava libary
cp generate.py ./LLaVA/ # copy our example to the LLaVA folder
@@ -34,7 +33,6 @@ conda activate llm
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install einops # install dependencies required by llava
-pip install transformers==4.36.2
git clone https://github.com/haotian-liu/LLaVA.git # clone the llava libary
copy generate.py .\LLaVA\ # copy our example to the LLaVA folder
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md
index 5f9a617aaa3..c480c545366 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md
@@ -15,6 +15,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install "transformers<4.37.0"
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
```
@@ -27,6 +28,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install "transformers<4.37.0"
pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
```
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md b/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md
index 171ff392422..98806eda677 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md
@@ -15,6 +15,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install transformers==4.36.2
pip install "datasets<2.18" soundfile # additional package required for SpeechT5 to conduct generation
```
@@ -27,6 +28,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+pip install transformers==4.36.2
pip install "datasets<2.18" soundfile # additional package required for SpeechT5 to conduct generation
```
diff --git a/python/llm/setup.py b/python/llm/setup.py
index f9adc5f39f8..4386293cac6 100644
--- a/python/llm/setup.py
+++ b/python/llm/setup.py
@@ -53,7 +53,7 @@
cpu_torch_version = ["torch==2.1.2+cpu;platform_system=='Linux'", "torch==2.1.2;platform_system=='Windows'"]
CONVERT_DEP = ['numpy == 1.26.4', # lastet 2.0.0b1 will cause error
- 'transformers == 4.36.2', 'sentencepiece', 'tokenizers == 0.15.2',
+ 'transformers == 4.37.0', 'sentencepiece', 'tokenizers == 0.15.2',
'accelerate == 0.23.0', 'tabulate'] + cpu_torch_version
SERVING_DEP = ['fschat[model_worker, webui] == 0.2.36', 'protobuf']
diff --git a/python/llm/test/inference_gpu/test_transformers_api.py b/python/llm/test/inference_gpu/test_transformers_api.py
index ae9c6b9bc3e..b29c25997ae 100644
--- a/python/llm/test/inference_gpu/test_transformers_api.py
+++ b/python/llm/test/inference_gpu/test_transformers_api.py
@@ -36,7 +36,7 @@
(AutoModelForCausalLM, AutoTokenizer, os.environ.get('MPT_7B_ORIGIN_PATH')),
# (AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')),
# (AutoModelForCausalLM, AutoTokenizer, os.environ.get('BAICHUAN2_7B_ORIGIN_PATH')),
- # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')),
+ # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0
])
def test_completion(Model, Tokenizer, model_path, prompt, answer):
with torch.inference_mode():
diff --git a/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py b/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py
index f45f017ef0b..edb2adf1ec0 100644
--- a/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py
+++ b/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py
@@ -32,7 +32,7 @@
("ChatGLM2-6B", AutoModel, AutoTokenizer, os.environ.get('CHATGLM2_6B_ORIGIN_PATH')),
("Mistral-7B-Instruct-v0.1", AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')),
("Baichuan2-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('BAICHUAN2_7B_ORIGIN_PATH')),
- ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')),
+ # ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0
]
class Test_Optimize_Gpu_Model:
diff --git a/python/llm/test/inference_gpu/test_transformers_api_attention.py b/python/llm/test/inference_gpu/test_transformers_api_attention.py
index 4db5ba8b531..84bdcf8e8cb 100644
--- a/python/llm/test/inference_gpu/test_transformers_api_attention.py
+++ b/python/llm/test/inference_gpu/test_transformers_api_attention.py
@@ -34,7 +34,7 @@
("ChatGLM2-6B", AutoModel, AutoTokenizer, os.environ.get('CHATGLM2_6B_ORIGIN_PATH')),
("Mistral-7B-Instruct-v0.1", AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')),
("Baichuan2-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('BAICHUAN2_7B_ORIGIN_PATH')),
- ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')),
+ # ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0
]
class Test_Optimize_Gpu_Model:
diff --git a/python/llm/test/inference_gpu/test_transformers_api_mlp.py b/python/llm/test/inference_gpu/test_transformers_api_mlp.py
index cf0581a50c0..c6229d73fc4 100644
--- a/python/llm/test/inference_gpu/test_transformers_api_mlp.py
+++ b/python/llm/test/inference_gpu/test_transformers_api_mlp.py
@@ -27,7 +27,7 @@
PROMPT = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
TEST_MODEL_LIST = [
- ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')),
+ # ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0
("Mistral-7B-Instruct-v0.1", AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')),
("Llama2-7B", AutoModelForCausalLM, LlamaTokenizer, os.environ.get('LLAMA2_7B_ORIGIN_PATH'))
]
From cab32ea354f5fa388bb1d11f90913c98f459c594 Mon Sep 17 00:00:00 2001
From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com>
Date: Tue, 20 Aug 2024 17:59:28 +0800
Subject: [PATCH 07/11] Update performance test regarding updated default
`transformers==4.37.0` (#11869)
* Update igpu performance from transformers 4.36.2 to 4.37.0 (#11841)
* upgrade arc perf test to transformers 4.37 (#11842)
* fix load low bit com dtype (#11832)
* feat: add mixed_precision argument on ppl longbench evaluation
* fix: delete extra code
* feat: upgrade arc perf test to transformers 4.37
* fix: add missing codes
* fix: keep perf test for qwen-vl-chat in transformers 4.36
* fix: remove extra space
* fix: resolve pr comment
* fix: add empty line
* fix: add pip install for spr and core test
* fix: delete extra comments
* fix: remove python -m for pip
* Revert "fix load low bit com dtype (#11832)"
This reverts commit 6841a9ac8fc8b3f4eb06e41fa3944f7877fd8f94.
---------
Co-authored-by: Zhao Changmin
Co-authored-by: Jinhe Tang
* add transformers==4.36 for qwen vl in igpu-perf (#11846)
* add transformers==4.36.2 for qwen-vl
* Small update
---------
Co-authored-by: Yuwen Hu
* fix: remove qwen-7b on core test (#11851)
* fix: remove qwen-7b on core test
* fix: change delete to comment
---------
Co-authored-by: Jinhe Tang
* replce filename (#11854)
* fix: remove qwen-7b on core test
* fix: change delete to comment
* fix: replace filename
---------
Co-authored-by: Jinhe Tang
* fix: delete extra comments (#11863)
* Remove transformers installation for temp test purposes
* Small fix
* Small update
---------
Co-authored-by: Chu,Youcheng <70999398+cranechu0131@users.noreply.github.com>
Co-authored-by: Zhao Changmin
Co-authored-by: Jinhe Tang
Co-authored-by: Zijie Li
Co-authored-by: Chu,Youcheng <1340390339@qq.com>
---
.github/workflows/llm_performance_tests.yml | 128 +++++++-----------
.../test/benchmark/arc-perf-test-batch2.yaml | 30 ----
.../test/benchmark/arc-perf-test-batch4.yaml | 36 -----
python/llm/test/benchmark/arc-perf-test.yaml | 32 -----
.../arc-perf-transformers-436-batch2.yaml | 16 +++
.../arc-perf-transformers-436-batch4.yaml | 18 +++
.../benchmark/arc-perf-transformers-436.yaml | 16 +++
.../arc-perf-transformers-437-batch2.yaml | 14 ++
.../arc-perf-transformers-437-batch4.yaml | 18 ++-
.../benchmark/arc-perf-transformers-437.yaml | 14 ++
python/llm/test/benchmark/core-perf-test.yaml | 2 +-
.../test/benchmark/igpu-perf/1024-128.yaml | 8 +-
.../{1024-128_437.yaml => 1024-128_436.yaml} | 8 +-
.../igpu-perf/1024-128_int4_fp16.yaml | 8 +-
...6_437.yaml => 1024-128_int4_fp16_436.yaml} | 8 +-
.../1024-128_int4_fp16_loadlowbit.yaml | 7 +-
...=> 1024-128_int4_fp16_loadlowbit_436.yaml} | 7 +-
.../igpu-perf/2048-256_int4_fp16.yaml | 8 +-
...6_437.yaml => 2048-256_int4_fp16_436.yaml} | 8 +-
.../igpu-perf/3072-384_int4_fp16.yaml | 8 +-
...6_437.yaml => 3072-384_int4_fp16_436.yaml} | 10 +-
.../benchmark/igpu-perf/32-32_int4_fp16.yaml | 8 +-
...fp16_437.yaml => 32-32_int4_fp16_436.yaml} | 8 +-
.../igpu-perf/4096-512_int4_fp16.yaml | 7 +
.../igpu-perf/4096-512_int4_fp16_437.yaml | 19 ---
25 files changed, 202 insertions(+), 244 deletions(-)
delete mode 100644 python/llm/test/benchmark/arc-perf-test-batch2.yaml
delete mode 100644 python/llm/test/benchmark/arc-perf-test-batch4.yaml
delete mode 100644 python/llm/test/benchmark/arc-perf-test.yaml
create mode 100644 python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml
create mode 100644 python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml
create mode 100644 python/llm/test/benchmark/arc-perf-transformers-436.yaml
rename python/llm/test/benchmark/igpu-perf/{1024-128_437.yaml => 1024-128_436.yaml} (65%)
rename python/llm/test/benchmark/igpu-perf/{1024-128_int4_fp16_437.yaml => 1024-128_int4_fp16_436.yaml} (65%)
rename python/llm/test/benchmark/igpu-perf/{1024-128_int4_fp16_loadlowbit_437.yaml => 1024-128_int4_fp16_loadlowbit_436.yaml} (68%)
rename python/llm/test/benchmark/igpu-perf/{2048-256_int4_fp16_437.yaml => 2048-256_int4_fp16_436.yaml} (65%)
rename python/llm/test/benchmark/igpu-perf/{3072-384_int4_fp16_437.yaml => 3072-384_int4_fp16_436.yaml} (52%)
rename python/llm/test/benchmark/igpu-perf/{32-32_int4_fp16_437.yaml => 32-32_int4_fp16_436.yaml} (65%)
delete mode 100644 python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml
diff --git a/.github/workflows/llm_performance_tests.yml b/.github/workflows/llm_performance_tests.yml
index 36b31f23937..736b1dd4540 100644
--- a/.github/workflows/llm_performance_tests.yml
+++ b/.github/workflows/llm_performance_tests.yml
@@ -153,7 +153,8 @@ jobs:
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
- cp python/llm/test/benchmark/arc-perf-test.yaml python/llm/dev/benchmark/all-in-one/config.yaml
+ pip install transformers==4.36.2
+ cp python/llm/test/benchmark/arc-perf-transformers-436.yaml python/llm/dev/benchmark/all-in-one/config.yaml
cd python/llm/dev/benchmark/all-in-one
mkdir test_batch1
mkdir test_batch2
@@ -167,7 +168,7 @@ jobs:
mv *.csv test_batch1
# batch_size 2
cd ../../../../../
- cp python/llm/test/benchmark/arc-perf-test-batch2.yaml python/llm/dev/benchmark/all-in-one/config.yaml
+ cp python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml python/llm/dev/benchmark/all-in-one/config.yaml
cd python/llm/dev/benchmark/all-in-one
# change csv name
sed -i 's/batch1/batch2/g' run.py
@@ -175,7 +176,7 @@ jobs:
mv *.csv test_batch2
# batch_size 4
cd ../../../../../
- cp python/llm/test/benchmark/arc-perf-test-batch4.yaml python/llm/dev/benchmark/all-in-one/config.yaml
+ cp python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml python/llm/dev/benchmark/all-in-one/config.yaml
cd python/llm/dev/benchmark/all-in-one
# change csv name
sed -i 's/batch2/batch4/g' run.py
@@ -188,7 +189,7 @@ jobs:
source /opt/intel/oneapi/setvars.sh
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
- # upgrade transformers for model Qwen/Qwen1.5-7B-Chat
+ # upgrade for default transformers version
python -m pip install transformers==4.37.0
# batch_size 1
cp python/llm/test/benchmark/arc-perf-transformers-437.yaml python/llm/dev/benchmark/all-in-one/config.yaml
@@ -314,7 +315,7 @@ jobs:
run: |
# batch_size 1
cd python/llm/dev/benchmark/all-in-one/test_batch1
- python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-test.yaml
+ python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-transformers-436.yaml
python ../../../../test/benchmark/check_results.py -c test2 -y ../../../../test/benchmark/arc-perf-transformers-437.yaml
python ../../../../test/benchmark/check_results.py -c test3 -y ../../../../test/benchmark/arc-perf-transformers-440.yaml
find . -name "*test*.csv" -delete
@@ -327,7 +328,7 @@ jobs:
rm -r test_batch1
# batch_size 2
cd test_batch2
- python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-test-batch2.yaml
+ python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-transformers-436-batch2.yaml
python ../../../../test/benchmark/check_results.py -c test2 -y ../../../../test/benchmark/arc-perf-transformers-437-batch2.yaml
find . -name "*test*.csv" -delete
if [[ ${{ github.event_name }} == "schedule" ]]; then
@@ -339,7 +340,7 @@ jobs:
rm -r test_batch2
# batch_size 4
cd test_batch4
- python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-test-batch4.yaml
+ python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-transformers-436-batch4.yaml
python ../../../../test/benchmark/check_results.py -c test2 -y ../../../../test/benchmark/arc-perf-transformers-437-batch4.yaml
find . -name "*test*.csv" -delete
if [[ ${{ github.event_name }} == "schedule" ]]; then
@@ -384,7 +385,6 @@ jobs:
python -m pip install --upgrade einops
python -m pip install --upgrade tiktoken
python -m pip install --upgrade transformers_stream_generator
-
# specific for test on certain commits
- name: Download llm binary
if: ${{ github.event_name == 'workflow_dispatch' && (inputs.checkout-ref != 'main') }}
@@ -653,6 +653,7 @@ jobs:
set BIGDL_LLM_XMX_DISABLED=1
REM for llava
set TRANSFORMERS_OFFLINE=1
+ pip install transformers==4.37.0
cd python\llm\dev\benchmark\all-in-one
move ..\..\..\test\benchmark\igpu-perf\32-32_int4_fp16.yaml config.yaml
@@ -664,23 +665,23 @@ jobs:
call conda deactivate
- - name: Prepare igpu perf test for transformers 4.37 (32-32 int4+fp16)
+ - name: Prepare igpu perf test for transformers 4.36 (32-32 int4+fp16)
shell: bash
run: |
sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
- sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml
+ sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml
- - name: Test on igpu for transformers 4.37 (32-32 int4+fp16)
+ - name: Test on igpu for transformers 4.36 (32-32 int4+fp16)
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.37.0
+ pip install transformers==4.36.2
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
cd python\llm\dev\benchmark\all-in-one
- move ..\..\..\test\benchmark\igpu-perf\32-32_int4_fp16_437.yaml config.yaml
+ move ..\..\..\test\benchmark\igpu-perf\32-32_int4_fp16_436.yaml config.yaml
set PYTHONIOENCODING=utf-8
python run.py >> %CSV_SAVE_PATH%\32-32_int4_fp16\log\%LOG_FILE% 2>&1
if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -771,7 +772,7 @@ jobs:
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.36.2
+ pip install transformers==4.37.0
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
@@ -788,23 +789,23 @@ jobs:
call conda deactivate
- - name: Prepare igpu perf test for transformers 4.37 (1024-128 int4+fp16)
+ - name: Prepare igpu perf test for transformers 4.36 (1024-128 int4+fp16)
shell: bash
run: |
sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
- sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml
+ sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml
- - name: Test on igpu for transformers 4.37 (1024-128 int4+fp16)
+ - name: Test on igpu for transformers 4.36 (1024-128 int4+fp16)
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.37.0
+ pip install transformers==4.36.2
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
cd python\llm\dev\benchmark\all-in-one
- move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_437.yaml config.yaml
+ move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_436.yaml config.yaml
set PYTHONIOENCODING=utf-8
python run.py >> %CSV_SAVE_PATH%\1024-128_int4_fp16\log\%LOG_FILE% 2>&1
if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -812,7 +813,7 @@ jobs:
if %ERRORLEVEL% neq 0 (exit /b 1)
call conda deactivate
-
+
- name: Prepare igpu perf test for transformers 4.38 (1024-128 int4+fp16)
shell: bash
run: |
@@ -894,7 +895,6 @@ jobs:
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.36.2
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
@@ -911,23 +911,23 @@ jobs:
call conda deactivate
- - name: Prepare igpu perf test for transformers 4.37 (2048-256 int4+fp16)
+ - name: Prepare igpu perf test for transformers 4.36 (2048-256 int4+fp16)
shell: bash
run: |
sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
- sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml
+ sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml
- - name: Test on igpu for transformers 4.37 (2048-256 int4+fp16)
+ - name: Test on igpu for transformers 4.36 (2048-256 int4+fp16)
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.37.0
+ pip install transformers==4.36.2
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
cd python\llm\dev\benchmark\all-in-one
- move ..\..\..\test\benchmark\igpu-perf\2048-256_int4_fp16_437.yaml config.yaml
+ move ..\..\..\test\benchmark\igpu-perf\2048-256_int4_fp16_436.yaml config.yaml
set PYTHONIOENCODING=utf-8
python run.py >> %CSV_SAVE_PATH%\2048-256_int4_fp16\log\%LOG_FILE% 2>&1
if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -935,7 +935,7 @@ jobs:
if %ERRORLEVEL% neq 0 (exit /b 1)
call conda deactivate
-
+
- name: Prepare igpu perf test for transformers 4.38 (2048-256 int4+fp16)
shell: bash
run: |
@@ -1017,7 +1017,7 @@ jobs:
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.36.2
+ pip install transformers==4.37.0
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
@@ -1034,23 +1034,23 @@ jobs:
call conda deactivate
- - name: Prepare igpu perf test for transformers 4.37 (3072-384 int4+fp16)
+ - name: Prepare igpu perf test for transformers 4.36 (3072-384 int4+fp16)
shell: bash
run: |
sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
- sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml
+ sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml
- - name: Test on igpu for transformers 4.37 (3072-384 int4+fp16)
+ - name: Test on igpu for transformers 4.36 (3072-384 int4+fp16)
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.37.0
+ pip install transformers==4.36.2
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
cd python\llm\dev\benchmark\all-in-one
- move ..\..\..\test\benchmark\igpu-perf\3072-384_int4_fp16_437.yaml config.yaml
+ move ..\..\..\test\benchmark\igpu-perf\3072-384_int4_fp16_436.yaml config.yaml
set PYTHONIOENCODING=utf-8
python run.py >> %CSV_SAVE_PATH%\3072-384_int4_fp16\log\%LOG_FILE% 2>&1
if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -1140,7 +1140,7 @@ jobs:
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.36.2
+ pip install transformers==4.37.0
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
@@ -1157,35 +1157,10 @@ jobs:
call conda deactivate
- - name: Prepare igpu perf test for transformers 4.37 (4096-512 int4+fp16)
- shell: bash
- run: |
- sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
- sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml
-
- - name: Test on igpu for transformers 4.37 (4096-512 int4+fp16)
- shell: cmd
- run: |
- call conda activate igpu-perf
- pip install transformers==4.37.0
-
- set SYCL_CACHE_PERSISTENT=1
- set BIGDL_LLM_XMX_DISABLED=1
-
- cd python\llm\dev\benchmark\all-in-one
- move ..\..\..\test\benchmark\igpu-perf\4096-512_int4_fp16_437.yaml config.yaml
- set PYTHONIOENCODING=utf-8
- python run.py >> %CSV_SAVE_PATH%\4096-512_int4_fp16\log\%LOG_FILE% 2>&1
- if %ERRORLEVEL% neq 0 (exit /b 1)
- python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test2
- if %ERRORLEVEL% neq 0 (exit /b 1)
-
- call conda deactivate
-
- name: Prepare igpu perf test for transformers 4.38 (4096-512 int4+fp16)
shell: bash
run: |
- sed -i 's/{today}_test2/{today}_test3/g' python/llm/dev/benchmark/all-in-one/run.py
+ sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_438.yaml
- name: Test on igpu for transformers 4.38 (4096-512 int4+fp16)
@@ -1202,7 +1177,7 @@ jobs:
set PYTHONIOENCODING=utf-8
python run.py >> %CSV_SAVE_PATH%\4096-512_int4_fp16\log\%LOG_FILE% 2>&1
if %ERRORLEVEL% neq 0 (exit /b 1)
- python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test3
+ python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test2
if %ERRORLEVEL% neq 0 (exit /b 1)
call conda deactivate
@@ -1210,7 +1185,7 @@ jobs:
- name: Prepare igpu perf test for transformers 4.43 (4096-512 int4+fp16)
shell: bash
run: |
- sed -i 's/{today}_test3/{today}_test4/g' python/llm/dev/benchmark/all-in-one/run.py
+ sed -i 's/{today}_test2/{today}_test3/g' python/llm/dev/benchmark/all-in-one/run.py
sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_443.yaml
- name: Test on igpu for transformers 4.43 (4096-512 int4+fp16)
@@ -1228,7 +1203,7 @@ jobs:
set PYTHONIOENCODING=utf-8
python run.py >> %CSV_SAVE_PATH%\4096-512_int4_fp16\log\%LOG_FILE% 2>&1
if %ERRORLEVEL% neq 0 (exit /b 1)
- python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test4
+ python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test3
if %ERRORLEVEL% neq 0 (exit /b 1)
pip uninstall trl -y
@@ -1256,14 +1231,14 @@ jobs:
shell: bash
run: |
sed -i 's/4096-512/1024-128/g' python/llm/dev/benchmark/all-in-one/run.py
- sed -i 's/{today}_test4/{today}_test1/g' python/llm/dev/benchmark/all-in-one/run.py
+ sed -i 's/{today}_test3/{today}_test1/g' python/llm/dev/benchmark/all-in-one/run.py
sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml
- name: Test on igpu (load_low_bit 1024-128 int4+fp16)
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.36.2
+ pip install transformers==4.37.0
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
@@ -1280,23 +1255,23 @@ jobs:
call conda deactivate
- - name: Prepare igpu perf test for transformers 4.37 (load_low_bit 1024-128 int4+fp16)
+ - name: Prepare igpu perf test for transformers 4.36 (load_low_bit 1024-128 int4+fp16)
shell: bash
run: |
sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
- sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml
+ sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml
- - name: Test on igpu for transformers 4.37 (load_low_bit 1024-128 int4+fp16)
+ - name: Test on igpu for transformers 4.36 (load_low_bit 1024-128 int4+fp16)
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.37.0
+ pip install transformers==4.36.2
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
cd python\llm\dev\benchmark\all-in-one
- move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_loadlowbit_437.yaml config.yaml
+ move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_loadlowbit_436.yaml config.yaml
set PYTHONIOENCODING=utf-8
python run.py >> %CSV_SAVE_PATH%\1024-128_int4_fp16_loadlowbit\log\%LOG_FILE% 2>&1
if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -1385,7 +1360,7 @@ jobs:
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.36.2
+ pip install transformers==4.37.0
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
@@ -1402,23 +1377,23 @@ jobs:
call conda deactivate
- - name: Prepare igpu perf test for transformers 4.37 (1024-128)
+ - name: Prepare igpu perf test for transformers 4.36 (1024-128)
shell: bash
run: |
sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
- sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_437.yaml
+ sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_436.yaml
- - name: Test on igpu for transformers 4.37 (1024-128)
+ - name: Test on igpu for transformers 4.36 (1024-128)
shell: cmd
run: |
call conda activate igpu-perf
- pip install transformers==4.37.0
+ pip install transformers==4.36.2
set SYCL_CACHE_PERSISTENT=1
set BIGDL_LLM_XMX_DISABLED=1
cd python\llm\dev\benchmark\all-in-one
- move ..\..\..\test\benchmark\igpu-perf\1024-128_437.yaml config.yaml
+ move ..\..\..\test\benchmark\igpu-perf\1024-128_436.yaml config.yaml
set PYTHONIOENCODING=utf-8
python run.py >> %CSV_SAVE_PATH%\1024-128\log\%LOG_FILE% 2>&1
if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -1520,4 +1495,3 @@ jobs:
# shell: cmd
# run: |
# call conda env remove -n igpu-perf -y
-
diff --git a/python/llm/test/benchmark/arc-perf-test-batch2.yaml b/python/llm/test/benchmark/arc-perf-test-batch2.yaml
deleted file mode 100644
index 70447fd7f59..00000000000
--- a/python/llm/test/benchmark/arc-perf-test-batch2.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-repo_id:
- - 'meta-llama/Llama-2-7b-chat-hf'
- - 'meta-llama/Llama-2-13b-chat-hf'
- - 'THUDM/chatglm3-6b-4bit'
- - 'baichuan-inc/Baichuan2-7B-Chat'
- - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
- - 'THUDM/glm-4-9b-chat'
- - 'openbmb/MiniCPM-2B-sft-bf16'
- - 'Qwen/Qwen-VL-Chat'
- #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
- - '01-ai/Yi-6B-Chat'
- - 'mistralai/Mistral-7B-Instruct-v0.2'
- - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
- - '01-ai/Yi-1.5-6B-Chat'
-local_model_hub: '/mnt/disk1/models'
-warm_up: 1
-num_trials: 3
-num_beams: 1 # default to greedy search
-low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
-batch_size: 2 # default to 1
-in_out_pairs:
- - '32-32'
- - '1024-128'
- - '2048-256'
-test_api:
- - "transformer_int4_fp16_gpu" # on Intel GPU
-cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
-exclude:
- - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
-task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-test-batch4.yaml b/python/llm/test/benchmark/arc-perf-test-batch4.yaml
deleted file mode 100644
index 3bfd47963a4..00000000000
--- a/python/llm/test/benchmark/arc-perf-test-batch4.yaml
+++ /dev/null
@@ -1,36 +0,0 @@
-repo_id:
- - 'meta-llama/Llama-2-7b-chat-hf'
- - 'meta-llama/Llama-2-13b-chat-hf'
- - 'THUDM/chatglm3-6b-4bit'
- - 'baichuan-inc/Baichuan2-7B-Chat'
- - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
- - 'THUDM/glm-4-9b-chat'
- - 'openbmb/MiniCPM-2B-sft-bf16'
- - 'Qwen/Qwen-VL-Chat'
- #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
- - '01-ai/Yi-6B-Chat'
- - 'mistralai/Mistral-7B-Instruct-v0.2'
- - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
- - '01-ai/Yi-1.5-6B-Chat'
-local_model_hub: '/mnt/disk1/models'
-warm_up: 1
-num_trials: 3
-num_beams: 1 # default to greedy search
-low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
-batch_size: 4 # default to 1
-in_out_pairs:
- - '32-32'
- - '1024-128'
- - '2048-256'
-test_api:
- - "transformer_int4_fp16_gpu" # on Intel GPU
-cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
-exclude:
- - 'meta-llama/Llama-2-13b-chat-hf:2048'
- - 'baichuan-inc/Baichuan2-7B-Chat:2048'
- - 'baichuan-inc/Baichuan2-13B-Chat-4bit:1024'
- - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
- - 'Qwen/Qwen-VL-Chat:2048'
-# - 'fnlp/moss-moon-003-sft-4bit:1024'
-# - 'fnlp/moss-moon-003-sft-4bit:2048'
-task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-test.yaml b/python/llm/test/benchmark/arc-perf-test.yaml
deleted file mode 100644
index 890b8dbf470..00000000000
--- a/python/llm/test/benchmark/arc-perf-test.yaml
+++ /dev/null
@@ -1,32 +0,0 @@
-repo_id:
- - 'meta-llama/Llama-2-7b-chat-hf'
- - 'meta-llama/Llama-2-13b-chat-hf'
- - 'THUDM/chatglm3-6b-4bit'
- - 'baichuan-inc/Baichuan2-7B-Chat'
- - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
- - 'THUDM/glm-4-9b-chat'
- - 'openbmb/MiniCPM-2B-sft-bf16'
- - 'Qwen/Qwen-VL-Chat'
- #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
- - '01-ai/Yi-6B-Chat'
- - 'mistralai/Mistral-7B-Instruct-v0.2'
- - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
- - '01-ai/Yi-1.5-6B-Chat'
-local_model_hub: '/mnt/disk1/models'
-warm_up: 1
-num_trials: 3
-num_beams: 1 # default to greedy search
-low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
-batch_size: 1 # default to 1
-in_out_pairs:
- - '32-32'
- - '1024-128'
- - '2048-256'
-test_api:
- - "transformer_int4_fp16_gpu" # on Intel GPU
-cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
-exclude:
-# - 'fnlp/moss-moon-003-sft-4bit:1024'
-# - 'fnlp/moss-moon-003-sft-4bit:2048'
- - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
-task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml b/python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml
new file mode 100644
index 00000000000..42ef79f344c
--- /dev/null
+++ b/python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml
@@ -0,0 +1,16 @@
+repo_id:
+ - 'Qwen/Qwen-VL-Chat'
+local_model_hub: '/mnt/disk1/models'
+warm_up: 1
+num_trials: 3
+num_beams: 1 # default to greedy search
+low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
+batch_size: 2 # default to 1
+in_out_pairs:
+ - '32-32'
+ - '1024-128'
+ - '2048-256'
+test_api:
+ - "transformer_int4_fp16_gpu" # on Intel GPU
+cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml b/python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml
new file mode 100644
index 00000000000..606b9c6cf05
--- /dev/null
+++ b/python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml
@@ -0,0 +1,18 @@
+repo_id:
+ - 'Qwen/Qwen-VL-Chat'
+local_model_hub: '/mnt/disk1/models'
+warm_up: 1
+num_trials: 3
+num_beams: 1 # default to greedy search
+low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
+batch_size: 4 # default to 1
+in_out_pairs:
+ - '32-32'
+ - '1024-128'
+ - '2048-256'
+test_api:
+ - "transformer_int4_fp16_gpu" # on Intel GPU
+cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+exclude:
+ - 'Qwen/Qwen-VL-Chat:2048'
+task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-436.yaml b/python/llm/test/benchmark/arc-perf-transformers-436.yaml
new file mode 100644
index 00000000000..efdf14193a3
--- /dev/null
+++ b/python/llm/test/benchmark/arc-perf-transformers-436.yaml
@@ -0,0 +1,16 @@
+repo_id:
+ - 'Qwen/Qwen-VL-Chat'
+local_model_hub: '/mnt/disk1/models'
+warm_up: 1
+num_trials: 3
+num_beams: 1 # default to greedy search
+low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
+batch_size: 1 # default to 1
+in_out_pairs:
+ - '32-32'
+ - '1024-128'
+ - '2048-256'
+test_api:
+ - "transformer_int4_fp16_gpu" # on Intel GPU
+cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml b/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml
index d675d506629..9b9ab1f14ae 100644
--- a/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml
+++ b/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml
@@ -6,6 +6,18 @@ repo_id:
- 'microsoft/phi-3-vision-128k-instruct'
- 'Qwen/Qwen2-7B-Instruct'
- 'microsoft/Phi-3-mini-128k-instruct'
+ - 'meta-llama/Llama-2-7b-chat-hf'
+ - 'meta-llama/Llama-2-13b-chat-hf'
+ - 'THUDM/chatglm3-6b-4bit'
+ - 'baichuan-inc/Baichuan2-7B-Chat'
+ - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
+ - 'THUDM/glm-4-9b-chat'
+ - 'openbmb/MiniCPM-2B-sft-bf16'
+ #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
+ - '01-ai/Yi-6B-Chat'
+ - 'mistralai/Mistral-7B-Instruct-v0.2'
+ - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
+ - '01-ai/Yi-1.5-6B-Chat'
local_model_hub: '/mnt/disk1/models'
warm_up: 1
num_trials: 3
@@ -19,4 +31,6 @@ in_out_pairs:
test_api:
- "transformer_int4_fp16_gpu" # on Intel GPU
cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+exclude:
+ - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml b/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml
index f3d55c83e35..368a8c636b5 100644
--- a/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml
+++ b/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml
@@ -6,6 +6,18 @@ repo_id:
- 'microsoft/phi-3-vision-128k-instruct'
- 'Qwen/Qwen2-7B-Instruct'
- 'microsoft/Phi-3-mini-128k-instruct'
+ - 'meta-llama/Llama-2-7b-chat-hf'
+ - 'meta-llama/Llama-2-13b-chat-hf'
+ - 'THUDM/chatglm3-6b-4bit'
+ - 'baichuan-inc/Baichuan2-7B-Chat'
+ - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
+ - 'THUDM/glm-4-9b-chat'
+ - 'openbmb/MiniCPM-2B-sft-bf16'
+ #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
+ - '01-ai/Yi-6B-Chat'
+ - 'mistralai/Mistral-7B-Instruct-v0.2'
+ - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
+ - '01-ai/Yi-1.5-6B-Chat'
local_model_hub: '/mnt/disk1/models'
warm_up: 1
num_trials: 3
@@ -22,4 +34,8 @@ cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu w
exclude:
- 'Qwen/Qwen1.5-7B-Chat:2048'
- 'meta-llama/Meta-Llama-3-8B-Instruct:2048'
-task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
\ No newline at end of file
+ - 'meta-llama/Llama-2-13b-chat-hf:2048'
+ - 'baichuan-inc/Baichuan2-7B-Chat:2048'
+ - 'baichuan-inc/Baichuan2-13B-Chat-4bit:1024'
+ - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
+task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-437.yaml b/python/llm/test/benchmark/arc-perf-transformers-437.yaml
index 1c775344c43..bca87891f6b 100644
--- a/python/llm/test/benchmark/arc-perf-transformers-437.yaml
+++ b/python/llm/test/benchmark/arc-perf-transformers-437.yaml
@@ -6,6 +6,18 @@ repo_id:
- 'microsoft/phi-3-vision-128k-instruct'
- 'Qwen/Qwen2-7B-Instruct'
- 'microsoft/Phi-3-mini-128k-instruct'
+ - 'meta-llama/Llama-2-7b-chat-hf'
+ - 'meta-llama/Llama-2-13b-chat-hf'
+ - 'THUDM/chatglm3-6b-4bit'
+ - 'baichuan-inc/Baichuan2-7B-Chat'
+ - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
+ - 'THUDM/glm-4-9b-chat'
+ - 'openbmb/MiniCPM-2B-sft-bf16'
+ #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
+ - '01-ai/Yi-6B-Chat'
+ - 'mistralai/Mistral-7B-Instruct-v0.2'
+ - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
+ - '01-ai/Yi-1.5-6B-Chat'
local_model_hub: '/mnt/disk1/models'
warm_up: 1
num_trials: 3
@@ -19,4 +31,6 @@ in_out_pairs:
test_api:
- "transformer_int4_fp16_gpu" # on Intel GPU
cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+exclude:
+ - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/core-perf-test.yaml b/python/llm/test/benchmark/core-perf-test.yaml
index 55f738de54b..2def68c1494 100644
--- a/python/llm/test/benchmark/core-perf-test.yaml
+++ b/python/llm/test/benchmark/core-perf-test.yaml
@@ -3,7 +3,7 @@ repo_id:
- 'THUDM/chatglm3-6b'
- 'baichuan-inc/Baichuan2-7B-Chat'
- 'internlm/internlm-chat-7b'
- - 'Qwen/Qwen-7B-Chat'
+ # - 'Qwen/Qwen-7B-Chat' # requires transformers < 4.37.0
- 'BAAI/AquilaChat2-7B'
- 'meta-llama/Llama-2-7b-chat-hf'
- 'WisdomShell/CodeShell-7B'
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128.yaml b/python/llm/test/benchmark/igpu-perf/1024-128.yaml
index b0bd5f30c20..759a7566237 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128.yaml
@@ -10,9 +10,15 @@ repo_id:
- 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
- 'RWKV/v5-Eagle-7B-HF'
- '01-ai/Yi-6B-Chat'
- - 'Qwen/Qwen-VL-Chat'
- 'openbmb/MiniCPM-1B-sft-bf16'
- 'openbmb/MiniCPM-2B-sft-bf16'
+ - 'Qwen/Qwen1.5-7B-Chat'
+ - 'Qwen/Qwen2-1.5B-Instruct'
+ - 'Qwen/Qwen2-7B-Instruct'
+ - 'microsoft/Phi-3-mini-4k-instruct'
+ - 'microsoft/Phi-3-mini-128k-instruct'
+ - 'microsoft/phi-3-vision-128k-instruct'
+ - 'openbmb/MiniCPM-V-2_6'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_437.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_436.yaml
similarity index 65%
rename from python/llm/test/benchmark/igpu-perf/1024-128_437.yaml
rename to python/llm/test/benchmark/igpu-perf/1024-128_436.yaml
index c6850389b97..c967f66a7ba 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_436.yaml
@@ -1,11 +1,5 @@
repo_id:
- - 'Qwen/Qwen1.5-7B-Chat'
- - 'Qwen/Qwen2-1.5B-Instruct'
- - 'Qwen/Qwen2-7B-Instruct'
- - 'microsoft/Phi-3-mini-4k-instruct'
- - 'microsoft/Phi-3-mini-128k-instruct'
- - 'microsoft/phi-3-vision-128k-instruct'
- - 'openbmb/MiniCPM-V-2_6'
+ - 'Qwen/Qwen-VL-Chat'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml
index 39d575680ab..f66172d9a39 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml
@@ -9,9 +9,15 @@ repo_id:
- 'mistralai/Mistral-7B-Instruct-v0.2'
- 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
- '01-ai/Yi-6B-Chat'
- - 'Qwen/Qwen-VL-Chat'
- 'openbmb/MiniCPM-1B-sft-bf16'
- 'openbmb/MiniCPM-2B-sft-bf16'
+ - 'Qwen/Qwen1.5-7B-Chat'
+ - 'Qwen/Qwen2-1.5B-Instruct'
+ - 'Qwen/Qwen2-7B-Instruct'
+ - 'microsoft/Phi-3-mini-4k-instruct'
+ - 'microsoft/Phi-3-mini-128k-instruct'
+ - 'microsoft/phi-3-vision-128k-instruct'
+ - 'openbmb/MiniCPM-V-2_6'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml
similarity index 65%
rename from python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml
rename to python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml
index 68cbaf2a163..c224b65e745 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml
@@ -1,11 +1,5 @@
repo_id:
- - 'Qwen/Qwen1.5-7B-Chat'
- - 'Qwen/Qwen2-1.5B-Instruct'
- - 'Qwen/Qwen2-7B-Instruct'
- - 'microsoft/Phi-3-mini-4k-instruct'
- - 'microsoft/Phi-3-mini-128k-instruct'
- - 'microsoft/phi-3-vision-128k-instruct'
- - 'openbmb/MiniCPM-V-2_6'
+ - 'Qwen/Qwen-VL-Chat'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml
index 2730e465d47..76c35d4dde7 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml
@@ -9,9 +9,14 @@ repo_id:
- 'mistralai/Mistral-7B-Instruct-v0.2'
- 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
- '01-ai/Yi-6B-Chat'
- - 'Qwen/Qwen-VL-Chat'
- 'openbmb/MiniCPM-1B-sft-bf16'
- 'openbmb/MiniCPM-2B-sft-bf16'
+ - 'Qwen/Qwen1.5-7B-Chat'
+ - 'Qwen/Qwen2-1.5B-Instruct'
+ - 'Qwen/Qwen2-7B-Instruct'
+ - 'microsoft/Phi-3-mini-4k-instruct'
+ - 'microsoft/Phi-3-mini-128k-instruct'
+ - 'microsoft/phi-3-vision-128k-instruct'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml
similarity index 68%
rename from python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml
rename to python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml
index 3839d0d2951..917e6d0ff3c 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml
@@ -1,10 +1,5 @@
repo_id:
- - 'Qwen/Qwen1.5-7B-Chat'
- - 'Qwen/Qwen2-1.5B-Instruct'
- - 'Qwen/Qwen2-7B-Instruct'
- - 'microsoft/Phi-3-mini-4k-instruct'
- - 'microsoft/Phi-3-mini-128k-instruct'
- - 'microsoft/phi-3-vision-128k-instruct'
+ - 'Qwen/Qwen-VL-Chat'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml
index c53e6283919..bf5fc1e978b 100644
--- a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml
@@ -9,9 +9,15 @@ repo_id:
- 'mistralai/Mistral-7B-Instruct-v0.2'
- 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
- '01-ai/Yi-6B-Chat'
- - 'Qwen/Qwen-VL-Chat'
- 'openbmb/MiniCPM-1B-sft-bf16'
- 'openbmb/MiniCPM-2B-sft-bf16'
+ - 'Qwen/Qwen1.5-7B-Chat'
+ - 'Qwen/Qwen2-1.5B-Instruct'
+ - 'Qwen/Qwen2-7B-Instruct'
+ - 'microsoft/Phi-3-mini-4k-instruct'
+ - 'microsoft/Phi-3-mini-128k-instruct'
+ - 'microsoft/phi-3-vision-128k-instruct'
+ - 'openbmb/MiniCPM-V-2_6'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml
similarity index 65%
rename from python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml
rename to python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml
index 0eddd403b86..e9566c13250 100644
--- a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml
@@ -1,11 +1,5 @@
repo_id:
- - 'Qwen/Qwen1.5-7B-Chat'
- - 'Qwen/Qwen2-1.5B-Instruct'
- - 'Qwen/Qwen2-7B-Instruct'
- - 'microsoft/Phi-3-mini-4k-instruct'
- - 'microsoft/Phi-3-mini-128k-instruct'
- - 'microsoft/phi-3-vision-128k-instruct'
- - 'openbmb/MiniCPM-V-2_6'
+ - 'Qwen/Qwen-VL-Chat'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml
index 47b9839a789..60202594cba 100644
--- a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml
@@ -8,9 +8,15 @@ repo_id:
- 'mistralai/Mistral-7B-Instruct-v0.2'
- 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
- '01-ai/Yi-6B-Chat'
- - 'Qwen/Qwen-VL-Chat'
- 'openbmb/MiniCPM-1B-sft-bf16'
- 'openbmb/MiniCPM-2B-sft-bf16'
+ - 'Qwen/Qwen1.5-7B-Chat'
+ - 'Qwen/Qwen2-1.5B-Instruct'
+ - 'Qwen/Qwen2-7B-Instruct'
+ - 'microsoft/Phi-3-mini-4k-instruct'
+ - 'microsoft/Phi-3-mini-128k-instruct'
+ - 'microsoft/phi-3-vision-128k-instruct'
+ - 'openbmb/MiniCPM-V-2_6'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml
similarity index 52%
rename from python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml
rename to python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml
index 087da9773db..6448a358cb5 100644
--- a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml
@@ -1,11 +1,5 @@
repo_id:
- - 'Qwen/Qwen1.5-7B-Chat'
- - 'Qwen/Qwen2-1.5B-Instruct'
- - 'Qwen/Qwen2-7B-Instruct'
- - 'microsoft/Phi-3-mini-4k-instruct'
- - 'microsoft/Phi-3-mini-128k-instruct'
- - 'microsoft/phi-3-vision-128k-instruct'
- - 'openbmb/MiniCPM-V-2_6'
+ - 'Qwen/Qwen-VL-Chat'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
@@ -15,5 +9,5 @@ batch_size: 1 # default to 1
in_out_pairs:
- '3072-384'
test_api:
- - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows (catch GPU peak memory)
+ - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows, use fp16 for non-linear layer
cpu_embedding: True # whether put embedding to CPU (only avaiable now for gpu win related test_api)
diff --git a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml
index 39115e0231b..e70178744a3 100644
--- a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml
@@ -9,9 +9,15 @@ repo_id:
- 'mistralai/Mistral-7B-Instruct-v0.2'
- 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
- '01-ai/Yi-6B-Chat'
- - 'Qwen/Qwen-VL-Chat'
- 'openbmb/MiniCPM-1B-sft-bf16'
- 'openbmb/MiniCPM-2B-sft-bf16'
+ - 'Qwen/Qwen1.5-7B-Chat'
+ - 'Qwen/Qwen2-1.5B-Instruct'
+ - 'Qwen/Qwen2-7B-Instruct'
+ - 'microsoft/Phi-3-mini-4k-instruct'
+ - 'microsoft/Phi-3-mini-128k-instruct'
+ - 'microsoft/phi-3-vision-128k-instruct'
+ - 'openbmb/MiniCPM-V-2_6'
local_model_hub: 'path to your local model hub'
warm_up: 3
num_trials: 5
diff --git a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml
similarity index 65%
rename from python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml
rename to python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml
index 1f0d11a2004..8faf43aed97 100644
--- a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml
@@ -1,11 +1,5 @@
repo_id:
- - 'Qwen/Qwen1.5-7B-Chat'
- - 'Qwen/Qwen2-1.5B-Instruct'
- - 'Qwen/Qwen2-7B-Instruct'
- - 'microsoft/Phi-3-mini-4k-instruct'
- - 'microsoft/Phi-3-mini-128k-instruct'
- - 'microsoft/phi-3-vision-128k-instruct'
- - 'openbmb/MiniCPM-V-2_6'
+ - 'Qwen/Qwen-VL-Chat'
local_model_hub: 'path to your local model hub'
warm_up: 3
num_trials: 5
diff --git a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml
index 26e128a564c..514037a7380 100644
--- a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml
@@ -10,6 +10,13 @@ repo_id:
- '01-ai/Yi-6B-Chat'
- 'openbmb/MiniCPM-1B-sft-bf16'
- 'openbmb/MiniCPM-2B-sft-bf16'
+ - 'Qwen/Qwen1.5-7B-Chat'
+ - 'Qwen/Qwen2-1.5B-Instruct'
+ - 'Qwen/Qwen2-7B-Instruct'
+ - 'microsoft/Phi-3-mini-4k-instruct'
+ - 'microsoft/Phi-3-mini-128k-instruct'
+ - 'microsoft/phi-3-vision-128k-instruct'
+ - 'openbmb/MiniCPM-V-2_6'
local_model_hub: 'path to your local model hub'
warm_up: 1
num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml
deleted file mode 100644
index 4472b5da1f2..00000000000
--- a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml
+++ /dev/null
@@ -1,19 +0,0 @@
-repo_id:
- - 'Qwen/Qwen1.5-7B-Chat'
- - 'Qwen/Qwen2-1.5B-Instruct'
- - 'Qwen/Qwen2-7B-Instruct'
- - 'microsoft/Phi-3-mini-4k-instruct'
- - 'microsoft/Phi-3-mini-128k-instruct'
- - 'microsoft/phi-3-vision-128k-instruct'
- - 'openbmb/MiniCPM-V-2_6'
-local_model_hub: 'path to your local model hub'
-warm_up: 1
-num_trials: 3
-num_beams: 1 # default to greedy search
-low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
-batch_size: 1 # default to 1
-in_out_pairs:
- - '4096-512'
-test_api:
- - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows (catch GPU peak memory)
-cpu_embedding: True # whether put embedding to CPU (only avaiable now for gpu win related test_api)
From 90549883db6143453b4a8c02e131b4adc5b142d6 Mon Sep 17 00:00:00 2001
From: Jinhe
Date: Tue, 20 Aug 2024 18:01:42 +0800
Subject: [PATCH 08/11] Pytorch models transformers version update (#11860)
* yi sync
* delete 4.34 constraint
* delete 4.34 constraint
* delete 4.31 constraint
* delete 4.34 constraint
* delete 4.35 constraint
* added <=4.33.3 constraint
* added <=4.33.3 constraint
* switched to chinese prompt
---
.../llm/example/GPU/HuggingFace/LLM/yi/README.md | 12 ++++++------
.../llm/example/GPU/HuggingFace/LLM/yi/generate.py | 2 +-
.../GPU/PyTorch-Models/Model/codegeex2/README.md | 2 --
.../GPU/PyTorch-Models/Model/codellama/README.md | 4 ----
.../GPU/PyTorch-Models/Model/deciLM-7b/README.md | 2 --
.../GPU/PyTorch-Models/Model/mistral/README.md | 7 -------
.../GPU/PyTorch-Models/Model/replit/README.md | 4 +++-
.../GPU/PyTorch-Models/Model/solar/README.md | 4 ----
.../example/GPU/PyTorch-Models/Model/yi/README.md | 14 ++++++++++++--
.../GPU/PyTorch-Models/Model/yi/generate.py | 2 +-
10 files changed, 23 insertions(+), 30 deletions(-)
diff --git a/python/llm/example/GPU/HuggingFace/LLM/yi/README.md b/python/llm/example/GPU/HuggingFace/LLM/yi/README.md
index 1fb49f21523..080e2676fdc 100644
--- a/python/llm/example/GPU/HuggingFace/LLM/yi/README.md
+++ b/python/llm/example/GPU/HuggingFace/LLM/yi/README.md
@@ -122,18 +122,18 @@ In the example, several arguments can be passed to satisfy your requirements:
```log
Inference time: xxxx s
-------------------- Prompt --------------------
-What is AI?
+AI是什么?
-------------------- Output --------------------
-What is AI?
-Artificial Intelligence (AI) is the simulation of human intelligence in machines. AI is the science and engineering of making intelligent machines, especially intelligent computer programs.
+AI是什么?
+人工智能(Artificial Intelligence),英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及
```
#### [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)
```log
Inference time: xxxx s
-------------------- Prompt --------------------
-What is AI?
+AI是什么?
-------------------- Output --------------------
-What is AI?
-Artificial Intelligence (AI) refers to the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, and self-
+AI是什么?
+人工智能(Artificial Intelligence, AI)是计算机科学的一个分支,它研究如何让计算机模拟人类的智能行为。人工智能可以通过模仿人类的思维过程和
```
\ No newline at end of file
diff --git a/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py b/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py
index f32f272c13a..643c5f7b34d 100644
--- a/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py
+++ b/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py
@@ -27,7 +27,7 @@
parser.add_argument('--repo-id-or-model-path', type=str, default="01-ai/Yi-6B-Chat",
help='The huggingface repo id for the Yi model to be downloaded'
', or the path to the huggingface checkpoint folder')
- parser.add_argument('--prompt', type=str, default="What is AI?",
+ parser.add_argument('--prompt', type=str, default="AI是什么?",
help='Prompt to infer')
parser.add_argument('--n-predict', type=int, default=32,
help='Max tokens to predict')
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md b/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md
index 37f801a28bf..bc8cfa62907 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md
@@ -16,7 +16,6 @@ conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-pip install transformers==4.31.0
```
#### 1.2 Installation on Windows
@@ -27,7 +26,6 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-pip install transformers==4.31.0
```
### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md b/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md
index 497a6828b24..ff68817eca4 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md
@@ -14,8 +14,6 @@ conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
```
#### 1.2 Installation on Windows
@@ -26,8 +24,6 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
```
### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md b/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md
index ff8eab5ae09..a9e66f54732 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md
@@ -14,8 +14,6 @@ conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.35.2 # required by DeciLM-7B
```
#### 1.2 Installation on Windows
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md b/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md
index 4fc017e1ba7..4f3e58b045c 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md
@@ -4,7 +4,6 @@ In this directory, you will find examples on how you could use IPEX-LLM `optimiz
## Requirements
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
-**Important: According to [Mistral Troubleshooting](https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting), please make sure you have installed `transformers==4.34.0` to run the example.**
## Example: Predict Tokens using `generate()` API
In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
@@ -16,9 +15,6 @@ conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-# Refer to https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting, please make sure you are using a stable version of Transformers, 4.34.0 or newer.
-pip install transformers==4.34.0
```
#### 1.2 Installation on Windows
@@ -29,9 +25,6 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-# Refer to https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting, please make sure you are using a stable version of Transformers, 4.34.0 or newer.
-pip install transformers==4.34.0
```
### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md b/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md
index 4938682aea2..3bfbf245655 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md
@@ -15,7 +15,7 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-pip install "transformers<4.35"
+pip install transformers<=4.33.3
```
#### 1.2 Installation on Windows
@@ -26,6 +26,8 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+
+pip install transformers<=4.33.3
```
### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md b/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md
index 2b718cd4a6a..4d157d19bf3 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md
@@ -14,8 +14,6 @@ conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.35.2 # required by SOLAR
```
#### 1.2 Installation on Windows
@@ -26,8 +24,6 @@ conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.35.2 # required by SOLAR
```
### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md b/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md
index b48b95325c3..2b500175575 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md
@@ -1,5 +1,5 @@
# Yi
-In this directory, you will find examples on how you could use IPEX-LLM `optimize_model` API on Yi models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B) as a reference Yi model.
+In this directory, you will find examples on how you could use IPEX-LLM `optimize_model` API on Yi models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B) and [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-1.5-6B-Chat) as reference Yi models.
## 0. Requirements
To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@@ -112,7 +112,7 @@ python ./generate.py
In the example, several arguments can be passed to satisfy your requirements:
-- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yi model (e.g. `01-ai/Yi-6B`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'01-ai/Yi-6B'`.
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yi model (e.g. `01-ai/Yi-6B` and `01-ai/Yi-6B-Chat`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'01-ai/Yi-6B-Chat'`.
- `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`.
- `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
@@ -127,3 +127,13 @@ AI是什么?
AI是什么?
人工智能(Artificial Intelligence),英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及
```
+
+#### [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)
+```log
+Inference time: xxxx s
+-------------------- Prompt --------------------
+AI是什么?
+-------------------- Output --------------------
+AI是什么?
+人工智能(Artificial Intelligence, AI)是计算机科学的一个分支,它研究如何让计算机模拟人类的智能行为。人工智能可以通过模仿人类的思维过程和
+```
\ No newline at end of file
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py
index 31256cda112..871f5f4fbd1 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py
@@ -26,7 +26,7 @@
if __name__ == '__main__':
parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Yi model')
- parser.add_argument('--repo-id-or-model-path', type=str, default="01-ai/Yi-6B",
+ parser.add_argument('--repo-id-or-model-path', type=str, default="01-ai/Yi-6B-Chat",
help='The huggingface repo id for the Yi model to be downloaded'
', or the path to the huggingface checkpoint folder')
parser.add_argument('--prompt', type=str, default="AI是什么?",
From 9a8eb10cd6f28c1b78d1cad549472f9b140d9e41 Mon Sep 17 00:00:00 2001
From: Yina Chen <33650826+cyita@users.noreply.github.com>
Date: Tue, 20 Aug 2024 13:11:37 +0300
Subject: [PATCH 09/11] Update compresskv model forward type logic (#11868)
* update
* fix
---
.../src/ipex_llm/transformers/models/llama.py | 18 ++++++++++-----
.../ipex_llm/transformers/models/minicpm.py | 9 ++++----
.../src/ipex_llm/transformers/models/phi3.py | 11 +++++-----
.../src/ipex_llm/transformers/models/qwen2.py | 22 +++++++++----------
4 files changed, 33 insertions(+), 27 deletions(-)
diff --git a/python/llm/src/ipex_llm/transformers/models/llama.py b/python/llm/src/ipex_llm/transformers/models/llama.py
index 5e633da7406..2c9c17e7a58 100644
--- a/python/llm/src/ipex_llm/transformers/models/llama.py
+++ b/python/llm/src/ipex_llm/transformers/models/llama.py
@@ -128,7 +128,9 @@ def llama_model_forward_4_36(
use_quantize = use_quantize_kv_cache(
self.layers[0].mlp.up_proj, input,
self.config.num_attention_heads//self.config.num_key_value_heads)
- if should_use_compresskv(input, input.shape[1]):
+ use_compresskv = should_use_compresskv(input, input.shape[1]) or \
+ isinstance(past_key_values, DynamicCompressCache)
+ if use_compresskv:
if not isinstance(past_key_values, DynamicCompressCache):
if use_quantize:
past_key_values = DynamicCompressFp8Cache.from_legacy_cache(
@@ -137,7 +139,7 @@ def llama_model_forward_4_36(
past_key_values = DynamicCompressCache.from_legacy_cache(
past_key_values)
elif use_quantize:
- if not isinstance(past_key_values, (DynamicFp8Cache, DynamicCompressCache)):
+ if not isinstance(past_key_values, DynamicFp8Cache):
past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
return llama_model_forward_4_36_internal(
self=self,
@@ -174,7 +176,9 @@ def llama_model_forward_4_38(
use_quantize = use_quantize_kv_cache(
self.layers[0].mlp.up_proj, input,
self.config.num_attention_heads//self.config.num_key_value_heads)
- if should_use_compresskv(input, input.shape[1]):
+ use_compresskv = should_use_compresskv(input, input.shape[1]) or \
+ isinstance(past_key_values, DynamicCompressCache)
+ if use_compresskv:
if not isinstance(past_key_values, DynamicCompressCache):
if use_quantize:
past_key_values = DynamicCompressFp8Cache.from_legacy_cache(
@@ -183,7 +187,7 @@ def llama_model_forward_4_38(
past_key_values = DynamicCompressCache.from_legacy_cache(
past_key_values)
elif use_quantize:
- if not isinstance(past_key_values, (DynamicFp8Cache, DynamicCompressCache)):
+ if not isinstance(past_key_values, DynamicFp8Cache):
past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
return llama_model_forward_4_38_internal(
self=self,
@@ -221,7 +225,9 @@ def llama_model_forward_4_41(
use_quantize = use_quantize_kv_cache(
self.layers[0].mlp.up_proj, input,
self.config.num_attention_heads//self.config.num_key_value_heads)
- if should_use_compresskv(input, input.shape[1]):
+ use_compresskv = should_use_compresskv(input, input.shape[1]) or \
+ isinstance(past_key_values, DynamicCompressCache)
+ if use_compresskv:
if not isinstance(past_key_values, DynamicCompressCache):
if use_quantize:
past_key_values = DynamicCompressFp8Cache.from_legacy_cache(
@@ -230,7 +236,7 @@ def llama_model_forward_4_41(
past_key_values = DynamicCompressCache.from_legacy_cache(
past_key_values)
elif use_quantize:
- if not isinstance(past_key_values, (DynamicFp8Cache, DynamicCompressCache)):
+ if not isinstance(past_key_values, DynamicFp8Cache):
past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
return llama_model_forward_4_41_internal(
self=self,
diff --git a/python/llm/src/ipex_llm/transformers/models/minicpm.py b/python/llm/src/ipex_llm/transformers/models/minicpm.py
index afbcde6c657..d248c507773 100644
--- a/python/llm/src/ipex_llm/transformers/models/minicpm.py
+++ b/python/llm/src/ipex_llm/transformers/models/minicpm.py
@@ -182,7 +182,8 @@ def minicpm_model_forward(
use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.up_proj, inputs,
self.config.num_attention_heads //
self.config.num_key_value_heads)
- use_compress_kv = should_use_compresskv(inputs, inputs.shape[1])
+ use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) or \
+ isinstance(past_key_values, DynamicCompressCache)
use_cache = use_cache if use_cache is not None else self.config.use_cache
if use_cache:
@@ -192,11 +193,11 @@ def minicpm_model_forward(
past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values)
else:
past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values)
- elif use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache,
- DynamicCompressCache)):
+ elif (use_quantize_kv and not use_compress_kv
+ and not isinstance(past_key_values, DynamicFp8Cache)):
past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
elif (not use_quantize_kv and not use_compress_kv
- and not isinstance(past_key_values, (DynamicNormalCache, DynamicCompressCache))):
+ and not isinstance(past_key_values, DynamicNormalCache)):
past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
# ipex-llm changes end
return origin_forward(
diff --git a/python/llm/src/ipex_llm/transformers/models/phi3.py b/python/llm/src/ipex_llm/transformers/models/phi3.py
index 823fb10391a..bfa380c2f51 100644
--- a/python/llm/src/ipex_llm/transformers/models/phi3.py
+++ b/python/llm/src/ipex_llm/transformers/models/phi3.py
@@ -256,7 +256,8 @@ def model_forward(
use_cache = use_cache if use_cache is not None else self.config.use_cache
inputs = input_ids if input_ids is not None else inputs_embeds
use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.down_proj, inputs)
- use_compress_kv = should_use_compresskv(inputs, inputs.shape[1])
+ use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) or \
+ isinstance(past_key_values, DynamicCompressCache)
if use_cache:
if use_compress_kv and not isinstance(past_key_values,
DynamicCompressCache):
@@ -264,13 +265,11 @@ def model_forward(
past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values)
else:
past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values)
- if use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache,
- DynamicCompressCache)):
+ if use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
+ DynamicFp8Cache):
past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
if not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
- (DynamicNormalCache,
- DynamicCompressCache
- )):
+ DynamicNormalCache):
past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
if past_key_values.get_seq_length() == 0:
n_layer = self.config.num_hidden_layers
diff --git a/python/llm/src/ipex_llm/transformers/models/qwen2.py b/python/llm/src/ipex_llm/transformers/models/qwen2.py
index c01488a6fb6..802c5e7ec45 100644
--- a/python/llm/src/ipex_llm/transformers/models/qwen2.py
+++ b/python/llm/src/ipex_llm/transformers/models/qwen2.py
@@ -120,7 +120,8 @@ def qwen2_model_forward(
and use_quantize_kv_cache(self.layers[0].mlp.up_proj, inputs,
self.config.num_attention_heads//self.config.num_key_value_heads)
)
- use_compress_kv = should_use_compresskv(inputs, inputs.shape[1])
+ use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) or \
+ isinstance(past_key_values, DynamicCompressCache)
if use_cache:
if use_compress_kv and not isinstance(past_key_values, DynamicCompressCache):
@@ -128,12 +129,11 @@ def qwen2_model_forward(
past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values)
else:
past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values)
- elif use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache,
- DynamicCompressCache)):
+ elif use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
+ DynamicFp8Cache):
past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
if not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
- (DynamicNormalCache,
- DynamicCompressCache)):
+ DynamicNormalCache):
past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
past_key_values_length = past_key_values.get_usable_length(seq_length)
# ipex-llm changes end
@@ -316,7 +316,8 @@ def qwen2_model_forward_4_42(
and use_quantize_kv_cache(self.layers[0].mlp.up_proj, inputs_embeds,
self.config.num_attention_heads//self.config.num_key_value_heads)
)
- use_compress_kv = should_use_compresskv(inputs_embeds, inputs_embeds.shape[1])
+ use_compress_kv = should_use_compresskv(inputs_embeds, inputs_embeds.shape[1]) or \
+ isinstance(past_key_values, DynamicCompressCache)
if use_cache:
if use_compress_kv and not isinstance(past_key_values, DynamicCompressCache):
@@ -324,12 +325,11 @@ def qwen2_model_forward_4_42(
past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values)
else:
past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values)
- elif use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache,
- DynamicCompressCache)):
+ elif use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
+ DynamicFp8Cache):
past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
- elif not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
- (DynamicNormalCache,
- DynamicCompressCache)):
+ if not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
+ DynamicNormalCache):
past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
# ipex-llm changes end
From f573959b2a6e4854379905777afaac58aedb908e Mon Sep 17 00:00:00 2001
From: RyuKosei <70006706+RyuKosei@users.noreply.github.com>
Date: Tue, 20 Aug 2024 18:50:00 +0800
Subject: [PATCH 10/11] Update local import for ppl (#11866)
Co-authored-by: jenniew
---
python/llm/dev/benchmark/perplexity/run_wikitext.py | 2 +-
1 file changed, 1 insertion(+), 1 deletion(-)
diff --git a/python/llm/dev/benchmark/perplexity/run_wikitext.py b/python/llm/dev/benchmark/perplexity/run_wikitext.py
index 50991558f35..92426a86fb1 100644
--- a/python/llm/dev/benchmark/perplexity/run_wikitext.py
+++ b/python/llm/dev/benchmark/perplexity/run_wikitext.py
@@ -21,7 +21,6 @@
import torch
from tqdm import tqdm
from datasets import load_dataset
-from ipex_llm.utils.common import invalidInputError
parser = argparse.ArgumentParser()
@@ -63,6 +62,7 @@ def parse_kwargs(kwstr):
data = f.read()
encodings = tokenizer(data.decode("utf-8").strip("\n"), return_tensors="pt")
else:
+ from ipex_llm.utils.common import invalidInputError
raise invalidInputError(False, "Must specify either dataset or datapath.")
if not args.max_length:
From 52728feb7738ebc29c7cc2afb73fd4d49c30a664 Mon Sep 17 00:00:00 2001
From: cranechu <1340390339@qq.com>
Date: Tue, 20 Aug 2024 19:18:49 +0800
Subject: [PATCH 11/11] fix: textual adjustment
---
python/llm/dev/benchmark/perplexity/README.md | 6 ++----
1 file changed, 2 insertions(+), 4 deletions(-)
diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md
index 3d824ac570f..410358eed34 100644
--- a/python/llm/dev/benchmark/perplexity/README.md
+++ b/python/llm/dev/benchmark/perplexity/README.md
@@ -2,9 +2,7 @@
Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py)
## Environment Preparation
-Install ipex-llm and dataset.
```bash
-# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
pip install datasets
```
@@ -14,9 +12,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
source /opt/intel/oneapi/setvars.sh
```
-## Running PPL Evaluation
+## PPL Evaluation
### 1. Run on Wikitext
-An example to run perplexity on wikitext:
+An example to run perplexity on [wikitext](https://paperswithcode.com/dataset/wikitext-2):
```bash
python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096
```