From 5e0144997665fdb3b522168d52a6fbd7c41b1f89 Mon Sep 17 00:00:00 2001
From: cranechu <1340390339@qq.com>
Date: Tue, 20 Aug 2024 16:41:34 +0800
Subject: [PATCH 01/11] feat: update readme for ppl test

---
 python/llm/dev/benchmark/perplexity/README.md | 68 +++++++++++++++++--
 1 file changed, 63 insertions(+), 5 deletions(-)
diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md
index 8e6d5bacb89..d20ccf686f1 100644
--- a/python/llm/dev/benchmark/perplexity/README.md
+++ b/python/llm/dev/benchmark/perplexity/README.md
@@ -1,29 +1,87 @@
 # Perplexity
 Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py) 
 
-## Run on Wikitext
+## Requirements
+To run perplexity test  with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 
+### 1. Install IPEX
+We suggest using conda to manage environment:
 ```bash
-pip install datasets
+conda create -n llm python=3.11
+conda activate llm
+# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
+pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 ```
-An example to run perplexity on wikitext:
+
+
+### 2. Configures OneAPI environment variables for Linux
+
+> [!NOTE]
+> Skip this step if you are running on Windows.
+
+This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
+
 ```bash
+source /opt/intel/oneapi/setvars.sh
+```
 
-python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096
+### 3. Runtime Configurations
+For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
+<details>
 
+<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
+
+```bash
+export USE_XETLA=OFF
+export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
+export SYCL_CACHE_PERSISTENT=1
 ```
 
-## Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset
+</details>
+
+<details>
+
+<summary>For Intel Data Center GPU Max Series</summary>
 
+```bash
+export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
+export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
+export SYCL_CACHE_PERSISTENT=1
+export ENABLE_SDP_FUSION=1
+```
+> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
+</details>
+
+<details>
+
+<summary>For Intel iGPU</summary>
+
+```bash
+export SYCL_CACHE_PERSISTENT=1
+export BIGDL_LLM_XMX_DISABLED=1
+```
+
+</details>
+
+### 4. installing dependency
+Install the dataset dependency to download and load dataset for the test.
 ```bash
 pip install datasets
 ```
+## Running the test
+### 1.Run on Wikitext
+An example to run perplexity on wikitext:
+```bash
+python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096
+```
+###  2.Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset
 
 An example to run perplexity on chatglm3-6b using the default Chinese datasets("multifieldqa_zh", "dureader", "vcsum", "lsht", "passage_retrieval_zh")
 ```bash
 python run_longbench.py --model_path THUDM/chatglm3-6b --precisions float16 sym_int4 --device xpu --language zh
 ```
 
+
 Notes:
 - If you want to test model perplexity on a few selected datasets from the `LongBench` dataset, please use the format below.
   ```bash

From 6122714f9d454b65111b801a2aba86fa21f6bd7a Mon Sep 17 00:00:00 2001
From: cranechu <1340390339@qq.com>
Date: Tue, 20 Aug 2024 17:04:12 +0800
Subject: [PATCH 02/11] fix: textual adjustments

---
 python/llm/dev/benchmark/perplexity/README.md | 63 +++----------------
 1 file changed, 8 insertions(+), 55 deletions(-)

diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md
index d20ccf686f1..6f8e721129d 100644
--- a/python/llm/dev/benchmark/perplexity/README.md
+++ b/python/llm/dev/benchmark/perplexity/README.md
@@ -1,80 +1,33 @@
 # Perplexity
 Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py) 
 
-## Requirements
+## Environment Preparations 
 To run perplexity test  with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 
-### 1. Install IPEX
-We suggest using conda to manage environment:
+We suggest using conda to manage iprx environment:
 ```bash
 conda create -n llm python=3.11
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 ```
-
-
-### 2. Configures OneAPI environment variables for Linux
-
-> [!NOTE]
-> Skip this step if you are running on Windows.
-
-This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
-
-```bash
-source /opt/intel/oneapi/setvars.sh
-```
-
-### 3. Runtime Configurations
-For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.
-<details>
-
-<summary>For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series</summary>
-
-```bash
-export USE_XETLA=OFF
-export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-export SYCL_CACHE_PERSISTENT=1
-```
-
-</details>
-
-<details>
-
-<summary>For Intel Data Center GPU Max Series</summary>
-
+Install the dataset dependency to download and load dataset for the test.
 ```bash
-export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
-export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-export SYCL_CACHE_PERSISTENT=1
-export ENABLE_SDP_FUSION=1
+pip install datasets
 ```
-> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`.
-</details>
-
-<details>
-
-<summary>For Intel iGPU</summary>
+This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
 
 ```bash
-export SYCL_CACHE_PERSISTENT=1
-export BIGDL_LLM_XMX_DISABLED=1
+source /opt/intel/oneapi/setvars.sh
 ```
 
-</details>
-
-### 4. installing dependency
-Install the dataset dependency to download and load dataset for the test.
-```bash
-pip install datasets
-```
 ## Running the test
-### 1.Run on Wikitext
+### 1. Run on Wikitext
 An example to run perplexity on wikitext:
 ```bash
 python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096
 ```
-###  2.Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset
+###  2. Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset
 
 An example to run perplexity on chatglm3-6b using the default Chinese datasets("multifieldqa_zh", "dureader", "vcsum", "lsht", "passage_retrieval_zh")
 ```bash

From 9e67b22502337b783c22fff47ec3240002d7de91 Mon Sep 17 00:00:00 2001
From: cranechu <1340390339@qq.com>
Date: Tue, 20 Aug 2024 17:34:42 +0800
Subject: [PATCH 03/11] fix: textual adjustments

---
 python/llm/dev/benchmark/perplexity/README.md | 13 +++----------
 1 file changed, 3 insertions(+), 10 deletions(-)

diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md
index 6f8e721129d..3d824ac570f 100644
--- a/python/llm/dev/benchmark/perplexity/README.md
+++ b/python/llm/dev/benchmark/perplexity/README.md
@@ -1,18 +1,11 @@
 # Perplexity
 Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py) 
 
-## Environment Preparations 
-To run perplexity test  with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
-
-We suggest using conda to manage iprx environment:
+## Environment Preparation
+Install ipex-llm and dataset.
 ```bash
-conda create -n llm python=3.11
-conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-```
-Install the dataset dependency to download and load dataset for the test.
-```bash
 pip install datasets
 ```
 This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.
@@ -21,7 +14,7 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
 source /opt/intel/oneapi/setvars.sh
 ```
 
-## Running the test
+## Running PPL Evaluation
 ### 1. Run on Wikitext
 An example to run perplexity on wikitext:
 ```bash

From 979c738194d9afa8281878ab8c38dc01d62b64d7 Mon Sep 17 00:00:00 2001
From: SONG Ge <38711238+sgwhat@users.noreply.github.com>
Date: Tue, 20 Aug 2024 17:29:49 +0800
Subject: [PATCH 04/11] Add ipex-llm npu option in setup.py (#11858)

* add ipex-llm npu release

* update example doc

* meet latest release changes
---
 .../example/NPU/HF-Transformers-AutoModels/LLM/README.md   | 7 ++-----
 python/llm/setup.py                                        | 7 +++++++
 2 files changed, 9 insertions(+), 5 deletions(-)

diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
index 31e055b5bea..728617f0a45 100644
--- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
+++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md
@@ -91,11 +91,8 @@ We suggest using conda to manage environment:
 conda create -n llm python=3.10
 conda activate llm
 
-# install ipex-llm with 'all' option
-pip install --pre --upgrade ipex-llm[all]
-pip install --pre --upgrade bigdl-core-npu
-
-pip install transformers==4.40
+# install ipex-llm with 'npu' option
+pip install --pre --upgrade ipex-llm[npu]
 ```
 
 ### 2. Runtime Configurations
diff --git a/python/llm/setup.py b/python/llm/setup.py
index ecb7aea861b..f9adc5f39f8 100644
--- a/python/llm/setup.py
+++ b/python/llm/setup.py
@@ -300,6 +300,12 @@ def setup_package():
     serving_requires = ['py-cpuinfo']
     serving_requires += SERVING_DEP
 
+    npu_requires = copy.deepcopy(all_requires)
+    cpu_transformers_version = ['transformers == 4.37.0', 'tokenizers == 0.15.2']
+    for exclude_require in cpu_transformers_version:
+        npu_requires.remove(exclude_require)
+    npu_requires += ["transformers==4.40.0",
+                     "bigdl-core-npu==" + CORE_XE_VERSION + ";platform_system=='Windows'"]
 
     metadata = dict(
         name='ipex_llm',
@@ -323,6 +329,7 @@ def setup_package():
         },
         extras_require={"all": all_requires,
                         "xpu": xpu_requires,  # default to ipex 2.1 for linux and windows
+                        "npu": npu_requires,
                         "xpu-2-1": xpu_21_requires,
                         "serving": serving_requires,
                         "cpp": cpp_requires,

From a9ab309690ef1e69e85153c9963f0b6feab011ab Mon Sep 17 00:00:00 2001
From: Yishuo Wang <yishuo.wang@intel.com>
Date: Tue, 20 Aug 2024 17:32:51 +0800
Subject: [PATCH 05/11] optimize phi3 memory usage (#11867)

---
 python/llm/src/ipex_llm/transformers/kv.py        | 15 +++++++++++++++
 .../llm/src/ipex_llm/transformers/models/phi3.py  | 14 +++++++++++---
 2 files changed, 26 insertions(+), 3 deletions(-)

diff --git a/python/llm/src/ipex_llm/transformers/kv.py b/python/llm/src/ipex_llm/transformers/kv.py
index 100da837a9e..8b20f546893 100644
--- a/python/llm/src/ipex_llm/transformers/kv.py
+++ b/python/llm/src/ipex_llm/transformers/kv.py
@@ -121,6 +121,21 @@ def update(
 
         return self.key_cache[layer_idx], self.value_cache[layer_idx]
 
+    @classmethod
+    def from_reserved(cls, layers: int,
+                      bsz: int, n_head: int, length: int, head_dim: int,
+                      dtype: torch.dtype, device: torch.device):
+        past_key_values = cls()
+        for _i in range(layers):
+            k_cache, v_cache = init_kv_cache(
+                bsz, n_head, head_dim,
+                0, length + cls.KV_ALLOC_BLOCK_LENGTH,
+                dtype, device
+            )
+            past_key_values.key_cache.append(k_cache)
+            past_key_values.value_cache.append(v_cache)
+        return past_key_values
+
 
 # Copied from transformers.models.llama.modeling_llama.repeat_kv
 def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor:
diff --git a/python/llm/src/ipex_llm/transformers/models/phi3.py b/python/llm/src/ipex_llm/transformers/models/phi3.py
index 5c630681cc9..823fb10391a 100644
--- a/python/llm/src/ipex_llm/transformers/models/phi3.py
+++ b/python/llm/src/ipex_llm/transformers/models/phi3.py
@@ -254,9 +254,9 @@ def model_forward(
     ):
         # IPEX-LLM OPT: kv cache and quantize kv cache and sdp
         use_cache = use_cache if use_cache is not None else self.config.use_cache
-        input = input_ids if input_ids is not None else inputs_embeds
-        use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.down_proj, input)
-        use_compress_kv = should_use_compresskv(input, input.shape[1])
+        inputs = input_ids if input_ids is not None else inputs_embeds
+        use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.down_proj, inputs)
+        use_compress_kv = should_use_compresskv(inputs, inputs.shape[1])
         if use_cache:
             if use_compress_kv and not isinstance(past_key_values,
                                                   DynamicCompressCache):
@@ -272,6 +272,14 @@ def model_forward(
                                                                                DynamicCompressCache
                                                                                )):
                 past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
+                if past_key_values.get_seq_length() == 0:
+                    n_layer = self.config.num_hidden_layers
+                    n_head = self.config.num_attention_heads
+                    head_dim = self.config.hidden_size // self.config.num_attention_heads
+                    past_key_values = DynamicNormalCache.from_reserved(
+                        n_layer, inputs.size(0), n_head, inputs.size(1), head_dim,
+                        inputs.dtype, inputs.device
+                    )
         return origin_model_forward(
             self=self,
             input_ids=input_ids,

From f5f3f19f98efe77c23eb2f5ccadbdaf58643ba8b Mon Sep 17 00:00:00 2001
From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com>
Date: Tue, 20 Aug 2024 17:37:58 +0800
Subject: [PATCH 06/11] Update `ipex-llm` default transformers version to
 4.37.0 (#11859)

* Update default transformers version to 4.37.0

* Add dependency requirements for qwen and qwen-vl

* Temp fix transformers version for these not yet verified models

* Skip qwen test in UT for now as it requires transformers<4.37.0
---
 .../CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md    | 2 ++
 .../CPU/HF-Transformers-AutoModels/Model/qwen/README.md       | 4 ++++
 python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md | 4 ++++
 python/llm/example/GPU/HuggingFace/LLM/qwen/README.md         | 2 ++
 .../llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md  | 2 ++
 .../GPU/HuggingFace/Multimodal/voiceassistant/README.md       | 2 ++
 .../llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md  | 2 ++
 python/llm/example/GPU/PyTorch-Models/Model/llava/README.md   | 2 --
 python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md | 2 ++
 .../llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md  | 2 ++
 python/llm/setup.py                                           | 2 +-
 python/llm/test/inference_gpu/test_transformers_api.py        | 2 +-
 .../llm/test/inference_gpu/test_transformers_api_RMSNorm.py   | 2 +-
 .../llm/test/inference_gpu/test_transformers_api_attention.py | 2 +-
 python/llm/test/inference_gpu/test_transformers_api_mlp.py    | 2 +-
 15 files changed, 27 insertions(+), 7 deletions(-)

diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md
index 7dc3dedc5cb..7f5061eccd6 100644
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md
@@ -20,6 +20,7 @@ conda activate llm
 # install the latest ipex-llm nightly build with 'all' option
 pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
 
+pip install "transformers<4.37.0"
 pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 
 ```
@@ -32,6 +33,7 @@ conda activate llm
 
 pip install --pre --upgrade ipex-llm[all]
 
+pip install "transformers<4.37.0"
 pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib
 
 ```
diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md
index cee06098d2d..992ea9ee10e 100644
--- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md
+++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md
@@ -22,6 +22,8 @@ conda activate llm
 
 # install the latest ipex-llm nightly build with 'all' option
 pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
+
+pip install "transformers<4.37.0"
 pip install tiktoken einops transformers_stream_generator  # additional package required for Qwen-7B-Chat to conduct generation
 ```
 
@@ -32,6 +34,8 @@ conda create -n llm python=3.11
 conda activate llm
 
 pip install --pre --upgrade ipex-llm[all]
+
+pip install "transformers<4.37.0"
 pip install tiktoken einops transformers_stream_generator
 ```
 
diff --git a/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md b/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md
index 25744465c26..f6f5f1ffe8e 100644
--- a/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md
+++ b/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md
@@ -19,6 +19,8 @@ conda activate llm
 
 # install the latest ipex-llm nightly build with 'all' option
 pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu
+
+pip install "transformers<4.37.0"
 pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 ```
 
@@ -29,6 +31,8 @@ conda create -n llm python=3.11
 conda activate llm
 
 pip install --pre --upgrade ipex-llm[all]
+
+pip install "transformers<4.37.0"
 pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib
 ```
 
diff --git a/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md
index 500e2b0f2ad..8311f7f1369 100644
--- a/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md
+++ b/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md
@@ -15,6 +15,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install "transformers<4.37.0"
 pip install tiktoken einops transformers_stream_generator  # additional package required for Qwen-7B-Chat to conduct generation
 ```
 
@@ -27,6 +28,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install "transformers<4.37.0"
 pip install tiktoken einops transformers_stream_generator  # additional package required for Qwen-7B-Chat to conduct generation
 ```
 
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md
index fb02816b1f0..737232661fd 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md
@@ -15,6 +15,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install "transformers<4.37.0"
 pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 ```
 
@@ -27,6 +28,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install "transformers<4.37.0"
 pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 ```
 
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md
index 67c0fb26249..7dea109b078 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md
@@ -17,6 +17,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install transformers==4.36.2
 pip install librosa soundfile datasets
 pip install accelerate
 pip install SpeechRecognition sentencepiece colorama
@@ -33,6 +34,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install transformers==4.36.2
 pip install librosa soundfile datasets
 pip install accelerate
 pip install SpeechRecognition sentencepiece colorama
diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md
index 29a4dc4619c..ac664fb0a36 100644
--- a/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md
+++ b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md
@@ -16,6 +16,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install transformers==4.36.2
 pip install datasets soundfile librosa # required by audio processing
 ```
 
@@ -28,6 +29,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install transformers==4.36.2
 pip install datasets soundfile librosa # required by audio processing
 ```
 
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md b/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md
index 461ae53a8dd..77e0f1cfd9c 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md
@@ -16,7 +16,6 @@ conda activate llm
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
 pip install einops # install dependencies required by llava
-pip install transformers==4.36.2
 
 git clone https://github.com/haotian-liu/LLaVA.git # clone the llava libary
 cp generate.py ./LLaVA/ # copy our example to the LLaVA folder
@@ -34,7 +33,6 @@ conda activate llm
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
 pip install einops # install dependencies required by llava
-pip install transformers==4.36.2
 
 git clone https://github.com/haotian-liu/LLaVA.git # clone the llava libary
 copy generate.py .\LLaVA\ # copy our example to the LLaVA folder
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md
index 5f9a617aaa3..c480c545366 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md
@@ -15,6 +15,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install "transformers<4.37.0"
 pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 ```
 
@@ -27,6 +28,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install "transformers<4.37.0"
 pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation
 ```
 
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md b/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md
index 171ff392422..98806eda677 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md
@@ -15,6 +15,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install transformers==4.36.2
 pip install "datasets<2.18" soundfile # additional package required for SpeechT5 to conduct generation
 ```
 
@@ -27,6 +28,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
+pip install transformers==4.36.2
 pip install "datasets<2.18" soundfile # additional package required for SpeechT5 to conduct generation
 ```
 
diff --git a/python/llm/setup.py b/python/llm/setup.py
index f9adc5f39f8..4386293cac6 100644
--- a/python/llm/setup.py
+++ b/python/llm/setup.py
@@ -53,7 +53,7 @@
 
 cpu_torch_version = ["torch==2.1.2+cpu;platform_system=='Linux'", "torch==2.1.2;platform_system=='Windows'"]
 CONVERT_DEP = ['numpy == 1.26.4', # lastet 2.0.0b1 will cause error
-               'transformers == 4.36.2', 'sentencepiece', 'tokenizers == 0.15.2',
+               'transformers == 4.37.0', 'sentencepiece', 'tokenizers == 0.15.2',
                'accelerate == 0.23.0', 'tabulate'] + cpu_torch_version
 
 SERVING_DEP = ['fschat[model_worker, webui] == 0.2.36', 'protobuf']
diff --git a/python/llm/test/inference_gpu/test_transformers_api.py b/python/llm/test/inference_gpu/test_transformers_api.py
index ae9c6b9bc3e..b29c25997ae 100644
--- a/python/llm/test/inference_gpu/test_transformers_api.py
+++ b/python/llm/test/inference_gpu/test_transformers_api.py
@@ -36,7 +36,7 @@
     (AutoModelForCausalLM, AutoTokenizer, os.environ.get('MPT_7B_ORIGIN_PATH')),
     # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')),
     # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('BAICHUAN2_7B_ORIGIN_PATH')),
-    # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')),
+    # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0
     ])
 def test_completion(Model, Tokenizer, model_path, prompt, answer):
     with torch.inference_mode():
diff --git a/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py b/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py
index f45f017ef0b..edb2adf1ec0 100644
--- a/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py
+++ b/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py
@@ -32,7 +32,7 @@
     ("ChatGLM2-6B", AutoModel, AutoTokenizer, os.environ.get('CHATGLM2_6B_ORIGIN_PATH')),
     ("Mistral-7B-Instruct-v0.1", AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')),
     ("Baichuan2-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('BAICHUAN2_7B_ORIGIN_PATH')),
-    ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')),
+    # ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0
 ]
 
 class Test_Optimize_Gpu_Model:
diff --git a/python/llm/test/inference_gpu/test_transformers_api_attention.py b/python/llm/test/inference_gpu/test_transformers_api_attention.py
index 4db5ba8b531..84bdcf8e8cb 100644
--- a/python/llm/test/inference_gpu/test_transformers_api_attention.py
+++ b/python/llm/test/inference_gpu/test_transformers_api_attention.py
@@ -34,7 +34,7 @@
     ("ChatGLM2-6B", AutoModel, AutoTokenizer, os.environ.get('CHATGLM2_6B_ORIGIN_PATH')),
     ("Mistral-7B-Instruct-v0.1", AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')),
     ("Baichuan2-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('BAICHUAN2_7B_ORIGIN_PATH')),
-    ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')),
+    # ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0
 ]
 
 class Test_Optimize_Gpu_Model:
diff --git a/python/llm/test/inference_gpu/test_transformers_api_mlp.py b/python/llm/test/inference_gpu/test_transformers_api_mlp.py
index cf0581a50c0..c6229d73fc4 100644
--- a/python/llm/test/inference_gpu/test_transformers_api_mlp.py
+++ b/python/llm/test/inference_gpu/test_transformers_api_mlp.py
@@ -27,7 +27,7 @@
 
 PROMPT = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun"
 TEST_MODEL_LIST = [
-    ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')),
+    # ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0
     ("Mistral-7B-Instruct-v0.1", AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')),
     ("Llama2-7B", AutoModelForCausalLM, LlamaTokenizer, os.environ.get('LLAMA2_7B_ORIGIN_PATH'))
 ]

From cab32ea354f5fa388bb1d11f90913c98f459c594 Mon Sep 17 00:00:00 2001
From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com>
Date: Tue, 20 Aug 2024 17:59:28 +0800
Subject: [PATCH 07/11] Update performance test regarding updated default
 `transformers==4.37.0` (#11869)

* Update igpu performance from transformers 4.36.2 to 4.37.0 (#11841)

* upgrade arc perf test to transformers 4.37 (#11842)

* fix load low bit com dtype (#11832)

* feat: add mixed_precision argument on ppl longbench evaluation

* fix: delete extra code

* feat: upgrade arc perf test to transformers 4.37

* fix: add missing codes

* fix: keep perf test for qwen-vl-chat in transformers 4.36

* fix: remove extra space

* fix: resolve pr comment

* fix: add empty line

* fix: add pip install for spr and core test

* fix: delete extra comments

* fix: remove python -m for pip

* Revert "fix load low bit com dtype (#11832)"

This reverts commit 6841a9ac8fc8b3f4eb06e41fa3944f7877fd8f94.

---------

Co-authored-by: Zhao Changmin <changmin.zhao@intel.com>
Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>

* add transformers==4.36 for qwen vl in igpu-perf (#11846)

* add transformers==4.36.2 for qwen-vl

* Small update

---------

Co-authored-by: Yuwen Hu <yuwen.hu@intel.com>

* fix: remove qwen-7b on core test (#11851)

* fix: remove qwen-7b on core test

* fix: change delete to comment

---------

Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>

* replce filename (#11854)

* fix: remove qwen-7b on core test

* fix: change delete to comment

* fix: replace filename

---------

Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>

* fix: delete extra comments (#11863)

* Remove transformers installation for temp test purposes

* Small fix

* Small update

---------

Co-authored-by: Chu,Youcheng <70999398+cranechu0131@users.noreply.github.com>
Co-authored-by: Zhao Changmin <changmin.zhao@intel.com>
Co-authored-by: Jinhe Tang <jin.tang1337@gmail.com>
Co-authored-by: Zijie Li <michael20001122@gmail.com>
Co-authored-by: Chu,Youcheng <1340390339@qq.com>
---
 .github/workflows/llm_performance_tests.yml   | 128 +++++++-----------
 .../test/benchmark/arc-perf-test-batch2.yaml  |  30 ----
 .../test/benchmark/arc-perf-test-batch4.yaml  |  36 -----
 python/llm/test/benchmark/arc-perf-test.yaml  |  32 -----
 .../arc-perf-transformers-436-batch2.yaml     |  16 +++
 .../arc-perf-transformers-436-batch4.yaml     |  18 +++
 .../benchmark/arc-perf-transformers-436.yaml  |  16 +++
 .../arc-perf-transformers-437-batch2.yaml     |  14 ++
 .../arc-perf-transformers-437-batch4.yaml     |  18 ++-
 .../benchmark/arc-perf-transformers-437.yaml  |  14 ++
 python/llm/test/benchmark/core-perf-test.yaml |   2 +-
 .../test/benchmark/igpu-perf/1024-128.yaml    |   8 +-
 .../{1024-128_437.yaml => 1024-128_436.yaml}  |   8 +-
 .../igpu-perf/1024-128_int4_fp16.yaml         |   8 +-
 ...6_437.yaml => 1024-128_int4_fp16_436.yaml} |   8 +-
 .../1024-128_int4_fp16_loadlowbit.yaml        |   7 +-
 ...=> 1024-128_int4_fp16_loadlowbit_436.yaml} |   7 +-
 .../igpu-perf/2048-256_int4_fp16.yaml         |   8 +-
 ...6_437.yaml => 2048-256_int4_fp16_436.yaml} |   8 +-
 .../igpu-perf/3072-384_int4_fp16.yaml         |   8 +-
 ...6_437.yaml => 3072-384_int4_fp16_436.yaml} |  10 +-
 .../benchmark/igpu-perf/32-32_int4_fp16.yaml  |   8 +-
 ...fp16_437.yaml => 32-32_int4_fp16_436.yaml} |   8 +-
 .../igpu-perf/4096-512_int4_fp16.yaml         |   7 +
 .../igpu-perf/4096-512_int4_fp16_437.yaml     |  19 ---
 25 files changed, 202 insertions(+), 244 deletions(-)
 delete mode 100644 python/llm/test/benchmark/arc-perf-test-batch2.yaml
 delete mode 100644 python/llm/test/benchmark/arc-perf-test-batch4.yaml
 delete mode 100644 python/llm/test/benchmark/arc-perf-test.yaml
 create mode 100644 python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml
 create mode 100644 python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml
 create mode 100644 python/llm/test/benchmark/arc-perf-transformers-436.yaml
 rename python/llm/test/benchmark/igpu-perf/{1024-128_437.yaml => 1024-128_436.yaml} (65%)
 rename python/llm/test/benchmark/igpu-perf/{1024-128_int4_fp16_437.yaml => 1024-128_int4_fp16_436.yaml} (65%)
 rename python/llm/test/benchmark/igpu-perf/{1024-128_int4_fp16_loadlowbit_437.yaml => 1024-128_int4_fp16_loadlowbit_436.yaml} (68%)
 rename python/llm/test/benchmark/igpu-perf/{2048-256_int4_fp16_437.yaml => 2048-256_int4_fp16_436.yaml} (65%)
 rename python/llm/test/benchmark/igpu-perf/{3072-384_int4_fp16_437.yaml => 3072-384_int4_fp16_436.yaml} (52%)
 rename python/llm/test/benchmark/igpu-perf/{32-32_int4_fp16_437.yaml => 32-32_int4_fp16_436.yaml} (65%)
 delete mode 100644 python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml

diff --git a/.github/workflows/llm_performance_tests.yml b/.github/workflows/llm_performance_tests.yml
index 36b31f23937..736b1dd4540 100644
--- a/.github/workflows/llm_performance_tests.yml
+++ b/.github/workflows/llm_performance_tests.yml
@@ -153,7 +153,8 @@ jobs:
           source /opt/intel/oneapi/setvars.sh
           export USE_XETLA=OFF
           export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-          cp python/llm/test/benchmark/arc-perf-test.yaml python/llm/dev/benchmark/all-in-one/config.yaml
+          pip install transformers==4.36.2
+          cp python/llm/test/benchmark/arc-perf-transformers-436.yaml python/llm/dev/benchmark/all-in-one/config.yaml
           cd python/llm/dev/benchmark/all-in-one
           mkdir test_batch1
           mkdir test_batch2
@@ -167,7 +168,7 @@ jobs:
           mv *.csv test_batch1
           # batch_size 2
           cd ../../../../../ 
-          cp python/llm/test/benchmark/arc-perf-test-batch2.yaml python/llm/dev/benchmark/all-in-one/config.yaml
+          cp python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml python/llm/dev/benchmark/all-in-one/config.yaml
           cd python/llm/dev/benchmark/all-in-one
           # change csv name
           sed -i 's/batch1/batch2/g' run.py
@@ -175,7 +176,7 @@ jobs:
           mv *.csv test_batch2
           # batch_size 4
           cd ../../../../../ 
-          cp python/llm/test/benchmark/arc-perf-test-batch4.yaml python/llm/dev/benchmark/all-in-one/config.yaml
+          cp python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml python/llm/dev/benchmark/all-in-one/config.yaml
           cd python/llm/dev/benchmark/all-in-one
           # change csv name
           sed -i 's/batch2/batch4/g' run.py
@@ -188,7 +189,7 @@ jobs:
           source /opt/intel/oneapi/setvars.sh
           export USE_XETLA=OFF
           export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
-          # upgrade transformers for model Qwen/Qwen1.5-7B-Chat
+          # upgrade for default transformers version
           python -m pip install transformers==4.37.0
           # batch_size 1
           cp python/llm/test/benchmark/arc-perf-transformers-437.yaml python/llm/dev/benchmark/all-in-one/config.yaml
@@ -314,7 +315,7 @@ jobs:
         run: |
           # batch_size 1
           cd python/llm/dev/benchmark/all-in-one/test_batch1
-          python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-test.yaml
+          python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-transformers-436.yaml
           python ../../../../test/benchmark/check_results.py -c test2 -y ../../../../test/benchmark/arc-perf-transformers-437.yaml
           python ../../../../test/benchmark/check_results.py -c test3 -y ../../../../test/benchmark/arc-perf-transformers-440.yaml
           find . -name "*test*.csv" -delete
@@ -327,7 +328,7 @@ jobs:
           rm -r test_batch1
           # batch_size 2
           cd test_batch2
-          python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-test-batch2.yaml
+          python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-transformers-436-batch2.yaml
           python ../../../../test/benchmark/check_results.py -c test2 -y ../../../../test/benchmark/arc-perf-transformers-437-batch2.yaml
           find . -name "*test*.csv" -delete
           if [[ ${{ github.event_name }} == "schedule" ]]; then
@@ -339,7 +340,7 @@ jobs:
           rm -r test_batch2
           # batch_size 4
           cd test_batch4
-          python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-test-batch4.yaml
+          python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-transformers-436-batch4.yaml
           python ../../../../test/benchmark/check_results.py -c test2 -y ../../../../test/benchmark/arc-perf-transformers-437-batch4.yaml
           find . -name "*test*.csv" -delete
           if [[ ${{ github.event_name }} == "schedule" ]]; then
@@ -384,7 +385,6 @@ jobs:
           python -m pip install --upgrade einops
           python -m pip install --upgrade tiktoken
           python -m pip install --upgrade transformers_stream_generator
-
       # specific for test on certain commits
       - name: Download llm binary
         if: ${{ github.event_name == 'workflow_dispatch' && (inputs.checkout-ref != 'main') }}
@@ -653,6 +653,7 @@ jobs:
           set BIGDL_LLM_XMX_DISABLED=1
           REM for llava
           set TRANSFORMERS_OFFLINE=1
+          pip install transformers==4.37.0
 
           cd python\llm\dev\benchmark\all-in-one
           move ..\..\..\test\benchmark\igpu-perf\32-32_int4_fp16.yaml config.yaml
@@ -664,23 +665,23 @@ jobs:
 
           call conda deactivate
 
-      - name: Prepare igpu perf test for transformers 4.37 (32-32 int4+fp16)
+      - name: Prepare igpu perf test for transformers 4.36 (32-32 int4+fp16)
         shell: bash
         run: |
           sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
-          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml
+          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml
 
-      - name: Test on igpu for transformers 4.37 (32-32 int4+fp16)
+      - name: Test on igpu for transformers 4.36 (32-32 int4+fp16)
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.37.0
+          pip install transformers==4.36.2
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
 
           cd python\llm\dev\benchmark\all-in-one
-          move ..\..\..\test\benchmark\igpu-perf\32-32_int4_fp16_437.yaml config.yaml
+          move ..\..\..\test\benchmark\igpu-perf\32-32_int4_fp16_436.yaml config.yaml
           set PYTHONIOENCODING=utf-8
           python run.py >> %CSV_SAVE_PATH%\32-32_int4_fp16\log\%LOG_FILE% 2>&1
           if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -771,7 +772,7 @@ jobs:
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.36.2
+          pip install transformers==4.37.0
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
@@ -788,23 +789,23 @@ jobs:
 
           call conda deactivate
 
-      - name: Prepare igpu perf test for transformers 4.37 (1024-128 int4+fp16)
+      - name: Prepare igpu perf test for transformers 4.36 (1024-128 int4+fp16)
         shell: bash
         run: |
           sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
-          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml
+          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml
 
-      - name: Test on igpu for transformers 4.37 (1024-128 int4+fp16)
+      - name: Test on igpu for transformers 4.36 (1024-128 int4+fp16)
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.37.0
+          pip install transformers==4.36.2
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
 
           cd python\llm\dev\benchmark\all-in-one
-          move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_437.yaml config.yaml
+          move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_436.yaml config.yaml
           set PYTHONIOENCODING=utf-8
           python run.py >> %CSV_SAVE_PATH%\1024-128_int4_fp16\log\%LOG_FILE% 2>&1
           if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -812,7 +813,7 @@ jobs:
           if %ERRORLEVEL% neq 0 (exit /b 1)
 
           call conda deactivate
-          
+
       - name: Prepare igpu perf test for transformers 4.38 (1024-128 int4+fp16)
         shell: bash
         run: |
@@ -894,7 +895,6 @@ jobs:
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.36.2
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
@@ -911,23 +911,23 @@ jobs:
 
           call conda deactivate
 
-      - name: Prepare igpu perf test for transformers 4.37 (2048-256 int4+fp16)
+      - name: Prepare igpu perf test for transformers 4.36 (2048-256 int4+fp16)
         shell: bash
         run: |
           sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
-          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml
+          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml
 
-      - name: Test on igpu for transformers 4.37 (2048-256 int4+fp16)
+      - name: Test on igpu for transformers 4.36 (2048-256 int4+fp16)
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.37.0
+          pip install transformers==4.36.2
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
 
           cd python\llm\dev\benchmark\all-in-one
-          move ..\..\..\test\benchmark\igpu-perf\2048-256_int4_fp16_437.yaml config.yaml
+          move ..\..\..\test\benchmark\igpu-perf\2048-256_int4_fp16_436.yaml config.yaml
           set PYTHONIOENCODING=utf-8
           python run.py >> %CSV_SAVE_PATH%\2048-256_int4_fp16\log\%LOG_FILE% 2>&1
           if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -935,7 +935,7 @@ jobs:
           if %ERRORLEVEL% neq 0 (exit /b 1)
 
           call conda deactivate
-          
+
       - name: Prepare igpu perf test for transformers 4.38 (2048-256 int4+fp16)
         shell: bash
         run: |
@@ -1017,7 +1017,7 @@ jobs:
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.36.2
+          pip install transformers==4.37.0
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
@@ -1034,23 +1034,23 @@ jobs:
 
           call conda deactivate
 
-      - name: Prepare igpu perf test for transformers 4.37 (3072-384 int4+fp16)
+      - name: Prepare igpu perf test for transformers 4.36 (3072-384 int4+fp16)
         shell: bash
         run: |
           sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
-          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml
+          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml
 
-      - name: Test on igpu for transformers 4.37 (3072-384 int4+fp16)
+      - name: Test on igpu for transformers 4.36 (3072-384 int4+fp16)
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.37.0
+          pip install transformers==4.36.2
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
 
           cd python\llm\dev\benchmark\all-in-one
-          move ..\..\..\test\benchmark\igpu-perf\3072-384_int4_fp16_437.yaml config.yaml
+          move ..\..\..\test\benchmark\igpu-perf\3072-384_int4_fp16_436.yaml config.yaml
           set PYTHONIOENCODING=utf-8
           python run.py >> %CSV_SAVE_PATH%\3072-384_int4_fp16\log\%LOG_FILE% 2>&1
           if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -1140,7 +1140,7 @@ jobs:
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.36.2
+          pip install transformers==4.37.0
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
@@ -1157,35 +1157,10 @@ jobs:
 
           call conda deactivate
 
-      - name: Prepare igpu perf test for transformers 4.37 (4096-512 int4+fp16)
-        shell: bash
-        run: |
-          sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
-          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml
-
-      - name: Test on igpu for transformers 4.37 (4096-512 int4+fp16)
-        shell: cmd
-        run: |
-          call conda activate igpu-perf
-          pip install transformers==4.37.0
-
-          set SYCL_CACHE_PERSISTENT=1
-          set BIGDL_LLM_XMX_DISABLED=1
-
-          cd python\llm\dev\benchmark\all-in-one
-          move ..\..\..\test\benchmark\igpu-perf\4096-512_int4_fp16_437.yaml config.yaml
-          set PYTHONIOENCODING=utf-8
-          python run.py >> %CSV_SAVE_PATH%\4096-512_int4_fp16\log\%LOG_FILE% 2>&1
-          if %ERRORLEVEL% neq 0 (exit /b 1)
-          python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test2
-          if %ERRORLEVEL% neq 0 (exit /b 1)
-
-          call conda deactivate
-
       - name: Prepare igpu perf test for transformers 4.38 (4096-512 int4+fp16)
         shell: bash
         run: |
-          sed -i 's/{today}_test2/{today}_test3/g' python/llm/dev/benchmark/all-in-one/run.py
+          sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
           sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_438.yaml
 
       - name: Test on igpu for transformers 4.38 (4096-512 int4+fp16)
@@ -1202,7 +1177,7 @@ jobs:
           set PYTHONIOENCODING=utf-8
           python run.py >> %CSV_SAVE_PATH%\4096-512_int4_fp16\log\%LOG_FILE% 2>&1
           if %ERRORLEVEL% neq 0 (exit /b 1)
-          python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test3
+          python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test2
           if %ERRORLEVEL% neq 0 (exit /b 1)
 
           call conda deactivate
@@ -1210,7 +1185,7 @@ jobs:
       - name: Prepare igpu perf test for transformers 4.43 (4096-512 int4+fp16)
         shell: bash
         run: |
-          sed -i 's/{today}_test3/{today}_test4/g' python/llm/dev/benchmark/all-in-one/run.py
+          sed -i 's/{today}_test2/{today}_test3/g' python/llm/dev/benchmark/all-in-one/run.py
           sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_443.yaml
 
       - name: Test on igpu for transformers 4.43 (4096-512 int4+fp16)
@@ -1228,7 +1203,7 @@ jobs:
           set PYTHONIOENCODING=utf-8
           python run.py >> %CSV_SAVE_PATH%\4096-512_int4_fp16\log\%LOG_FILE% 2>&1
           if %ERRORLEVEL% neq 0 (exit /b 1)
-          python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test4
+          python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test3
           if %ERRORLEVEL% neq 0 (exit /b 1)
 
           pip uninstall trl -y
@@ -1256,14 +1231,14 @@ jobs:
         shell: bash
         run: |
           sed -i 's/4096-512/1024-128/g' python/llm/dev/benchmark/all-in-one/run.py
-          sed -i 's/{today}_test4/{today}_test1/g' python/llm/dev/benchmark/all-in-one/run.py
+          sed -i 's/{today}_test3/{today}_test1/g' python/llm/dev/benchmark/all-in-one/run.py
           sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml
 
       - name: Test on igpu (load_low_bit 1024-128 int4+fp16)
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.36.2
+          pip install transformers==4.37.0
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
@@ -1280,23 +1255,23 @@ jobs:
 
           call conda deactivate
 
-      - name: Prepare igpu perf test for transformers 4.37 (load_low_bit 1024-128 int4+fp16)
+      - name: Prepare igpu perf test for transformers 4.36 (load_low_bit 1024-128 int4+fp16)
         shell: bash
         run: |
           sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
-          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml
+          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml
 
-      - name: Test on igpu for transformers 4.37 (load_low_bit 1024-128 int4+fp16)
+      - name: Test on igpu for transformers 4.36 (load_low_bit 1024-128 int4+fp16)
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.37.0
+          pip install transformers==4.36.2
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
 
           cd python\llm\dev\benchmark\all-in-one
-          move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_loadlowbit_437.yaml config.yaml
+          move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_loadlowbit_436.yaml config.yaml
           set PYTHONIOENCODING=utf-8
           python run.py >> %CSV_SAVE_PATH%\1024-128_int4_fp16_loadlowbit\log\%LOG_FILE% 2>&1
           if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -1385,7 +1360,7 @@ jobs:
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.36.2
+          pip install transformers==4.37.0
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
@@ -1402,23 +1377,23 @@ jobs:
 
           call conda deactivate
 
-      - name: Prepare igpu perf test for transformers 4.37 (1024-128)
+      - name: Prepare igpu perf test for transformers 4.36 (1024-128)
         shell: bash
         run: |
           sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py
-          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_437.yaml
+          sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_436.yaml
 
-      - name: Test on igpu for transformers 4.37 (1024-128)
+      - name: Test on igpu for transformers 4.36 (1024-128)
         shell: cmd
         run: |
           call conda activate igpu-perf
-          pip install transformers==4.37.0
+          pip install transformers==4.36.2
 
           set SYCL_CACHE_PERSISTENT=1
           set BIGDL_LLM_XMX_DISABLED=1
 
           cd python\llm\dev\benchmark\all-in-one
-          move ..\..\..\test\benchmark\igpu-perf\1024-128_437.yaml config.yaml
+          move ..\..\..\test\benchmark\igpu-perf\1024-128_436.yaml config.yaml
           set PYTHONIOENCODING=utf-8
           python run.py >> %CSV_SAVE_PATH%\1024-128\log\%LOG_FILE% 2>&1
           if %ERRORLEVEL% neq 0 (exit /b 1)
@@ -1520,4 +1495,3 @@ jobs:
       #   shell: cmd
       #   run: |
       #     call conda env remove -n igpu-perf -y
-      
diff --git a/python/llm/test/benchmark/arc-perf-test-batch2.yaml b/python/llm/test/benchmark/arc-perf-test-batch2.yaml
deleted file mode 100644
index 70447fd7f59..00000000000
--- a/python/llm/test/benchmark/arc-perf-test-batch2.yaml
+++ /dev/null
@@ -1,30 +0,0 @@
-repo_id:
-  - 'meta-llama/Llama-2-7b-chat-hf'
-  - 'meta-llama/Llama-2-13b-chat-hf'
-  - 'THUDM/chatglm3-6b-4bit'
-  - 'baichuan-inc/Baichuan2-7B-Chat'
-  - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
-  - 'THUDM/glm-4-9b-chat'
-  - 'openbmb/MiniCPM-2B-sft-bf16'
-  - 'Qwen/Qwen-VL-Chat'
-  #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
-  - '01-ai/Yi-6B-Chat'
-  - 'mistralai/Mistral-7B-Instruct-v0.2'
-  - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
-  - '01-ai/Yi-1.5-6B-Chat'
-local_model_hub: '/mnt/disk1/models'
-warm_up: 1
-num_trials: 3
-num_beams: 1 # default to greedy search
-low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
-batch_size: 2 # default to 1
-in_out_pairs:
-  - '32-32'
-  - '1024-128'
-  - '2048-256'
-test_api:
-  - "transformer_int4_fp16_gpu"  # on Intel GPU
-cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
-exclude:
-  - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
-task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-test-batch4.yaml b/python/llm/test/benchmark/arc-perf-test-batch4.yaml
deleted file mode 100644
index 3bfd47963a4..00000000000
--- a/python/llm/test/benchmark/arc-perf-test-batch4.yaml
+++ /dev/null
@@ -1,36 +0,0 @@
-repo_id:
-  - 'meta-llama/Llama-2-7b-chat-hf'
-  - 'meta-llama/Llama-2-13b-chat-hf'
-  - 'THUDM/chatglm3-6b-4bit'
-  - 'baichuan-inc/Baichuan2-7B-Chat'
-  - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
-  - 'THUDM/glm-4-9b-chat'
-  - 'openbmb/MiniCPM-2B-sft-bf16'
-  - 'Qwen/Qwen-VL-Chat'
-  #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
-  - '01-ai/Yi-6B-Chat'
-  - 'mistralai/Mistral-7B-Instruct-v0.2'
-  - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
-  - '01-ai/Yi-1.5-6B-Chat'
-local_model_hub: '/mnt/disk1/models'
-warm_up: 1
-num_trials: 3
-num_beams: 1 # default to greedy search
-low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
-batch_size: 4 # default to 1
-in_out_pairs:
-  - '32-32'
-  - '1024-128'
-  - '2048-256'
-test_api:
-  - "transformer_int4_fp16_gpu"  # on Intel GPU
-cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
-exclude:
-  - 'meta-llama/Llama-2-13b-chat-hf:2048'
-  - 'baichuan-inc/Baichuan2-7B-Chat:2048'
-  - 'baichuan-inc/Baichuan2-13B-Chat-4bit:1024'
-  - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
-  - 'Qwen/Qwen-VL-Chat:2048'
-#  - 'fnlp/moss-moon-003-sft-4bit:1024'
-#  - 'fnlp/moss-moon-003-sft-4bit:2048'
-task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-test.yaml b/python/llm/test/benchmark/arc-perf-test.yaml
deleted file mode 100644
index 890b8dbf470..00000000000
--- a/python/llm/test/benchmark/arc-perf-test.yaml
+++ /dev/null
@@ -1,32 +0,0 @@
-repo_id:
-  - 'meta-llama/Llama-2-7b-chat-hf'
-  - 'meta-llama/Llama-2-13b-chat-hf'
-  - 'THUDM/chatglm3-6b-4bit'
-  - 'baichuan-inc/Baichuan2-7B-Chat'
-  - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
-  - 'THUDM/glm-4-9b-chat'
-  - 'openbmb/MiniCPM-2B-sft-bf16'
-  - 'Qwen/Qwen-VL-Chat'
-  #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
-  - '01-ai/Yi-6B-Chat'
-  - 'mistralai/Mistral-7B-Instruct-v0.2'
-  - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
-  - '01-ai/Yi-1.5-6B-Chat'
-local_model_hub: '/mnt/disk1/models'
-warm_up: 1
-num_trials: 3
-num_beams: 1 # default to greedy search
-low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
-batch_size: 1 # default to 1
-in_out_pairs:
-  - '32-32'
-  - '1024-128'
-  - '2048-256'
-test_api:
-  - "transformer_int4_fp16_gpu"  # on Intel GPU
-cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
-exclude:
-#  - 'fnlp/moss-moon-003-sft-4bit:1024'
-#  - 'fnlp/moss-moon-003-sft-4bit:2048'
-  - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
-task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml b/python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml
new file mode 100644
index 00000000000..42ef79f344c
--- /dev/null
+++ b/python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml
@@ -0,0 +1,16 @@
+repo_id:
+  - 'Qwen/Qwen-VL-Chat'
+local_model_hub: '/mnt/disk1/models'
+warm_up: 1
+num_trials: 3
+num_beams: 1 # default to greedy search
+low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
+batch_size: 2 # default to 1
+in_out_pairs:
+  - '32-32'
+  - '1024-128'
+  - '2048-256'
+test_api:
+  - "transformer_int4_fp16_gpu"  # on Intel GPU
+cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml b/python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml
new file mode 100644
index 00000000000..606b9c6cf05
--- /dev/null
+++ b/python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml
@@ -0,0 +1,18 @@
+repo_id:
+  - 'Qwen/Qwen-VL-Chat'
+local_model_hub: '/mnt/disk1/models'
+warm_up: 1
+num_trials: 3
+num_beams: 1 # default to greedy search
+low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
+batch_size: 4 # default to 1
+in_out_pairs:
+  - '32-32'
+  - '1024-128'
+  - '2048-256'
+test_api:
+  - "transformer_int4_fp16_gpu"  # on Intel GPU
+cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+exclude:
+  - 'Qwen/Qwen-VL-Chat:2048'
+task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-436.yaml b/python/llm/test/benchmark/arc-perf-transformers-436.yaml
new file mode 100644
index 00000000000..efdf14193a3
--- /dev/null
+++ b/python/llm/test/benchmark/arc-perf-transformers-436.yaml
@@ -0,0 +1,16 @@
+repo_id:
+  - 'Qwen/Qwen-VL-Chat'
+local_model_hub: '/mnt/disk1/models'
+warm_up: 1
+num_trials: 3
+num_beams: 1 # default to greedy search
+low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
+batch_size: 1 # default to 1
+in_out_pairs:
+  - '32-32'
+  - '1024-128'
+  - '2048-256'
+test_api:
+  - "transformer_int4_fp16_gpu"  # on Intel GPU
+cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml b/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml
index d675d506629..9b9ab1f14ae 100644
--- a/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml
+++ b/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml
@@ -6,6 +6,18 @@ repo_id:
   - 'microsoft/phi-3-vision-128k-instruct'
   - 'Qwen/Qwen2-7B-Instruct'
   - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'meta-llama/Llama-2-7b-chat-hf'
+  - 'meta-llama/Llama-2-13b-chat-hf'
+  - 'THUDM/chatglm3-6b-4bit'
+  - 'baichuan-inc/Baichuan2-7B-Chat'
+  - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
+  - 'THUDM/glm-4-9b-chat'
+  - 'openbmb/MiniCPM-2B-sft-bf16'
+  #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
+  - '01-ai/Yi-6B-Chat'
+  - 'mistralai/Mistral-7B-Instruct-v0.2'
+  - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
+  - '01-ai/Yi-1.5-6B-Chat'
 local_model_hub: '/mnt/disk1/models'
 warm_up: 1
 num_trials: 3
@@ -19,4 +31,6 @@ in_out_pairs:
 test_api:
   - "transformer_int4_fp16_gpu"  # on Intel GPU
 cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+exclude:
+  - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
 task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml b/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml
index f3d55c83e35..368a8c636b5 100644
--- a/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml
+++ b/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml
@@ -6,6 +6,18 @@ repo_id:
   - 'microsoft/phi-3-vision-128k-instruct'
   - 'Qwen/Qwen2-7B-Instruct'
   - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'meta-llama/Llama-2-7b-chat-hf'
+  - 'meta-llama/Llama-2-13b-chat-hf'
+  - 'THUDM/chatglm3-6b-4bit'
+  - 'baichuan-inc/Baichuan2-7B-Chat'
+  - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
+  - 'THUDM/glm-4-9b-chat'
+  - 'openbmb/MiniCPM-2B-sft-bf16'
+  #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
+  - '01-ai/Yi-6B-Chat'
+  - 'mistralai/Mistral-7B-Instruct-v0.2'
+  - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
+  - '01-ai/Yi-1.5-6B-Chat'
 local_model_hub: '/mnt/disk1/models'
 warm_up: 1
 num_trials: 3
@@ -22,4 +34,8 @@ cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu w
 exclude:
   - 'Qwen/Qwen1.5-7B-Chat:2048'
   - 'meta-llama/Meta-Llama-3-8B-Instruct:2048'
-task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
\ No newline at end of file
+  - 'meta-llama/Llama-2-13b-chat-hf:2048'
+  - 'baichuan-inc/Baichuan2-7B-Chat:2048'
+  - 'baichuan-inc/Baichuan2-13B-Chat-4bit:1024'
+  - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
+task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/arc-perf-transformers-437.yaml b/python/llm/test/benchmark/arc-perf-transformers-437.yaml
index 1c775344c43..bca87891f6b 100644
--- a/python/llm/test/benchmark/arc-perf-transformers-437.yaml
+++ b/python/llm/test/benchmark/arc-perf-transformers-437.yaml
@@ -6,6 +6,18 @@ repo_id:
   - 'microsoft/phi-3-vision-128k-instruct'
   - 'Qwen/Qwen2-7B-Instruct'
   - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'meta-llama/Llama-2-7b-chat-hf'
+  - 'meta-llama/Llama-2-13b-chat-hf'
+  - 'THUDM/chatglm3-6b-4bit'
+  - 'baichuan-inc/Baichuan2-7B-Chat'
+  - 'baichuan-inc/Baichuan2-13B-Chat-4bit'
+  - 'THUDM/glm-4-9b-chat'
+  - 'openbmb/MiniCPM-2B-sft-bf16'
+  #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported
+  - '01-ai/Yi-6B-Chat'
+  - 'mistralai/Mistral-7B-Instruct-v0.2'
+  - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
+  - '01-ai/Yi-1.5-6B-Chat'
 local_model_hub: '/mnt/disk1/models'
 warm_up: 1
 num_trials: 3
@@ -19,4 +31,6 @@ in_out_pairs:
 test_api:
   - "transformer_int4_fp16_gpu"  # on Intel GPU
 cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+exclude:
+  - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048'
 task: 'continuation' # task can be 'continuation', 'QA' and 'summarize'
diff --git a/python/llm/test/benchmark/core-perf-test.yaml b/python/llm/test/benchmark/core-perf-test.yaml
index 55f738de54b..2def68c1494 100644
--- a/python/llm/test/benchmark/core-perf-test.yaml
+++ b/python/llm/test/benchmark/core-perf-test.yaml
@@ -3,7 +3,7 @@ repo_id:
   - 'THUDM/chatglm3-6b'
   - 'baichuan-inc/Baichuan2-7B-Chat'
   - 'internlm/internlm-chat-7b'
-  - 'Qwen/Qwen-7B-Chat'
+ # - 'Qwen/Qwen-7B-Chat'  # requires transformers < 4.37.0
   - 'BAAI/AquilaChat2-7B'
   - 'meta-llama/Llama-2-7b-chat-hf'
   - 'WisdomShell/CodeShell-7B'
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128.yaml b/python/llm/test/benchmark/igpu-perf/1024-128.yaml
index b0bd5f30c20..759a7566237 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128.yaml
@@ -10,9 +10,15 @@ repo_id:
   - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
   - 'RWKV/v5-Eagle-7B-HF'
   - '01-ai/Yi-6B-Chat'
-  - 'Qwen/Qwen-VL-Chat'
   - 'openbmb/MiniCPM-1B-sft-bf16'
   - 'openbmb/MiniCPM-2B-sft-bf16'
+  - 'Qwen/Qwen1.5-7B-Chat'
+  - 'Qwen/Qwen2-1.5B-Instruct'
+  - 'Qwen/Qwen2-7B-Instruct'
+  - 'microsoft/Phi-3-mini-4k-instruct'
+  - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'microsoft/phi-3-vision-128k-instruct'
+  - 'openbmb/MiniCPM-V-2_6'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_437.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_436.yaml
similarity index 65%
rename from python/llm/test/benchmark/igpu-perf/1024-128_437.yaml
rename to python/llm/test/benchmark/igpu-perf/1024-128_436.yaml
index c6850389b97..c967f66a7ba 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_436.yaml
@@ -1,11 +1,5 @@
 repo_id:
-  - 'Qwen/Qwen1.5-7B-Chat'
-  - 'Qwen/Qwen2-1.5B-Instruct'
-  - 'Qwen/Qwen2-7B-Instruct'
-  - 'microsoft/Phi-3-mini-4k-instruct'
-  - 'microsoft/Phi-3-mini-128k-instruct'
-  - 'microsoft/phi-3-vision-128k-instruct'
-  - 'openbmb/MiniCPM-V-2_6'
+  - 'Qwen/Qwen-VL-Chat'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml
index 39d575680ab..f66172d9a39 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml
@@ -9,9 +9,15 @@ repo_id:
   - 'mistralai/Mistral-7B-Instruct-v0.2'
   - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
   - '01-ai/Yi-6B-Chat'
-  - 'Qwen/Qwen-VL-Chat'
   - 'openbmb/MiniCPM-1B-sft-bf16'
   - 'openbmb/MiniCPM-2B-sft-bf16'
+  - 'Qwen/Qwen1.5-7B-Chat'
+  - 'Qwen/Qwen2-1.5B-Instruct'
+  - 'Qwen/Qwen2-7B-Instruct'
+  - 'microsoft/Phi-3-mini-4k-instruct'
+  - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'microsoft/phi-3-vision-128k-instruct'
+  - 'openbmb/MiniCPM-V-2_6'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml
similarity index 65%
rename from python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml
rename to python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml
index 68cbaf2a163..c224b65e745 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml
@@ -1,11 +1,5 @@
 repo_id:
-  - 'Qwen/Qwen1.5-7B-Chat'
-  - 'Qwen/Qwen2-1.5B-Instruct'
-  - 'Qwen/Qwen2-7B-Instruct'
-  - 'microsoft/Phi-3-mini-4k-instruct'
-  - 'microsoft/Phi-3-mini-128k-instruct'
-  - 'microsoft/phi-3-vision-128k-instruct'
-  - 'openbmb/MiniCPM-V-2_6'
+  - 'Qwen/Qwen-VL-Chat'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml
index 2730e465d47..76c35d4dde7 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml
@@ -9,9 +9,14 @@ repo_id:
   - 'mistralai/Mistral-7B-Instruct-v0.2'
   - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
   - '01-ai/Yi-6B-Chat'
-  - 'Qwen/Qwen-VL-Chat'
   - 'openbmb/MiniCPM-1B-sft-bf16'
   - 'openbmb/MiniCPM-2B-sft-bf16'
+  - 'Qwen/Qwen1.5-7B-Chat'
+  - 'Qwen/Qwen2-1.5B-Instruct'
+  - 'Qwen/Qwen2-7B-Instruct'
+  - 'microsoft/Phi-3-mini-4k-instruct'
+  - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'microsoft/phi-3-vision-128k-instruct'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml
similarity index 68%
rename from python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml
rename to python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml
index 3839d0d2951..917e6d0ff3c 100644
--- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml
@@ -1,10 +1,5 @@
 repo_id:
-  - 'Qwen/Qwen1.5-7B-Chat'
-  - 'Qwen/Qwen2-1.5B-Instruct'
-  - 'Qwen/Qwen2-7B-Instruct'
-  - 'microsoft/Phi-3-mini-4k-instruct'
-  - 'microsoft/Phi-3-mini-128k-instruct'
-  - 'microsoft/phi-3-vision-128k-instruct'
+  - 'Qwen/Qwen-VL-Chat'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml
index c53e6283919..bf5fc1e978b 100644
--- a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml
@@ -9,9 +9,15 @@ repo_id:
   - 'mistralai/Mistral-7B-Instruct-v0.2'
   - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
   - '01-ai/Yi-6B-Chat'
-  - 'Qwen/Qwen-VL-Chat'
   - 'openbmb/MiniCPM-1B-sft-bf16'
   - 'openbmb/MiniCPM-2B-sft-bf16'
+  - 'Qwen/Qwen1.5-7B-Chat'
+  - 'Qwen/Qwen2-1.5B-Instruct'
+  - 'Qwen/Qwen2-7B-Instruct'
+  - 'microsoft/Phi-3-mini-4k-instruct'
+  - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'microsoft/phi-3-vision-128k-instruct'
+  - 'openbmb/MiniCPM-V-2_6'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml
similarity index 65%
rename from python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml
rename to python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml
index 0eddd403b86..e9566c13250 100644
--- a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml
@@ -1,11 +1,5 @@
 repo_id:
-  - 'Qwen/Qwen1.5-7B-Chat'
-  - 'Qwen/Qwen2-1.5B-Instruct'
-  - 'Qwen/Qwen2-7B-Instruct'
-  - 'microsoft/Phi-3-mini-4k-instruct'
-  - 'microsoft/Phi-3-mini-128k-instruct'
-  - 'microsoft/phi-3-vision-128k-instruct'
-  - 'openbmb/MiniCPM-V-2_6'
+  - 'Qwen/Qwen-VL-Chat'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml
index 47b9839a789..60202594cba 100644
--- a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml
@@ -8,9 +8,15 @@ repo_id:
   - 'mistralai/Mistral-7B-Instruct-v0.2'
   - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
   - '01-ai/Yi-6B-Chat'
-  - 'Qwen/Qwen-VL-Chat'
   - 'openbmb/MiniCPM-1B-sft-bf16'
   - 'openbmb/MiniCPM-2B-sft-bf16'
+  - 'Qwen/Qwen1.5-7B-Chat'
+  - 'Qwen/Qwen2-1.5B-Instruct'
+  - 'Qwen/Qwen2-7B-Instruct'
+  - 'microsoft/Phi-3-mini-4k-instruct'
+  - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'microsoft/phi-3-vision-128k-instruct'
+  - 'openbmb/MiniCPM-V-2_6'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml
similarity index 52%
rename from python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml
rename to python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml
index 087da9773db..6448a358cb5 100644
--- a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml
@@ -1,11 +1,5 @@
 repo_id:
-  - 'Qwen/Qwen1.5-7B-Chat'
-  - 'Qwen/Qwen2-1.5B-Instruct'
-  - 'Qwen/Qwen2-7B-Instruct'
-  - 'microsoft/Phi-3-mini-4k-instruct'
-  - 'microsoft/Phi-3-mini-128k-instruct'
-  - 'microsoft/phi-3-vision-128k-instruct'
-  - 'openbmb/MiniCPM-V-2_6'
+  - 'Qwen/Qwen-VL-Chat'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
@@ -15,5 +9,5 @@ batch_size: 1 # default to 1
 in_out_pairs:
   - '3072-384'
 test_api:
-  - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows (catch GPU peak memory)
+  - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows, use fp16 for non-linear layer
 cpu_embedding: True # whether put embedding to CPU (only avaiable now for gpu win related test_api)
diff --git a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml
index 39115e0231b..e70178744a3 100644
--- a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml
@@ -9,9 +9,15 @@ repo_id:
   - 'mistralai/Mistral-7B-Instruct-v0.2'
   - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5'
   - '01-ai/Yi-6B-Chat'
-  - 'Qwen/Qwen-VL-Chat'
   - 'openbmb/MiniCPM-1B-sft-bf16'
   - 'openbmb/MiniCPM-2B-sft-bf16'
+  - 'Qwen/Qwen1.5-7B-Chat'
+  - 'Qwen/Qwen2-1.5B-Instruct'
+  - 'Qwen/Qwen2-7B-Instruct'
+  - 'microsoft/Phi-3-mini-4k-instruct'
+  - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'microsoft/phi-3-vision-128k-instruct'
+  - 'openbmb/MiniCPM-V-2_6'
 local_model_hub: 'path to your local model hub'
 warm_up: 3
 num_trials: 5
diff --git a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml
similarity index 65%
rename from python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml
rename to python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml
index 1f0d11a2004..8faf43aed97 100644
--- a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml
+++ b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml
@@ -1,11 +1,5 @@
 repo_id:
-  - 'Qwen/Qwen1.5-7B-Chat'
-  - 'Qwen/Qwen2-1.5B-Instruct'
-  - 'Qwen/Qwen2-7B-Instruct'
-  - 'microsoft/Phi-3-mini-4k-instruct'
-  - 'microsoft/Phi-3-mini-128k-instruct'
-  - 'microsoft/phi-3-vision-128k-instruct'
-  - 'openbmb/MiniCPM-V-2_6'
+  - 'Qwen/Qwen-VL-Chat'
 local_model_hub: 'path to your local model hub'
 warm_up: 3
 num_trials: 5
diff --git a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml
index 26e128a564c..514037a7380 100644
--- a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml
+++ b/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml
@@ -10,6 +10,13 @@ repo_id:
   - '01-ai/Yi-6B-Chat'
   - 'openbmb/MiniCPM-1B-sft-bf16'
   - 'openbmb/MiniCPM-2B-sft-bf16'
+  - 'Qwen/Qwen1.5-7B-Chat'
+  - 'Qwen/Qwen2-1.5B-Instruct'
+  - 'Qwen/Qwen2-7B-Instruct'
+  - 'microsoft/Phi-3-mini-4k-instruct'
+  - 'microsoft/Phi-3-mini-128k-instruct'
+  - 'microsoft/phi-3-vision-128k-instruct'
+  - 'openbmb/MiniCPM-V-2_6'
 local_model_hub: 'path to your local model hub'
 warm_up: 1
 num_trials: 3
diff --git a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml
deleted file mode 100644
index 4472b5da1f2..00000000000
--- a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml
+++ /dev/null
@@ -1,19 +0,0 @@
-repo_id:
-  - 'Qwen/Qwen1.5-7B-Chat'
-  - 'Qwen/Qwen2-1.5B-Instruct'
-  - 'Qwen/Qwen2-7B-Instruct'
-  - 'microsoft/Phi-3-mini-4k-instruct'
-  - 'microsoft/Phi-3-mini-128k-instruct'
-  - 'microsoft/phi-3-vision-128k-instruct'
-  - 'openbmb/MiniCPM-V-2_6'
-local_model_hub: 'path to your local model hub'
-warm_up: 1
-num_trials: 3
-num_beams: 1 # default to greedy search
-low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
-batch_size: 1 # default to 1
-in_out_pairs:
-  - '4096-512'
-test_api:
-  - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows (catch GPU peak memory)
-cpu_embedding: True # whether put embedding to CPU (only avaiable now for gpu win related test_api)

From 90549883db6143453b4a8c02e131b4adc5b142d6 Mon Sep 17 00:00:00 2001
From: Jinhe <jin.tang1337@gmail.com>
Date: Tue, 20 Aug 2024 18:01:42 +0800
Subject: [PATCH 08/11] Pytorch models transformers version update (#11860)

* yi sync

* delete 4.34 constraint

* delete 4.34 constraint

* delete 4.31 constraint

* delete 4.34 constraint

* delete 4.35 constraint

* added <=4.33.3 constraint

* added <=4.33.3 constraint

* switched to chinese prompt
---
 .../llm/example/GPU/HuggingFace/LLM/yi/README.md   | 12 ++++++------
 .../llm/example/GPU/HuggingFace/LLM/yi/generate.py |  2 +-
 .../GPU/PyTorch-Models/Model/codegeex2/README.md   |  2 --
 .../GPU/PyTorch-Models/Model/codellama/README.md   |  4 ----
 .../GPU/PyTorch-Models/Model/deciLM-7b/README.md   |  2 --
 .../GPU/PyTorch-Models/Model/mistral/README.md     |  7 -------
 .../GPU/PyTorch-Models/Model/replit/README.md      |  4 +++-
 .../GPU/PyTorch-Models/Model/solar/README.md       |  4 ----
 .../example/GPU/PyTorch-Models/Model/yi/README.md  | 14 ++++++++++++--
 .../GPU/PyTorch-Models/Model/yi/generate.py        |  2 +-
 10 files changed, 23 insertions(+), 30 deletions(-)

diff --git a/python/llm/example/GPU/HuggingFace/LLM/yi/README.md b/python/llm/example/GPU/HuggingFace/LLM/yi/README.md
index 1fb49f21523..080e2676fdc 100644
--- a/python/llm/example/GPU/HuggingFace/LLM/yi/README.md
+++ b/python/llm/example/GPU/HuggingFace/LLM/yi/README.md
@@ -122,18 +122,18 @@ In the example, several arguments can be passed to satisfy your requirements:
 ```log
 Inference time: xxxx s
 -------------------- Prompt --------------------
-What is AI?
+AI是什么？
 -------------------- Output --------------------
-What is AI?
-Artificial Intelligence (AI) is the simulation of human intelligence in machines. AI is the science and engineering of making intelligent machines, especially intelligent computer programs.
+AI是什么？
+人工智能（Artificial Intelligence），英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及
 ```
 
 #### [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)
 ```log
 Inference time: xxxx s
 -------------------- Prompt --------------------
-What is AI?
+AI是什么？
 -------------------- Output --------------------
-What is AI?
-Artificial Intelligence (AI) refers to the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, and self-
+AI是什么？
+人工智能（Artificial Intelligence, AI）是计算机科学的一个分支，它研究如何让计算机模拟人类的智能行为。人工智能可以通过模仿人类的思维过程和
 ```
\ No newline at end of file
diff --git a/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py b/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py
index f32f272c13a..643c5f7b34d 100644
--- a/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py
+++ b/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py
@@ -27,7 +27,7 @@
     parser.add_argument('--repo-id-or-model-path', type=str, default="01-ai/Yi-6B-Chat",
                         help='The huggingface repo id for the Yi model to be downloaded'
                              ', or the path to the huggingface checkpoint folder')
-    parser.add_argument('--prompt', type=str, default="What is AI?",
+    parser.add_argument('--prompt', type=str, default="AI是什么？",
                         help='Prompt to infer')
     parser.add_argument('--n-predict', type=int, default=32,
                         help='Max tokens to predict')
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md b/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md
index 37f801a28bf..bc8cfa62907 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md
@@ -16,7 +16,6 @@ conda create -n llm python=3.11
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-pip install transformers==4.31.0
 ```
 
 #### 1.2 Installation on Windows
@@ -27,7 +26,6 @@ conda activate llm
 
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-pip install transformers==4.31.0
 ```
 
 ### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md b/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md
index 497a6828b24..ff68817eca4 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md
@@ -14,8 +14,6 @@ conda create -n llm python=3.11
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
 ```
 
 #### 1.2 Installation on Windows
@@ -26,8 +24,6 @@ conda activate llm
 
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers
 ```
 
 ### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md b/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md
index ff8eab5ae09..a9e66f54732 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md
@@ -14,8 +14,6 @@ conda create -n llm python=3.11
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.35.2 # required by DeciLM-7B
 ```
 
 #### 1.2 Installation on Windows
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md b/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md
index 4fc017e1ba7..4f3e58b045c 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md
@@ -4,7 +4,6 @@ In this directory, you will find examples on how you could use IPEX-LLM `optimiz
 ## Requirements
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
 
-**Important: According to [Mistral Troubleshooting](https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting), please make sure you have installed `transformers==4.34.0` to run the example.**
 
 ## Example: Predict Tokens using `generate()` API
 In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs.
@@ -16,9 +15,6 @@ conda create -n llm python=3.11
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-# Refer to https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting, please make sure you are using a stable version of Transformers, 4.34.0 or newer.
-pip install transformers==4.34.0
 ```
 
 #### 1.2 Installation on Windows
@@ -29,9 +25,6 @@ conda activate llm
 
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-# Refer to https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting, please make sure you are using a stable version of Transformers, 4.34.0 or newer.
-pip install transformers==4.34.0
 ```
 
 ### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md b/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md
index 4938682aea2..3bfbf245655 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md
@@ -15,7 +15,7 @@ conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 
-pip install "transformers<4.35"
+pip install transformers<=4.33.3
 ```
 
 #### 1.2 Installation on Windows
@@ -26,6 +26,8 @@ conda activate llm
 
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
+
+pip install transformers<=4.33.3
 ```
 
 ### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md b/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md
index 2b718cd4a6a..4d157d19bf3 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md
@@ -14,8 +14,6 @@ conda create -n llm python=3.11
 conda activate llm
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.35.2 # required by SOLAR
 ```
 
 #### 1.2 Installation on Windows
@@ -26,8 +24,6 @@ conda activate llm
 
 # below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
-
-pip install transformers==4.35.2 # required by SOLAR
 ```
 
 ### 2. Configures OneAPI environment variables for Linux
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md b/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md
index b48b95325c3..2b500175575 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md
+++ b/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md
@@ -1,5 +1,5 @@
 # Yi
-In this directory, you will find examples on how you could use IPEX-LLM `optimize_model` API on Yi models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B) as a reference Yi model.
+In this directory, you will find examples on how you could use IPEX-LLM `optimize_model` API on Yi models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B) and [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-1.5-6B-Chat) as reference Yi models.
 
 ## 0. Requirements
 To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information.
@@ -112,7 +112,7 @@ python ./generate.py
 
 In the example, several arguments can be passed to satisfy your requirements:
 
-- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yi model (e.g. `01-ai/Yi-6B`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'01-ai/Yi-6B'`.
+- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yi model (e.g. `01-ai/Yi-6B` and `01-ai/Yi-6B-Chat`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'01-ai/Yi-6B-Chat'`.
 - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么？'`.
 - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`.
 
@@ -127,3 +127,13 @@ AI是什么？
 AI是什么？
 人工智能（Artificial Intelligence），英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及
 ```
+
+#### [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat)
+```log
+Inference time: xxxx s
+-------------------- Prompt --------------------
+AI是什么？
+-------------------- Output --------------------
+AI是什么？
+人工智能（Artificial Intelligence, AI）是计算机科学的一个分支，它研究如何让计算机模拟人类的智能行为。人工智能可以通过模仿人类的思维过程和
+```
\ No newline at end of file
diff --git a/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py
index 31256cda112..871f5f4fbd1 100644
--- a/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py
+++ b/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py
@@ -26,7 +26,7 @@
 
 if __name__ == '__main__':
     parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Yi model')
-    parser.add_argument('--repo-id-or-model-path', type=str, default="01-ai/Yi-6B",
+    parser.add_argument('--repo-id-or-model-path', type=str, default="01-ai/Yi-6B-Chat",
                         help='The huggingface repo id for the Yi model to be downloaded'
                              ', or the path to the huggingface checkpoint folder')
     parser.add_argument('--prompt', type=str, default="AI是什么？",

From 9a8eb10cd6f28c1b78d1cad549472f9b140d9e41 Mon Sep 17 00:00:00 2001
From: Yina Chen <33650826+cyita@users.noreply.github.com>
Date: Tue, 20 Aug 2024 13:11:37 +0300
Subject: [PATCH 09/11] Update compresskv model forward type logic (#11868)

* update

* fix
---
 .../src/ipex_llm/transformers/models/llama.py | 18 ++++++++++-----
 .../ipex_llm/transformers/models/minicpm.py   |  9 ++++----
 .../src/ipex_llm/transformers/models/phi3.py  | 11 +++++-----
 .../src/ipex_llm/transformers/models/qwen2.py | 22 +++++++++----------
 4 files changed, 33 insertions(+), 27 deletions(-)

diff --git a/python/llm/src/ipex_llm/transformers/models/llama.py b/python/llm/src/ipex_llm/transformers/models/llama.py
index 5e633da7406..2c9c17e7a58 100644
--- a/python/llm/src/ipex_llm/transformers/models/llama.py
+++ b/python/llm/src/ipex_llm/transformers/models/llama.py
@@ -128,7 +128,9 @@ def llama_model_forward_4_36(
         use_quantize = use_quantize_kv_cache(
             self.layers[0].mlp.up_proj, input,
             self.config.num_attention_heads//self.config.num_key_value_heads)
-        if should_use_compresskv(input, input.shape[1]):
+        use_compresskv = should_use_compresskv(input, input.shape[1]) or \
+            isinstance(past_key_values, DynamicCompressCache)
+        if use_compresskv:
             if not isinstance(past_key_values, DynamicCompressCache):
                 if use_quantize:
                     past_key_values = DynamicCompressFp8Cache.from_legacy_cache(
@@ -137,7 +139,7 @@ def llama_model_forward_4_36(
                     past_key_values = DynamicCompressCache.from_legacy_cache(
                         past_key_values)
         elif use_quantize:
-            if not isinstance(past_key_values, (DynamicFp8Cache, DynamicCompressCache)):
+            if not isinstance(past_key_values, DynamicFp8Cache):
                 past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
     return llama_model_forward_4_36_internal(
         self=self,
@@ -174,7 +176,9 @@ def llama_model_forward_4_38(
         use_quantize = use_quantize_kv_cache(
             self.layers[0].mlp.up_proj, input,
             self.config.num_attention_heads//self.config.num_key_value_heads)
-        if should_use_compresskv(input, input.shape[1]):
+        use_compresskv = should_use_compresskv(input, input.shape[1]) or \
+            isinstance(past_key_values, DynamicCompressCache)
+        if use_compresskv:
             if not isinstance(past_key_values, DynamicCompressCache):
                 if use_quantize:
                     past_key_values = DynamicCompressFp8Cache.from_legacy_cache(
@@ -183,7 +187,7 @@ def llama_model_forward_4_38(
                     past_key_values = DynamicCompressCache.from_legacy_cache(
                         past_key_values)
         elif use_quantize:
-            if not isinstance(past_key_values, (DynamicFp8Cache, DynamicCompressCache)):
+            if not isinstance(past_key_values, DynamicFp8Cache):
                 past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
     return llama_model_forward_4_38_internal(
         self=self,
@@ -221,7 +225,9 @@ def llama_model_forward_4_41(
         use_quantize = use_quantize_kv_cache(
             self.layers[0].mlp.up_proj, input,
             self.config.num_attention_heads//self.config.num_key_value_heads)
-        if should_use_compresskv(input, input.shape[1]):
+        use_compresskv = should_use_compresskv(input, input.shape[1]) or \
+            isinstance(past_key_values, DynamicCompressCache)
+        if use_compresskv:
             if not isinstance(past_key_values, DynamicCompressCache):
                 if use_quantize:
                     past_key_values = DynamicCompressFp8Cache.from_legacy_cache(
@@ -230,7 +236,7 @@ def llama_model_forward_4_41(
                     past_key_values = DynamicCompressCache.from_legacy_cache(
                         past_key_values)
         elif use_quantize:
-            if not isinstance(past_key_values, (DynamicFp8Cache, DynamicCompressCache)):
+            if not isinstance(past_key_values, DynamicFp8Cache):
                 past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
     return llama_model_forward_4_41_internal(
         self=self,
diff --git a/python/llm/src/ipex_llm/transformers/models/minicpm.py b/python/llm/src/ipex_llm/transformers/models/minicpm.py
index afbcde6c657..d248c507773 100644
--- a/python/llm/src/ipex_llm/transformers/models/minicpm.py
+++ b/python/llm/src/ipex_llm/transformers/models/minicpm.py
@@ -182,7 +182,8 @@ def minicpm_model_forward(
         use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.up_proj, inputs,
                                                 self.config.num_attention_heads //
                                                 self.config.num_key_value_heads)
-        use_compress_kv = should_use_compresskv(inputs, inputs.shape[1])
+        use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) or \
+            isinstance(past_key_values, DynamicCompressCache)
 
         use_cache = use_cache if use_cache is not None else self.config.use_cache
         if use_cache:
@@ -192,11 +193,11 @@ def minicpm_model_forward(
                     past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values)
                 else:
                     past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values)
-            elif use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache,
-                                                                      DynamicCompressCache)):
+            elif (use_quantize_kv and not use_compress_kv
+                  and not isinstance(past_key_values, DynamicFp8Cache)):
                 past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
             elif (not use_quantize_kv and not use_compress_kv
-                  and not isinstance(past_key_values, (DynamicNormalCache, DynamicCompressCache))):
+                  and not isinstance(past_key_values, DynamicNormalCache)):
                 past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
         # ipex-llm changes end
         return origin_forward(
diff --git a/python/llm/src/ipex_llm/transformers/models/phi3.py b/python/llm/src/ipex_llm/transformers/models/phi3.py
index 823fb10391a..bfa380c2f51 100644
--- a/python/llm/src/ipex_llm/transformers/models/phi3.py
+++ b/python/llm/src/ipex_llm/transformers/models/phi3.py
@@ -256,7 +256,8 @@ def model_forward(
         use_cache = use_cache if use_cache is not None else self.config.use_cache
         inputs = input_ids if input_ids is not None else inputs_embeds
         use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.down_proj, inputs)
-        use_compress_kv = should_use_compresskv(inputs, inputs.shape[1])
+        use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) or \
+            isinstance(past_key_values, DynamicCompressCache)
         if use_cache:
             if use_compress_kv and not isinstance(past_key_values,
                                                   DynamicCompressCache):
@@ -264,13 +265,11 @@ def model_forward(
                     past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values)
                 else:
                     past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values)
-            if use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache,
-                                                                    DynamicCompressCache)):
+            if use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
+                                                                          DynamicFp8Cache):
                 past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
             if not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
-                                                                              (DynamicNormalCache,
-                                                                               DynamicCompressCache
-                                                                               )):
+                                                                              DynamicNormalCache):
                 past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
                 if past_key_values.get_seq_length() == 0:
                     n_layer = self.config.num_hidden_layers
diff --git a/python/llm/src/ipex_llm/transformers/models/qwen2.py b/python/llm/src/ipex_llm/transformers/models/qwen2.py
index c01488a6fb6..802c5e7ec45 100644
--- a/python/llm/src/ipex_llm/transformers/models/qwen2.py
+++ b/python/llm/src/ipex_llm/transformers/models/qwen2.py
@@ -120,7 +120,8 @@ def qwen2_model_forward(
         and use_quantize_kv_cache(self.layers[0].mlp.up_proj, inputs,
                                   self.config.num_attention_heads//self.config.num_key_value_heads)
     )
-    use_compress_kv = should_use_compresskv(inputs, inputs.shape[1])
+    use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) or \
+        isinstance(past_key_values, DynamicCompressCache)
 
     if use_cache:
         if use_compress_kv and not isinstance(past_key_values, DynamicCompressCache):
@@ -128,12 +129,11 @@ def qwen2_model_forward(
                 past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values)
             else:
                 past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values)
-        elif use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache,
-                                                                  DynamicCompressCache)):
+        elif use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
+                                                                        DynamicFp8Cache):
             past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
         if not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
-                                                                          (DynamicNormalCache,
-                                                                           DynamicCompressCache)):
+                                                                          DynamicNormalCache):
             past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
         past_key_values_length = past_key_values.get_usable_length(seq_length)
     # ipex-llm changes end
@@ -316,7 +316,8 @@ def qwen2_model_forward_4_42(
         and use_quantize_kv_cache(self.layers[0].mlp.up_proj, inputs_embeds,
                                   self.config.num_attention_heads//self.config.num_key_value_heads)
     )
-    use_compress_kv = should_use_compresskv(inputs_embeds, inputs_embeds.shape[1])
+    use_compress_kv = should_use_compresskv(inputs_embeds, inputs_embeds.shape[1]) or \
+        isinstance(past_key_values, DynamicCompressCache)
 
     if use_cache:
         if use_compress_kv and not isinstance(past_key_values, DynamicCompressCache):
@@ -324,12 +325,11 @@ def qwen2_model_forward_4_42(
                 past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values)
             else:
                 past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values)
-        elif use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache,
-                                                                  DynamicCompressCache)):
+        elif use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
+                                                                        DynamicFp8Cache):
             past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values)
-        elif not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
-                                                                            (DynamicNormalCache,
-                                                                             DynamicCompressCache)):
+        if not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values,
+                                                                          DynamicNormalCache):
             past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values)
     # ipex-llm changes end
 

From f573959b2a6e4854379905777afaac58aedb908e Mon Sep 17 00:00:00 2001
From: RyuKosei <70006706+RyuKosei@users.noreply.github.com>
Date: Tue, 20 Aug 2024 18:50:00 +0800
Subject: [PATCH 10/11] Update local import for ppl (#11866)

Co-authored-by: jenniew <jenniewang123@gmail.com>
---
 python/llm/dev/benchmark/perplexity/run_wikitext.py | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/python/llm/dev/benchmark/perplexity/run_wikitext.py b/python/llm/dev/benchmark/perplexity/run_wikitext.py
index 50991558f35..92426a86fb1 100644
--- a/python/llm/dev/benchmark/perplexity/run_wikitext.py
+++ b/python/llm/dev/benchmark/perplexity/run_wikitext.py
@@ -21,7 +21,6 @@
 import torch
 from tqdm import tqdm
 from datasets import load_dataset
-from ipex_llm.utils.common import invalidInputError
 
 
 parser = argparse.ArgumentParser()
@@ -63,6 +62,7 @@ def parse_kwargs(kwstr):
         data = f.read()
     encodings = tokenizer(data.decode("utf-8").strip("\n"), return_tensors="pt")
 else:
+    from ipex_llm.utils.common import invalidInputError
     raise invalidInputError(False, "Must specify either dataset or datapath.")
 
 if not args.max_length:

From 52728feb7738ebc29c7cc2afb73fd4d49c30a664 Mon Sep 17 00:00:00 2001
From: cranechu <1340390339@qq.com>
Date: Tue, 20 Aug 2024 19:18:49 +0800
Subject: [PATCH 11/11] fix: textual adjustment

---
 python/llm/dev/benchmark/perplexity/README.md | 6 ++----
 1 file changed, 2 insertions(+), 4 deletions(-)

diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md
index 3d824ac570f..410358eed34 100644
--- a/python/llm/dev/benchmark/perplexity/README.md
+++ b/python/llm/dev/benchmark/perplexity/README.md
@@ -2,9 +2,7 @@
 Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py) 
 
 ## Environment Preparation
-Install ipex-llm and dataset.
 ```bash
-# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
 pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
 pip install datasets
 ```
@@ -14,9 +12,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this
 source /opt/intel/oneapi/setvars.sh
 ```
 
-## Running PPL Evaluation
+## PPL Evaluation
 ### 1. Run on Wikitext
-An example to run perplexity on wikitext:
+An example to run perplexity on [wikitext](https://paperswithcode.com/dataset/wikitext-2):
 ```bash
 python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096
 ```