From 5e0144997665fdb3b522168d52a6fbd7c41b1f89 Mon Sep 17 00:00:00 2001 From: cranechu <1340390339@qq.com> Date: Tue, 20 Aug 2024 16:41:34 +0800 Subject: [PATCH 01/11] feat: update readme for ppl test --- python/llm/dev/benchmark/perplexity/README.md | 68 +++++++++++++++++-- 1 file changed, 63 insertions(+), 5 deletions(-) diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md index 8e6d5bacb89..d20ccf686f1 100644 --- a/python/llm/dev/benchmark/perplexity/README.md +++ b/python/llm/dev/benchmark/perplexity/README.md @@ -1,29 +1,87 @@ # Perplexity Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py) -## Run on Wikitext +## Requirements +To run perplexity test with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. +### 1. Install IPEX +We suggest using conda to manage environment: ```bash -pip install datasets +conda create -n llm python=3.11 +conda activate llm +# below command will install intel_extension_for_pytorch==2.1.10+xpu as default +pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ ``` -An example to run perplexity on wikitext: + + +### 2. Configures OneAPI environment variables for Linux + +> [!NOTE] +> Skip this step if you are running on Windows. + +This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI. + ```bash +source /opt/intel/oneapi/setvars.sh +``` -python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096 +### 3. Runtime Configurations +For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. +
+For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series + +```bash +export USE_XETLA=OFF +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export SYCL_CACHE_PERSISTENT=1 ``` -## Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset +
+ +
+ +For Intel Data Center GPU Max Series +```bash +export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so +export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 +export SYCL_CACHE_PERSISTENT=1 +export ENABLE_SDP_FUSION=1 +``` +> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. +
+ +
+ +For Intel iGPU + +```bash +export SYCL_CACHE_PERSISTENT=1 +export BIGDL_LLM_XMX_DISABLED=1 +``` + +
+ +### 4. installing dependency +Install the dataset dependency to download and load dataset for the test. ```bash pip install datasets ``` +## Running the test +### 1.Run on Wikitext +An example to run perplexity on wikitext: +```bash +python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096 +``` +### 2.Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset An example to run perplexity on chatglm3-6b using the default Chinese datasets("multifieldqa_zh", "dureader", "vcsum", "lsht", "passage_retrieval_zh") ```bash python run_longbench.py --model_path THUDM/chatglm3-6b --precisions float16 sym_int4 --device xpu --language zh ``` + Notes: - If you want to test model perplexity on a few selected datasets from the `LongBench` dataset, please use the format below. ```bash From 6122714f9d454b65111b801a2aba86fa21f6bd7a Mon Sep 17 00:00:00 2001 From: cranechu <1340390339@qq.com> Date: Tue, 20 Aug 2024 17:04:12 +0800 Subject: [PATCH 02/11] fix: textual adjustments --- python/llm/dev/benchmark/perplexity/README.md | 63 +++---------------- 1 file changed, 8 insertions(+), 55 deletions(-) diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md index d20ccf686f1..6f8e721129d 100644 --- a/python/llm/dev/benchmark/perplexity/README.md +++ b/python/llm/dev/benchmark/perplexity/README.md @@ -1,80 +1,33 @@ # Perplexity Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py) -## Requirements +## Environment Preparations To run perplexity test with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. -### 1. Install IPEX -We suggest using conda to manage environment: +We suggest using conda to manage iprx environment: ```bash conda create -n llm python=3.11 conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ ``` - - -### 2. Configures OneAPI environment variables for Linux - -> [!NOTE] -> Skip this step if you are running on Windows. - -This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI. - -```bash -source /opt/intel/oneapi/setvars.sh -``` - -### 3. Runtime Configurations -For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device. -
- -For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series - -```bash -export USE_XETLA=OFF -export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 -export SYCL_CACHE_PERSISTENT=1 -``` - -
- -
- -For Intel Data Center GPU Max Series - +Install the dataset dependency to download and load dataset for the test. ```bash -export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so -export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 -export SYCL_CACHE_PERSISTENT=1 -export ENABLE_SDP_FUSION=1 +pip install datasets ``` -> Note: Please note that `libtcmalloc.so` can be installed by `conda install -c conda-forge -y gperftools=2.10`. -
- -
- -For Intel iGPU +This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI. ```bash -export SYCL_CACHE_PERSISTENT=1 -export BIGDL_LLM_XMX_DISABLED=1 +source /opt/intel/oneapi/setvars.sh ``` -
- -### 4. installing dependency -Install the dataset dependency to download and load dataset for the test. -```bash -pip install datasets -``` ## Running the test -### 1.Run on Wikitext +### 1. Run on Wikitext An example to run perplexity on wikitext: ```bash python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096 ``` -### 2.Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset +### 2. Run on [THUDM/LongBench](https://github.com/THUDM/LongBench) dataset An example to run perplexity on chatglm3-6b using the default Chinese datasets("multifieldqa_zh", "dureader", "vcsum", "lsht", "passage_retrieval_zh") ```bash From 9e67b22502337b783c22fff47ec3240002d7de91 Mon Sep 17 00:00:00 2001 From: cranechu <1340390339@qq.com> Date: Tue, 20 Aug 2024 17:34:42 +0800 Subject: [PATCH 03/11] fix: textual adjustments --- python/llm/dev/benchmark/perplexity/README.md | 13 +++---------- 1 file changed, 3 insertions(+), 10 deletions(-) diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md index 6f8e721129d..3d824ac570f 100644 --- a/python/llm/dev/benchmark/perplexity/README.md +++ b/python/llm/dev/benchmark/perplexity/README.md @@ -1,18 +1,11 @@ # Perplexity Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py) -## Environment Preparations -To run perplexity test with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. - -We suggest using conda to manage iprx environment: +## Environment Preparation +Install ipex-llm and dataset. ```bash -conda create -n llm python=3.11 -conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ -``` -Install the dataset dependency to download and load dataset for the test. -```bash pip install datasets ``` This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI. @@ -21,7 +14,7 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this source /opt/intel/oneapi/setvars.sh ``` -## Running the test +## Running PPL Evaluation ### 1. Run on Wikitext An example to run perplexity on wikitext: ```bash From 979c738194d9afa8281878ab8c38dc01d62b64d7 Mon Sep 17 00:00:00 2001 From: SONG Ge <38711238+sgwhat@users.noreply.github.com> Date: Tue, 20 Aug 2024 17:29:49 +0800 Subject: [PATCH 04/11] Add ipex-llm npu option in setup.py (#11858) * add ipex-llm npu release * update example doc * meet latest release changes --- .../example/NPU/HF-Transformers-AutoModels/LLM/README.md | 7 ++----- python/llm/setup.py | 7 +++++++ 2 files changed, 9 insertions(+), 5 deletions(-) diff --git a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md index 31e055b5bea..728617f0a45 100644 --- a/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md +++ b/python/llm/example/NPU/HF-Transformers-AutoModels/LLM/README.md @@ -91,11 +91,8 @@ We suggest using conda to manage environment: conda create -n llm python=3.10 conda activate llm -# install ipex-llm with 'all' option -pip install --pre --upgrade ipex-llm[all] -pip install --pre --upgrade bigdl-core-npu - -pip install transformers==4.40 +# install ipex-llm with 'npu' option +pip install --pre --upgrade ipex-llm[npu] ``` ### 2. Runtime Configurations diff --git a/python/llm/setup.py b/python/llm/setup.py index ecb7aea861b..f9adc5f39f8 100644 --- a/python/llm/setup.py +++ b/python/llm/setup.py @@ -300,6 +300,12 @@ def setup_package(): serving_requires = ['py-cpuinfo'] serving_requires += SERVING_DEP + npu_requires = copy.deepcopy(all_requires) + cpu_transformers_version = ['transformers == 4.37.0', 'tokenizers == 0.15.2'] + for exclude_require in cpu_transformers_version: + npu_requires.remove(exclude_require) + npu_requires += ["transformers==4.40.0", + "bigdl-core-npu==" + CORE_XE_VERSION + ";platform_system=='Windows'"] metadata = dict( name='ipex_llm', @@ -323,6 +329,7 @@ def setup_package(): }, extras_require={"all": all_requires, "xpu": xpu_requires, # default to ipex 2.1 for linux and windows + "npu": npu_requires, "xpu-2-1": xpu_21_requires, "serving": serving_requires, "cpp": cpp_requires, From a9ab309690ef1e69e85153c9963f0b6feab011ab Mon Sep 17 00:00:00 2001 From: Yishuo Wang Date: Tue, 20 Aug 2024 17:32:51 +0800 Subject: [PATCH 05/11] optimize phi3 memory usage (#11867) --- python/llm/src/ipex_llm/transformers/kv.py | 15 +++++++++++++++ .../llm/src/ipex_llm/transformers/models/phi3.py | 14 +++++++++++--- 2 files changed, 26 insertions(+), 3 deletions(-) diff --git a/python/llm/src/ipex_llm/transformers/kv.py b/python/llm/src/ipex_llm/transformers/kv.py index 100da837a9e..8b20f546893 100644 --- a/python/llm/src/ipex_llm/transformers/kv.py +++ b/python/llm/src/ipex_llm/transformers/kv.py @@ -121,6 +121,21 @@ def update( return self.key_cache[layer_idx], self.value_cache[layer_idx] + @classmethod + def from_reserved(cls, layers: int, + bsz: int, n_head: int, length: int, head_dim: int, + dtype: torch.dtype, device: torch.device): + past_key_values = cls() + for _i in range(layers): + k_cache, v_cache = init_kv_cache( + bsz, n_head, head_dim, + 0, length + cls.KV_ALLOC_BLOCK_LENGTH, + dtype, device + ) + past_key_values.key_cache.append(k_cache) + past_key_values.value_cache.append(v_cache) + return past_key_values + # Copied from transformers.models.llama.modeling_llama.repeat_kv def repeat_kv(hidden_states: torch.Tensor, n_rep: int) -> torch.Tensor: diff --git a/python/llm/src/ipex_llm/transformers/models/phi3.py b/python/llm/src/ipex_llm/transformers/models/phi3.py index 5c630681cc9..823fb10391a 100644 --- a/python/llm/src/ipex_llm/transformers/models/phi3.py +++ b/python/llm/src/ipex_llm/transformers/models/phi3.py @@ -254,9 +254,9 @@ def model_forward( ): # IPEX-LLM OPT: kv cache and quantize kv cache and sdp use_cache = use_cache if use_cache is not None else self.config.use_cache - input = input_ids if input_ids is not None else inputs_embeds - use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.down_proj, input) - use_compress_kv = should_use_compresskv(input, input.shape[1]) + inputs = input_ids if input_ids is not None else inputs_embeds + use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.down_proj, inputs) + use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) if use_cache: if use_compress_kv and not isinstance(past_key_values, DynamicCompressCache): @@ -272,6 +272,14 @@ def model_forward( DynamicCompressCache )): past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values) + if past_key_values.get_seq_length() == 0: + n_layer = self.config.num_hidden_layers + n_head = self.config.num_attention_heads + head_dim = self.config.hidden_size // self.config.num_attention_heads + past_key_values = DynamicNormalCache.from_reserved( + n_layer, inputs.size(0), n_head, inputs.size(1), head_dim, + inputs.dtype, inputs.device + ) return origin_model_forward( self=self, input_ids=input_ids, From f5f3f19f98efe77c23eb2f5ccadbdaf58643ba8b Mon Sep 17 00:00:00 2001 From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com> Date: Tue, 20 Aug 2024 17:37:58 +0800 Subject: [PATCH 06/11] Update `ipex-llm` default transformers version to 4.37.0 (#11859) * Update default transformers version to 4.37.0 * Add dependency requirements for qwen and qwen-vl * Temp fix transformers version for these not yet verified models * Skip qwen test in UT for now as it requires transformers<4.37.0 --- .../CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md | 2 ++ .../CPU/HF-Transformers-AutoModels/Model/qwen/README.md | 4 ++++ python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md | 4 ++++ python/llm/example/GPU/HuggingFace/LLM/qwen/README.md | 2 ++ .../llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md | 2 ++ .../GPU/HuggingFace/Multimodal/voiceassistant/README.md | 2 ++ .../llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md | 2 ++ python/llm/example/GPU/PyTorch-Models/Model/llava/README.md | 2 -- python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md | 2 ++ .../llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md | 2 ++ python/llm/setup.py | 2 +- python/llm/test/inference_gpu/test_transformers_api.py | 2 +- .../llm/test/inference_gpu/test_transformers_api_RMSNorm.py | 2 +- .../llm/test/inference_gpu/test_transformers_api_attention.py | 2 +- python/llm/test/inference_gpu/test_transformers_api_mlp.py | 2 +- 15 files changed, 27 insertions(+), 7 deletions(-) diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md index 7dc3dedc5cb..7f5061eccd6 100644 --- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md +++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen-vl/README.md @@ -20,6 +20,7 @@ conda activate llm # install the latest ipex-llm nightly build with 'all' option pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu +pip install "transformers<4.37.0" pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation ``` @@ -32,6 +33,7 @@ conda activate llm pip install --pre --upgrade ipex-llm[all] +pip install "transformers<4.37.0" pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib ``` diff --git a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md index cee06098d2d..992ea9ee10e 100644 --- a/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md +++ b/python/llm/example/CPU/HF-Transformers-AutoModels/Model/qwen/README.md @@ -22,6 +22,8 @@ conda activate llm # install the latest ipex-llm nightly build with 'all' option pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu + +pip install "transformers<4.37.0" pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation ``` @@ -32,6 +34,8 @@ conda create -n llm python=3.11 conda activate llm pip install --pre --upgrade ipex-llm[all] + +pip install "transformers<4.37.0" pip install tiktoken einops transformers_stream_generator ``` diff --git a/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md b/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md index 25744465c26..f6f5f1ffe8e 100644 --- a/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md +++ b/python/llm/example/CPU/PyTorch-Models/Model/qwen-vl/README.md @@ -19,6 +19,8 @@ conda activate llm # install the latest ipex-llm nightly build with 'all' option pip install --pre --upgrade ipex-llm[all] --extra-index-url https://download.pytorch.org/whl/cpu + +pip install "transformers<4.37.0" pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation ``` @@ -29,6 +31,8 @@ conda create -n llm python=3.11 conda activate llm pip install --pre --upgrade ipex-llm[all] + +pip install "transformers<4.37.0" pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib ``` diff --git a/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md b/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md index 500e2b0f2ad..8311f7f1369 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/qwen/README.md @@ -15,6 +15,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install "transformers<4.37.0" pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation ``` @@ -27,6 +28,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install "transformers<4.37.0" pip install tiktoken einops transformers_stream_generator # additional package required for Qwen-7B-Chat to conduct generation ``` diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md index fb02816b1f0..737232661fd 100644 --- a/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md +++ b/python/llm/example/GPU/HuggingFace/Multimodal/qwen-vl/README.md @@ -15,6 +15,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install "transformers<4.37.0" pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation ``` @@ -27,6 +28,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install "transformers<4.37.0" pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation ``` diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md index 67c0fb26249..7dea109b078 100644 --- a/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md +++ b/python/llm/example/GPU/HuggingFace/Multimodal/voiceassistant/README.md @@ -17,6 +17,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install transformers==4.36.2 pip install librosa soundfile datasets pip install accelerate pip install SpeechRecognition sentencepiece colorama @@ -33,6 +34,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install transformers==4.36.2 pip install librosa soundfile datasets pip install accelerate pip install SpeechRecognition sentencepiece colorama diff --git a/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md index 29a4dc4619c..ac664fb0a36 100644 --- a/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md +++ b/python/llm/example/GPU/HuggingFace/Multimodal/whisper/readme.md @@ -16,6 +16,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install transformers==4.36.2 pip install datasets soundfile librosa # required by audio processing ``` @@ -28,6 +29,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install transformers==4.36.2 pip install datasets soundfile librosa # required by audio processing ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md b/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md index 461ae53a8dd..77e0f1cfd9c 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/llava/README.md @@ -16,7 +16,6 @@ conda activate llm pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install einops # install dependencies required by llava -pip install transformers==4.36.2 git clone https://github.com/haotian-liu/LLaVA.git # clone the llava libary cp generate.py ./LLaVA/ # copy our example to the LLaVA folder @@ -34,7 +33,6 @@ conda activate llm pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install einops # install dependencies required by llava -pip install transformers==4.36.2 git clone https://github.com/haotian-liu/LLaVA.git # clone the llava libary copy generate.py .\LLaVA\ # copy our example to the LLaVA folder diff --git a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md index 5f9a617aaa3..c480c545366 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/qwen-vl/README.md @@ -15,6 +15,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install "transformers<4.37.0" pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation ``` @@ -27,6 +28,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install "transformers<4.37.0" pip install accelerate tiktoken einops transformers_stream_generator==0.0.4 scipy torchvision pillow tensorboard matplotlib # additional package required for Qwen-VL-Chat to conduct generation ``` diff --git a/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md b/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md index 171ff392422..98806eda677 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/speech-t5/README.md @@ -15,6 +15,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install transformers==4.36.2 pip install "datasets<2.18" soundfile # additional package required for SpeechT5 to conduct generation ``` @@ -27,6 +28,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ +pip install transformers==4.36.2 pip install "datasets<2.18" soundfile # additional package required for SpeechT5 to conduct generation ``` diff --git a/python/llm/setup.py b/python/llm/setup.py index f9adc5f39f8..4386293cac6 100644 --- a/python/llm/setup.py +++ b/python/llm/setup.py @@ -53,7 +53,7 @@ cpu_torch_version = ["torch==2.1.2+cpu;platform_system=='Linux'", "torch==2.1.2;platform_system=='Windows'"] CONVERT_DEP = ['numpy == 1.26.4', # lastet 2.0.0b1 will cause error - 'transformers == 4.36.2', 'sentencepiece', 'tokenizers == 0.15.2', + 'transformers == 4.37.0', 'sentencepiece', 'tokenizers == 0.15.2', 'accelerate == 0.23.0', 'tabulate'] + cpu_torch_version SERVING_DEP = ['fschat[model_worker, webui] == 0.2.36', 'protobuf'] diff --git a/python/llm/test/inference_gpu/test_transformers_api.py b/python/llm/test/inference_gpu/test_transformers_api.py index ae9c6b9bc3e..b29c25997ae 100644 --- a/python/llm/test/inference_gpu/test_transformers_api.py +++ b/python/llm/test/inference_gpu/test_transformers_api.py @@ -36,7 +36,7 @@ (AutoModelForCausalLM, AutoTokenizer, os.environ.get('MPT_7B_ORIGIN_PATH')), # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')), # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('BAICHUAN2_7B_ORIGIN_PATH')), - # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), + # (AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0 ]) def test_completion(Model, Tokenizer, model_path, prompt, answer): with torch.inference_mode(): diff --git a/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py b/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py index f45f017ef0b..edb2adf1ec0 100644 --- a/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py +++ b/python/llm/test/inference_gpu/test_transformers_api_RMSNorm.py @@ -32,7 +32,7 @@ ("ChatGLM2-6B", AutoModel, AutoTokenizer, os.environ.get('CHATGLM2_6B_ORIGIN_PATH')), ("Mistral-7B-Instruct-v0.1", AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')), ("Baichuan2-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('BAICHUAN2_7B_ORIGIN_PATH')), - ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), + # ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0 ] class Test_Optimize_Gpu_Model: diff --git a/python/llm/test/inference_gpu/test_transformers_api_attention.py b/python/llm/test/inference_gpu/test_transformers_api_attention.py index 4db5ba8b531..84bdcf8e8cb 100644 --- a/python/llm/test/inference_gpu/test_transformers_api_attention.py +++ b/python/llm/test/inference_gpu/test_transformers_api_attention.py @@ -34,7 +34,7 @@ ("ChatGLM2-6B", AutoModel, AutoTokenizer, os.environ.get('CHATGLM2_6B_ORIGIN_PATH')), ("Mistral-7B-Instruct-v0.1", AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')), ("Baichuan2-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('BAICHUAN2_7B_ORIGIN_PATH')), - ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), + # ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0 ] class Test_Optimize_Gpu_Model: diff --git a/python/llm/test/inference_gpu/test_transformers_api_mlp.py b/python/llm/test/inference_gpu/test_transformers_api_mlp.py index cf0581a50c0..c6229d73fc4 100644 --- a/python/llm/test/inference_gpu/test_transformers_api_mlp.py +++ b/python/llm/test/inference_gpu/test_transformers_api_mlp.py @@ -27,7 +27,7 @@ PROMPT = "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun" TEST_MODEL_LIST = [ - ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), + # ("Qwen-7B-Chat", AutoModelForCausalLM, AutoTokenizer, os.environ.get('QWEN_7B_ORIGIN_PATH')), # qwen requires transformers<4.37.0 ("Mistral-7B-Instruct-v0.1", AutoModelForCausalLM, AutoTokenizer, os.environ.get('MISTRAL_7B_INSTRUCT_V0_1_ORIGIN_PATH')), ("Llama2-7B", AutoModelForCausalLM, LlamaTokenizer, os.environ.get('LLAMA2_7B_ORIGIN_PATH')) ] From cab32ea354f5fa388bb1d11f90913c98f459c594 Mon Sep 17 00:00:00 2001 From: Yuwen Hu <54161268+Oscilloscope98@users.noreply.github.com> Date: Tue, 20 Aug 2024 17:59:28 +0800 Subject: [PATCH 07/11] Update performance test regarding updated default `transformers==4.37.0` (#11869) * Update igpu performance from transformers 4.36.2 to 4.37.0 (#11841) * upgrade arc perf test to transformers 4.37 (#11842) * fix load low bit com dtype (#11832) * feat: add mixed_precision argument on ppl longbench evaluation * fix: delete extra code * feat: upgrade arc perf test to transformers 4.37 * fix: add missing codes * fix: keep perf test for qwen-vl-chat in transformers 4.36 * fix: remove extra space * fix: resolve pr comment * fix: add empty line * fix: add pip install for spr and core test * fix: delete extra comments * fix: remove python -m for pip * Revert "fix load low bit com dtype (#11832)" This reverts commit 6841a9ac8fc8b3f4eb06e41fa3944f7877fd8f94. --------- Co-authored-by: Zhao Changmin Co-authored-by: Jinhe Tang * add transformers==4.36 for qwen vl in igpu-perf (#11846) * add transformers==4.36.2 for qwen-vl * Small update --------- Co-authored-by: Yuwen Hu * fix: remove qwen-7b on core test (#11851) * fix: remove qwen-7b on core test * fix: change delete to comment --------- Co-authored-by: Jinhe Tang * replce filename (#11854) * fix: remove qwen-7b on core test * fix: change delete to comment * fix: replace filename --------- Co-authored-by: Jinhe Tang * fix: delete extra comments (#11863) * Remove transformers installation for temp test purposes * Small fix * Small update --------- Co-authored-by: Chu,Youcheng <70999398+cranechu0131@users.noreply.github.com> Co-authored-by: Zhao Changmin Co-authored-by: Jinhe Tang Co-authored-by: Zijie Li Co-authored-by: Chu,Youcheng <1340390339@qq.com> --- .github/workflows/llm_performance_tests.yml | 128 +++++++----------- .../test/benchmark/arc-perf-test-batch2.yaml | 30 ---- .../test/benchmark/arc-perf-test-batch4.yaml | 36 ----- python/llm/test/benchmark/arc-perf-test.yaml | 32 ----- .../arc-perf-transformers-436-batch2.yaml | 16 +++ .../arc-perf-transformers-436-batch4.yaml | 18 +++ .../benchmark/arc-perf-transformers-436.yaml | 16 +++ .../arc-perf-transformers-437-batch2.yaml | 14 ++ .../arc-perf-transformers-437-batch4.yaml | 18 ++- .../benchmark/arc-perf-transformers-437.yaml | 14 ++ python/llm/test/benchmark/core-perf-test.yaml | 2 +- .../test/benchmark/igpu-perf/1024-128.yaml | 8 +- .../{1024-128_437.yaml => 1024-128_436.yaml} | 8 +- .../igpu-perf/1024-128_int4_fp16.yaml | 8 +- ...6_437.yaml => 1024-128_int4_fp16_436.yaml} | 8 +- .../1024-128_int4_fp16_loadlowbit.yaml | 7 +- ...=> 1024-128_int4_fp16_loadlowbit_436.yaml} | 7 +- .../igpu-perf/2048-256_int4_fp16.yaml | 8 +- ...6_437.yaml => 2048-256_int4_fp16_436.yaml} | 8 +- .../igpu-perf/3072-384_int4_fp16.yaml | 8 +- ...6_437.yaml => 3072-384_int4_fp16_436.yaml} | 10 +- .../benchmark/igpu-perf/32-32_int4_fp16.yaml | 8 +- ...fp16_437.yaml => 32-32_int4_fp16_436.yaml} | 8 +- .../igpu-perf/4096-512_int4_fp16.yaml | 7 + .../igpu-perf/4096-512_int4_fp16_437.yaml | 19 --- 25 files changed, 202 insertions(+), 244 deletions(-) delete mode 100644 python/llm/test/benchmark/arc-perf-test-batch2.yaml delete mode 100644 python/llm/test/benchmark/arc-perf-test-batch4.yaml delete mode 100644 python/llm/test/benchmark/arc-perf-test.yaml create mode 100644 python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml create mode 100644 python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml create mode 100644 python/llm/test/benchmark/arc-perf-transformers-436.yaml rename python/llm/test/benchmark/igpu-perf/{1024-128_437.yaml => 1024-128_436.yaml} (65%) rename python/llm/test/benchmark/igpu-perf/{1024-128_int4_fp16_437.yaml => 1024-128_int4_fp16_436.yaml} (65%) rename python/llm/test/benchmark/igpu-perf/{1024-128_int4_fp16_loadlowbit_437.yaml => 1024-128_int4_fp16_loadlowbit_436.yaml} (68%) rename python/llm/test/benchmark/igpu-perf/{2048-256_int4_fp16_437.yaml => 2048-256_int4_fp16_436.yaml} (65%) rename python/llm/test/benchmark/igpu-perf/{3072-384_int4_fp16_437.yaml => 3072-384_int4_fp16_436.yaml} (52%) rename python/llm/test/benchmark/igpu-perf/{32-32_int4_fp16_437.yaml => 32-32_int4_fp16_436.yaml} (65%) delete mode 100644 python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml diff --git a/.github/workflows/llm_performance_tests.yml b/.github/workflows/llm_performance_tests.yml index 36b31f23937..736b1dd4540 100644 --- a/.github/workflows/llm_performance_tests.yml +++ b/.github/workflows/llm_performance_tests.yml @@ -153,7 +153,8 @@ jobs: source /opt/intel/oneapi/setvars.sh export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 - cp python/llm/test/benchmark/arc-perf-test.yaml python/llm/dev/benchmark/all-in-one/config.yaml + pip install transformers==4.36.2 + cp python/llm/test/benchmark/arc-perf-transformers-436.yaml python/llm/dev/benchmark/all-in-one/config.yaml cd python/llm/dev/benchmark/all-in-one mkdir test_batch1 mkdir test_batch2 @@ -167,7 +168,7 @@ jobs: mv *.csv test_batch1 # batch_size 2 cd ../../../../../ - cp python/llm/test/benchmark/arc-perf-test-batch2.yaml python/llm/dev/benchmark/all-in-one/config.yaml + cp python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml python/llm/dev/benchmark/all-in-one/config.yaml cd python/llm/dev/benchmark/all-in-one # change csv name sed -i 's/batch1/batch2/g' run.py @@ -175,7 +176,7 @@ jobs: mv *.csv test_batch2 # batch_size 4 cd ../../../../../ - cp python/llm/test/benchmark/arc-perf-test-batch4.yaml python/llm/dev/benchmark/all-in-one/config.yaml + cp python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml python/llm/dev/benchmark/all-in-one/config.yaml cd python/llm/dev/benchmark/all-in-one # change csv name sed -i 's/batch2/batch4/g' run.py @@ -188,7 +189,7 @@ jobs: source /opt/intel/oneapi/setvars.sh export USE_XETLA=OFF export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1 - # upgrade transformers for model Qwen/Qwen1.5-7B-Chat + # upgrade for default transformers version python -m pip install transformers==4.37.0 # batch_size 1 cp python/llm/test/benchmark/arc-perf-transformers-437.yaml python/llm/dev/benchmark/all-in-one/config.yaml @@ -314,7 +315,7 @@ jobs: run: | # batch_size 1 cd python/llm/dev/benchmark/all-in-one/test_batch1 - python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-test.yaml + python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-transformers-436.yaml python ../../../../test/benchmark/check_results.py -c test2 -y ../../../../test/benchmark/arc-perf-transformers-437.yaml python ../../../../test/benchmark/check_results.py -c test3 -y ../../../../test/benchmark/arc-perf-transformers-440.yaml find . -name "*test*.csv" -delete @@ -327,7 +328,7 @@ jobs: rm -r test_batch1 # batch_size 2 cd test_batch2 - python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-test-batch2.yaml + python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-transformers-436-batch2.yaml python ../../../../test/benchmark/check_results.py -c test2 -y ../../../../test/benchmark/arc-perf-transformers-437-batch2.yaml find . -name "*test*.csv" -delete if [[ ${{ github.event_name }} == "schedule" ]]; then @@ -339,7 +340,7 @@ jobs: rm -r test_batch2 # batch_size 4 cd test_batch4 - python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-test-batch4.yaml + python ../../../../test/benchmark/check_results.py -c test1 -y ../../../../test/benchmark/arc-perf-transformers-436-batch4.yaml python ../../../../test/benchmark/check_results.py -c test2 -y ../../../../test/benchmark/arc-perf-transformers-437-batch4.yaml find . -name "*test*.csv" -delete if [[ ${{ github.event_name }} == "schedule" ]]; then @@ -384,7 +385,6 @@ jobs: python -m pip install --upgrade einops python -m pip install --upgrade tiktoken python -m pip install --upgrade transformers_stream_generator - # specific for test on certain commits - name: Download llm binary if: ${{ github.event_name == 'workflow_dispatch' && (inputs.checkout-ref != 'main') }} @@ -653,6 +653,7 @@ jobs: set BIGDL_LLM_XMX_DISABLED=1 REM for llava set TRANSFORMERS_OFFLINE=1 + pip install transformers==4.37.0 cd python\llm\dev\benchmark\all-in-one move ..\..\..\test\benchmark\igpu-perf\32-32_int4_fp16.yaml config.yaml @@ -664,23 +665,23 @@ jobs: call conda deactivate - - name: Prepare igpu perf test for transformers 4.37 (32-32 int4+fp16) + - name: Prepare igpu perf test for transformers 4.36 (32-32 int4+fp16) shell: bash run: | sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py - sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml + sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml - - name: Test on igpu for transformers 4.37 (32-32 int4+fp16) + - name: Test on igpu for transformers 4.36 (32-32 int4+fp16) shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.37.0 + pip install transformers==4.36.2 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 cd python\llm\dev\benchmark\all-in-one - move ..\..\..\test\benchmark\igpu-perf\32-32_int4_fp16_437.yaml config.yaml + move ..\..\..\test\benchmark\igpu-perf\32-32_int4_fp16_436.yaml config.yaml set PYTHONIOENCODING=utf-8 python run.py >> %CSV_SAVE_PATH%\32-32_int4_fp16\log\%LOG_FILE% 2>&1 if %ERRORLEVEL% neq 0 (exit /b 1) @@ -771,7 +772,7 @@ jobs: shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.36.2 + pip install transformers==4.37.0 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 @@ -788,23 +789,23 @@ jobs: call conda deactivate - - name: Prepare igpu perf test for transformers 4.37 (1024-128 int4+fp16) + - name: Prepare igpu perf test for transformers 4.36 (1024-128 int4+fp16) shell: bash run: | sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py - sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml + sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml - - name: Test on igpu for transformers 4.37 (1024-128 int4+fp16) + - name: Test on igpu for transformers 4.36 (1024-128 int4+fp16) shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.37.0 + pip install transformers==4.36.2 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 cd python\llm\dev\benchmark\all-in-one - move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_437.yaml config.yaml + move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_436.yaml config.yaml set PYTHONIOENCODING=utf-8 python run.py >> %CSV_SAVE_PATH%\1024-128_int4_fp16\log\%LOG_FILE% 2>&1 if %ERRORLEVEL% neq 0 (exit /b 1) @@ -812,7 +813,7 @@ jobs: if %ERRORLEVEL% neq 0 (exit /b 1) call conda deactivate - + - name: Prepare igpu perf test for transformers 4.38 (1024-128 int4+fp16) shell: bash run: | @@ -894,7 +895,6 @@ jobs: shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.36.2 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 @@ -911,23 +911,23 @@ jobs: call conda deactivate - - name: Prepare igpu perf test for transformers 4.37 (2048-256 int4+fp16) + - name: Prepare igpu perf test for transformers 4.36 (2048-256 int4+fp16) shell: bash run: | sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py - sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml + sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml - - name: Test on igpu for transformers 4.37 (2048-256 int4+fp16) + - name: Test on igpu for transformers 4.36 (2048-256 int4+fp16) shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.37.0 + pip install transformers==4.36.2 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 cd python\llm\dev\benchmark\all-in-one - move ..\..\..\test\benchmark\igpu-perf\2048-256_int4_fp16_437.yaml config.yaml + move ..\..\..\test\benchmark\igpu-perf\2048-256_int4_fp16_436.yaml config.yaml set PYTHONIOENCODING=utf-8 python run.py >> %CSV_SAVE_PATH%\2048-256_int4_fp16\log\%LOG_FILE% 2>&1 if %ERRORLEVEL% neq 0 (exit /b 1) @@ -935,7 +935,7 @@ jobs: if %ERRORLEVEL% neq 0 (exit /b 1) call conda deactivate - + - name: Prepare igpu perf test for transformers 4.38 (2048-256 int4+fp16) shell: bash run: | @@ -1017,7 +1017,7 @@ jobs: shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.36.2 + pip install transformers==4.37.0 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 @@ -1034,23 +1034,23 @@ jobs: call conda deactivate - - name: Prepare igpu perf test for transformers 4.37 (3072-384 int4+fp16) + - name: Prepare igpu perf test for transformers 4.36 (3072-384 int4+fp16) shell: bash run: | sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py - sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml + sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml - - name: Test on igpu for transformers 4.37 (3072-384 int4+fp16) + - name: Test on igpu for transformers 4.36 (3072-384 int4+fp16) shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.37.0 + pip install transformers==4.36.2 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 cd python\llm\dev\benchmark\all-in-one - move ..\..\..\test\benchmark\igpu-perf\3072-384_int4_fp16_437.yaml config.yaml + move ..\..\..\test\benchmark\igpu-perf\3072-384_int4_fp16_436.yaml config.yaml set PYTHONIOENCODING=utf-8 python run.py >> %CSV_SAVE_PATH%\3072-384_int4_fp16\log\%LOG_FILE% 2>&1 if %ERRORLEVEL% neq 0 (exit /b 1) @@ -1140,7 +1140,7 @@ jobs: shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.36.2 + pip install transformers==4.37.0 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 @@ -1157,35 +1157,10 @@ jobs: call conda deactivate - - name: Prepare igpu perf test for transformers 4.37 (4096-512 int4+fp16) - shell: bash - run: | - sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py - sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml - - - name: Test on igpu for transformers 4.37 (4096-512 int4+fp16) - shell: cmd - run: | - call conda activate igpu-perf - pip install transformers==4.37.0 - - set SYCL_CACHE_PERSISTENT=1 - set BIGDL_LLM_XMX_DISABLED=1 - - cd python\llm\dev\benchmark\all-in-one - move ..\..\..\test\benchmark\igpu-perf\4096-512_int4_fp16_437.yaml config.yaml - set PYTHONIOENCODING=utf-8 - python run.py >> %CSV_SAVE_PATH%\4096-512_int4_fp16\log\%LOG_FILE% 2>&1 - if %ERRORLEVEL% neq 0 (exit /b 1) - python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test2 - if %ERRORLEVEL% neq 0 (exit /b 1) - - call conda deactivate - - name: Prepare igpu perf test for transformers 4.38 (4096-512 int4+fp16) shell: bash run: | - sed -i 's/{today}_test2/{today}_test3/g' python/llm/dev/benchmark/all-in-one/run.py + sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_438.yaml - name: Test on igpu for transformers 4.38 (4096-512 int4+fp16) @@ -1202,7 +1177,7 @@ jobs: set PYTHONIOENCODING=utf-8 python run.py >> %CSV_SAVE_PATH%\4096-512_int4_fp16\log\%LOG_FILE% 2>&1 if %ERRORLEVEL% neq 0 (exit /b 1) - python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test3 + python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test2 if %ERRORLEVEL% neq 0 (exit /b 1) call conda deactivate @@ -1210,7 +1185,7 @@ jobs: - name: Prepare igpu perf test for transformers 4.43 (4096-512 int4+fp16) shell: bash run: | - sed -i 's/{today}_test3/{today}_test4/g' python/llm/dev/benchmark/all-in-one/run.py + sed -i 's/{today}_test2/{today}_test3/g' python/llm/dev/benchmark/all-in-one/run.py sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_443.yaml - name: Test on igpu for transformers 4.43 (4096-512 int4+fp16) @@ -1228,7 +1203,7 @@ jobs: set PYTHONIOENCODING=utf-8 python run.py >> %CSV_SAVE_PATH%\4096-512_int4_fp16\log\%LOG_FILE% 2>&1 if %ERRORLEVEL% neq 0 (exit /b 1) - python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test4 + python ..\..\..\test\benchmark\igpu-perf\check_csv_results.py --yaml-file config.yaml --suffix test3 if %ERRORLEVEL% neq 0 (exit /b 1) pip uninstall trl -y @@ -1256,14 +1231,14 @@ jobs: shell: bash run: | sed -i 's/4096-512/1024-128/g' python/llm/dev/benchmark/all-in-one/run.py - sed -i 's/{today}_test4/{today}_test1/g' python/llm/dev/benchmark/all-in-one/run.py + sed -i 's/{today}_test3/{today}_test1/g' python/llm/dev/benchmark/all-in-one/run.py sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml - name: Test on igpu (load_low_bit 1024-128 int4+fp16) shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.36.2 + pip install transformers==4.37.0 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 @@ -1280,23 +1255,23 @@ jobs: call conda deactivate - - name: Prepare igpu perf test for transformers 4.37 (load_low_bit 1024-128 int4+fp16) + - name: Prepare igpu perf test for transformers 4.36 (load_low_bit 1024-128 int4+fp16) shell: bash run: | sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py - sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml + sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml - - name: Test on igpu for transformers 4.37 (load_low_bit 1024-128 int4+fp16) + - name: Test on igpu for transformers 4.36 (load_low_bit 1024-128 int4+fp16) shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.37.0 + pip install transformers==4.36.2 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 cd python\llm\dev\benchmark\all-in-one - move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_loadlowbit_437.yaml config.yaml + move ..\..\..\test\benchmark\igpu-perf\1024-128_int4_fp16_loadlowbit_436.yaml config.yaml set PYTHONIOENCODING=utf-8 python run.py >> %CSV_SAVE_PATH%\1024-128_int4_fp16_loadlowbit\log\%LOG_FILE% 2>&1 if %ERRORLEVEL% neq 0 (exit /b 1) @@ -1385,7 +1360,7 @@ jobs: shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.36.2 + pip install transformers==4.37.0 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 @@ -1402,23 +1377,23 @@ jobs: call conda deactivate - - name: Prepare igpu perf test for transformers 4.37 (1024-128) + - name: Prepare igpu perf test for transformers 4.36 (1024-128) shell: bash run: | sed -i 's/{today}_test1/{today}_test2/g' python/llm/dev/benchmark/all-in-one/run.py - sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_437.yaml + sed -i "s/path to your local model hub/$MODEL_HUB_DIR/g" python/llm/test/benchmark/igpu-perf/1024-128_436.yaml - - name: Test on igpu for transformers 4.37 (1024-128) + - name: Test on igpu for transformers 4.36 (1024-128) shell: cmd run: | call conda activate igpu-perf - pip install transformers==4.37.0 + pip install transformers==4.36.2 set SYCL_CACHE_PERSISTENT=1 set BIGDL_LLM_XMX_DISABLED=1 cd python\llm\dev\benchmark\all-in-one - move ..\..\..\test\benchmark\igpu-perf\1024-128_437.yaml config.yaml + move ..\..\..\test\benchmark\igpu-perf\1024-128_436.yaml config.yaml set PYTHONIOENCODING=utf-8 python run.py >> %CSV_SAVE_PATH%\1024-128\log\%LOG_FILE% 2>&1 if %ERRORLEVEL% neq 0 (exit /b 1) @@ -1520,4 +1495,3 @@ jobs: # shell: cmd # run: | # call conda env remove -n igpu-perf -y - diff --git a/python/llm/test/benchmark/arc-perf-test-batch2.yaml b/python/llm/test/benchmark/arc-perf-test-batch2.yaml deleted file mode 100644 index 70447fd7f59..00000000000 --- a/python/llm/test/benchmark/arc-perf-test-batch2.yaml +++ /dev/null @@ -1,30 +0,0 @@ -repo_id: - - 'meta-llama/Llama-2-7b-chat-hf' - - 'meta-llama/Llama-2-13b-chat-hf' - - 'THUDM/chatglm3-6b-4bit' - - 'baichuan-inc/Baichuan2-7B-Chat' - - 'baichuan-inc/Baichuan2-13B-Chat-4bit' - - 'THUDM/glm-4-9b-chat' - - 'openbmb/MiniCPM-2B-sft-bf16' - - 'Qwen/Qwen-VL-Chat' - #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported - - '01-ai/Yi-6B-Chat' - - 'mistralai/Mistral-7B-Instruct-v0.2' - - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' - - '01-ai/Yi-1.5-6B-Chat' -local_model_hub: '/mnt/disk1/models' -warm_up: 1 -num_trials: 3 -num_beams: 1 # default to greedy search -low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) -batch_size: 2 # default to 1 -in_out_pairs: - - '32-32' - - '1024-128' - - '2048-256' -test_api: - - "transformer_int4_fp16_gpu" # on Intel GPU -cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api) -exclude: - - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048' -task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' diff --git a/python/llm/test/benchmark/arc-perf-test-batch4.yaml b/python/llm/test/benchmark/arc-perf-test-batch4.yaml deleted file mode 100644 index 3bfd47963a4..00000000000 --- a/python/llm/test/benchmark/arc-perf-test-batch4.yaml +++ /dev/null @@ -1,36 +0,0 @@ -repo_id: - - 'meta-llama/Llama-2-7b-chat-hf' - - 'meta-llama/Llama-2-13b-chat-hf' - - 'THUDM/chatglm3-6b-4bit' - - 'baichuan-inc/Baichuan2-7B-Chat' - - 'baichuan-inc/Baichuan2-13B-Chat-4bit' - - 'THUDM/glm-4-9b-chat' - - 'openbmb/MiniCPM-2B-sft-bf16' - - 'Qwen/Qwen-VL-Chat' - #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported - - '01-ai/Yi-6B-Chat' - - 'mistralai/Mistral-7B-Instruct-v0.2' - - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' - - '01-ai/Yi-1.5-6B-Chat' -local_model_hub: '/mnt/disk1/models' -warm_up: 1 -num_trials: 3 -num_beams: 1 # default to greedy search -low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) -batch_size: 4 # default to 1 -in_out_pairs: - - '32-32' - - '1024-128' - - '2048-256' -test_api: - - "transformer_int4_fp16_gpu" # on Intel GPU -cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api) -exclude: - - 'meta-llama/Llama-2-13b-chat-hf:2048' - - 'baichuan-inc/Baichuan2-7B-Chat:2048' - - 'baichuan-inc/Baichuan2-13B-Chat-4bit:1024' - - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048' - - 'Qwen/Qwen-VL-Chat:2048' -# - 'fnlp/moss-moon-003-sft-4bit:1024' -# - 'fnlp/moss-moon-003-sft-4bit:2048' -task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' diff --git a/python/llm/test/benchmark/arc-perf-test.yaml b/python/llm/test/benchmark/arc-perf-test.yaml deleted file mode 100644 index 890b8dbf470..00000000000 --- a/python/llm/test/benchmark/arc-perf-test.yaml +++ /dev/null @@ -1,32 +0,0 @@ -repo_id: - - 'meta-llama/Llama-2-7b-chat-hf' - - 'meta-llama/Llama-2-13b-chat-hf' - - 'THUDM/chatglm3-6b-4bit' - - 'baichuan-inc/Baichuan2-7B-Chat' - - 'baichuan-inc/Baichuan2-13B-Chat-4bit' - - 'THUDM/glm-4-9b-chat' - - 'openbmb/MiniCPM-2B-sft-bf16' - - 'Qwen/Qwen-VL-Chat' - #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported - - '01-ai/Yi-6B-Chat' - - 'mistralai/Mistral-7B-Instruct-v0.2' - - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' - - '01-ai/Yi-1.5-6B-Chat' -local_model_hub: '/mnt/disk1/models' -warm_up: 1 -num_trials: 3 -num_beams: 1 # default to greedy search -low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) -batch_size: 1 # default to 1 -in_out_pairs: - - '32-32' - - '1024-128' - - '2048-256' -test_api: - - "transformer_int4_fp16_gpu" # on Intel GPU -cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api) -exclude: -# - 'fnlp/moss-moon-003-sft-4bit:1024' -# - 'fnlp/moss-moon-003-sft-4bit:2048' - - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048' -task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' diff --git a/python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml b/python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml new file mode 100644 index 00000000000..42ef79f344c --- /dev/null +++ b/python/llm/test/benchmark/arc-perf-transformers-436-batch2.yaml @@ -0,0 +1,16 @@ +repo_id: + - 'Qwen/Qwen-VL-Chat' +local_model_hub: '/mnt/disk1/models' +warm_up: 1 +num_trials: 3 +num_beams: 1 # default to greedy search +low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) +batch_size: 2 # default to 1 +in_out_pairs: + - '32-32' + - '1024-128' + - '2048-256' +test_api: + - "transformer_int4_fp16_gpu" # on Intel GPU +cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api) +task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' diff --git a/python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml b/python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml new file mode 100644 index 00000000000..606b9c6cf05 --- /dev/null +++ b/python/llm/test/benchmark/arc-perf-transformers-436-batch4.yaml @@ -0,0 +1,18 @@ +repo_id: + - 'Qwen/Qwen-VL-Chat' +local_model_hub: '/mnt/disk1/models' +warm_up: 1 +num_trials: 3 +num_beams: 1 # default to greedy search +low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) +batch_size: 4 # default to 1 +in_out_pairs: + - '32-32' + - '1024-128' + - '2048-256' +test_api: + - "transformer_int4_fp16_gpu" # on Intel GPU +cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api) +exclude: + - 'Qwen/Qwen-VL-Chat:2048' +task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' diff --git a/python/llm/test/benchmark/arc-perf-transformers-436.yaml b/python/llm/test/benchmark/arc-perf-transformers-436.yaml new file mode 100644 index 00000000000..efdf14193a3 --- /dev/null +++ b/python/llm/test/benchmark/arc-perf-transformers-436.yaml @@ -0,0 +1,16 @@ +repo_id: + - 'Qwen/Qwen-VL-Chat' +local_model_hub: '/mnt/disk1/models' +warm_up: 1 +num_trials: 3 +num_beams: 1 # default to greedy search +low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) +batch_size: 1 # default to 1 +in_out_pairs: + - '32-32' + - '1024-128' + - '2048-256' +test_api: + - "transformer_int4_fp16_gpu" # on Intel GPU +cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api) +task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' diff --git a/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml b/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml index d675d506629..9b9ab1f14ae 100644 --- a/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml +++ b/python/llm/test/benchmark/arc-perf-transformers-437-batch2.yaml @@ -6,6 +6,18 @@ repo_id: - 'microsoft/phi-3-vision-128k-instruct' - 'Qwen/Qwen2-7B-Instruct' - 'microsoft/Phi-3-mini-128k-instruct' + - 'meta-llama/Llama-2-7b-chat-hf' + - 'meta-llama/Llama-2-13b-chat-hf' + - 'THUDM/chatglm3-6b-4bit' + - 'baichuan-inc/Baichuan2-7B-Chat' + - 'baichuan-inc/Baichuan2-13B-Chat-4bit' + - 'THUDM/glm-4-9b-chat' + - 'openbmb/MiniCPM-2B-sft-bf16' + #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported + - '01-ai/Yi-6B-Chat' + - 'mistralai/Mistral-7B-Instruct-v0.2' + - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' + - '01-ai/Yi-1.5-6B-Chat' local_model_hub: '/mnt/disk1/models' warm_up: 1 num_trials: 3 @@ -19,4 +31,6 @@ in_out_pairs: test_api: - "transformer_int4_fp16_gpu" # on Intel GPU cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api) +exclude: + - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048' task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' diff --git a/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml b/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml index f3d55c83e35..368a8c636b5 100644 --- a/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml +++ b/python/llm/test/benchmark/arc-perf-transformers-437-batch4.yaml @@ -6,6 +6,18 @@ repo_id: - 'microsoft/phi-3-vision-128k-instruct' - 'Qwen/Qwen2-7B-Instruct' - 'microsoft/Phi-3-mini-128k-instruct' + - 'meta-llama/Llama-2-7b-chat-hf' + - 'meta-llama/Llama-2-13b-chat-hf' + - 'THUDM/chatglm3-6b-4bit' + - 'baichuan-inc/Baichuan2-7B-Chat' + - 'baichuan-inc/Baichuan2-13B-Chat-4bit' + - 'THUDM/glm-4-9b-chat' + - 'openbmb/MiniCPM-2B-sft-bf16' + #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported + - '01-ai/Yi-6B-Chat' + - 'mistralai/Mistral-7B-Instruct-v0.2' + - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' + - '01-ai/Yi-1.5-6B-Chat' local_model_hub: '/mnt/disk1/models' warm_up: 1 num_trials: 3 @@ -22,4 +34,8 @@ cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu w exclude: - 'Qwen/Qwen1.5-7B-Chat:2048' - 'meta-llama/Meta-Llama-3-8B-Instruct:2048' -task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' \ No newline at end of file + - 'meta-llama/Llama-2-13b-chat-hf:2048' + - 'baichuan-inc/Baichuan2-7B-Chat:2048' + - 'baichuan-inc/Baichuan2-13B-Chat-4bit:1024' + - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048' +task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' diff --git a/python/llm/test/benchmark/arc-perf-transformers-437.yaml b/python/llm/test/benchmark/arc-perf-transformers-437.yaml index 1c775344c43..bca87891f6b 100644 --- a/python/llm/test/benchmark/arc-perf-transformers-437.yaml +++ b/python/llm/test/benchmark/arc-perf-transformers-437.yaml @@ -6,6 +6,18 @@ repo_id: - 'microsoft/phi-3-vision-128k-instruct' - 'Qwen/Qwen2-7B-Instruct' - 'microsoft/Phi-3-mini-128k-instruct' + - 'meta-llama/Llama-2-7b-chat-hf' + - 'meta-llama/Llama-2-13b-chat-hf' + - 'THUDM/chatglm3-6b-4bit' + - 'baichuan-inc/Baichuan2-7B-Chat' + - 'baichuan-inc/Baichuan2-13B-Chat-4bit' + - 'THUDM/glm-4-9b-chat' + - 'openbmb/MiniCPM-2B-sft-bf16' + #- 'SmerkyG/rwkv-5-world-7b' #this model only fp32 is supported for now, fp16 and bf16 are not supported + - '01-ai/Yi-6B-Chat' + - 'mistralai/Mistral-7B-Instruct-v0.2' + - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' + - '01-ai/Yi-1.5-6B-Chat' local_model_hub: '/mnt/disk1/models' warm_up: 1 num_trials: 3 @@ -19,4 +31,6 @@ in_out_pairs: test_api: - "transformer_int4_fp16_gpu" # on Intel GPU cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api) +exclude: + - 'baichuan-inc/Baichuan2-13B-Chat-4bit:2048' task: 'continuation' # task can be 'continuation', 'QA' and 'summarize' diff --git a/python/llm/test/benchmark/core-perf-test.yaml b/python/llm/test/benchmark/core-perf-test.yaml index 55f738de54b..2def68c1494 100644 --- a/python/llm/test/benchmark/core-perf-test.yaml +++ b/python/llm/test/benchmark/core-perf-test.yaml @@ -3,7 +3,7 @@ repo_id: - 'THUDM/chatglm3-6b' - 'baichuan-inc/Baichuan2-7B-Chat' - 'internlm/internlm-chat-7b' - - 'Qwen/Qwen-7B-Chat' + # - 'Qwen/Qwen-7B-Chat' # requires transformers < 4.37.0 - 'BAAI/AquilaChat2-7B' - 'meta-llama/Llama-2-7b-chat-hf' - 'WisdomShell/CodeShell-7B' diff --git a/python/llm/test/benchmark/igpu-perf/1024-128.yaml b/python/llm/test/benchmark/igpu-perf/1024-128.yaml index b0bd5f30c20..759a7566237 100644 --- a/python/llm/test/benchmark/igpu-perf/1024-128.yaml +++ b/python/llm/test/benchmark/igpu-perf/1024-128.yaml @@ -10,9 +10,15 @@ repo_id: - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' - 'RWKV/v5-Eagle-7B-HF' - '01-ai/Yi-6B-Chat' - - 'Qwen/Qwen-VL-Chat' - 'openbmb/MiniCPM-1B-sft-bf16' - 'openbmb/MiniCPM-2B-sft-bf16' + - 'Qwen/Qwen1.5-7B-Chat' + - 'Qwen/Qwen2-1.5B-Instruct' + - 'Qwen/Qwen2-7B-Instruct' + - 'microsoft/Phi-3-mini-4k-instruct' + - 'microsoft/Phi-3-mini-128k-instruct' + - 'microsoft/phi-3-vision-128k-instruct' + - 'openbmb/MiniCPM-V-2_6' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_437.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_436.yaml similarity index 65% rename from python/llm/test/benchmark/igpu-perf/1024-128_437.yaml rename to python/llm/test/benchmark/igpu-perf/1024-128_436.yaml index c6850389b97..c967f66a7ba 100644 --- a/python/llm/test/benchmark/igpu-perf/1024-128_437.yaml +++ b/python/llm/test/benchmark/igpu-perf/1024-128_436.yaml @@ -1,11 +1,5 @@ repo_id: - - 'Qwen/Qwen1.5-7B-Chat' - - 'Qwen/Qwen2-1.5B-Instruct' - - 'Qwen/Qwen2-7B-Instruct' - - 'microsoft/Phi-3-mini-4k-instruct' - - 'microsoft/Phi-3-mini-128k-instruct' - - 'microsoft/phi-3-vision-128k-instruct' - - 'openbmb/MiniCPM-V-2_6' + - 'Qwen/Qwen-VL-Chat' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml index 39d575680ab..f66172d9a39 100644 --- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml +++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16.yaml @@ -9,9 +9,15 @@ repo_id: - 'mistralai/Mistral-7B-Instruct-v0.2' - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' - '01-ai/Yi-6B-Chat' - - 'Qwen/Qwen-VL-Chat' - 'openbmb/MiniCPM-1B-sft-bf16' - 'openbmb/MiniCPM-2B-sft-bf16' + - 'Qwen/Qwen1.5-7B-Chat' + - 'Qwen/Qwen2-1.5B-Instruct' + - 'Qwen/Qwen2-7B-Instruct' + - 'microsoft/Phi-3-mini-4k-instruct' + - 'microsoft/Phi-3-mini-128k-instruct' + - 'microsoft/phi-3-vision-128k-instruct' + - 'openbmb/MiniCPM-V-2_6' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml similarity index 65% rename from python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml rename to python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml index 68cbaf2a163..c224b65e745 100644 --- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_437.yaml +++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_436.yaml @@ -1,11 +1,5 @@ repo_id: - - 'Qwen/Qwen1.5-7B-Chat' - - 'Qwen/Qwen2-1.5B-Instruct' - - 'Qwen/Qwen2-7B-Instruct' - - 'microsoft/Phi-3-mini-4k-instruct' - - 'microsoft/Phi-3-mini-128k-instruct' - - 'microsoft/phi-3-vision-128k-instruct' - - 'openbmb/MiniCPM-V-2_6' + - 'Qwen/Qwen-VL-Chat' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml index 2730e465d47..76c35d4dde7 100644 --- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml +++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit.yaml @@ -9,9 +9,14 @@ repo_id: - 'mistralai/Mistral-7B-Instruct-v0.2' - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' - '01-ai/Yi-6B-Chat' - - 'Qwen/Qwen-VL-Chat' - 'openbmb/MiniCPM-1B-sft-bf16' - 'openbmb/MiniCPM-2B-sft-bf16' + - 'Qwen/Qwen1.5-7B-Chat' + - 'Qwen/Qwen2-1.5B-Instruct' + - 'Qwen/Qwen2-7B-Instruct' + - 'microsoft/Phi-3-mini-4k-instruct' + - 'microsoft/Phi-3-mini-128k-instruct' + - 'microsoft/phi-3-vision-128k-instruct' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml similarity index 68% rename from python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml rename to python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml index 3839d0d2951..917e6d0ff3c 100644 --- a/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_437.yaml +++ b/python/llm/test/benchmark/igpu-perf/1024-128_int4_fp16_loadlowbit_436.yaml @@ -1,10 +1,5 @@ repo_id: - - 'Qwen/Qwen1.5-7B-Chat' - - 'Qwen/Qwen2-1.5B-Instruct' - - 'Qwen/Qwen2-7B-Instruct' - - 'microsoft/Phi-3-mini-4k-instruct' - - 'microsoft/Phi-3-mini-128k-instruct' - - 'microsoft/phi-3-vision-128k-instruct' + - 'Qwen/Qwen-VL-Chat' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml index c53e6283919..bf5fc1e978b 100644 --- a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml +++ b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16.yaml @@ -9,9 +9,15 @@ repo_id: - 'mistralai/Mistral-7B-Instruct-v0.2' - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' - '01-ai/Yi-6B-Chat' - - 'Qwen/Qwen-VL-Chat' - 'openbmb/MiniCPM-1B-sft-bf16' - 'openbmb/MiniCPM-2B-sft-bf16' + - 'Qwen/Qwen1.5-7B-Chat' + - 'Qwen/Qwen2-1.5B-Instruct' + - 'Qwen/Qwen2-7B-Instruct' + - 'microsoft/Phi-3-mini-4k-instruct' + - 'microsoft/Phi-3-mini-128k-instruct' + - 'microsoft/phi-3-vision-128k-instruct' + - 'openbmb/MiniCPM-V-2_6' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml similarity index 65% rename from python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml rename to python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml index 0eddd403b86..e9566c13250 100644 --- a/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_437.yaml +++ b/python/llm/test/benchmark/igpu-perf/2048-256_int4_fp16_436.yaml @@ -1,11 +1,5 @@ repo_id: - - 'Qwen/Qwen1.5-7B-Chat' - - 'Qwen/Qwen2-1.5B-Instruct' - - 'Qwen/Qwen2-7B-Instruct' - - 'microsoft/Phi-3-mini-4k-instruct' - - 'microsoft/Phi-3-mini-128k-instruct' - - 'microsoft/phi-3-vision-128k-instruct' - - 'openbmb/MiniCPM-V-2_6' + - 'Qwen/Qwen-VL-Chat' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml index 47b9839a789..60202594cba 100644 --- a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml +++ b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16.yaml @@ -8,9 +8,15 @@ repo_id: - 'mistralai/Mistral-7B-Instruct-v0.2' - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' - '01-ai/Yi-6B-Chat' - - 'Qwen/Qwen-VL-Chat' - 'openbmb/MiniCPM-1B-sft-bf16' - 'openbmb/MiniCPM-2B-sft-bf16' + - 'Qwen/Qwen1.5-7B-Chat' + - 'Qwen/Qwen2-1.5B-Instruct' + - 'Qwen/Qwen2-7B-Instruct' + - 'microsoft/Phi-3-mini-4k-instruct' + - 'microsoft/Phi-3-mini-128k-instruct' + - 'microsoft/phi-3-vision-128k-instruct' + - 'openbmb/MiniCPM-V-2_6' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml similarity index 52% rename from python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml rename to python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml index 087da9773db..6448a358cb5 100644 --- a/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_437.yaml +++ b/python/llm/test/benchmark/igpu-perf/3072-384_int4_fp16_436.yaml @@ -1,11 +1,5 @@ repo_id: - - 'Qwen/Qwen1.5-7B-Chat' - - 'Qwen/Qwen2-1.5B-Instruct' - - 'Qwen/Qwen2-7B-Instruct' - - 'microsoft/Phi-3-mini-4k-instruct' - - 'microsoft/Phi-3-mini-128k-instruct' - - 'microsoft/phi-3-vision-128k-instruct' - - 'openbmb/MiniCPM-V-2_6' + - 'Qwen/Qwen-VL-Chat' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 @@ -15,5 +9,5 @@ batch_size: 1 # default to 1 in_out_pairs: - '3072-384' test_api: - - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows (catch GPU peak memory) + - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows, use fp16 for non-linear layer cpu_embedding: True # whether put embedding to CPU (only avaiable now for gpu win related test_api) diff --git a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml index 39115e0231b..e70178744a3 100644 --- a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml +++ b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16.yaml @@ -9,9 +9,15 @@ repo_id: - 'mistralai/Mistral-7B-Instruct-v0.2' - 'deepseek-ai/deepseek-coder-7b-instruct-v1.5' - '01-ai/Yi-6B-Chat' - - 'Qwen/Qwen-VL-Chat' - 'openbmb/MiniCPM-1B-sft-bf16' - 'openbmb/MiniCPM-2B-sft-bf16' + - 'Qwen/Qwen1.5-7B-Chat' + - 'Qwen/Qwen2-1.5B-Instruct' + - 'Qwen/Qwen2-7B-Instruct' + - 'microsoft/Phi-3-mini-4k-instruct' + - 'microsoft/Phi-3-mini-128k-instruct' + - 'microsoft/phi-3-vision-128k-instruct' + - 'openbmb/MiniCPM-V-2_6' local_model_hub: 'path to your local model hub' warm_up: 3 num_trials: 5 diff --git a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml similarity index 65% rename from python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml rename to python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml index 1f0d11a2004..8faf43aed97 100644 --- a/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_437.yaml +++ b/python/llm/test/benchmark/igpu-perf/32-32_int4_fp16_436.yaml @@ -1,11 +1,5 @@ repo_id: - - 'Qwen/Qwen1.5-7B-Chat' - - 'Qwen/Qwen2-1.5B-Instruct' - - 'Qwen/Qwen2-7B-Instruct' - - 'microsoft/Phi-3-mini-4k-instruct' - - 'microsoft/Phi-3-mini-128k-instruct' - - 'microsoft/phi-3-vision-128k-instruct' - - 'openbmb/MiniCPM-V-2_6' + - 'Qwen/Qwen-VL-Chat' local_model_hub: 'path to your local model hub' warm_up: 3 num_trials: 5 diff --git a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml b/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml index 26e128a564c..514037a7380 100644 --- a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml +++ b/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16.yaml @@ -10,6 +10,13 @@ repo_id: - '01-ai/Yi-6B-Chat' - 'openbmb/MiniCPM-1B-sft-bf16' - 'openbmb/MiniCPM-2B-sft-bf16' + - 'Qwen/Qwen1.5-7B-Chat' + - 'Qwen/Qwen2-1.5B-Instruct' + - 'Qwen/Qwen2-7B-Instruct' + - 'microsoft/Phi-3-mini-4k-instruct' + - 'microsoft/Phi-3-mini-128k-instruct' + - 'microsoft/phi-3-vision-128k-instruct' + - 'openbmb/MiniCPM-V-2_6' local_model_hub: 'path to your local model hub' warm_up: 1 num_trials: 3 diff --git a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml b/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml deleted file mode 100644 index 4472b5da1f2..00000000000 --- a/python/llm/test/benchmark/igpu-perf/4096-512_int4_fp16_437.yaml +++ /dev/null @@ -1,19 +0,0 @@ -repo_id: - - 'Qwen/Qwen1.5-7B-Chat' - - 'Qwen/Qwen2-1.5B-Instruct' - - 'Qwen/Qwen2-7B-Instruct' - - 'microsoft/Phi-3-mini-4k-instruct' - - 'microsoft/Phi-3-mini-128k-instruct' - - 'microsoft/phi-3-vision-128k-instruct' - - 'openbmb/MiniCPM-V-2_6' -local_model_hub: 'path to your local model hub' -warm_up: 1 -num_trials: 3 -num_beams: 1 # default to greedy search -low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4) -batch_size: 1 # default to 1 -in_out_pairs: - - '4096-512' -test_api: - - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows (catch GPU peak memory) -cpu_embedding: True # whether put embedding to CPU (only avaiable now for gpu win related test_api) From 90549883db6143453b4a8c02e131b4adc5b142d6 Mon Sep 17 00:00:00 2001 From: Jinhe Date: Tue, 20 Aug 2024 18:01:42 +0800 Subject: [PATCH 08/11] Pytorch models transformers version update (#11860) * yi sync * delete 4.34 constraint * delete 4.34 constraint * delete 4.31 constraint * delete 4.34 constraint * delete 4.35 constraint * added <=4.33.3 constraint * added <=4.33.3 constraint * switched to chinese prompt --- .../llm/example/GPU/HuggingFace/LLM/yi/README.md | 12 ++++++------ .../llm/example/GPU/HuggingFace/LLM/yi/generate.py | 2 +- .../GPU/PyTorch-Models/Model/codegeex2/README.md | 2 -- .../GPU/PyTorch-Models/Model/codellama/README.md | 4 ---- .../GPU/PyTorch-Models/Model/deciLM-7b/README.md | 2 -- .../GPU/PyTorch-Models/Model/mistral/README.md | 7 ------- .../GPU/PyTorch-Models/Model/replit/README.md | 4 +++- .../GPU/PyTorch-Models/Model/solar/README.md | 4 ---- .../example/GPU/PyTorch-Models/Model/yi/README.md | 14 ++++++++++++-- .../GPU/PyTorch-Models/Model/yi/generate.py | 2 +- 10 files changed, 23 insertions(+), 30 deletions(-) diff --git a/python/llm/example/GPU/HuggingFace/LLM/yi/README.md b/python/llm/example/GPU/HuggingFace/LLM/yi/README.md index 1fb49f21523..080e2676fdc 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/yi/README.md +++ b/python/llm/example/GPU/HuggingFace/LLM/yi/README.md @@ -122,18 +122,18 @@ In the example, several arguments can be passed to satisfy your requirements: ```log Inference time: xxxx s -------------------- Prompt -------------------- -What is AI? +AI是什么? -------------------- Output -------------------- -What is AI? -Artificial Intelligence (AI) is the simulation of human intelligence in machines. AI is the science and engineering of making intelligent machines, especially intelligent computer programs. +AI是什么? +人工智能(Artificial Intelligence),英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及 ``` #### [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) ```log Inference time: xxxx s -------------------- Prompt -------------------- -What is AI? +AI是什么? -------------------- Output -------------------- -What is AI? -Artificial Intelligence (AI) refers to the simulation of human intelligence processes by machines, especially computer systems. These processes include learning, reasoning, and self- +AI是什么? +人工智能(Artificial Intelligence, AI)是计算机科学的一个分支,它研究如何让计算机模拟人类的智能行为。人工智能可以通过模仿人类的思维过程和 ``` \ No newline at end of file diff --git a/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py b/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py index f32f272c13a..643c5f7b34d 100644 --- a/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py +++ b/python/llm/example/GPU/HuggingFace/LLM/yi/generate.py @@ -27,7 +27,7 @@ parser.add_argument('--repo-id-or-model-path', type=str, default="01-ai/Yi-6B-Chat", help='The huggingface repo id for the Yi model to be downloaded' ', or the path to the huggingface checkpoint folder') - parser.add_argument('--prompt', type=str, default="What is AI?", + parser.add_argument('--prompt', type=str, default="AI是什么?", help='Prompt to infer') parser.add_argument('--n-predict', type=int, default=32, help='Max tokens to predict') diff --git a/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md b/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md index 37f801a28bf..bc8cfa62907 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/codegeex2/README.md @@ -16,7 +16,6 @@ conda create -n llm python=3.11 conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ -pip install transformers==4.31.0 ``` #### 1.2 Installation on Windows @@ -27,7 +26,6 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ -pip install transformers==4.31.0 ``` ### 2. Configures OneAPI environment variables for Linux diff --git a/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md b/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md index 497a6828b24..ff68817eca4 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/codellama/README.md @@ -14,8 +14,6 @@ conda create -n llm python=3.11 conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ - -pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers ``` #### 1.2 Installation on Windows @@ -26,8 +24,6 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ - -pip install transformers==4.34.1 # CodeLlamaTokenizer is supported in higher version of transformers ``` ### 2. Configures OneAPI environment variables for Linux diff --git a/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md b/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md index ff8eab5ae09..a9e66f54732 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/deciLM-7b/README.md @@ -14,8 +14,6 @@ conda create -n llm python=3.11 conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ - -pip install transformers==4.35.2 # required by DeciLM-7B ``` #### 1.2 Installation on Windows diff --git a/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md b/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md index 4fc017e1ba7..4f3e58b045c 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/mistral/README.md @@ -4,7 +4,6 @@ In this directory, you will find examples on how you could use IPEX-LLM `optimiz ## Requirements To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. -**Important: According to [Mistral Troubleshooting](https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting), please make sure you have installed `transformers==4.34.0` to run the example.** ## Example: Predict Tokens using `generate()` API In the example [generate.py](./generate.py), we show a basic use case for a Mistral model to predict the next N tokens using `generate()` API, with IPEX-LLM INT4 optimizations on Intel GPUs. @@ -16,9 +15,6 @@ conda create -n llm python=3.11 conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ - -# Refer to https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting, please make sure you are using a stable version of Transformers, 4.34.0 or newer. -pip install transformers==4.34.0 ``` #### 1.2 Installation on Windows @@ -29,9 +25,6 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ - -# Refer to https://huggingface.co/mistralai/Mistral-7B-v0.1#troubleshooting, please make sure you are using a stable version of Transformers, 4.34.0 or newer. -pip install transformers==4.34.0 ``` ### 2. Configures OneAPI environment variables for Linux diff --git a/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md b/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md index 4938682aea2..3bfbf245655 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/replit/README.md @@ -15,7 +15,7 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ -pip install "transformers<4.35" +pip install transformers<=4.33.3 ``` #### 1.2 Installation on Windows @@ -26,6 +26,8 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ + +pip install transformers<=4.33.3 ``` ### 2. Configures OneAPI environment variables for Linux diff --git a/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md b/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md index 2b718cd4a6a..4d157d19bf3 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/solar/README.md @@ -14,8 +14,6 @@ conda create -n llm python=3.11 conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ - -pip install transformers==4.35.2 # required by SOLAR ``` #### 1.2 Installation on Windows @@ -26,8 +24,6 @@ conda activate llm # below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ - -pip install transformers==4.35.2 # required by SOLAR ``` ### 2. Configures OneAPI environment variables for Linux diff --git a/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md b/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md index b48b95325c3..2b500175575 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md +++ b/python/llm/example/GPU/PyTorch-Models/Model/yi/README.md @@ -1,5 +1,5 @@ # Yi -In this directory, you will find examples on how you could use IPEX-LLM `optimize_model` API on Yi models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B) as a reference Yi model. +In this directory, you will find examples on how you could use IPEX-LLM `optimize_model` API on Yi models on [Intel GPUs](../../../README.md). For illustration purposes, we utilize the [01-ai/Yi-6B](https://huggingface.co/01-ai/Yi-6B) and [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-1.5-6B-Chat) as reference Yi models. ## 0. Requirements To run these examples with IPEX-LLM on Intel GPUs, we have some recommended requirements for your machine, please refer to [here](../../../README.md#requirements) for more information. @@ -112,7 +112,7 @@ python ./generate.py In the example, several arguments can be passed to satisfy your requirements: -- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yi model (e.g. `01-ai/Yi-6B`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'01-ai/Yi-6B'`. +- `--repo-id-or-model-path REPO_ID_OR_MODEL_PATH`: argument defining the huggingface repo id for the Yi model (e.g. `01-ai/Yi-6B` and `01-ai/Yi-6B-Chat`) to be downloaded, or the path to the huggingface checkpoint folder. It is default to be `'01-ai/Yi-6B-Chat'`. - `--prompt PROMPT`: argument defining the prompt to be infered (with integrated prompt format for chat). It is default to be `'AI是什么?'`. - `--n-predict N_PREDICT`: argument defining the max number of tokens to predict. It is default to be `32`. @@ -127,3 +127,13 @@ AI是什么? AI是什么? 人工智能(Artificial Intelligence),英文缩写为AI。它是研究、开发用于模拟、延伸和扩展人的智能的理论、方法、技术及 ``` + +#### [01-ai/Yi-6B-Chat](https://huggingface.co/01-ai/Yi-6B-Chat) +```log +Inference time: xxxx s +-------------------- Prompt -------------------- +AI是什么? +-------------------- Output -------------------- +AI是什么? +人工智能(Artificial Intelligence, AI)是计算机科学的一个分支,它研究如何让计算机模拟人类的智能行为。人工智能可以通过模仿人类的思维过程和 +``` \ No newline at end of file diff --git a/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py b/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py index 31256cda112..871f5f4fbd1 100644 --- a/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py +++ b/python/llm/example/GPU/PyTorch-Models/Model/yi/generate.py @@ -26,7 +26,7 @@ if __name__ == '__main__': parser = argparse.ArgumentParser(description='Predict Tokens using `generate()` API for Yi model') - parser.add_argument('--repo-id-or-model-path', type=str, default="01-ai/Yi-6B", + parser.add_argument('--repo-id-or-model-path', type=str, default="01-ai/Yi-6B-Chat", help='The huggingface repo id for the Yi model to be downloaded' ', or the path to the huggingface checkpoint folder') parser.add_argument('--prompt', type=str, default="AI是什么?", From 9a8eb10cd6f28c1b78d1cad549472f9b140d9e41 Mon Sep 17 00:00:00 2001 From: Yina Chen <33650826+cyita@users.noreply.github.com> Date: Tue, 20 Aug 2024 13:11:37 +0300 Subject: [PATCH 09/11] Update compresskv model forward type logic (#11868) * update * fix --- .../src/ipex_llm/transformers/models/llama.py | 18 ++++++++++----- .../ipex_llm/transformers/models/minicpm.py | 9 ++++---- .../src/ipex_llm/transformers/models/phi3.py | 11 +++++----- .../src/ipex_llm/transformers/models/qwen2.py | 22 +++++++++---------- 4 files changed, 33 insertions(+), 27 deletions(-) diff --git a/python/llm/src/ipex_llm/transformers/models/llama.py b/python/llm/src/ipex_llm/transformers/models/llama.py index 5e633da7406..2c9c17e7a58 100644 --- a/python/llm/src/ipex_llm/transformers/models/llama.py +++ b/python/llm/src/ipex_llm/transformers/models/llama.py @@ -128,7 +128,9 @@ def llama_model_forward_4_36( use_quantize = use_quantize_kv_cache( self.layers[0].mlp.up_proj, input, self.config.num_attention_heads//self.config.num_key_value_heads) - if should_use_compresskv(input, input.shape[1]): + use_compresskv = should_use_compresskv(input, input.shape[1]) or \ + isinstance(past_key_values, DynamicCompressCache) + if use_compresskv: if not isinstance(past_key_values, DynamicCompressCache): if use_quantize: past_key_values = DynamicCompressFp8Cache.from_legacy_cache( @@ -137,7 +139,7 @@ def llama_model_forward_4_36( past_key_values = DynamicCompressCache.from_legacy_cache( past_key_values) elif use_quantize: - if not isinstance(past_key_values, (DynamicFp8Cache, DynamicCompressCache)): + if not isinstance(past_key_values, DynamicFp8Cache): past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values) return llama_model_forward_4_36_internal( self=self, @@ -174,7 +176,9 @@ def llama_model_forward_4_38( use_quantize = use_quantize_kv_cache( self.layers[0].mlp.up_proj, input, self.config.num_attention_heads//self.config.num_key_value_heads) - if should_use_compresskv(input, input.shape[1]): + use_compresskv = should_use_compresskv(input, input.shape[1]) or \ + isinstance(past_key_values, DynamicCompressCache) + if use_compresskv: if not isinstance(past_key_values, DynamicCompressCache): if use_quantize: past_key_values = DynamicCompressFp8Cache.from_legacy_cache( @@ -183,7 +187,7 @@ def llama_model_forward_4_38( past_key_values = DynamicCompressCache.from_legacy_cache( past_key_values) elif use_quantize: - if not isinstance(past_key_values, (DynamicFp8Cache, DynamicCompressCache)): + if not isinstance(past_key_values, DynamicFp8Cache): past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values) return llama_model_forward_4_38_internal( self=self, @@ -221,7 +225,9 @@ def llama_model_forward_4_41( use_quantize = use_quantize_kv_cache( self.layers[0].mlp.up_proj, input, self.config.num_attention_heads//self.config.num_key_value_heads) - if should_use_compresskv(input, input.shape[1]): + use_compresskv = should_use_compresskv(input, input.shape[1]) or \ + isinstance(past_key_values, DynamicCompressCache) + if use_compresskv: if not isinstance(past_key_values, DynamicCompressCache): if use_quantize: past_key_values = DynamicCompressFp8Cache.from_legacy_cache( @@ -230,7 +236,7 @@ def llama_model_forward_4_41( past_key_values = DynamicCompressCache.from_legacy_cache( past_key_values) elif use_quantize: - if not isinstance(past_key_values, (DynamicFp8Cache, DynamicCompressCache)): + if not isinstance(past_key_values, DynamicFp8Cache): past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values) return llama_model_forward_4_41_internal( self=self, diff --git a/python/llm/src/ipex_llm/transformers/models/minicpm.py b/python/llm/src/ipex_llm/transformers/models/minicpm.py index afbcde6c657..d248c507773 100644 --- a/python/llm/src/ipex_llm/transformers/models/minicpm.py +++ b/python/llm/src/ipex_llm/transformers/models/minicpm.py @@ -182,7 +182,8 @@ def minicpm_model_forward( use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.up_proj, inputs, self.config.num_attention_heads // self.config.num_key_value_heads) - use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) + use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) or \ + isinstance(past_key_values, DynamicCompressCache) use_cache = use_cache if use_cache is not None else self.config.use_cache if use_cache: @@ -192,11 +193,11 @@ def minicpm_model_forward( past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values) else: past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values) - elif use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache, - DynamicCompressCache)): + elif (use_quantize_kv and not use_compress_kv + and not isinstance(past_key_values, DynamicFp8Cache)): past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values) elif (not use_quantize_kv and not use_compress_kv - and not isinstance(past_key_values, (DynamicNormalCache, DynamicCompressCache))): + and not isinstance(past_key_values, DynamicNormalCache)): past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values) # ipex-llm changes end return origin_forward( diff --git a/python/llm/src/ipex_llm/transformers/models/phi3.py b/python/llm/src/ipex_llm/transformers/models/phi3.py index 823fb10391a..bfa380c2f51 100644 --- a/python/llm/src/ipex_llm/transformers/models/phi3.py +++ b/python/llm/src/ipex_llm/transformers/models/phi3.py @@ -256,7 +256,8 @@ def model_forward( use_cache = use_cache if use_cache is not None else self.config.use_cache inputs = input_ids if input_ids is not None else inputs_embeds use_quantize_kv = use_quantize_kv_cache(self.layers[0].mlp.down_proj, inputs) - use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) + use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) or \ + isinstance(past_key_values, DynamicCompressCache) if use_cache: if use_compress_kv and not isinstance(past_key_values, DynamicCompressCache): @@ -264,13 +265,11 @@ def model_forward( past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values) else: past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values) - if use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache, - DynamicCompressCache)): + if use_quantize_kv and not use_compress_kv and not isinstance(past_key_values, + DynamicFp8Cache): past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values) if not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values, - (DynamicNormalCache, - DynamicCompressCache - )): + DynamicNormalCache): past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values) if past_key_values.get_seq_length() == 0: n_layer = self.config.num_hidden_layers diff --git a/python/llm/src/ipex_llm/transformers/models/qwen2.py b/python/llm/src/ipex_llm/transformers/models/qwen2.py index c01488a6fb6..802c5e7ec45 100644 --- a/python/llm/src/ipex_llm/transformers/models/qwen2.py +++ b/python/llm/src/ipex_llm/transformers/models/qwen2.py @@ -120,7 +120,8 @@ def qwen2_model_forward( and use_quantize_kv_cache(self.layers[0].mlp.up_proj, inputs, self.config.num_attention_heads//self.config.num_key_value_heads) ) - use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) + use_compress_kv = should_use_compresskv(inputs, inputs.shape[1]) or \ + isinstance(past_key_values, DynamicCompressCache) if use_cache: if use_compress_kv and not isinstance(past_key_values, DynamicCompressCache): @@ -128,12 +129,11 @@ def qwen2_model_forward( past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values) else: past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values) - elif use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache, - DynamicCompressCache)): + elif use_quantize_kv and not use_compress_kv and not isinstance(past_key_values, + DynamicFp8Cache): past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values) if not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values, - (DynamicNormalCache, - DynamicCompressCache)): + DynamicNormalCache): past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values) past_key_values_length = past_key_values.get_usable_length(seq_length) # ipex-llm changes end @@ -316,7 +316,8 @@ def qwen2_model_forward_4_42( and use_quantize_kv_cache(self.layers[0].mlp.up_proj, inputs_embeds, self.config.num_attention_heads//self.config.num_key_value_heads) ) - use_compress_kv = should_use_compresskv(inputs_embeds, inputs_embeds.shape[1]) + use_compress_kv = should_use_compresskv(inputs_embeds, inputs_embeds.shape[1]) or \ + isinstance(past_key_values, DynamicCompressCache) if use_cache: if use_compress_kv and not isinstance(past_key_values, DynamicCompressCache): @@ -324,12 +325,11 @@ def qwen2_model_forward_4_42( past_key_values = DynamicCompressFp8Cache.from_legacy_cache(past_key_values) else: past_key_values = DynamicCompressCache.from_legacy_cache(past_key_values) - elif use_quantize_kv and not isinstance(past_key_values, (DynamicFp8Cache, - DynamicCompressCache)): + elif use_quantize_kv and not use_compress_kv and not isinstance(past_key_values, + DynamicFp8Cache): past_key_values = DynamicFp8Cache.from_legacy_cache(past_key_values) - elif not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values, - (DynamicNormalCache, - DynamicCompressCache)): + if not use_quantize_kv and not use_compress_kv and not isinstance(past_key_values, + DynamicNormalCache): past_key_values = DynamicNormalCache.from_legacy_cache(past_key_values) # ipex-llm changes end From f573959b2a6e4854379905777afaac58aedb908e Mon Sep 17 00:00:00 2001 From: RyuKosei <70006706+RyuKosei@users.noreply.github.com> Date: Tue, 20 Aug 2024 18:50:00 +0800 Subject: [PATCH 10/11] Update local import for ppl (#11866) Co-authored-by: jenniew --- python/llm/dev/benchmark/perplexity/run_wikitext.py | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/python/llm/dev/benchmark/perplexity/run_wikitext.py b/python/llm/dev/benchmark/perplexity/run_wikitext.py index 50991558f35..92426a86fb1 100644 --- a/python/llm/dev/benchmark/perplexity/run_wikitext.py +++ b/python/llm/dev/benchmark/perplexity/run_wikitext.py @@ -21,7 +21,6 @@ import torch from tqdm import tqdm from datasets import load_dataset -from ipex_llm.utils.common import invalidInputError parser = argparse.ArgumentParser() @@ -63,6 +62,7 @@ def parse_kwargs(kwstr): data = f.read() encodings = tokenizer(data.decode("utf-8").strip("\n"), return_tensors="pt") else: + from ipex_llm.utils.common import invalidInputError raise invalidInputError(False, "Must specify either dataset or datapath.") if not args.max_length: From 52728feb7738ebc29c7cc2afb73fd4d49c30a664 Mon Sep 17 00:00:00 2001 From: cranechu <1340390339@qq.com> Date: Tue, 20 Aug 2024 19:18:49 +0800 Subject: [PATCH 11/11] fix: textual adjustment --- python/llm/dev/benchmark/perplexity/README.md | 6 ++---- 1 file changed, 2 insertions(+), 4 deletions(-) diff --git a/python/llm/dev/benchmark/perplexity/README.md b/python/llm/dev/benchmark/perplexity/README.md index 3d824ac570f..410358eed34 100644 --- a/python/llm/dev/benchmark/perplexity/README.md +++ b/python/llm/dev/benchmark/perplexity/README.md @@ -2,9 +2,7 @@ Perplexity (PPL) is one of the most common metrics for evaluating language models. This benchmark implementation is adapted from [transformers/perplexity](https://huggingface.co/docs/transformers/perplexity#perplexity-of-fixed-length-models) and [benchmark_patch_llm.py](https://github.com/insuhan/hyper-attn/blob/main/benchmark_patch_llm.py) ## Environment Preparation -Install ipex-llm and dataset. ```bash -# below command will install intel_extension_for_pytorch==2.1.10+xpu as default pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/ pip install datasets ``` @@ -14,9 +12,9 @@ This is a required step on Linux for APT or offline installed oneAPI. Skip this source /opt/intel/oneapi/setvars.sh ``` -## Running PPL Evaluation +## PPL Evaluation ### 1. Run on Wikitext -An example to run perplexity on wikitext: +An example to run perplexity on [wikitext](https://paperswithcode.com/dataset/wikitext-2): ```bash python run_wikitext.py --model_path meta-llama/Meta-Llama-3-8B --dataset path=wikitext,name=wikitext-2-raw-v1 --precision sym_int4 --device xpu --stride 512 --max_length 4096 ```