Skip to content

Latest commit

 

History

History

Folders and files

NameName
Last commit message
Last commit date

parent directory

..
 
 
 
 
 
 
 
 

EAGLE - Speculative Sampling using IPEX-LLM on Intel GPUs

In this directory, you will find the examples on how IPEX-LLM accelerate inference with speculative sampling using EAGLE (Extrapolation Algorithm for Greater Language-model Efficiency), a speculative sampling method that improves text generation speed) on Intel GPUs. See here to view the paper and here for more info on EAGLE code.

Requirements

To apply Intel GPU acceleration, there’re several steps for tools installation and environment preparation. See the GPU installation guide for more details.

Step 1, only Linux system is supported now, Ubuntu 22.04 is prefered.

Step 2, please refer to our driver installation for general purpose GPU capabilities.

Note: IPEX 2.1.10+xpu requires Intel GPU Driver version >= stable_775_20_20231219.

Step 3, you also need to download and install Intel® oneAPI Base Toolkit. OneMKL and DPC++ compiler are needed, others are optional.

Note: IPEX 2.1.10+xpu requires Intel® oneAPI Base Toolkit's version == 2024.0.

Verified Hardware Platforms

  • Intel Data Center GPU Max Series
  • Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series

Example - EAGLE-2 Speculative Sampling with IPEX-LLM on MT-bench

In this example, we run inference for a Llama2 model to showcase the speed of EAGLE with IPEX-LLM on MT-bench data on Intel GPUs. We use EAGLE-2 which have better performance than EAGLE-1

1. Install

1.1 Installation on Linux

We suggest using conda to manage environment:

conda create -n llm python=3.11
conda activate llm
# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt
pip install -e .

1.2 Installation on Windows

We suggest using conda to manage environment:

conda create -n llm python=3.11 libuv
conda activate llm
# below command will use pip to install the Intel oneAPI Base Toolkit 2024.0
pip install dpcpp-cpp-rt==2024.0.2 mkl-dpcpp==2024.0.0 onednn==2024.0.0

# below command will install intel_extension_for_pytorch==2.1.10+xpu as default
pip install --pre --upgrade ipex-llm[xpu] --extra-index-url https://pytorch-extension.intel.com/release-whl/stable/xpu/us/
git clone https://github.com/SafeAILab/EAGLE.git
cd EAGLE
pip install -r requirements.txt
pip install -e .

2. Configures OneAPI environment variables for Linux

Note

Skip this step if you are running on Windows.

This is a required step on Linux for APT or offline installed oneAPI. Skip this step for PIP-installed oneAPI.

source /opt/intel/oneapi/setvars.sh

3. Runtime Configurations

For optimal performance, it is recommended to set several environment variables. Please check out the suggestions based on your device.

3.1 Configurations for Linux

For Intel Arc™ A-Series Graphics and Intel Data Center GPU Flex Series
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
For Intel Data Center GPU Max Series
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=1
export SYCL_CACHE_PERSISTENT=1
export ENABLE_SDP_FUSION=1

Note: Please note that libtcmalloc.so can be installed by conda install -c conda-forge -y gperftools=2.10.

4. Running Example

You can test the speed of EAGLE speculative sampling with ipex-llm on MT-bench using the following command.

python -m evaluation.gen_ea_answer_llama2chat_e2_ipex_optimize\
                 --ea-model-path [path of EAGLE weight]\
                 --base-model-path [path of the original model]\
                 --enable-ipex-llm\

Please refer to here for the complete list of available EAGLE weights.

The above command will generate a .jsonl file that records the generation results and wall time. Then, you can use evaluation/speed.py to calculate the speed.

python -m evaluation.speed\
                 --base-model-path [path of the original model]\
                 --jsonl-file [pathname of the .jsonl file]\