Run Tensor-Parallel IPEX-LLM Transformers INT4 Inference with Deepspeed

1. Install Dependencies

Install necessary packages (here Python 3.11 is our test environment):

bash install.sh

The first step in the script is to install oneCCL (wrapper for Intel MPI) to enable distributed communication between deepspeed instances, which can be skipped if Inte MPI/oneCCL/oneAPI has already been prepared on your machine. Please refer to oneCCL if any related issue when install or import.

2. Initialize Deepspeed Distributed Context

Like shown in example code deepspeed_autotp.py, you can construct parallel model with Python API:

# Load in HuggingFace Transformers' model
from transformers import AutoModelForCausalLM

model = AutoModelForCausalLM.from_pretrained(...)


# Parallelize model on deepspeed
import deepspeed

model = deepspeed.init_inference(
    model, # an AutoModel of Transformers
    mp_size = world_size, # instance (process) count
    dtype=torch.float16,
    replace_method="auto")

Then, returned model is converted into a deepspeed InferenceEnginee type.

3. Optimize Model with IPEX-LLM Low Bit

Distributed model managed by deepspeed can be further optimized with IPEX low-bit Python API, e.g. sym_int4:

# Apply IPEX-LLM INT4 optimizations on transformers
from ipex_llm import optimize_model

model = optimize_model(model.module.to(f'cpu'), low_bit='sym_int4')
model = model.to(f'cpu:{local_rank}') # move partial model to local rank

Then, a ipex-llm transformers is returned, which in the following, can serve in parallel with native APIs.

4. Start Python Code

You can try deepspeed with IPEX LLM by:

bash run.sh

If you want to run your own application, there are necessary configurations in the script which can also be ported to run your custom deepspeed application:

# run.sh
source ipex-llm-init
unset OMP_NUM_THREADS # deepspeed will set it for each instance automatically
source /opt/intel/oneccl/env/setvars.sh
......
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export CCL_PROCESS_LAUNCHER=none

Set the above configurations before running deepspeed please to ensure right parallel communication and high performance.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Run Tensor-Parallel IPEX-LLM Transformers INT4 Inference with Deepspeed

1. Install Dependencies

2. Initialize Deepspeed Distributed Context

3. Optimize Model with IPEX-LLM Low Bit

4. Start Python Code

Files

README.md

Latest commit

History

README.md

File metadata and controls

Run Tensor-Parallel IPEX-LLM Transformers INT4 Inference with Deepspeed

1. Install Dependencies

2. Initialize Deepspeed Distributed Context

3. Optimize Model with IPEX-LLM Low Bit

4. Start Python Code