Run neural-chat 7b inference with Deepspeed on Flex 140. #10507

Vasud-ha · 2024-03-22T05:48:32Z

The Intel GPU Flex 140 has two GPUs per card, with a memory capacity of 12 GB (6GB per GPU). Currently, I can do the inference only on one GPU device with limited memory. Could you please guide to run the model inference on two cards using deepspeed with neural-chat as done in these samples https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/Deepspeed-AutoTP

plusbang · 2024-03-25T07:07:40Z

Hi, to run neural-chat 7b inference using DeepSpeed AutoTP and our low-bit optimization, you could follow these steps:

Prepare your environment following installation steps. Especially for neural-chat-7b model, you need to run pip install transformers==4.34.0 additionally.
Currently, you need to modify https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py#L85 to model = optimize_model(model.module.to(f'cpu'), low_bit=low_bit, optimize_llm=False).to(torch.float16)
Important: This PR(LLM: fix mistral hidden_size setting for deepspeed autotp #10527) is used to support the default optimize_llm=True case. If you use the later version including this fix, then you could skip the step2.
Directly use the following script to run on two GPUs:

export MASTER_ADDR=127.0.0.1
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets

export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD}
basekit_root=/opt/intel/oneapi
source $basekit_root/setvars.sh --force
source $basekit_root/ccl/latest/env/vars.sh --force

NUM_GPUS=2 # number of used GPU
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0 # Different from PVC

mpirun -np $NUM_GPUS --prepend-rank \
    python deepspeed_autotp.py --repo-id-or-model-path 'Intel/neural-chat-7b-v3' --low-bit 'sym_int4'

Please have a try and feel free to let me know any other question.

Vasud-ha · 2024-03-26T06:54:10Z

Hi @plusbang , I can see 3 GPUs (2 cards of Flex 140 having 6GB of memory each and 1 Flex 170) on my system

, while running neural-chat 7b with deepspeed getting out of resource error.

However, GPU memory utilization is only 50% on the devices.

plusbang · 2024-03-26T08:48:13Z

Device 0 and 1 are used by default in our script. Please refer to here for more details about how to select devices.

According to my experiment on 2 A770, ~3G is used per GPU if you run neural-chat-7B with sym_int4 and default input prompt in the example. According to your error message, python=3.10 is used. We recommend to create python=3.9 env following our steps and run pip install transformers==4.34.0 for this model additionally.

Vasud-ha · 2024-03-27T09:12:48Z

Hi @plusbang, we can successfully run the inference with deep speed for the neural chat on Flex 140. Thanks for your support. However, the customer is also interested in knowing the performance during deployment for concurrent usage cases. Could you please guide how to test it for handling multiple requests on the same instance with deepspeed on Flex 140?

glorysdj · 2024-03-28T05:34:20Z

Hi @plusbang, we can successfully run the inference with deep speed for the neural chat on Flex 140. Thanks for your support. However, the customer is also interested in knowing the performance during deployment for concurrent usage cases. Could you please guide how to test it for handling multiple requests on the same instance with deepspeed on Flex 140?

We plan to add deepspeed+ipex-llm inference backend to FastChat serving, will get you updated once it's supported. Thanks.

glorysdj · 2024-05-10T13:28:28Z

@Vasud-ha
we have added PEX-LLM serving on Multiple Intel GPUs using DeepSpeed AutoTP and FastApi
please refer to https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/deepspeed_autotp_fastapi_quickstart.html
and https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU/Deepspeed-AutoTP-FastAPI

qiuxin2012 assigned plusbang Mar 25, 2024

qiuxin2012 added the user issue label Mar 25, 2024

plusbang mentioned this issue Mar 25, 2024

LLM: fix mistral hidden_size setting for deepspeed autotp #10527

Merged

2 tasks

weiseng-yeap mentioned this issue May 10, 2024

Run neural-chat 7b inference with Deepspeed on Flex 140. #10507 #10983

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run neural-chat 7b inference with Deepspeed on Flex 140. #10507

Run neural-chat 7b inference with Deepspeed on Flex 140. #10507

Vasud-ha commented Mar 22, 2024

plusbang commented Mar 25, 2024 •

edited

Loading

Vasud-ha commented Mar 26, 2024

plusbang commented Mar 26, 2024

Vasud-ha commented Mar 27, 2024

glorysdj commented Mar 28, 2024

glorysdj commented May 10, 2024

Run neural-chat 7b inference with Deepspeed on Flex 140. #10507

Run neural-chat 7b inference with Deepspeed on Flex 140. #10507

Comments

Vasud-ha commented Mar 22, 2024

plusbang commented Mar 25, 2024 • edited Loading

Vasud-ha commented Mar 26, 2024

plusbang commented Mar 26, 2024

Vasud-ha commented Mar 27, 2024

glorysdj commented Mar 28, 2024

glorysdj commented May 10, 2024

plusbang commented Mar 25, 2024 •

edited

Loading