Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run neural-chat 7b inference with Deepspeed on Flex 140. #10507

Open
Vasud-ha opened this issue Mar 22, 2024 · 6 comments
Open

Run neural-chat 7b inference with Deepspeed on Flex 140. #10507

Vasud-ha opened this issue Mar 22, 2024 · 6 comments
Assignees

Comments

@Vasud-ha
Copy link

The Intel GPU Flex 140 has two GPUs per card, with a memory capacity of 12 GB (6GB per GPU). Currently, I can do the inference only on one GPU device with limited memory. Could you please guide to run the model inference on two cards using deepspeed with neural-chat as done in these samples https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/Deepspeed-AutoTP

@plusbang
Copy link
Contributor

plusbang commented Mar 25, 2024

Hi, to run neural-chat 7b inference using DeepSpeed AutoTP and our low-bit optimization, you could follow these steps:

  1. Prepare your environment following installation steps. Especially for neural-chat-7b model, you need to run pip install transformers==4.34.0 additionally.

  2. Currently, you need to modify https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/Deepspeed-AutoTP/deepspeed_autotp.py#L85 to model = optimize_model(model.module.to(f'cpu'), low_bit=low_bit, optimize_llm=False).to(torch.float16)
    Important: This PR(LLM: fix mistral hidden_size setting for deepspeed autotp #10527) is used to support the default optimize_llm=True case. If you use the later version including this fix, then you could skip the step2.

  3. Directly use the following script to run on two GPUs:

export MASTER_ADDR=127.0.0.1
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets

export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD}
basekit_root=/opt/intel/oneapi
source $basekit_root/setvars.sh --force
source $basekit_root/ccl/latest/env/vars.sh --force

NUM_GPUS=2 # number of used GPU
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0 # Different from PVC

mpirun -np $NUM_GPUS --prepend-rank \
    python deepspeed_autotp.py --repo-id-or-model-path 'Intel/neural-chat-7b-v3' --low-bit 'sym_int4'

Please have a try and feel free to let me know any other question.

@Vasud-ha
Copy link
Author

Hi @plusbang , I can see 3 GPUs (2 cards of Flex 140 having 6GB of memory each and 1 Flex 170) on my system
image
, while running neural-chat 7b with deepspeed getting out of resource error.
image
However, GPU memory utilization is only 50% on the devices.
image
image

@plusbang
Copy link
Contributor

Device 0 and 1 are used by default in our script. Please refer to here for more details about how to select devices.

According to my experiment on 2 A770, ~3G is used per GPU if you run neural-chat-7B with sym_int4 and default input prompt in the example. According to your error message, python=3.10 is used. We recommend to create python=3.9 env following our steps and run pip install transformers==4.34.0 for this model additionally.

@Vasud-ha
Copy link
Author

Hi @plusbang, we can successfully run the inference with deep speed for the neural chat on Flex 140. Thanks for your support. However, the customer is also interested in knowing the performance during deployment for concurrent usage cases. Could you please guide how to test it for handling multiple requests on the same instance with deepspeed on Flex 140?

@glorysdj
Copy link
Contributor

Hi @plusbang, we can successfully run the inference with deep speed for the neural chat on Flex 140. Thanks for your support. However, the customer is also interested in knowing the performance during deployment for concurrent usage cases. Could you please guide how to test it for handling multiple requests on the same instance with deepspeed on Flex 140?

We plan to add deepspeed+ipex-llm inference backend to FastChat serving, will get you updated once it's supported. Thanks.

@glorysdj
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants