-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Run neural-chat 7b inference with Deepspeed on Flex 140. #10507
Comments
Hi, to run neural-chat 7b inference using DeepSpeed AutoTP and our low-bit optimization, you could follow these steps:
export MASTER_ADDR=127.0.0.1
export FI_PROVIDER=tcp
export CCL_ATL_TRANSPORT=ofi
export CCL_ZE_IPC_EXCHANGE=sockets
export LD_PRELOAD=${LD_PRELOAD}:${CONDA_PREFIX}/lib/libtcmalloc.so:${LD_PRELOAD}
basekit_root=/opt/intel/oneapi
source $basekit_root/setvars.sh --force
source $basekit_root/ccl/latest/env/vars.sh --force
NUM_GPUS=2 # number of used GPU
export USE_XETLA=OFF
export SYCL_PI_LEVEL_ZERO_USE_IMMEDIATE_COMMANDLISTS=2
export TORCH_LLM_ALLREDUCE=0 # Different from PVC
mpirun -np $NUM_GPUS --prepend-rank \
python deepspeed_autotp.py --repo-id-or-model-path 'Intel/neural-chat-7b-v3' --low-bit 'sym_int4' Please have a try and feel free to let me know any other question. |
Hi @plusbang , I can see 3 GPUs (2 cards of Flex 140 having 6GB of memory each and 1 Flex 170) on my system |
Device 0 and 1 are used by default in our script. Please refer to here for more details about how to select devices. According to my experiment on 2 A770, ~3G is used per GPU if you run neural-chat-7B with sym_int4 and default input prompt in the example. According to your error message, |
Hi @plusbang, we can successfully run the inference with deep speed for the neural chat on Flex 140. Thanks for your support. However, the customer is also interested in knowing the performance during deployment for concurrent usage cases. Could you please guide how to test it for handling multiple requests on the same instance with deepspeed on Flex 140? |
We plan to add deepspeed+ipex-llm inference backend to FastChat serving, will get you updated once it's supported. Thanks. |
@Vasud-ha |
The Intel GPU Flex 140 has two GPUs per card, with a memory capacity of 12 GB (6GB per GPU). Currently, I can do the inference only on one GPU device with limited memory. Could you please guide to run the model inference on two cards using deepspeed with neural-chat as done in these samples https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/GPU/Deepspeed-AutoTP
The text was updated successfully, but these errors were encountered: