-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Readme for FastChat docker demo #12354
Conversation
@@ -63,6 +63,70 @@ For convenience, we have included a file `/llm/start-pp_serving-service.sh` in t | |||
|
|||
To run model-serving using `IPEX-LLM` as backend using FastChat, you can refer to this [quickstart](https://ipex-llm.readthedocs.io/en/latest/doc/LLM/Quickstart/fastchat_quickstart.html#). | |||
|
|||
In short, you need to start a docker container with `--device=/dev/dri`, a recommended command is: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To set up model serving using IPEX-LLM
as the backend with FastChat, you can refer to this Quickstart guide or follow these quick steps to deploy a demo.
Quick Setup for FastChat with IPEX-LLM
-
Start the Docker Container
Run the following command to launch a Docker container with device access:
#/bin/bash export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:latest sudo docker run -itd \ --net=host \ --device=/dev/dri \ --name=demo-container \ -v /LLM_MODELS/:/llm/models/ \ # Example: map host model directory to container --shm-size="16g" \ -e http_proxy=... \ # Optional: set proxy if needed -e https_proxy=... \ -e no_proxy="127.0.0.1,localhost" \ $DOCKER_IMAGE
-
Start the FastChat Service
Enter the container and start the FastChat service:
#/bin/bash # Stop any existing FastChat processes ps -ef | grep "fastchat" | awk '{print $2}' | xargs kill -9 # Install the required Gradio version pip install -U gradio==4.43.0 # Launch the FastChat controller python -m fastchat.serve.controller & # Set environment variables for CCL export TORCH_LLM_ALLREDUCE=0 export CCL_DG2_ALLREDUCE=1 export CCL_WORKER_COUNT=2 # Optional: Pin CCL workers to specific cores # export CCL_WORKER_AFFINITY=32,33,34,35 export FI_PROVIDER=shm export CCL_ATL_TRANSPORT=ofi export CCL_ZE_IPC_EXCHANGE=sockets export CCL_ATL_SHM=1 # Load Intel CCL settings source /opt/intel/1ccl-wks/setvars.sh # Start the model worker (replace "Yi-1.5-34B" with your model name) python -m ipex_llm.serving.fastchat.vllm_worker \ --model-path /llm/models/Yi-1.5-34B \ --device xpu \ --enforce-eager \ --dtype float16 \ --load-in-low-bit fp8 \ --tensor-parallel-size 4 \ --gpu-memory-utilization 0.9 \ --max-model-len 4096 \ --max-num-batched-tokens 8000 & # Wait for initialization sleep 120 # Start the Gradio web server for FastChat python -m fastchat.serve.gradio_web_server &
This quick setup allows you to deploy FastChat with IPEX-LLM efficiently.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Update Readme for FastChat docker demo