From 86f9b5a785d519411ef53f1085a8721951d1478a Mon Sep 17 00:00:00 2001 From: liu-shaojun Date: Tue, 2 Apr 2024 09:45:37 +0800 Subject: [PATCH 1/3] refine serving on cpu/xpu --- .github/workflows/manually_build.yml | 4 +- docker/llm/README.md | 257 +++++++++++++++++---------- 2 files changed, 163 insertions(+), 98 deletions(-) diff --git a/.github/workflows/manually_build.yml b/.github/workflows/manually_build.yml index 36525496d70..0d3a01b3938 100644 --- a/.github/workflows/manually_build.yml +++ b/.github/workflows/manually_build.yml @@ -20,7 +20,7 @@ on: tag: description: 'docker image tag (e.g. 2.1.0-SNAPSHOT)' required: true - default: 'latest' + default: '2.1.0-SNAPSHOT' type: string workflow_call: inputs: @@ -32,7 +32,7 @@ on: tag: description: 'docker image tag (e.g. 2.1.0-SNAPSHOT)' required: true - default: 'latest' + default: '2.1.0-SNAPSHOT' type: string env: diff --git a/docker/llm/README.md b/docker/llm/README.md index 71d52b67590..1ee737a79cd 100644 --- a/docker/llm/README.md +++ b/docker/llm/README.md @@ -144,9 +144,9 @@ Then, execute `bash run-spr.sh`, which will generate output results in `results. bash run-spr.sh ``` -For further details and comprehensive functionality of the benchmark tool, please refer to the [all-in-one benchmark tool](https://github.com/intel-analytics/BigDL/tree/main/python/llm/dev/benchmark/all-in-one). +For further details and comprehensive functionality of the benchmark tool, please refer to the [all-in-one benchmark tool](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/dev/benchmark/all-in-one). -Additionally, for examples related to Inference with Speculative Decoding, you can explore [Speculative-Decoding Examples](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/Speculative-Decoding). +Additionally, for examples related to Inference with Speculative Decoding, you can explore [Speculative-Decoding Examples](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/CPU/Speculative-Decoding). @@ -202,19 +202,26 @@ For example, if your model is Llama-2-7b-chat-hf and mounted on /llm/models, you python chat.py --model-path /llm/models/Llama-2-7b-chat-hf ``` -To run inference using `IPEX-LLM` using xpu, you could refer to this [documentation](https://github.com/intel-analytics/IPEX/tree/main/python/llm/example/GPU). +To run inference using `IPEX-LLM` using xpu, you could refer to this [documentation](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm/example/GPU). ## IPEX-LLM Serving on CPU +FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat). -### Boot container +IPEX-LLM is integrated into FastChat so that user can use IPEX-LLM as a serving backend in the deployment. -Pull image: -``` +### 1. Prepare ipex-llm-serving-cpu Docker Image + +Run the following command: + +```bash docker pull intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT ``` -You could use the following bash script to start the container. Please be noted that the CPU config is specified for Xeon CPUs, change it accordingly if you are not using a Xeon CPU. +### 2. Start ipex-llm-serving-cpu Docker Container + +Please be noted that the CPU config is specified for Xeon CPUs, change it accordingly if you are not using a Xeon CPU. + ```bash export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT export CONTAINER_NAME=my_container @@ -229,102 +236,131 @@ docker run -itd \ -v $MODEL_PATH:/llm/models \ $DOCKER_IMAGE ``` -After the container is booted, you could get into the container through `docker exec`. - -### Models +Access the container: +``` +docker exec -it $CONTAINER_NAME bash +``` -Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat. +### 3. Serving with FastChat -FastChat determines the Model adapter to use through path matching. Therefore, in order to load models using IPEX-LLM, you need to make some modifications to the model's name. +To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers. -A special case is `ChatGLM` models. For these models, you do not need to do any changes after downloading the model and the `IPEX-LLM` backend will be used automatically. +- #### **Step 1: Launch the Controller** + ```bash + python3 -m fastchat.serve.controller & + ``` + This controller manages the distributed workers. -### Start the service +- #### **Step 2: Launch the model worker(s)** -#### Serving with Web UI + Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat. + ```bash + source ipex-llm-init -t -To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers. + # Available low_bit format including sym_int4, sym_int8, bf16 etc. + python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path path/to/vicuna-7b-v1.5 --low-bit "sym_int4" --trust-remote-code --device "cpu" & + ``` + Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller. -##### Launch the Controller -```bash -python3 -m fastchat.serve.controller -``` +- #### **Step 3: Launch Gradio web server or RESTful API server** + You can launch Gradio web server to serve your models using the web UI or launch RESTful API server to serve with cURL. -This controller manages the distributed workers. + - **Option 1: Serving with Web UI** + ```bash + python3 -m fastchat.serve.gradio_web_server & + ``` + This is the user interface that users will interact with. -##### Launch the model worker(s) -```bash -python3 -m ipex_llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu -``` -Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller. + By following these steps, you will be able to serve your models using the web UI with `IPEX-LLM` as the backend. You can open your browser and chat with a model now. -> To run model worker using Intel GPU, simply change the --device cpu option to --device xpu + - **Option 2: Serving with OpenAI-Compatible RESTful APIs** -##### Launch the Gradio web server + Launch the RESTful API server -```bash -python3 -m fastchat.serve.gradio_web_server -``` + ```bash + python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 & + ``` -This is the user interface that users will interact with. + Use curl for testing, an example could be: -By following these steps, you will be able to serve your models using the web UI with `IPEX-LLM` as the backend. You can open your browser and chat with a model now. + ```bash + curl -X POST -H "Content-Type: application/json" -d '{ + "model": "Llama-2-7b-chat-hf", + "prompt": "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun", + "n": 1, + "best_of": 1, + "use_beam_search": false, + "stream": false + }' http://localhost:8000/v1/completions + ``` + You can find more details here [Serving using IPEX-LLM and FastChat](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/src/ipex_llm/serving/fastchat/README.md) -#### Serving with OpenAI-Compatible RESTful APIs +### 4. Serving with vLLM Continuous Batching +To fully utilize the continuous batching feature of the vLLM, you can send requests to the service using curl or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same forward step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished. -To start an OpenAI API server that provides compatible APIs using `IPEX-LLM` backend, you need three main components: an OpenAI API Server that serves the in-coming requests, model workers that host one or more models, and a controller to coordinate the web server and model workers. +- #### **Step 1: Launch the api_server** + ```bash + #!/bin/bash + # You may also want to adjust the `--max-num-batched-tokens` argument, it indicates the hard limit + # of batched prompt length the server will accept + numactl -C 0-47 -m 0 python -m ipex_llm.vllm.entrypoints.openai.api_server \ + --model /llm/models/Llama-2-7b-chat-hf/ --port 8000 \ + --load-format 'auto' --device cpu --dtype bfloat16 \ + --max-num-batched-tokens 4096 & + ``` -First, launch the controller +- #### **Step 2: Use curl for testing, access the api server as follows:** -```bash -python3 -m fastchat.serve.controller -``` + ```bash + curl http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "/llm/models/Llama-2-7b-chat-hf/", + "prompt": "San Francisco is a", + "max_tokens": 128, + "temperature": 0 + }' & + ``` -Then, launch the model worker(s): + You can find more details here: [Serving with vLLM Continuous Batching](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/CPU/vLLM-Serving/README.md) -```bash -python3 -m ipex_llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device cpu -``` -Finally, launch the RESTful API server +## IPEX-LLM Serving on XPU -```bash -python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 -``` +FastChat is an open platform for training, serving, and evaluating large language model based chatbots. You can find the detailed information at their [homepage](https://github.com/lm-sys/FastChat). +IPEX-LLM is integrated into FastChat so that user can use IPEX-LLM as a serving backend in the deployment. -## IPEX-LLM Serving on XPU +### 1. Prepare ipex-llm-serving-xpu Docker Image -### Boot container +Run the following command: -Pull image: -``` +```bash docker pull intelanalytics/ipex-llm-serving-xpu:2.1.0-SNAPSHOT ``` +### 2. Start ipex-llm-serving-xpu Docker Container + To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container. -An example could be: ```bash -#/bin/bash -export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT +export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-xpu:2.1.0-SNAPSHOT export CONTAINER_NAME=my_container export MODEL_PATH=/llm/models[change to your model path] -export SERVICE_MODEL_PATH=/llm/models/chatglm2-6b[a specified model path for running service] docker run -itd \ --net=host \ - --device=/dev/dri \ - --memory="32G" \ + --cpuset-cpus="0-47" \ + --cpuset-mems="0" \ --name=$CONTAINER_NAME \ - --shm-size="16g" \ -v $MODEL_PATH:/llm/models \ - -e SERVICE_MODEL_PATH=$SERVICE_MODEL_PATH \ - $DOCKER_IMAGE --service-model-path $SERVICE_MODEL_PATH + $DOCKER_IMAGE +``` +Access the container: +``` +docker exec -it $CONTAINER_NAME bash ``` -You can assign specified model path to service-model-path to run the service while booting the container. Also you can manually run the service after entering container. Run `/opt/entrypoint.sh --help` in container to see more information. There are steps below describe how to run service in details as well. - To verify the device is successfully mapped into the container, run `sycl-ls` to check the result. In a machine with Arc A770, the sampled output is: ```bash @@ -334,58 +370,87 @@ root@arda-arc12:/# sycl-ls [opencl:gpu:2] Intel(R) OpenCL Graphics, Intel(R) Arc(TM) A770 Graphics 3.0 [23.17.26241.33] [ext_oneapi_level_zero:gpu:0] Intel(R) Level-Zero, Intel(R) Arc(TM) A770 Graphics 1.3 [1.3.26241] ``` -After the container is booted, you could get into the container through `docker exec`. - -### Start the service -#### Serving with Web UI +### 3. Serving with FastChat To serve using the Web UI, you need three main components: web servers that interface with users, model workers that host one or more models, and a controller to coordinate the web server and model workers. -##### Launch the Controller -```bash -python3 -m fastchat.serve.controller -``` +- #### **Step 1: Launch the Controller** + ```bash + python3 -m fastchat.serve.controller & + ``` -This controller manages the distributed workers. + This controller manages the distributed workers. -##### Launch the model worker(s) -```bash -python3 -m ipex_llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device xpu -``` -Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller. +- #### **Step 2: Launch the model worker(s)** -##### Launch the Gradio web server + Using IPEX-LLM in FastChat does not impose any new limitations on model usage. Therefore, all Hugging Face Transformer models can be utilized in FastChat. + ```bash + # Available low_bit format including sym_int4, sym_int8, fp16 etc. + python3 -m ipex_llm.serving.fastchat.ipex_llm_worker --model-path /llm/models/Llama-2-7b-chat-hf/ --low-bit "sym_int4" --trust-remote-code --device "xpu" & + ``` + Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller. -```bash -python3 -m fastchat.serve.gradio_web_server -``` +- #### **Step 3: Launch Gradio web server or RESTful API server** + You can launch Gradio web server to serve your models using the web UI or launch RESTful API server to serve with cURL. -This is the user interface that users will interact with. + - **Option 1: Serving with Web UI** + ```bash + python3 -m fastchat.serve.gradio_web_server & + ``` + This is the user interface that users will interact with. -By following these steps, you will be able to serve your models using the web UI with `IPEX-LLM` as the backend. You can open your browser and chat with a model now. + By following these steps, you will be able to serve your models using the web UI with `IPEX-LLM` as the backend. You can open your browser and chat with a model now. -#### Serving with OpenAI-Compatible RESTful APIs + - **Option 2: Serving with OpenAI-Compatible RESTful APIs** -To start an OpenAI API server that provides compatible APIs using `IPEX-LLM` backend, you need three main components: an OpenAI API Server that serves the in-coming requests, model workers that host one or more models, and a controller to coordinate the web server and model workers. + Launch the RESTful API server -First, launch the controller + ```bash + python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 & + ``` -```bash -python3 -m fastchat.serve.controller -``` + Use curl for testing, an example could be: -Then, launch the model worker(s): + ```bash + curl -X POST -H "Content-Type: application/json" -d '{ + "model": "Llama-2-7b-chat-hf", + "prompt": "Once upon a time, there existed a little girl who liked to have adventures. She wanted to go to places and meet new people, and have fun", + "n": 1, + "best_of": 1, + "use_beam_search": false, + "stream": false + }' http://localhost:8000/v1/completions + ``` + You can find more details here [Serving using IPEX-LLM and FastChat](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/src/ipex_llm/serving/fastchat/README.md) -```bash -python3 -m ipex_llm.serving.model_worker --model-path lmsys/vicuna-7b-v1.3 --device xpu -``` +### 4. Serving with vLLM Continuous Batching +To fully utilize the continuous batching feature of the vLLM, you can send requests to the service using curl or other similar methods. The requests sent to the engine will be batched at token level. Queries will be executed in the same forward step of the LLM and be removed when they are finished instead of waiting for all sequences to be finished. -Finally, launch the RESTful API server +- #### **Step 1: Launch the api_server** + ```bash + #!/bin/bash + # You may also want to adjust the `--max-num-batched-tokens` argument, it indicates the hard limit + # of batched prompt length the server will accept + python -m ipex_llm.vllm.entrypoints.openai.api_server \ + --model /llm/models/Llama-2-7b-chat-hf/ --port 8000 \ + --load-format 'auto' --device xpu --dtype bfloat16 \ + --max-num-batched-tokens 4096 & + ``` -```bash -python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 -``` +- #### **Step 2: Use curl for testing, access the api server as follows:** + + ```bash + curl http://localhost:8000/v1/completions \ + -H "Content-Type: application/json" \ + -d '{ + "model": "/llm/models/Llama-2-7b-chat-hf/", + "prompt": "San Francisco is a", + "max_tokens": 128, + "temperature": 0 + }' & + ``` + You can find more details here [Serving with vLLM Continuous Batching](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/example/GPU/vLLM-Serving/README.md) ## IPEX-LLM Fine Tuning on CPU From a0a55d0c454228940dadde2a26ee73546b186c32 Mon Sep 17 00:00:00 2001 From: liu-shaojun Date: Tue, 2 Apr 2024 10:07:31 +0800 Subject: [PATCH 2/3] minor fix --- docker/llm/README.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docker/llm/README.md b/docker/llm/README.md index 1ee737a79cd..188975f5f2a 100644 --- a/docker/llm/README.md +++ b/docker/llm/README.md @@ -92,7 +92,7 @@ Here's a demonstration of how to navigate the tutorial in the explorer: **3.3 Performance Benchmark**: We provide a benchmark tool help users to test all the benchmarks and record them in a result CSV. ```bash -cd /llm//benchmark/all-in-one +cd /llm/benchmark/all-in-one ``` Users can provide models and related information in config.yaml. From 5825091397df448e96d51fcbb15ae108b230643e Mon Sep 17 00:00:00 2001 From: liu-shaojun Date: Tue, 2 Apr 2024 11:16:08 +0800 Subject: [PATCH 3/3] replace localhost with 0.0.0.0 so that service can be accessed through ip address --- docker/llm/README.md | 22 ++++++++++++---------- 1 file changed, 12 insertions(+), 10 deletions(-) diff --git a/docker/llm/README.md b/docker/llm/README.md index 188975f5f2a..81ed9b4399b 100644 --- a/docker/llm/README.md +++ b/docker/llm/README.md @@ -264,7 +264,7 @@ To serve using the Web UI, you need three main components: web servers that inte Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller. - #### **Step 3: Launch Gradio web server or RESTful API server** - You can launch Gradio web server to serve your models using the web UI or launch RESTful API server to serve with cURL. + You can launch Gradio web server to serve your models using the web UI or launch RESTful API server to serve with HTTP. - **Option 1: Serving with Web UI** ```bash @@ -279,7 +279,7 @@ To serve using the Web UI, you need three main components: web servers that inte Launch the RESTful API server ```bash - python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 & + python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000 & ``` Use curl for testing, an example could be: @@ -292,7 +292,7 @@ To serve using the Web UI, you need three main components: web servers that inte "best_of": 1, "use_beam_search": false, "stream": false - }' http://localhost:8000/v1/completions + }' http://YOUR_HTTP_HOST:8000/v1/completions ``` You can find more details here [Serving using IPEX-LLM and FastChat](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/src/ipex_llm/serving/fastchat/README.md) @@ -305,7 +305,8 @@ To fully utilize the continuous batching feature of the vLLM, you can send reque # You may also want to adjust the `--max-num-batched-tokens` argument, it indicates the hard limit # of batched prompt length the server will accept numactl -C 0-47 -m 0 python -m ipex_llm.vllm.entrypoints.openai.api_server \ - --model /llm/models/Llama-2-7b-chat-hf/ --port 8000 \ + --model /llm/models/Llama-2-7b-chat-hf/ \ + --host 0.0.0.0 --port 8000 \ --load-format 'auto' --device cpu --dtype bfloat16 \ --max-num-batched-tokens 4096 & ``` @@ -313,7 +314,7 @@ To fully utilize the continuous batching feature of the vLLM, you can send reque - #### **Step 2: Use curl for testing, access the api server as follows:** ```bash - curl http://localhost:8000/v1/completions \ + curl http://YOUR_HTTP_HOST:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/llm/models/Llama-2-7b-chat-hf/", @@ -392,7 +393,7 @@ To serve using the Web UI, you need three main components: web servers that inte Wait until the process finishes loading the model and you see "Uvicorn running on ...". The model worker will register itself to the controller. - #### **Step 3: Launch Gradio web server or RESTful API server** - You can launch Gradio web server to serve your models using the web UI or launch RESTful API server to serve with cURL. + You can launch Gradio web server to serve your models using the web UI or launch RESTful API server to serve with HTTP. - **Option 1: Serving with Web UI** ```bash @@ -407,7 +408,7 @@ To serve using the Web UI, you need three main components: web servers that inte Launch the RESTful API server ```bash - python3 -m fastchat.serve.openai_api_server --host localhost --port 8000 & + python3 -m fastchat.serve.openai_api_server --host 0.0.0.0 --port 8000 & ``` Use curl for testing, an example could be: @@ -420,7 +421,7 @@ To serve using the Web UI, you need three main components: web servers that inte "best_of": 1, "use_beam_search": false, "stream": false - }' http://localhost:8000/v1/completions + }' http://YOUR_HTTP_HOST:8000/v1/completions ``` You can find more details here [Serving using IPEX-LLM and FastChat](https://github.com/intel-analytics/ipex-llm/blob/main/python/llm/src/ipex_llm/serving/fastchat/README.md) @@ -433,7 +434,8 @@ To fully utilize the continuous batching feature of the vLLM, you can send reque # You may also want to adjust the `--max-num-batched-tokens` argument, it indicates the hard limit # of batched prompt length the server will accept python -m ipex_llm.vllm.entrypoints.openai.api_server \ - --model /llm/models/Llama-2-7b-chat-hf/ --port 8000 \ + --model /llm/models/Llama-2-7b-chat-hf/ \ + --host 0.0.0.0 --port 8000 \ --load-format 'auto' --device xpu --dtype bfloat16 \ --max-num-batched-tokens 4096 & ``` @@ -441,7 +443,7 @@ To fully utilize the continuous batching feature of the vLLM, you can send reque - #### **Step 2: Use curl for testing, access the api server as follows:** ```bash - curl http://localhost:8000/v1/completions \ + curl http://YOUR_HTTP_HOST:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "/llm/models/Llama-2-7b-chat-hf/",