From 52f1b541cfd698a80876982ed14530a7f9539718 Mon Sep 17 00:00:00 2001
From: Shaojun Liu <61072813+liu-shaojun@users.noreply.github.com>
Date: Fri, 29 Mar 2024 10:16:10 +0800
Subject: [PATCH] refine and verify ipex-inference-cpu docker document (#10565)
* restructure the index
* refine and verify cpu-inference document
* update
---
docker/llm/README.md | 232 ++++++++++++----------
docker/llm/inference/cpu/docker/README.md | 6 +-
2 files changed, 128 insertions(+), 110 deletions(-)
diff --git a/docker/llm/README.md b/docker/llm/README.md
index 8c58287d41b..1baa751c557 100644
--- a/docker/llm/README.md
+++ b/docker/llm/README.md
@@ -1,148 +1,166 @@
# Getting started with IPEX-LLM in Docker
+This guide provides step-by-step instructions for installing and using IPEX-LLM in a Docker environment. It covers setups for both CPU and XPU (accelerated processing units) on different operating systems.
+
### Index
-- [Docker installation guide for IPEX-LLM on CPU](#docker-installation-guide-for-ipex-llm-on-cpu)
- - [IPEX-LLM on Windows](#ipex-llm-on-windows)
- - [IPEX-LLM on Linux/MacOS](#ipex-llm-on-linuxmacos)
-- [Docker installation guide for IPEX LLM on XPU](#docker-installation-guide-for-ipex-llm-on-xpu)
-- [Docker installation guide for IPEX LLM Serving on CPU](#docker-installation-guide-for-ipex-llm-serving-on-cpu)
-- [Docker installation guide for IPEX LLM Serving on XPU](#docker-installation-guide-for-ipex-llm-serving-on-xpu)
-- [Docker installation guide for IPEX LLM Fine Tuning on CPU](#docker-installation-guide-for-ipex-llm-fine-tuning-on-cpu)
-- [Docker installation guide for IPEX LLM Fine Tuning on XPU](#docker-installation-guide-for-ipex-llm-fine-tuning-on-xpu)
+- [Docker Installation](#docker-installation-instructions)
+- [IPEX-LLM Inference](#ipex-llm-inference-on-cpu)
+ - [On CPU](#ipex-llm-inference-on-cpu)
+ - [On XPU](#ipex-llm-inference-on-xpu)
+- [IPEX-LLM Serving](#ipex-llm-serving-on-cpu)
+ - [On CPU](#ipex-llm-serving-on-cpu)
+ - [On XPU](#ipex-llm-serving-on-xpu)
+- [IPEX-LLM Fine Tuning](#ipex-llm-fine-tuning-on-cpu)
+ - [On CPU](#ipex-llm-fine-tuning-on-cpu)
+ - [On XPU](#ipex-llm-fine-tuning-on-xpu)
-## Docker installation guide for IPEX-LLM on CPU
-### IPEX-LLM on Windows
+## Docker Installation Instructions
-#### Install docker
+1. **For New Users:**
+ - Begin by visiting the [official Docker Get Started page](https://www.docker.com/get-started/) for a comprehensive introduction and installation guide.
-New users can quickly get started with Docker using this [official link](https://www.docker.com/get-started/).
+2. **Additional Steps for Windows Users:**
+ - For Windows Users, follow the step-by-step guide: [Docker Installation Instructions for Windows](https://github.com/intel-analytics/ipex-llm/blob/main/docs/readthedocs/source/doc/LLM/Quickstart/docker_windows_gpu.md).
-For Windows users, make sure WSL2 or Hyper-V is enabled on your computer.
-The instructions for installing can be accessed from
-[here](https://docs.docker.com/desktop/install/windows-install/).
+## IPEX-LLM Inference on CPU
-#### Pull ipex-llm-cpu image
+### 1. Prepare ipex-llm-cpu Docker Image
-To pull image from hub, you can execute command on console:
+Run the following command to pull image from dockerhub:
```bash
-docker pull intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
-```
-to check if the image is successfully downloaded, you can use:
-```powershell
-docker images | sls intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
```
+### 2. Start bigdl-llm-cpu Docker Container
-#### Start ipex-llm-cpu container
-
-To run the image and do inference, you could create and run a bat script on Windows.
-
-An example on Windows could be:
-```bat
-@echo off
-set DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
-set CONTAINER_NAME=my_container
-set MODEL_PATH=D:/llm/models[change to your model path]
+```bash
+#/bin/bash
+export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
+export CONTAINER_NAME=my_container
+export MODEL_PATH=/llm/models[change to your model path]
-:: Run the Docker container
-docker run -itd ^
- -p 12345:12345 ^
- --cpuset-cpus="0-7" ^
- --cpuset-mems="0" ^
- --memory="8G" ^
- --name=%CONTAINER_NAME% ^
- -v %MODEL_PATH%:/llm/models ^
- %DOCKER_IMAGE%
+docker run -itd \
+ --privileged \
+ -p 12345:12345 \
+ --cpuset-cpus="0-47" \
+ --cpuset-mems="0" \
+ --name=$CONTAINER_NAME \
+ -v $MODEL_PATH:/llm/models \
+ $DOCKER_IMAGE
```
-After the container is booted, you could get into the container through `docker exec`.
+Access the container:
```
-docker exec -it my_container bash
+docker exec -it $CONTAINER_NAME bash
```
-To run inference using `IPEX-LLM` using cpu, you could refer to this [documentation](https://github.com/intel-analytics/ipex-llm/tree/main/python/llm#cpu-int4).
+### 3. Start Inference and Tutorials
+**3.1 Chat Interface**: Use `chat.py` for conversational AI.
-
-#### Getting started with chat
-
-chat.py can be used to initiate a conversation with a specified model. The file is under directory '/llm'.
-
-You can download models and bind the model directory from host machine to container when start a container.
-
-After entering the container through `docker exec`, you can run chat.py by:
-```bash
-cd /llm/portable-zip
-python chat.py --model-path YOUR_MODEL_PATH
-```
-If your model is chatglm-6b and mounted on /llm/models, you can excute:
-```bash
-python chat.py --model-path /llm/models/chatglm-6b
-```
+For example, if your model is chatglm-6b and mounted on /llm/models, you can excute the following command to initiate a conversation:
+ ```bash
+ cd /llm/portable-zip
+ python chat.py --model-path /llm/models/chatglm2-6b
+ ```
Here is a demostration:
-
+
-#### Getting started with tutorials
+**3.2 Jupyter Lab Tutorials**: Start a Jupyter Lab session for IPEX-LLM tutorials.
-You could start a jupyter-lab serving to explore ipex-llm-tutorial which can help you build a more sophisticated Chatbo.
-
-To start serving, run the script under '/llm':
+Run the following command to start notebook:
```bash
cd /llm
./start-notebook.sh [--port EXPECTED_PORT]
```
-You could assign a port to serving, or the default port 12345 will be assigned.
+The default port is 12345, you could assign a different port by specifying the --port parameter.
-If you use host network mode when booted the container, after successfully running service, you can access http://127.0.0.1:12345/lab to get into tutorial, or you should bind the correct ports between container and host.
+If you're using the host network mode when booting the container, once the service is running successfully, you can access the tutorial at http://127.0.0.1:12345/lab. Alternatively, you need to ensure the correct ports are bound between the container and the host.
-Here is a demostration of how to use tutorial in explorer:
+Here's a demonstration of how to navigate the tutorial in the explorer:
-
+
-### IPEX-LLM on Linux/MacOS
+**3.3 Performance Benchmark**: We provide a benchmark tool help users to test all the benchmarks and record them in a result CSV.
-To run container on Linux/MacOS:
```bash
-#/bin/bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
-export CONTAINER_NAME=my_container
-export MODEL_PATH=/llm/models[change to your model path]
+cd /llm//benchmark/all-in-one
+```
-docker run -itd \
- -p 12345:12345 \
- --cpuset-cpus="0-7" \
- --cpuset-mems="0" \
- --memory="8G" \
- --name=$CONTAINER_NAME \
- -v $MODEL_PATH:/llm/models \
- $DOCKER_IMAGE
+Users can provide models and related information in config.yaml.
+```bash
+repo_id:
+ - 'THUDM/chatglm-6b'
+ - 'THUDM/chatglm2-6b'
+ - 'meta-llama/Llama-2-7b-chat-hf'
+ # - 'liuhaotian/llava-v1.5-7b' # requires a LLAVA_REPO_DIR env variables pointing to the llava dir; added only for gpu win related test_api now
+local_model_hub: 'path to your local model hub'
+warm_up: 1
+num_trials: 3
+num_beams: 1 # default to greedy search
+low_bit: 'sym_int4' # default to use 'sym_int4' (i.e. symmetric int4)
+batch_size: 1 # default to 1
+in_out_pairs:
+ - '32-32'
+ - '1024-128'
+test_api:
+ - "transformer_int4"
+ - "native_int4"
+ - "optimize_model"
+ - "pytorch_autocast_bf16"
+ # - "transformer_autocast_bf16"
+ # - "bigdl_ipex_bf16"
+ # - "bigdl_ipex_int4"
+ # - "bigdl_ipex_int8"
+ # - "ipex_fp16_gpu" # on Intel GPU
+ # - "bigdl_fp16_gpu" # on Intel GPU
+ # - "transformer_int4_gpu" # on Intel GPU
+ # - "optimize_model_gpu" # on Intel GPU
+ # - "deepspeed_transformer_int4_cpu" # on Intel SPR Server
+ # - "transformer_int4_gpu_win" # on Intel GPU for Windows
+ # - "transformer_int4_fp16_gpu_win" # on Intel GPU for Windows, use fp16 for non-linear layer
+ # - "transformer_int4_loadlowbit_gpu_win" # on Intel GPU for Windows using load_low_bit API. Please make sure you have used the save.py to save the converted low bit model
+ # - "deepspeed_optimize_model_gpu" # deepspeed autotp on Intel GPU
+ # - "speculative_cpu"
+ # - "speculative_gpu"
+cpu_embedding: False # whether put embedding to CPU (only avaiable now for gpu win related test_api)
+streaming: False # whether output in streaming way (only avaiable now for gpu win related test_api)
+```
+
+This benchmark tool offers various test APIs, including `transformer_int4`, `speculative_cpu`, and more.
+
+For instance, if you wish to benchmark **inference with speculative decoding**, utilize the `speculative_cpu` test API in the `config.yml` file.
+
+Then, execute `bash run-spr.sh`, which will generate output results in `results.csv`.
+```bash
+bash run-spr.sh
```
-Also, you could use chat.py and ipex-llm-tutorial for development.
+For further details and comprehensive functionality of the benchmark tool, please refer to the [all-in-one benchmark tool](https://github.com/intel-analytics/BigDL/tree/main/python/llm/dev/benchmark/all-in-one).
+
+Additionally, for examples related to Inference with Speculative Decoding, you can explore [Speculative-Decoding Examples](https://github.com/intel-analytics/BigDL/tree/main/python/llm/example/CPU/Speculative-Decoding).
-[Getting started with chat](#getting-started-with-chat)
-[Getting started with tutorials](#getting-started-with-tutorials)
-## Docker installation guide for IPEX LLM on XPU
+## IPEX-LLM Inference on XPU
First, pull docker image from docker hub:
```
-docker pull intelanalytics/ipex-llm-xpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-xpu:2.1.0-SNAPSHOT
```
To map the xpu into the container, you need to specify --device=/dev/dri when booting the container.
An example could be:
```bash
#/bin/bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-xpu:2.1.0-SNAPSHOT
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
@@ -170,18 +188,18 @@ root@arda-arc12:/# sycl-ls
To run inference using `IPEX-LLM` using xpu, you could refer to this [documentation](https://github.com/intel-analytics/IPEX/tree/main/python/llm/example/GPU).
-## Docker installation guide for IPEX LLM Serving on CPU
+## IPEX-LLM Serving on CPU
### Boot container
Pull image:
```
-docker pull intelanalytics/ipex-llm-serving-cpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
```
You could use the following bash script to start the container. Please be noted that the CPU config is specified for Xeon CPUs, change it accordingly if you are not using a Xeon CPU.
```bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
@@ -259,13 +277,13 @@ python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
-## Docker installation guide for IPEX LLM Serving on XPU
+## IPEX-LLM Serving on XPU
### Boot container
Pull image:
```
-docker pull intelanalytics/ipex-llm-serving-xpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-serving-xpu:2.1.0-SNAPSHOT
```
To map the `xpu` into the container, you need to specify `--device=/dev/dri` when booting the container.
@@ -273,7 +291,7 @@ To map the `xpu` into the container, you need to specify `--device=/dev/dri` whe
An example could be:
```bash
#/bin/bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-serving-cpu:2.1.0-SNAPSHOT
export CONTAINER_NAME=my_container
export MODEL_PATH=/llm/models[change to your model path]
export SERVICE_MODEL_PATH=/llm/models/chatglm2-6b[a specified model path for running service]
@@ -352,7 +370,7 @@ Finally, launch the RESTful API server
python3 -m fastchat.serve.openai_api_server --host localhost --port 8000
```
-## Docker installation guide for IPEX LLM Fine Tuning on CPU
+## IPEX-LLM Fine Tuning on CPU
### 1. Prepare Docker Image
@@ -360,10 +378,10 @@ You can download directly from Dockerhub like:
```bash
# For standalone
-docker pull intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.1.0-SNAPSHOT
# For k8s
-docker pull intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.1.0-SNAPSHOT
```
Or build the image from source:
@@ -376,7 +394,7 @@ export HTTPS_PROXY=your_https_proxy
docker build \
--build-arg http_proxy=${HTTP_PROXY} \
--build-arg https_proxy=${HTTPS_PROXY} \
- -t intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.5.0-SNAPSHOT \
+ -t intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.1.0-SNAPSHOT \
-f ./Dockerfile .
# For k8s
@@ -386,7 +404,7 @@ export HTTPS_PROXY=your_https_proxy
docker build \
--build-arg http_proxy=${HTTP_PROXY} \
--build-arg https_proxy=${HTTPS_PROXY} \
- -t intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.5.0-SNAPSHOT \
+ -t intelanalytics/ipex-llm-finetune-qlora-cpu-k8s:2.1.0-SNAPSHOT \
-f ./Dockerfile.k8s .
```
@@ -407,7 +425,7 @@ docker run -itd \
-e https_proxy=${HTTPS_PROXY} \
-v $BASE_MODE_PATH:/ipex_llm/model \
-v $DATA_PATH:/ipex_llm/data/alpaca-cleaned \
- intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.5.0-SNAPSHOT
+ intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.1.0-SNAPSHOT
```
The download and mount of base model and data to a docker container demonstrates a standard fine-tuning process. You can skip this step for a quick start, and in this way, the fine-tuning codes will automatically download the needed files:
@@ -421,7 +439,7 @@ docker run -itd \
--name=ipex-llm-fintune-qlora-cpu \
-e http_proxy=${HTTP_PROXY} \
-e https_proxy=${HTTPS_PROXY} \
- intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.5.0-SNAPSHOT
+ intelanalytics/ipex-llm-finetune-qlora-cpu-standalone:2.1.0-SNAPSHOT
```
However, we do recommend you to handle them manually, because the automatical download can be blocked by Internet access and Huggingface authentication etc. according to different environment, and the manual method allows you to fine-tune in a custom way (with different base model and dataset).
@@ -470,7 +488,7 @@ python ./export_merged_model.py --repo-id-or-model-path REPO_ID_OR_MODEL_PATH --
Then you can use `./outputs/checkpoint-200-merged` as a normal huggingface transformer model to do inference.
-## Docker installation guide for IPEX LLM Fine Tuning on XPU
+## IPEX-LLM Fine Tuning on XPU
The following shows how to fine-tune LLM with Quantization (QLoRA built on IPEX-LLM 4bit optimizations) in a docker environment, which is accelerated by Intel XPU.
@@ -479,7 +497,7 @@ The following shows how to fine-tune LLM with Quantization (QLoRA built on IPEX-
You can download directly from Dockerhub like:
```bash
-docker pull intelanalytics/ipex-llm-finetune-qlora-xpu:2.5.0-SNAPSHOT
+docker pull intelanalytics/ipex-llm-finetune-qlora-xpu:2.1.0-SNAPSHOT
```
Or build the image from source:
@@ -491,7 +509,7 @@ export HTTPS_PROXY=your_https_proxy
docker build \
--build-arg http_proxy=${HTTP_PROXY} \
--build-arg https_proxy=${HTTPS_PROXY} \
- -t intelanalytics/ipex-llm-finetune-qlora-xpu:2.5.0-SNAPSHOT \
+ -t intelanalytics/ipex-llm-finetune-qlora-xpu:2.1.0-SNAPSHOT \
-f ./Dockerfile .
```
@@ -515,7 +533,7 @@ docker run -itd \
-v $BASE_MODE_PATH:/model \
-v $DATA_PATH:/data/alpaca-cleaned \
--shm-size="16g" \
- intelanalytics/ipex-llm-fintune-qlora-xpu:2.5.0-SNAPSHOT
+ intelanalytics/ipex-llm-fintune-qlora-xpu:2.1.0-SNAPSHOT
```
The download and mount of base model and data to a docker container demonstrates a standard fine-tuning process. You can skip this step for a quick start, and in this way, the fine-tuning codes will automatically download the needed files:
@@ -532,7 +550,7 @@ docker run -itd \
-e http_proxy=${HTTP_PROXY} \
-e https_proxy=${HTTPS_PROXY} \
--shm-size="16g" \
- intelanalytics/ipex-llm-fintune-qlora-xpu:2.5.0-SNAPSHOT
+ intelanalytics/ipex-llm-fintune-qlora-xpu:2.1.0-SNAPSHOT
```
However, we do recommend you to handle them manually, because the automatical download can be blocked by Internet access and Huggingface authentication etc. according to different environment, and the manual method allows you to fine-tune in a custom way (with different base model and dataset).
diff --git a/docker/llm/inference/cpu/docker/README.md b/docker/llm/inference/cpu/docker/README.md
index ba81774769f..c8babfacac0 100644
--- a/docker/llm/inference/cpu/docker/README.md
+++ b/docker/llm/inference/cpu/docker/README.md
@@ -6,7 +6,7 @@ docker build \
--build-arg http_proxy=.. \
--build-arg https_proxy=.. \
--build-arg no_proxy=.. \
- --rm --no-cache -t intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT .
+ --rm --no-cache -t intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT .
```
@@ -16,7 +16,7 @@ docker build \
An example could be:
```bash
#/bin/bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
sudo docker run -itd \
--net=host \
@@ -41,7 +41,7 @@ You can download models and bind the model directory from host machine to contai
Here is an example:
```bash
-export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.5.0-SNAPSHOT
+export DOCKER_IMAGE=intelanalytics/ipex-llm-cpu:2.1.0-SNAPSHOT
export MODEL_PATH=/home/llm/models
sudo docker run -itd \