-
-
Notifications
You must be signed in to change notification settings - Fork 5.2k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
[Doc] Update the SkyPilot doc with serving and Llama-3 (#4276)
- Loading branch information
1 parent
ad8d696
commit ceaf4ed
Showing
1 changed file
with
264 additions
and
23 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -1,69 +1,310 @@ | ||
.. _on_cloud: | ||
|
||
Running on clouds with SkyPilot | ||
=============================== | ||
Deploying and scaling up with SkyPilot | ||
================================================ | ||
|
||
.. raw:: html | ||
|
||
<p align="center"> | ||
<img src="https://imgur.com/yxtzPEu.png" alt="vLLM"/> | ||
</p> | ||
|
||
vLLM can be run on the cloud to scale to multiple GPUs with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. | ||
vLLM can be **run and scaled to multiple service replicas on clouds and Kubernetes** with `SkyPilot <https://github.com/skypilot-org/skypilot>`__, an open-source framework for running LLMs on any cloud. More examples for various open models, such as Llama-3, Mixtral, etc, can be found in `SkyPilot AI gallery <https://skypilot.readthedocs.io/en/latest/gallery/index.html>`__. | ||
|
||
To install SkyPilot and setup your cloud credentials, run: | ||
|
||
Prerequisites | ||
------------- | ||
|
||
- Go to the `HuggingFace model page <https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct>`__ and request access to the model :code:`meta-llama/Meta-Llama-3-8B-Instruct`. | ||
- Check that you have installed SkyPilot (`docs <https://skypilot.readthedocs.io/en/latest/getting-started/installation.html>`__). | ||
- Check that :code:`sky check` shows clouds or Kubernetes are enabled. | ||
|
||
.. code-block:: console | ||
$ pip install skypilot | ||
$ sky check | ||
pip install skypilot-nightly | ||
sky check | ||
Run on a single instance | ||
------------------------ | ||
|
||
See the vLLM SkyPilot YAML for serving, `serving.yaml <https://github.com/skypilot-org/skypilot/blob/master/llm/vllm/serve.yaml>`__. | ||
|
||
.. code-block:: yaml | ||
resources: | ||
accelerators: A100 | ||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. | ||
use_spot: True | ||
disk_size: 512 # Ensure model checkpoints can fit. | ||
disk_tier: best | ||
ports: 8081 # Expose to internet traffic. | ||
envs: | ||
MODEL_NAME: decapoda-research/llama-13b-hf | ||
TOKENIZER: hf-internal-testing/llama-tokenizer | ||
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct | ||
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. | ||
setup: | | ||
conda create -n vllm python=3.9 -y | ||
conda create -n vllm python=3.10 -y | ||
conda activate vllm | ||
git clone https://github.com/vllm-project/vllm.git | ||
cd vllm | ||
pip install . | ||
pip install gradio | ||
pip install vllm==0.4.0.post1 | ||
# Install Gradio for web UI. | ||
pip install gradio openai | ||
pip install flash-attn==2.5.7 | ||
run: | | ||
conda activate vllm | ||
echo 'Starting vllm api server...' | ||
python -u -m vllm.entrypoints.api_server \ | ||
--model $MODEL_NAME \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
--tokenizer $TOKENIZER 2>&1 | tee api_server.log & | ||
python -u -m vllm.entrypoints.openai.api_server \ | ||
--port 8081 \ | ||
--model $MODEL_NAME \ | ||
--trust-remote-code \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
2>&1 | tee api_server.log & | ||
echo 'Waiting for vllm api server to start...' | ||
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done | ||
echo 'Starting gradio server...' | ||
python vllm/examples/gradio_webserver.py | ||
git clone https://github.com/vllm-project/vllm.git || true | ||
python vllm/examples/gradio_openai_chatbot_webserver.py \ | ||
-m $MODEL_NAME \ | ||
--port 8811 \ | ||
--model-url http://localhost:8081/v1 \ | ||
--stop-token-ids 128009,128001 | ||
Start the serving the LLaMA-13B model on an A100 GPU: | ||
Start the serving the Llama-3 8B model on any of the candidate GPUs listed (L4, A10g, ...): | ||
|
||
.. code-block:: console | ||
$ sky launch serving.yaml | ||
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --env HF_TOKEN | ||
Check the output of the command. There will be a shareable gradio link (like the last line of the following). Open it in your browser to use the LLaMA model to do the text completion. | ||
|
||
.. code-block:: console | ||
(task, pid=7431) Running on public URL: https://<gradio-hash>.gradio.live | ||
**Optional**: Serve the 65B model instead of the default 13B and use more GPU: | ||
**Optional**: Serve the 70B model instead of the default 8B and use more GPU: | ||
|
||
.. code-block:: console | ||
HF_TOKEN="your-huggingface-token" sky launch serving.yaml --gpus A100:8 --env HF_TOKEN --env MODEL_NAME=meta-llama/Meta-Llama-3-70B-Instruct | ||
Scale up to multiple replicas | ||
----------------------------- | ||
|
||
SkyPilot can scale up the service to multiple service replicas with built-in autoscaling, load-balancing and fault-tolerance. You can do it by adding a services section to the YAML file. | ||
|
||
.. code-block:: yaml | ||
service: | ||
replicas: 2 | ||
# An actual request for readiness probe. | ||
readiness_probe: | ||
path: /v1/chat/completions | ||
post_data: | ||
model: $MODEL_NAME | ||
messages: | ||
- role: user | ||
content: Hello! What is your name? | ||
max_tokens: 1 | ||
.. raw:: html | ||
|
||
<details> | ||
<summary>Click to see the full recipe YAML</summary> | ||
|
||
|
||
.. code-block:: yaml | ||
service: | ||
replicas: 2 | ||
# An actual request for readiness probe. | ||
readiness_probe: | ||
path: /v1/chat/completions | ||
post_data: | ||
model: $MODEL_NAME | ||
messages: | ||
- role: user | ||
content: Hello! What is your name? | ||
max_tokens: 1 | ||
resources: | ||
accelerators: {L4, A10g, A10, L40, A40, A100, A100-80GB} # We can use cheaper accelerators for 8B model. | ||
use_spot: True | ||
disk_size: 512 # Ensure model checkpoints can fit. | ||
disk_tier: best | ||
ports: 8081 # Expose to internet traffic. | ||
envs: | ||
MODEL_NAME: meta-llama/Meta-Llama-3-8B-Instruct | ||
HF_TOKEN: <your-huggingface-token> # Change to your own huggingface token, or use --env to pass. | ||
setup: | | ||
conda create -n vllm python=3.10 -y | ||
conda activate vllm | ||
pip install vllm==0.4.0.post1 | ||
# Install Gradio for web UI. | ||
pip install gradio openai | ||
pip install flash-attn==2.5.7 | ||
run: | | ||
conda activate vllm | ||
echo 'Starting vllm api server...' | ||
python -u -m vllm.entrypoints.openai.api_server \ | ||
--port 8081 \ | ||
--model $MODEL_NAME \ | ||
--trust-remote-code \ | ||
--tensor-parallel-size $SKYPILOT_NUM_GPUS_PER_NODE \ | ||
2>&1 | tee api_server.log & | ||
echo 'Waiting for vllm api server to start...' | ||
while ! `cat api_server.log | grep -q 'Uvicorn running on'`; do sleep 1; done | ||
echo 'Starting gradio server...' | ||
git clone https://github.com/vllm-project/vllm.git || true | ||
python vllm/examples/gradio_openai_chatbot_webserver.py \ | ||
-m $MODEL_NAME \ | ||
--port 8811 \ | ||
--model-url http://localhost:8081/v1 \ | ||
--stop-token-ids 128009,128001 | ||
.. raw:: html | ||
|
||
</details> | ||
|
||
Start the serving the Llama-3 8B model on multiple replicas: | ||
|
||
.. code-block:: console | ||
HF_TOKEN="your-huggingface-token" sky serve up -n vllm serving.yaml --env HF_TOKEN | ||
Wait until the service is ready: | ||
|
||
.. code-block:: console | ||
sky launch -c vllm-serve-new -s serve.yaml --gpus A100:8 --env MODEL_NAME=decapoda-research/llama-65b-hf | ||
watch -n10 sky serve status vllm | ||
.. raw:: html | ||
|
||
<details> | ||
<summary>Example outputs:</summary> | ||
|
||
.. code-block:: console | ||
Services | ||
NAME VERSION UPTIME STATUS REPLICAS ENDPOINT | ||
vllm 1 35s READY 2/2 xx.yy.zz.100:30001 | ||
Service Replicas | ||
SERVICE_NAME ID VERSION IP LAUNCHED RESOURCES STATUS REGION | ||
vllm 1 1 xx.yy.zz.121 18 mins ago 1x GCP({'L4': 1}) READY us-east4 | ||
vllm 2 1 xx.yy.zz.245 18 mins ago 1x GCP({'L4': 1}) READY us-east4 | ||
.. raw:: html | ||
|
||
</details> | ||
|
||
After the service is READY, you can find a single endpoint for the service and access the service with the endpoint: | ||
|
||
.. code-block:: console | ||
ENDPOINT=$(sky serve status --endpoint 8081 vllm) | ||
curl -L http://$ENDPOINT/v1/chat/completions \ | ||
-H "Content-Type: application/json" \ | ||
-d '{ | ||
"model": "meta-llama/Meta-Llama-3-8B-Instruct", | ||
"messages": [ | ||
{ | ||
"role": "system", | ||
"content": "You are a helpful assistant." | ||
}, | ||
{ | ||
"role": "user", | ||
"content": "Who are you?" | ||
} | ||
], | ||
"stop_token_ids": [128009, 128001] | ||
}' | ||
To enable autoscaling, you could specify additional configs in `services`: | ||
|
||
.. code-block:: yaml | ||
services: | ||
replica_policy: | ||
min_replicas: 0 | ||
max_replicas: 3 | ||
target_qps_per_replica: 2 | ||
This will scale the service up to when the QPS exceeds 2 for each replica. | ||
|
||
|
||
**Optional**: Connect a GUI to the endpoint | ||
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ | ||
|
||
|
||
It is also possible to access the Llama-3 service with a separate GUI frontend, so the user requests send to the GUI will be load-balanced across replicas. | ||
|
||
.. raw:: html | ||
|
||
<details> | ||
<summary>Click to see the full GUI YAML</summary> | ||
|
||
.. code-block:: yaml | ||
envs: | ||
MODEL_NAME: meta-llama/Meta-Llama-3-70B-Instruct | ||
ENDPOINT: x.x.x.x:3031 # Address of the API server running vllm. | ||
resources: | ||
cpus: 2 | ||
setup: | | ||
conda activate vllm | ||
if [ $? -ne 0 ]; then | ||
conda create -n vllm python=3.10 -y | ||
conda activate vllm | ||
fi | ||
# Install Gradio for web UI. | ||
pip install gradio openai | ||
run: | | ||
conda activate vllm | ||
export PATH=$PATH:/sbin | ||
WORKER_IP=$(hostname -I | cut -d' ' -f1) | ||
CONTROLLER_PORT=21001 | ||
WORKER_PORT=21002 | ||
echo 'Starting gradio server...' | ||
git clone https://github.com/vllm-project/vllm.git || true | ||
python vllm/examples/gradio_openai_chatbot_webserver.py \ | ||
-m $MODEL_NAME \ | ||
--port 8811 \ | ||
--model-url http://$ENDPOINT/v1 \ | ||
--stop-token-ids 128009,128001 | tee ~/gradio.log | ||
.. raw:: html | ||
|
||
</details> | ||
|
||
1. Start the chat web UI: | ||
|
||
.. code-block:: console | ||
sky launch -c gui ./gui.yaml --env ENDPOINT=$(sky serve status --endpoint vllm) | ||
2. Then, we can access the GUI at the returned gradio link: | ||
|
||
.. code-block:: console | ||
| INFO | stdout | Running on public URL: https://6141e84201ce0bb4ed.gradio.live | ||