Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

u #1

Closed
wants to merge 158 commits into from
Closed

u #1

Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
158 commits
Select commit Hold shift + click to select a range
14f9c72
Update Supported Model List (#825)
zhuohan123 Aug 22, 2023
eedac9d
fix: revert code to avoid no attribute problem (#827)
HermitSun Aug 22, 2023
a41c204
Add compute capability 8.9 to default targets (#829)
WoosukKwon Aug 22, 2023
d64bf16
Implement approximate GELU kernels (#828)
WoosukKwon Aug 22, 2023
85ebcda
Fix typo of Aquila in README.md (#836)
ftgreat Aug 23, 2023
2a4ec90
Fix for breaking changes in xformers 0.0.21 (#834)
WoosukKwon Aug 23, 2023
75c0ca9
Clean up code (#844)
wenjun93 Aug 23, 2023
94d2f59
Set replacement=True in torch.multinomial (#858)
WoosukKwon Aug 25, 2023
791d79d
Bump up the version to v0.1.4 (#846)
WoosukKwon Aug 25, 2023
4b6f069
Add support for CodeLlama (#854)
Yard1 Aug 25, 2023
d2b2eed
[Fix] Fix a condition for ignored sequences (#867)
zhuohan123 Aug 28, 2023
7547138
use flash-attn via xformers (#877)
tmm1 Aug 30, 2023
becd7a5
Enable request body OpenAPI spec for OpenAI endpoints (#865)
Peilun-Li Aug 30, 2023
0d93f15
Accelerate LLaMA model loading (#234)
JF-D Aug 30, 2023
0080d83
Add acknowledgement to a16z grant
zhuohan123 Aug 30, 2023
28873a2
Improve _prune_hidden_states micro-benchmark (#707)
tmm1 Aug 31, 2023
e112223
fix: bug fix when penalties are negative (#913)
pfldy2850 Aug 31, 2023
55b28b1
[Docs] Minor fixes in supported models (#920)
WoosukKwon Aug 31, 2023
c128d69
Fix README.md Link (#927)
zhuohan123 Sep 1, 2023
32b6816
Add tests for models (#922)
WoosukKwon Sep 1, 2023
8ce9c50
Avoid compiling kernels for double data type (#933)
WoosukKwon Sep 2, 2023
bf87484
[BugFix] Fix NaN errors in paged attention kernel (#936)
WoosukKwon Sep 4, 2023
ce741ba
Refactor AsyncLLMEngine (#880)
Yard1 Sep 4, 2023
e15932b
Only emit warning about internal tokenizer if it isn't being used (#939)
nelson-liu Sep 4, 2023
002800f
Align vLLM's beam search implementation with HF generate (#857)
zhuohan123 Sep 5, 2023
1696725
Initialize AsyncLLMEngine bg loop correctly (#943)
Yard1 Sep 5, 2023
22379d5
fix: typo (#948)
HermitSun Sep 5, 2023
fbd80ad
Clean up kernel unit tests (#938)
WoosukKwon Sep 5, 2023
c9927c1
Use queue for finished requests (#957)
Yard1 Sep 6, 2023
320a622
[BugFix] Implement RoPE for GPT-J (#941)
WoosukKwon Sep 6, 2023
005ba45
Set torch default dtype in a context manager (#971)
Yard1 Sep 7, 2023
7a9c20c
Bum up transformers version (#976)
WoosukKwon Sep 7, 2023
c07ece5
Make `AsyncLLMEngine` more robust & fix batched abort (#969)
Yard1 Sep 7, 2023
c957c74
Enable safetensors loading for all models (#974)
zhuohan123 Sep 7, 2023
db09d4a
[FIX] Fix Alibi implementation in PagedAttention kernel (#945)
zhuohan123 Sep 7, 2023
852ef5b
Bump up the version to v0.1.5 (#944)
WoosukKwon Sep 7, 2023
4b5bcf8
faster startup of vLLM (#982)
ri938 Sep 8, 2023
0804384
Start background task in `AsyncLLMEngine.generate` (#988)
Yard1 Sep 8, 2023
1117aa1
Bump up the version to v0.1.6 (#989)
zhuohan123 Sep 8, 2023
4042d19
fix "tansformers_module" ModuleNotFoundError when load model with `tr…
Jingru Sep 9, 2023
a62de9e
Fix wrong dtype in PagedAttentionWithALiBi bias (#996)
Yard1 Sep 9, 2023
898285c
fix: CUDA error when inferencing with Falcon-40B base model (#992)
kyujin-cho Sep 10, 2023
b9cecc2
[Docs] Update installation page (#1005)
WoosukKwon Sep 10, 2023
d6770d1
Update setup.py (#1006)
WoosukKwon Sep 11, 2023
e67b4f2
Use FP32 in RoPE initialization (#1004)
WoosukKwon Sep 11, 2023
90eb3f4
Bump up the version to v0.1.7 (#1013)
WoosukKwon Sep 11, 2023
d6545ad
add option to shorten prompt print in log (#991)
leiwen83 Sep 12, 2023
0bb1e88
Make `max_model_len` configurable (#972)
Yard1 Sep 12, 2023
3272d7a
Fix typo in README.md (#1033)
eltociear Sep 13, 2023
9841d48
Use TGI-like incremental detokenization (#984)
Yard1 Sep 13, 2023
ab019ee
Add Model Revision Support (#1014)
Sep 13, 2023
f04908c
[FIX] Minor bug fixes (#1035)
zhuohan123 Sep 13, 2023
eda1a7c
Announce paper release (#1036)
WoosukKwon Sep 14, 2023
dd54a4b
Fix detokenization leaving special tokens (#1044)
Yard1 Sep 14, 2023
a589369
Add pandas to requirements.txt (#1047)
WoosukKwon Sep 15, 2023
b5f93d0
Only fail if logit_bias has actual values (#1045)
LLukas22 Sep 15, 2023
64ca424
Fix warning message on LLaMA FastTokenizer (#1037)
WoosukKwon Sep 15, 2023
b9fe461
Abort when coroutine is cancelled (#1020)
rucyang Sep 15, 2023
e3e79e9
Implement AWQ quantization support for LLaMA (#1032)
WoosukKwon Sep 16, 2023
ff36139
Remove AsyncLLMEngine busy loop, shield background task (#1059)
Yard1 Sep 17, 2023
e21d768
Fix hanging when prompt exceeds limit (#1029)
chenxu2048 Sep 17, 2023
90979c3
[FIX] Don't initialize parameter by default (#1067)
zhuohan123 Sep 18, 2023
fbe66e1
added support for quantize on LLM module (#1080)
orellavie1212 Sep 18, 2023
95592fa
align llm_engine and async_engine. (#1081)
esmeetu Sep 18, 2023
f029ef9
Fix get_max_num_running_seqs for waiting and swapped seq groups (#1068)
zhuohan123 Sep 18, 2023
cc796b1
Convert before transpose (#1073)
WoosukKwon Sep 18, 2023
2b1c116
Add minimum capability requirement for AWQ (#1064)
WoosukKwon Sep 18, 2023
c102631
[Community] Add vLLM Discord server (#1086)
zhuohan123 Sep 18, 2023
400b828
Add pyarrow to dependencies & Print warning on Ray import error (#1094)
WoosukKwon Sep 19, 2023
bc06445
Add gpu_memory_utilization and swap_space to LLM (#1090)
WoosukKwon Sep 20, 2023
6f2dd6c
Add documentation to Triton server tutorial (#983)
tanmayv25 Sep 20, 2023
3302f0a
rope_theta and max_position_embeddings from config (#1096)
Yard1 Sep 20, 2023
2ac4d5e
Replace DtypeTensor (#1123)
WoosukKwon Sep 21, 2023
1ac4ccf
Add float16 and float32 (#1115)
WoosukKwon Sep 21, 2023
2d1e86f
clean api code, remove redundant background task. (#1102)
esmeetu Sep 21, 2023
f98b745
feat: support stop_token_ids parameter. (#1097)
gesanqiu Sep 21, 2023
7d7e3b7
Use `--ipc=host` in docker run for distributed inference (#1125)
WoosukKwon Sep 22, 2023
4ee52bb
Docs: Fix broken link to openai example (#1145)
nkpz Sep 22, 2023
8d926e9
Announce the First vLLM Meetup (#1148)
WoosukKwon Sep 22, 2023
947b794
[Sampler] Vectorized sampling (simplified) (#1048)
zhuohan123 Sep 23, 2023
f187877
[FIX] Simplify sampler logic (#1156)
zhuohan123 Sep 24, 2023
9f6be86
Fix config for Falcon (#1164)
WoosukKwon Sep 24, 2023
bbbf865
Align `max_tokens` behavior with openai (#852)
HermitSun Sep 24, 2023
a425bd9
[Setup] Enable `TORCH_CUDA_ARCH_LIST` for selecting target GPUs (#1074)
WoosukKwon Sep 26, 2023
03ffd0a
Add comments on RoPE initialization (#1176)
WoosukKwon Sep 26, 2023
cf5cb1e
Allocate more shared memory to attention kernel (#1154)
Yard1 Sep 27, 2023
21877b0
Support Longchat and RoPE scaling (#555)
LiuXiaoxuanPKU Sep 27, 2023
30e7752
fix typo (#1184)
WrRan Sep 27, 2023
28e616c
fix qwen-14b model (#1173)
Sanster Sep 27, 2023
a19bc5c
Automatically configure `max_num_batched_tokens` (#1198)
WoosukKwon Sep 27, 2023
649aa73
Use standard extras for uvicorn (#1166)
danilopeixoto Sep 28, 2023
20f7cc4
Add `skip_special_tokens` sampling params (#1186)
blahblahasdf Sep 28, 2023
7bedab5
Add rope_scaling to Qwen (#1210)
Sanster Sep 28, 2023
bb1ba58
[Mistral] Mistral-7B-v0.1 support (#1196)
Bam4d Sep 28, 2023
a8e98ae
Fix Mistral model (#1220)
WoosukKwon Sep 28, 2023
2e8e49f
[Fix] Remove false assertion (#1222)
WoosukKwon Sep 28, 2023
202351d
Add Mistral to supported model list (#1221)
WoosukKwon Sep 28, 2023
6f88f76
Fix OOM in attention kernel test (#1223)
WoosukKwon Sep 28, 2023
f936657
Provide default max model length (#1224)
WoosukKwon Sep 28, 2023
e2fb71e
Bump up the version to v0.2.0 (#1212)
WoosukKwon Sep 28, 2023
0967102
fixing typo in `tiiuae/falcon-rw-7b` model name (#1226)
0ssamaak0 Sep 29, 2023
b5a10eb
Added `dtype` arg to benchmarks (#1228)
kg6-sleipnir Oct 1, 2023
ebe4d1d
Fix boundary check in paged attention kernel (#1241)
soundOfDestiny Oct 1, 2023
a60b353
support sharding llama2-70b on more than 8 GPUs (#1209)
zhuohan123 Oct 2, 2023
84e4e37
[Minor] Fix type annotations (#1238)
WoosukKwon Oct 2, 2023
ba0bfd4
TP/quantization/weight loading refactor part 1 - Simplify parallel li…
zhuohan123 Oct 2, 2023
66d18a7
add support for tokenizer revision (#1163)
cassanof Oct 3, 2023
acbed3e
Use monotonic time where appropriate (#1249)
Yard1 Oct 3, 2023
09ff7f1
API server support ipv4 / ipv6 dualstack (#1288)
yunfeng-scale Oct 7, 2023
ee92b58
Move bfloat16 check to worker (#1259)
Yard1 Oct 8, 2023
6b5296a
[FIX] Explain why the finished_reason of ignored sequences are length…
zhuohan123 Oct 8, 2023
9eed4d1
Update README.md (#1292)
zhuohan123 Oct 9, 2023
b95ee89
[Minor] Fix comment in mistral.py (#1303)
zhuohan123 Oct 10, 2023
6a61195
lock torch version to 2.0.1 (#1290)
yanxiyue Oct 10, 2023
ac5cf86
Fix `__repr__` of `SequenceOutputs` (#1311)
WrRan Oct 10, 2023
91fce82
change the timing of sorting logits (#1309)
yhlskt23 Oct 11, 2023
8285736
workaround of AWQ for Turing GPUs (#1252)
twaka Oct 11, 2023
980dd4a
Fix overflow in awq kernel (#1295)
chu-tianxiang Oct 11, 2023
ee8217e
Add Mistral to quantization model list (#1278)
AmaleshV Oct 11, 2023
875afe3
Add blacklist in model checkpoint (#1325)
WoosukKwon Oct 12, 2023
6368e77
Add Aquila2 to README (#1331)
ftgreat Oct 12, 2023
ec3b5ce
Improve detokenization performance (#1338)
Yard1 Oct 13, 2023
e7c8555
Bump up transformers version & Remove MistralConfig (#1254)
WoosukKwon Oct 13, 2023
de89472
Fix the issue for AquilaChat2-* models (#1339)
lu-wang-dl Oct 13, 2023
d0740df
Fix error message on `TORCH_CUDA_ARCH_LIST` (#1239)
WoosukKwon Oct 14, 2023
29678cd
Minor fix on AWQ kernel launch (#1356)
WoosukKwon Oct 16, 2023
928de46
Implement PagedAttention V2 (#1348)
WoosukKwon Oct 16, 2023
9d9072a
Implement prompt logprobs & Batched topk for computing logprobs (#1328)
zhuohan123 Oct 16, 2023
348897a
Fix PyTorch version to 2.0.1 in workflow (#1377)
WoosukKwon Oct 16, 2023
e8ef4c0
Fix PyTorch index URL in workflow (#1378)
WoosukKwon Oct 16, 2023
d3a5bd9
Fix sampler test (#1379)
WoosukKwon Oct 16, 2023
651c614
Bump up the version to v0.2.1 (#1355)
zhuohan123 Oct 16, 2023
c1376e0
Change scheduler & input tensor shape (#1381)
WoosukKwon Oct 17, 2023
9524867
Add Mistral 7B to `test_models` (#1366)
WoosukKwon Oct 17, 2023
a132435
Fix typo (#1383)
WrRan Oct 17, 2023
f8a1e39
[BugFix] Define `__eq__` in SequenceGroupOutputs (#1389)
WoosukKwon Oct 17, 2023
f61dc80
Fix type hints (#1427)
lxrite Oct 20, 2023
d189170
remove useless statements (#1408)
WrRan Oct 20, 2023
bf31d36
Pin pydantic dependency versions (#1429)
thiagosalvatore Oct 21, 2023
1f24755
Support SqueezeLLM (#1326)
chooper1 Oct 22, 2023
28b47d1
Add rope_scaling to Aquila model (#1457)
Sanster Oct 29, 2023
beac8dd
fix: don't skip first special token. (#1497)
gesanqiu Oct 29, 2023
69be658
Support repetition_penalty (#1424)
beginlner Oct 29, 2023
aa9af07
Fix bias in InternLM (#1501)
WoosukKwon Oct 29, 2023
15f5632
Delay GPU->CPU sync in sampling (#1337)
Yard1 Oct 30, 2023
ac8d36f
Refactor LLMEngine demo script for clarity and modularity (#1413)
iongpt Oct 30, 2023
2f3d36a
Fix logging so we actually get info level entries in the log. (#1494)
Tostino Oct 30, 2023
79a3091
Add py.typed so consumers of vLLM can get type checking (#1509)
jroesch Oct 30, 2023
7013a80
Add support for `spaces_between_special_tokens`
blahblahasdf Oct 30, 2023
7b895c5
[Fix] Fix duplicated logging messages (#1524)
zhuohan123 Oct 31, 2023
9cabcb7
Add Dockerfile (#1350)
skrider Oct 31, 2023
0ce8647
Fix integer overflows in attention & cache ops (#1514)
WoosukKwon Oct 31, 2023
e575df3
[Small] Formatter only checks lints in changed files (#1528)
cadedaniel Oct 31, 2023
cf8849f
Add `MptForCausalLM` key in model_loader (#1526)
wenfeiy-db Oct 31, 2023
5687d58
[BugFix] Set engine_use_ray=True when TP>1 (#1531)
beginlner Nov 1, 2023
7e90a2d
Add `/health` Endpoint for both Servers (#1540)
Fluder-Paradyne Nov 1, 2023
1fe0990
Remove `MPTConfig` (#1529)
WoosukKwon Nov 1, 2023
9738b84
Force paged attention v2 for long contexts (#1510)
Yard1 Nov 1, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 3 additions & 2 deletions .github/workflows/publish.yml
Original file line number Diff line number Diff line change
Expand Up @@ -49,6 +49,7 @@ jobs:
matrix:
os: ['ubuntu-20.04']
python-version: ['3.8', '3.9', '3.10', '3.11']
pytorch-version: ['2.0.1']
cuda-version: ['11.8'] # Github runner can't build anything older than 11.8

steps:
Expand All @@ -69,9 +70,9 @@ jobs:
run: |
bash -x .github/workflows/scripts/cuda-install.sh ${{ matrix.cuda-version }} ${{ matrix.os }}

- name: Install PyTorch-cu${{ matrix.cuda-version }}
- name: Install PyTorch ${{ matrix.pytorch-version }} with CUDA ${{ matrix.cuda-version }}
run: |
bash -x .github/workflows/scripts/pytorch-install.sh ${{ matrix.python-version }} ${{ matrix.cuda-version }}
bash -x .github/workflows/scripts/pytorch-install.sh ${{ matrix.python-version }} ${{ matrix.pytorch-version }} ${{ matrix.cuda-version }}

- name: Build wheel
shell: bash
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/pylint.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ jobs:
pip install pylint==2.8.2
- name: Analysing the code with pylint
run: |
pylint vllm
pylint vllm tests
5 changes: 3 additions & 2 deletions .github/workflows/scripts/pytorch-install.sh
Original file line number Diff line number Diff line change
@@ -1,11 +1,12 @@
#!/bin/bash

python_executable=python$1
cuda_version=$2
pytorch_version=$2
cuda_version=$3

# Install torch
$python_executable -m pip install numpy pyyaml scipy ipython mkl mkl-include ninja cython typing pandas typing-extensions dataclasses setuptools && conda clean -ya
$python_executable -m pip install torch -f https://download.pytorch.org/whl/cu${cuda_version//./}/torch_stable.html
$python_executable -m pip install torch==${pytorch_version}+cu${cuda_version//./} --extra-index-url https://download.pytorch.org/whl/cu${cuda_version//./}

# Print version information
$python_executable --version
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/yapf.yml
Original file line number Diff line number Diff line change
Expand Up @@ -28,4 +28,4 @@ jobs:
pip install toml==0.10.2
- name: Running yapf
run: |
yapf --diff --recursive vllm --exclude 'vllm/model_executor/parallel_utils/**'
yapf --diff --recursive vllm tests
4 changes: 4 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -173,3 +173,7 @@ cython_debug/

# Sphinx documentation
_build/

# vim swap files
*.swo
*.swp
2 changes: 1 addition & 1 deletion .pylintrc
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@
[MASTER]

# Files or directories to be skipped. They should be base names, not paths.
ignore=docs,parallel_utils
ignore=docs

# Files or directories matching the regex patterns are skipped. The regex
# matches against base names, not paths.
Expand Down
72 changes: 72 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
FROM nvidia/cuda:11.8.0-devel-ubuntu22.04 AS dev

RUN apt-get update -y \
&& apt-get install -y python3-pip

WORKDIR /workspace

# install build and runtime dependencies
COPY requirements.txt requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt

# install development dependencies
COPY requirements-dev.txt requirements-dev.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements-dev.txt

# image to build pytorch extensions
FROM dev AS build

# copy input files
COPY csrc csrc
COPY setup.py setup.py
COPY requirements.txt requirements.txt
COPY pyproject.toml pyproject.toml
COPY vllm/__init__.py vllm/__init__.py

# max jobs used by Ninja to build extensions
ENV MAX_JOBS=$max_jobs
RUN python3 setup.py build_ext --inplace

# image to run unit testing suite
FROM dev AS test

# copy pytorch extensions separately to avoid having to rebuild
# when python code changes
COPY --from=build /workspace/vllm/*.so /workspace/vllm/
COPY tests tests
COPY vllm vllm

ENTRYPOINT ["python3", "-m", "pytest", "tests"]

# use CUDA base as CUDA runtime dependencies are already installed via pip
FROM nvidia/cuda:11.8.0-base-ubuntu22.04 AS vllm-base

# libnccl required for ray
RUN apt-get update -y \
&& apt-get install -y python3-pip

WORKDIR /workspace
COPY requirements.txt requirements.txt
RUN --mount=type=cache,target=/root/.cache/pip \
pip install -r requirements.txt

FROM vllm-base AS vllm
COPY --from=build /workspace/vllm/*.so /workspace/vllm/
COPY vllm vllm

EXPOSE 8000
ENTRYPOINT ["python3", "-m", "vllm.entrypoints.api_server"]

# openai api server alternative
FROM vllm-base AS vllm-openai
# install additional dependencies for openai api server
RUN --mount=type=cache,target=/root/.cache/pip \
pip install accelerate fschat

COPY --from=build /workspace/vllm/*.so /workspace/vllm/
COPY vllm vllm

ENTRYPOINT ["python3", "-m", "vllm.entrypoints.openai.api_server"]

56 changes: 23 additions & 33 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,13 +10,17 @@ Easy, fast, and cheap LLM serving for everyone
</h3>

<p align="center">
| <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://github.com/vllm-project/vllm/discussions"><b>Discussions</b></a> |
| <a href="https://vllm.readthedocs.io/en/latest/"><b>Documentation</b></a> | <a href="https://vllm.ai"><b>Blog</b></a> | <a href="https://arxiv.org/abs/2309.06180"><b>Paper</b></a> | <a href="https://discord.gg/jz7wjKhh6g"><b>Discord</b></a> |

</p>

---

*Latest News* 🔥
- [2023/10] We hosted [the first vLLM meetup](https://lu.ma/first-vllm-meetup) in SF! Please find the meetup slides [here](https://docs.google.com/presentation/d/1QL-XPFXiFpDBh86DbEegFXBXFXjix4v032GhShbKf3s/edit?usp=sharing).
- [2023/09] We created our [Discord server](https://discord.gg/jz7wjKhh6g)! Join us to discuss vLLM and LLM serving! We will also post the latest announcements and updates there.
- [2023/09] We released our [PagedAttention paper](https://arxiv.org/abs/2309.06180) on arXiv!
- [2023/08] We would like to express our sincere gratitude to [Andreessen Horowitz](https://a16z.com/2023/08/30/supporting-the-open-source-ai-community/) (a16z) for providing a generous grant to support the open-source development and research of vLLM.
- [2023/07] Added support for LLaMA-2! You can run and serve 7B/13B/70B LLaMA-2s on vLLM with a single command!
- [2023/06] Serving vLLM On any Cloud with SkyPilot. Check out a 1-click [example](https://github.com/skypilot-org/skypilot/blob/master/llm/vllm) to start the vLLM demo, and the [blog post](https://blog.skypilot.co/serving-llm-24x-faster-on-the-cloud-with-vllm-and-skypilot/) for the story behind vLLM development on the clouds.
- [2023/06] We officially released vLLM! FastChat-vLLM integration has powered [LMSYS Vicuna and Chatbot Arena](https://chat.lmsys.org) since mid-April. Check out our [blog post](https://vllm.ai).
Expand All @@ -34,24 +38,28 @@ vLLM is fast with:

vLLM is flexible and easy to use with:

- Seamless integration with popular HuggingFace models
- Seamless integration with popular Hugging Face models
- High-throughput serving with various decoding algorithms, including *parallel sampling*, *beam search*, and more
- Tensor parallelism support for distributed inference
- Streaming outputs
- OpenAI-compatible API server

vLLM seamlessly supports many Huggingface models, including the following architectures:
vLLM seamlessly supports many Hugging Face models, including the following architectures:

- Aquila & Aquila2 (`BAAI/AquilaChat2-7B`, `BAAI/AquilaChat2-34B`, `BAAI/Aquila-7B`, `BAAI/AquilaChat-7B`, etc.)
- Baichuan (`baichuan-inc/Baichuan-7B`, `baichuan-inc/Baichuan-13B-Chat`, etc.)
- BLOOM (`bigscience/bloom`, `bigscience/bloomz`, etc.)
- Falcon (`tiiuae/falcon-7b`, `tiiuae/falcon-40b`, `tiiuae/falcon-rw-7b`, etc.)
- GPT-2 (`gpt2`, `gpt2-xl`, etc.)
- GPT BigCode (`bigcode/starcoder`, `bigcode/gpt_bigcode-santacoder`, etc.)
- GPT-J (`EleutherAI/gpt-j-6b`, `nomic-ai/gpt4all-j`, etc.)
- GPT-NeoX (`EleutherAI/gpt-neox-20b`, `databricks/dolly-v2-12b`, `stabilityai/stablelm-tuned-alpha-7b`, etc.)
- InternLM (`internlm/internlm-7b`, `internlm/internlm-chat-7b`, etc.)
- LLaMA & LLaMA-2 (`meta-llama/Llama-2-70b-hf`, `lmsys/vicuna-13b-v1.3`, `young-geng/koala`, `openlm-research/open_llama_13b`, etc.)
- Mistral (`mistralai/Mistral-7B-v0.1`, `mistralai/Mistral-7B-Instruct-v0.1`, etc.)
- MPT (`mosaicml/mpt-7b`, `mosaicml/mpt-30b`, etc.)
- OPT (`facebook/opt-66b`, `facebook/opt-iml-max-30b`, etc.)
- Qwen (`Qwen/Qwen-7B`, `Qwen/Qwen-7B-Chat`, etc.)

Install vLLM with pip or [from source](https://vllm.readthedocs.io/en/latest/getting_started/installation.html#build-from-source):

Expand All @@ -66,37 +74,19 @@ Visit our [documentation](https://vllm.readthedocs.io/en/latest/) to get started
- [Quickstart](https://vllm.readthedocs.io/en/latest/getting_started/quickstart.html)
- [Supported Models](https://vllm.readthedocs.io/en/latest/models/supported_models.html)

## Performance

vLLM outperforms HuggingFace Transformers (HF) by up to 24x and Text Generation Inference (TGI) by up to 3.5x, in terms of throughput.
For details, check out our [blog post](https://vllm.ai).

<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n1_dark.png">
<img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n1_light.png" width="45%">
</picture>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n1_dark.png">
<img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n1_light.png" width="45%">
</picture>
<br>
<em> Serving throughput when each request asks for 1 output completion. </em>
</p>

<p align="center">
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n3_dark.png">
<img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a10g_n3_light.png" width="45%">
</picture>
<picture>
<source media="(prefers-color-scheme: dark)" srcset="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n3_dark.png">
<img src="https://raw.githubusercontent.com/vllm-project/vllm/main/docs/source/assets/figures/perf_a100_n3_light.png" width="45%">
</picture> <br>
<em> Serving throughput when each request asks for 3 output completions. </em>
</p>

## Contributing

We welcome and value any contributions and collaborations.
Please check out [CONTRIBUTING.md](./CONTRIBUTING.md) for how to get involved.

## Citation

If you use vLLM for your research, please cite our [paper](https://arxiv.org/abs/2309.06180):
```bibtex
@inproceedings{kwon2023efficient,
title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
year={2023}
}
```
32 changes: 26 additions & 6 deletions benchmarks/benchmark_latency.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,10 +18,12 @@ def main(args: argparse.Namespace):
llm = LLM(
model=args.model,
tokenizer=args.tokenizer,
quantization=args.quantization,
tensor_parallel_size=args.tensor_parallel_size,
max_num_seqs=args.batch_size,
max_num_batched_tokens=args.batch_size * args.input_len,
trust_remote_code=args.trust_remote_code,
dtype=args.dtype,
)

sampling_params = SamplingParams(
Expand All @@ -38,13 +40,13 @@ def main(args: argparse.Namespace):
def run_to_completion(profile: bool = False):
if profile:
torch.cuda.cudart().cudaProfilerStart()
start_time = time.time()
start_time = time.perf_counter()

llm.generate(prompt_token_ids=dummy_prompt_token_ids,
sampling_params=sampling_params,
use_tqdm=False)

end_time = time.time()
end_time = time.perf_counter()
latency = end_time - start_time
if profile:
torch.cuda.cudart().cudaProfilerStop()
Expand All @@ -63,19 +65,37 @@ def run_to_completion(profile: bool = False):
if __name__ == '__main__':
parser = argparse.ArgumentParser(
description='Benchmark the latency of processing a single batch of '
'requests till completion.')
'requests till completion.')
parser.add_argument('--model', type=str, default='facebook/opt-125m')
parser.add_argument('--tokenizer', type=str, default=None)
parser.add_argument('--quantization',
'-q',
choices=['awq', 'squeezellm', None],
default=None)
parser.add_argument('--tensor-parallel-size', '-tp', type=int, default=1)
parser.add_argument('--input-len', type=int, default=32)
parser.add_argument('--output-len', type=int, default=128)
parser.add_argument('--batch-size', type=int, default=8)
parser.add_argument('--n', type=int, default=1,
parser.add_argument('--n',
type=int,
default=1,
help='Number of generated sequences per prompt.')
parser.add_argument('--use-beam-search', action='store_true')
parser.add_argument('--num-iters', type=int, default=3,
parser.add_argument('--num-iters',
type=int,
default=3,
help='Number of iterations to run.')
parser.add_argument('--trust-remote-code', action='store_true',
parser.add_argument('--trust-remote-code',
action='store_true',
help='trust remote code from huggingface')
parser.add_argument(
'--dtype',
type=str,
default='auto',
choices=['auto', 'half', 'float16', 'bfloat16', 'float', 'float32'],
help='data type for model weights and activations. '
'The "auto" option will use FP16 precision '
'for FP32 and FP16 models, and BF16 precision '
'for BF16 models.')
args = parser.parse_args()
main(args)
8 changes: 4 additions & 4 deletions benchmarks/benchmark_serving.py
Original file line number Diff line number Diff line change
Expand Up @@ -105,7 +105,7 @@ async def send_request(
best_of: int,
use_beam_search: bool,
) -> None:
request_start_time = time.time()
request_start_time = time.perf_counter()

headers = {"User-Agent": "Benchmark Client"}
if backend == "vllm":
Expand Down Expand Up @@ -148,7 +148,7 @@ async def send_request(
if "error" not in output:
break

request_end_time = time.time()
request_end_time = time.perf_counter()
request_latency = request_end_time - request_start_time
REQUEST_LATENCY.append((prompt_len, output_len, request_latency))

Expand Down Expand Up @@ -180,10 +180,10 @@ def main(args: argparse.Namespace):
tokenizer = get_tokenizer(args.tokenizer, trust_remote_code=args.trust_remote_code)
input_requests = sample_requests(args.dataset, args.num_prompts, tokenizer)

benchmark_start_time = time.time()
benchmark_start_time = time.perf_counter()
asyncio.run(benchmark(args.backend, api_url, input_requests, args.best_of,
args.use_beam_search, args.request_rate))
benchmark_end_time = time.time()
benchmark_end_time = time.perf_counter()
benchmark_time = benchmark_end_time - benchmark_start_time
print(f"Total time: {benchmark_time:.2f} s")
print(f"Throughput: {args.num_prompts / benchmark_time:.2f} requests/s")
Expand Down
Loading
Loading