Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sync with [email protected] #120

Closed
wants to merge 168 commits into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
168 commits
Select commit Hold shift + click to select a range
0eb0757
[Misc] Add ignored layers for `fp8` quantization (#6657)
mgoin Jul 23, 2024
58f5303
[Frontend] Add Usage data in each chunk for chat_serving. #6540 (#6652)
yecohn Jul 23, 2024
507ef78
[Model] Pipeline Parallel Support for DeepSeek v2 (#6519)
tjohnson31415 Jul 23, 2024
1bedf21
Bump `transformers` version for Llama 3.1 hotfix and patch Chameleon …
ywang96 Jul 23, 2024
72fc704
[build] relax wheel size limit (#6704)
youkaichao Jul 23, 2024
01c16ed
[CI] Add smoke test for non-uniform AutoFP8 quantization (#6702)
mgoin Jul 23, 2024
2f808e6
[Bugfix] StatLoggers: cache spec decode metrics when they get collect…
tdoublep Jul 23, 2024
87525fa
[bitsandbytes]: support read bnb pre-quantized model (#5753)
thesues Jul 23, 2024
5e8ca97
[Bugfix] fix flashinfer cudagraph capture for PP (#6708)
SolitaryThinker Jul 24, 2024
c882a7f
[SpecDecoding] Update MLPSpeculator CI tests to use smaller model (#6…
njhill Jul 24, 2024
0a740a1
[Bugfix] Fix token padding for chameleon (#6724)
ywang96 Jul 24, 2024
ccc4a73
[Docs][ROCm] Detailed instructions to build from source (#6680)
WoosukKwon Jul 24, 2024
b570811
[Build/CI] Update run-amd-test.sh. Enable Docker Hub login. (#6711)
Alexei-V-Ivanov-AMD Jul 24, 2024
f4f8a9d
[Bugfix]fix modelscope compatible issue (#6730)
liuyhwangyh Jul 24, 2024
5451463
Adding f-string to validation error which is missing (#6748)
luizanao Jul 24, 2024
2cf0df3
[Bugfix] Fix speculative decode seeded test (#6743)
njhill Jul 24, 2024
40468b1
[Bugfix] Miscalculated latency lead to time_to_first_token_seconds in…
AllenDou Jul 24, 2024
ee81258
[Frontend] split run_server into build_server and run_server (#6740)
dtrifiro Jul 24, 2024
0e63494
Add fp8 support to `reshape_and_cache_flash` (#6667)
Yard1 Jul 24, 2024
5448f67
[Core] Tweaks to model runner/input builder developer APIs (#6712)
Yard1 Jul 24, 2024
421e218
[Bugfix] Bump transformers to 4.43.2 (#6752)
mgoin Jul 24, 2024
d88c458
[Doc][AMD][ROCm]Added tips to refer to mi300x tuning guide for mi300x…
hongxiayang Jul 24, 2024
740374d
[core][distributed] fix zmq hang (#6759)
youkaichao Jul 25, 2024
5689e25
[Frontend] Represent tokens with identifiable strings (#6626)
ezliu Jul 25, 2024
9e169a4
[Model] Adding support for MiniCPM-V (#4087)
HwwwwwwwH Jul 25, 2024
309aaef
[Bugfix] Fix decode tokens w. CUDA graph (#6757)
comaniac Jul 25, 2024
0310029
[Bugfix] Fix awq_marlin and gptq_marlin flags (#6745)
alexm-neuralmagic Jul 25, 2024
316a41a
[Bugfix] Fix encoding_format in examples/openai_embedding_client.py (…
CatherineSue Jul 25, 2024
b75e314
[Bugfix] Add image placeholder for OpenAI Compatible Server of MiniCP…
HwwwwwwwH Jul 25, 2024
889da13
[ Misc ] `fp8-marlin` channelwise via `compressed-tensors` (#6524)
robertgshaw2-neuralmagic Jul 25, 2024
65b1f12
[Bugfix] Fix `kv_cache_dtype=fp8` without scales for FP8 checkpoints …
mgoin Jul 25, 2024
95db75d
[Bugfix] Add synchronize to prevent possible data race (#6788)
tlrmchlsmth Jul 25, 2024
6a1e25b
[Doc] Add documentations for nightly benchmarks (#6412)
KuntaiDu Jul 25, 2024
cd7edc4
[Bugfix] Fix empty (nullptr) channelwise scales when loading wNa16 u…
LucasWilkinson Jul 25, 2024
f3ff63c
[doc][distributed] improve multinode serving doc (#6804)
youkaichao Jul 25, 2024
b7215de
[Docs] Publish 5th meetup slides (#6799)
WoosukKwon Jul 25, 2024
1adddb1
[Core] Fix ray forward_dag error mssg (#6792)
rkooo567 Jul 25, 2024
443c7cf
[ci][distributed] fix flaky tests (#6806)
youkaichao Jul 26, 2024
2eb9f4f
[ci] Mark tensorizer as soft fail and separate from grouped test (#6810)
khluu Jul 26, 2024
062a1d0
Fix ReplicatedLinear weight loading (#6793)
qingquansong Jul 26, 2024
084a01f
[Bugfix] [Easy] Fixed a bug in the multiprocessing GPU executor. (#6770)
eaplatanios Jul 26, 2024
89a84b0
[Core] Use array to speedup padding (#6779)
peng1999 Jul 26, 2024
85ad7e2
[doc][debugging] add known issues for hangs (#6816)
youkaichao Jul 26, 2024
07278c3
[Model] Support Nemotron models (Nemotron-3, Nemotron-4, Minitron) (#…
mgoin Jul 26, 2024
50704f5
[Bugfix][Kernel] Promote another index to int64_t (#6838)
tlrmchlsmth Jul 26, 2024
71734f1
[Build/CI][ROCm] Minor simplification to Dockerfile.rocm (#6811)
WoosukKwon Jul 26, 2024
aa48677
[Misc][TPU] Support TPU in initialize_ray_cluster (#6812)
WoosukKwon Jul 26, 2024
3bbb493
[Hardware] [Intel] Enable Multiprocessing and tensor parallel in CPU …
bigPYJ1151 Jul 26, 2024
281977b
[Doc] Add Nemotron to supported model docs (#6843)
mgoin Jul 26, 2024
150a1ff
[Doc] Update SkyPilot doc for wrong indents and instructions for upda…
Michaelvll Jul 26, 2024
b5f49ee
Update README.md (#6847)
gurpreet-dhami Jul 27, 2024
bb54946
enforce eager mode with bnb quantization temporarily (#6846)
chenqianfzh Jul 27, 2024
d09b94c
[TPU] Support collective communications in XLA devices (#6813)
WoosukKwon Jul 27, 2024
981b0d5
[Frontend] Factor out code for running uvicorn (#6828)
DarkLight1337 Jul 27, 2024
5571294
[Bug Fix] Illegal memory access, FP8 Llama 3.1 405b (#6852)
LucasWilkinson Jul 27, 2024
969d032
[Bugfix]: Fix Tensorizer test failures (#6835)
sangstar Jul 27, 2024
ced36cd
[ROCm] Upgrade PyTorch nightly version (#6845)
WoosukKwon Jul 27, 2024
3c30123
[Doc] add VLLM_TARGET_DEVICE=neuron to documentation for neuron (#6844)
omrishiv Jul 27, 2024
ed94e4f
[Bugfix][Model] Jamba assertions and no chunked prefill by default fo…
tomeras91 Jul 27, 2024
14dbd5a
[Model] H2O Danube3-4b (#6451)
g-eoj Jul 27, 2024
52f07e3
[Hardware][TPU] Implement tensor parallelism with Ray (#5871)
WoosukKwon Jul 27, 2024
c53041a
[Doc] Add missing mock import to docs `conf.py` (#6834)
hmellor Jul 27, 2024
593e79e
[Bugfix] torch.set_num_threads() in multiproc_gpu_executor (#6802)
tjohnson31415 Jul 27, 2024
aa46953
[Misc][VLM][Doc] Consolidate offline examples for vision language mod…
ywang96 Jul 27, 2024
925de97
[Bugfix] Fix VLM example typo (#6859)
ywang96 Jul 27, 2024
a57d758
[bugfix] make args.stream work (#6831)
WrRan Jul 27, 2024
ecb33a2
[CI/Build][Doc] Update CI and Doc for VLM example changes (#6860)
ywang96 Jul 27, 2024
1ad86ac
[Model] Initial support for BLIP-2 (#5920)
DarkLight1337 Jul 27, 2024
f954d07
[Docs] Add RunLLM chat widget (#6857)
cw75 Jul 27, 2024
fad5576
[TPU] Reduce compilation time & Upgrade PyTorch XLA version (#6856)
WoosukKwon Jul 27, 2024
75acdaa
[Kernel] Increase precision of GPTQ/AWQ Marlin kernel (#6795)
alexm-neuralmagic Jul 27, 2024
b1366a9
Add Nemotron to PP_SUPPORTED_MODELS (#6863)
mgoin Jul 27, 2024
3eeb148
[Misc] Pass cutlass_fp8_supported correctly in fbgemm_fp8 (#6871)
zeyugao Jul 28, 2024
7cbd9ec
[Model] Initialize support for InternVL2 series models (#6514)
Isotr0py Jul 29, 2024
766435e
[Kernel] Tuned FP8 Kernels for Ada Lovelace (#6677)
varun-sundar-rabindranath Jul 29, 2024
db9e570
[Core] Reduce unnecessary compute when logprobs=None (#6532)
peng1999 Jul 29, 2024
60d1c6e
[Kernel] Fix deprecation function warnings squeezellm quant_cuda_kern…
tlrmchlsmth Jul 29, 2024
7f8d612
[TPU] Support tensor parallelism in async llm engine (#6891)
etwk Jul 29, 2024
9a7e2d0
[Bugfix] Allow vllm to still work if triton is not installed. (#6786)
tdoublep Jul 29, 2024
9f69d82
[Frontend] New `allowed_token_ids` decoding request parameter (#6753)
njhill Jul 29, 2024
aae6d36
[Kernel] Remove unused variables in awq/gemm_kernels.cu (#6908)
tlrmchlsmth Jul 30, 2024
4fbf4aa
[ci] GHA workflow to remove ready label upon "/notready" comment (#6921)
khluu Jul 30, 2024
61a97c3
[Kernel] Fix marlin divide-by-zero warnings (#6904)
tlrmchlsmth Jul 30, 2024
af647fb
[Kernel] Tuned int8 kernels for Ada Lovelace (#6848)
varun-sundar-rabindranath Jul 30, 2024
6e063ea
[TPU] Fix greedy decoding (#6933)
WoosukKwon Jul 30, 2024
c66c7f8
[Bugfix] Fix PaliGemma MMP (#6930)
ywang96 Jul 30, 2024
f058403
[Doc] Super tiny fix doc typo (#6949)
fzyzcjy Jul 30, 2024
5cf9254
[BugFix] Fix use of per-request seed with pipeline parallel (#6698)
njhill Jul 30, 2024
cbbc904
[Kernel] Squash a few more warnings (#6914)
tlrmchlsmth Jul 30, 2024
5895b24
[OpenVINO] Updated OpenVINO requirements and build docs (#6948)
ilya-lavrenov Jul 30, 2024
052b6f8
[Bugfix] Fix tensorizer memory profiling bug during testing (#6881)
sangstar Jul 30, 2024
d7a299e
[Kernel] Remove scaled_fp8_quant kernel padding footgun (#6842)
tlrmchlsmth Jul 30, 2024
6ca8031
[core][misc] improve free_finished_seq_groups (#6865)
youkaichao Jul 30, 2024
40c27a7
[Build] Temporarily Disable Kernels and LoRA tests (#6961)
simon-mo Jul 30, 2024
79319ce
[Nightly benchmarking suite] Remove pkill python from run benchmark s…
cadedaniel Jul 30, 2024
fb4f530
[CI] [nightly benchmark] Do not re-download sharegpt dataset if exist…
cadedaniel Jul 30, 2024
c32ab8b
[Speculative decoding] Add serving benchmark for llama3 70b + specula…
cadedaniel Jul 31, 2024
da1f7cc
[mypy] Enable following imports for some directories (#6681)
DarkLight1337 Jul 31, 2024
f230cc2
[Bugfix] Fix broadcasting logic for `multi_modal_kwargs` (#6836)
DarkLight1337 Jul 31, 2024
9f0e69b
[CI/Build] Fix mypy errors (#6968)
DarkLight1337 Jul 31, 2024
533d193
[Bugfix][TPU] Set readonly=True for non-root devices (#6980)
WoosukKwon Jul 31, 2024
c0644cf
[Bugfix] fix logit processor excceed vocab size issue (#6927)
FeiDeng Jul 31, 2024
6512937
Support W4A8 quantization for vllm (#5218)
HandH1998 Jul 31, 2024
2f4e108
[Bugfix] Clean up MiniCPM-V (#6939)
HwwwwwwwH Jul 31, 2024
daed30c
[Bugfix] Fix feature size calculation for LLaVA-NeXT (#6982)
DarkLight1337 Jul 31, 2024
2ee8d3b
[Model] use FusedMoE layer in Jamba (#6935)
avshalomman Jul 31, 2024
bd70013
[MISC] Introduce pipeline parallelism partition strategies (#6920)
comaniac Jul 31, 2024
460c188
[Bugfix] Support cpu offloading with fp8 quantization (#6960)
mgoin Jul 31, 2024
93548eb
[Kernel] Enable FP8 Cutlass for Ada Lovelace (#6950)
varun-sundar-rabindranath Jul 31, 2024
35e9c12
[Kernel] Tuned int8 Cutlass Kernels for SM75 (T4) (#6996)
varun-sundar-rabindranath Jul 31, 2024
a0dce93
[Misc] Add compressed-tensors to optimized quant list (#7006)
mgoin Jul 31, 2024
7eb0cb4
Revert "[Frontend] Factor out code for running uvicorn" (#7012)
simon-mo Jul 31, 2024
7ecee34
[Kernel][RFC] Refactor the punica kernel based on Triton (#5036)
jeejeelee Aug 1, 2024
1d2e7fb
[Model] Pipeline parallel support for Qwen2 (#6924)
xuyi Aug 1, 2024
23993a7
[Bugfix][TPU] Do not use torch.Generator for TPUs (#6981)
WoosukKwon Aug 1, 2024
630dd9e
[Bugfix][Model] Skip loading lm_head weights if using tie_word_embedd…
tjohnson31415 Aug 1, 2024
0437492
PP comm optimization: replace send with partial send + allgather (#6695)
aurickq Aug 1, 2024
3c10591
[Bugfix] Set SamplingParams.max_tokens for OpenAI requests if not pro…
zifeitong Aug 1, 2024
c8a7e93
[core][scheduler] simplify and improve scheduler (#6867)
youkaichao Aug 1, 2024
a72a424
[Build/CI] Fixing Docker Hub quota issue. (#7043)
Alexei-V-Ivanov-AMD Aug 1, 2024
7e0861b
[CI/Build] Update PyTorch to 2.4.0 (#6951)
SageMoore Aug 1, 2024
2dd3437
[Bugfix] Fix RMSNorm forward in InternViT attention qk_layernorm (#6992)
Isotr0py Aug 1, 2024
fb3db61
[CI/Build] Remove sparseml requirement from testing (#7037)
mgoin Aug 1, 2024
f4fd390
[Bugfix] Lower gemma's unloaded_params exception to warning (#7002)
mgoin Aug 1, 2024
fc912e0
[Models] Support Qwen model with PP (#6974)
andoorve Aug 1, 2024
562e580
Update run-amd-test.sh (#7044)
okakarpa Aug 1, 2024
805a8a7
[Misc] Support attention logits soft-capping with flash-attn (#7022)
WoosukKwon Aug 1, 2024
6a11fdf
[CI/Build][Bugfix] Fix CUTLASS header-only line (#7034)
tlrmchlsmth Aug 1, 2024
6ce01f3
[Performance] Optimize `get_seqs` (#7051)
WoosukKwon Aug 2, 2024
954f730
[Kernel] Fix input for flashinfer prefill wrapper. (#7008)
LiuXiaoxuanPKU Aug 2, 2024
3bb4b1e
[mypy] Speed up mypy checking (#7056)
DarkLight1337 Aug 2, 2024
2523577
[ci][distributed] try to fix pp test (#7054)
youkaichao Aug 2, 2024
cf2a1a4
Fix tracing.py (#7065)
bong-furiosa Aug 2, 2024
660dea1
[cuda][misc] remove error_on_invalid_device_count_status (#7069)
youkaichao Aug 2, 2024
db35186
[Core] Comment out unused code in sampler (#7023)
peng1999 Aug 2, 2024
c16eaac
[Hardware][Intel CPU] Update torch 2.4.0 for CPU backend (#6931)
DamonFool Aug 2, 2024
8069495
[ci] set timeout for test_oot_registration.py (#7082)
youkaichao Aug 2, 2024
b482b9a
[CI/Build] Add support for Python 3.12 (#7035)
mgoin Aug 2, 2024
a8d604c
[Misc] Disambiguate quantized types via a new ScalarType (#6396)
LucasWilkinson Aug 2, 2024
0530889
[Core] Pipeline parallel with Ray ADAG (#6837)
ruisearch42 Aug 2, 2024
22e718f
[Misc] Revive to use loopback address for driver IP (#7091)
ruisearch42 Aug 2, 2024
7089893
[misc] add a flag to enable compile (#7092)
youkaichao Aug 2, 2024
ed812a7
[ Frontend ] Multiprocessing for OpenAI Server with `zeromq` (#6883)
robertgshaw2-neuralmagic Aug 3, 2024
69ea15e
[ci][distributed] shorten wait time if server hangs (#7098)
youkaichao Aug 3, 2024
8c025fa
[Frontend] Factor out chat message parsing (#7055)
DarkLight1337 Aug 3, 2024
04e5583
[ci][distributed] merge distributed test commands (#7097)
youkaichao Aug 3, 2024
a0d1645
[ci][distributed] disable ray dag tests (#7099)
youkaichao Aug 3, 2024
0c25435
[Model] Refactor and decouple weight loading logic for InternVL2 mode…
Isotr0py Aug 3, 2024
fb2c1c8
[Bugfix] Fix block table for seqs that have prefix cache hits (#7018)
zachzzc Aug 3, 2024
99d7cab
[LoRA] ReplicatedLinear support LoRA (#7081)
jeejeelee Aug 3, 2024
67d745c
[CI] Temporarily turn off H100 performance benchmark (#7104)
KuntaiDu Aug 3, 2024
44dcb52
[ci][test] finalize fork_new_process_for_each_test (#7114)
youkaichao Aug 3, 2024
825b044
[Frontend] Warn if user `max_model_len` is greater than derived `max_…
fialhocoelho Aug 3, 2024
654bc5c
Support for guided decoding for offline LLM (#6878)
kevinbu233 Aug 4, 2024
9fadc7b
[misc] add zmq in collect env (#7119)
youkaichao Aug 4, 2024
83c644f
[core][misc] simply output processing with shortcut code path (#7117)
youkaichao Aug 4, 2024
179a6a3
[Model]Refactor MiniCPMV (#7020)
jeejeelee Aug 4, 2024
b1c9aa3
[Bugfix] [SpecDecode] Default speculative_draft_tensor_parallel_size …
tdoublep Aug 4, 2024
16a1cc9
[misc][distributed] improve libcudart.so finding (#7127)
youkaichao Aug 4, 2024
f80ab35
Clean up remaining Punica C information (#7027)
jeejeelee Aug 4, 2024
7b86e7c
[Model] Add multi-image support for minicpmv (#7122)
HwwwwwwwH Aug 5, 2024
cc08fc7
[Frontend] Reapply "Factor out code for running uvicorn" (#7095)
DarkLight1337 Aug 5, 2024
c0d8f16
[Model] SiglipVisionModel ported from transformers (#6942)
ChristopherCho Aug 5, 2024
82a1b1a
[Speculative decoding] Add periodic log with time spent in proposal/s…
cadedaniel Aug 5, 2024
e963045
[SpecDecode] Support FlashInfer in DraftModelRunner (#6926)
bong-furiosa Aug 5, 2024
003f8ee
[BugFix] Use IP4 localhost form for zmq bind (#7163)
njhill Aug 5, 2024
5efb049
Sync with [email protected] (pre)
dtrifiro Aug 5, 2024
f1575d9
Dockerfile.ubi: bump flashinfer
dtrifiro Aug 5, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
The table of contents is too big for display.
Diff view
Diff view
  •  
  •  
  •  
2 changes: 1 addition & 1 deletion .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import zipfile

MAX_SIZE_MB = 200
MAX_SIZE_MB = 250


def print_top_10_largest_files(zip_file):
Expand Down
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Meta-Llama-3-8B-QQQ.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m HandH1998/QQQ-Llama-3-8b-g128 -b 32 -l 1000 -f 5 -t 1
model_name: "HandH1998/QQQ-Llama-3-8b-g128"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.409
- name: "exact_match,flexible-extract"
value: 0.406
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Minitron-4B-Base.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nvidia/Minitron-4B-Base -b auto -l 1000 -f 5 -t 1
model_name: "nvidia/Minitron-4B-Base"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.252
- name: "exact_match,flexible-extract"
value: 0.252
limit: 1000
num_fewshot: 5
11 changes: 11 additions & 0 deletions .buildkite/lm-eval-harness/configs/Qwen2-1.5B-Instruct-FP8W8.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Qwen2-1.5B-Instruct-FP8W8 -b auto -l 1000 -f 5 -t 1
model_name: "nm-testing/Qwen2-1.5B-Instruct-FP8W8"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.578
- name: "exact_match,flexible-extract"
value: 0.585
limit: 1000
num_fewshot: 5
3 changes: 3 additions & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
Expand Up @@ -4,4 +4,7 @@ Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base.yaml
Qwen2-1.5B-Instruct-INT8-compressed-tensors.yaml
Qwen2-1.5B-Instruct-FP8W8.yaml
Meta-Llama-3-8B-QQQ.yaml
80 changes: 64 additions & 16 deletions .buildkite/nightly-benchmarks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,30 +3,51 @@

## Introduction

This directory contains the performance benchmarking CI for vllm.
The goal is to help developers know the impact of their PRs on the performance of vllm.
This directory contains two sets of benchmark for vllm.
- Performance benchmark: benchmark vllm's performance under various workload, for **developers** to gain clarity on whether their PR improves/degrades vllm's performance
- Nightly benchmark: compare vllm's performance against alternatives (tgi, trt-llm and lmdeploy), for **the public** to know when to choose vllm.

This benchmark will be *triggered* upon:

See [vLLM performance dashboard](https://perf.vllm.ai) for the latest performance benchmark results and [vLLM GitHub README](https://github.com/vllm-project/vllm/blob/main/README.md) for latest nightly benchmark results.


## Performance benchmark quick overview

**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!), with different models.

**Benchmarking Duration**: about 1hr.

**For benchmarking developers**: please try your best to constraint the duration of benchmarking to about 1 hr so that it won't take forever to run.


## Nightly benchmark quick overview

**Benchmarking Coverage**: Fix-qps serving on A100 (the support for FP8 benchmark on H100 is coming!) on Llama-3 8B, 70B and Mixtral 8x7B.

**Benchmarking engines**: vllm, TGI, trt-llm and lmdeploy.

**Benchmarking Duration**: about 3.5hrs.



## Trigger the benchmark

Performance benchmark will be triggered when:
- A PR being merged into vllm.
- Every commit for those PRs with `perf-benchmarks` label.

**Benchmarking Coverage**: latency, throughput and fix-qps serving on A100 (the support for more GPUs is comming later), with different models.
Nightly benchmark will be triggered when:
- Every commit for those PRs with `nightly-benchmarks` label.

**Benchmarking Duration**: about 1hr.

**For benchmarking developers**: please try your best to constraint the duration of benchmarking to less than 1.5 hr so that it won't take forever to run.


## Configuring the workload
## Performance benchmark details

The benchmarking workload contains three parts:
- Latency tests in `latency-tests.json`.
- Throughput tests in `throughput-tests.json`.
- Serving tests in `serving-tests.json`.
See [descriptions.md](tests/descriptions.md) for detailed descriptions, and use `tests/latency-tests.json`, `tests/throughput-tests.json`, `tests/serving-tests.json` to configure the test cases.

See [descriptions.md](tests/descriptions.md) for detailed descriptions.

### Latency test
#### Latency test

Here is an example of one test inside `latency-tests.json`:

Expand Down Expand Up @@ -54,12 +75,12 @@ Note that the performance numbers are highly sensitive to the value of the param
WARNING: The benchmarking script will save json results by itself, so please do not configure `--output-json` parameter in the json file.


### Throughput test
#### Throughput test
The tests are specified in `throughput-tests.json`. The syntax is similar to `latency-tests.json`, except for that the parameters will be fed forward to `benchmark_throughput.py`.

The number of this test is also stable -- a slight change on the value of this number might vary the performance numbers by a lot.

### Serving test
#### Serving test
We test the throughput by using `benchmark_serving.py` with request rate = inf to cover the online serving overhead. The corresponding parameters are in `serving-tests.json`, and here is an example:

```
Expand Down Expand Up @@ -96,9 +117,36 @@ The number of this test is less stable compared to the delay and latency benchma

WARNING: The benchmarking script will save json results by itself, so please do not configure `--save-results` or other results-saving-related parameters in `serving-tests.json`.

## Visualizing the results
#### Visualizing the results
The `convert-results-json-to-markdown.py` helps you put the benchmarking results inside a markdown table, by formatting [descriptions.md](tests/descriptions.md) with real benchmarking results.
You can find the result presented as a table inside the `buildkite/performance-benchmark` job page.
If you do not see the table, please wait till the benchmark finish running.
The json version of the table (together with the json version of the benchmark) will be also attached to the markdown file.
The raw benchmarking results (in the format of json files) are in the `Artifacts` tab of the benchmarking.



## Nightly test details

See [nightly-descriptions.md](nightly-descriptions.md) for the detailed description on test workload, models and docker containers of benchmarking other llm engines.


#### Workflow

- The [nightly-pipeline.yaml](nightly-pipeline.yaml) specifies the docker containers for different LLM serving engines.
- Inside each container, we run [run-nightly-suite.sh](run-nightly-suite.sh), which will probe the serving engine of the current container.
- The `run-nightly-suite.sh` will redirect the request to `tests/run-[llm serving engine name]-nightly.sh`, which parses the workload described in [nightly-tests.json](tests/nightly-tests.json) and performs the benchmark.
- At last, we run [scripts/plot-nightly-results.py](scripts/plot-nightly-results.py) to collect and plot the final benchmarking results, and update the results to buildkite.

#### Nightly tests

In [nightly-tests.json](tests/nightly-tests.json), we include the command line arguments for benchmarking commands, together with the benchmarking test cases. The format is highly similar to performance benchmark.

#### Docker containers

The docker containers for benchmarking are specified in `nightly-pipeline.yaml`.

WARNING: the docker versions are HARD-CODED and SHOULD BE ALIGNED WITH `nightly-descriptions.md`. The docker versions need to be hard-coded as there are several version-specific bug fixes inside `tests/run-[llm serving engine name]-nightly.sh`.

WARNING: populating `trt-llm` to latest version is not easy, as it requires updating several protobuf files in [tensorrt-demo](https://github.com/neuralmagic/tensorrt-demo.git).

32 changes: 16 additions & 16 deletions .buildkite/nightly-benchmarks/benchmark-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -42,20 +42,20 @@ steps:
- name: devshm
emptyDir:
medium: Memory
- label: "H100"
agents:
queue: H100
plugins:
- docker#v5.11.0:
image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
command:
- bash
- .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
mount-buildkite-agent: true
propagate-environment: true
ipc: host
gpus: all
environment:
- VLLM_USAGE_SOURCE
- HF_TOKEN
# - label: "H100"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT
# command:
# - bash
# - .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
# mount-buildkite-agent: true
# propagate-environment: true
# ipc: host
# gpus: all
# environment:
# - VLLM_USAGE_SOURCE
# - HF_TOKEN

16 changes: 10 additions & 6 deletions .buildkite/nightly-benchmarks/run-benchmarks-suite.sh
Original file line number Diff line number Diff line change
Expand Up @@ -34,6 +34,15 @@ check_hf_token() {
fi
}

ensure_sharegpt_downloaded() {
local FILE=ShareGPT_V3_unfiltered_cleaned_split.json
if [ ! -f "$FILE" ]; then
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/$FILE
else
echo "$FILE already exists."
fi
}

json2args() {
# transforms the JSON string to command line args, and '_' is replaced to '-'
# example:
Expand Down Expand Up @@ -73,11 +82,6 @@ kill_gpu_processes() {
echo "All GPU processes have been killed."
fi

# Sometimes kill with pid doesn't work properly, we can also kill all process running python or python3
# since we are in container anyway
pkill -9 -f python
pkill -9 -f python3

# waiting for GPU processes to be fully killed
# loop while nvidia-smi returns any processes
while [ -n "$(nvidia-smi --query-compute-apps=pid --format=csv,noheader)" ]; do
Expand Down Expand Up @@ -355,7 +359,7 @@ main() {

# prepare for benchmarking
cd benchmarks || exit 1
wget https://huggingface.co/datasets/anon8231489123/ShareGPT_Vicuna_unfiltered/resolve/main/ShareGPT_V3_unfiltered_cleaned_split.json
ensure_sharegpt_downloaded
declare -g RESULTS_FOLDER=results/
mkdir -p $RESULTS_FOLDER
QUICK_BENCHMARK_ROOT=../.buildkite/nightly-benchmarks/
Expand Down
23 changes: 22 additions & 1 deletion .buildkite/nightly-benchmarks/tests/serving-tests.json
Original file line number Diff line number Diff line change
Expand Up @@ -55,5 +55,26 @@
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
},
{
"test_name": "serving_llama70B_tp4_sharegpt_specdecode",
"qps_list": [2],
"server_parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"disable_log_requests": "",
"tensor_parallel_size": 4,
"swap_space": 16,
"speculative_model": "turboderp/Qwama-0.5B-Instruct",
"num_speculative_tokens": 4,
"speculative_draft_tensor_parallel_size": 1,
"use_v2_block_manager": ""
},
"client_parameters": {
"model": "meta-llama/Meta-Llama-3-70B-Instruct",
"backend": "vllm",
"dataset_name": "sharegpt",
"dataset_path": "./ShareGPT_V3_unfiltered_cleaned_split.json",
"num_prompts": 200
}
}
]
]
2 changes: 1 addition & 1 deletion .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ while true; do
done

echo "--- Pulling container"
image_name="rocmshared/vllm-ci:${BUILDKITE_COMMIT}"
image_name="rocm/vllm-ci:${BUILDKITE_COMMIT}"
container_name="rocm_${BUILDKITE_COMMIT}_$(tr -dc A-Za-z0-9 < /dev/urandom | head -c 10; echo)"
docker pull ${image_name}

Expand Down
30 changes: 21 additions & 9 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -3,26 +3,38 @@
set -ex

# Try building the docker image
docker build -t cpu-test -f Dockerfile.cpu .
docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .
numactl -C 48-95 -N 1 docker build -t cpu-test -f Dockerfile.cpu .
numactl -C 48-95 -N 1 docker build --build-arg VLLM_CPU_DISABLE_AVX512="true" -t cpu-test-avx2 -f Dockerfile.cpu .

# Setup cleanup
remove_docker_container() { docker rm -f cpu-test cpu-test-avx2 || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image
# Run the image, setting --shm-size=4g for tensor parallel.
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 \
--cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test cpu-test
--cpuset-mems=1 --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test cpu-test
docker run -itd --entrypoint /bin/bash -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 \
--cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test-avx2 cpu-test-avx2
--cpuset-mems=1 --privileged=true --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --shm-size=4g --name cpu-test-avx2 cpu-test-avx2

# offline inference
docker exec cpu-test bash -c "python3 examples/offline_inference.py"
docker exec cpu-test-avx2 bash -c "python3 examples/offline_inference.py"

# Run basic model test
docker exec cpu-test bash -c "cd tests;
docker exec cpu-test bash -c "
pip install pytest Pillow protobuf
cd ../
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py" # Mamba on CPU is not supported
pytest -v -s tests/models -m \"not vlm\" --ignore=tests/models/test_embedding.py --ignore=tests/models/test_registry.py --ignore=tests/models/test_jamba.py --ignore=tests/models/test_danube3_4b.py" # Mamba and Danube3-4B on CPU is not supported

# online inference
docker exec cpu-test bash -c "
export VLLM_CPU_KVCACHE_SPACE=10
export VLLM_CPU_OMP_THREADS_BIND=48-92
python3 -m vllm.entrypoints.openai.api_server --model facebook/opt-125m &
timeout 600 bash -c 'until curl localhost:8000/v1/models; do sleep 1; done' || exit 1
python3 benchmarks/benchmark_serving.py \
--backend vllm \
--dataset-name random \
--model facebook/opt-125m \
--num-prompts 20 \
--endpoint /v1/completions \
--tokenizer facebook/opt-125m"
Loading
Loading