Skip to content

Commit

Permalink
Upstream merge 24 10 08 (#226)
Browse files Browse the repository at this point in the history
* [Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814)

* [Docs] Add README to the build docker image (vllm-project#8825)

* [CI/Build] Fix missing ci dependencies (vllm-project#8834)

* [misc][installation] build from source without compilation (vllm-project#8818)

* [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872)

Signed-off-by: kevin <[email protected]>

* [Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861)

* [Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820)

* [Bugfix] Fix print_warning_once's line info (vllm-project#8867)

* fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568)

* [Bugfix] Fixup advance_step.cu warning (vllm-project#8815)

* [BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829)

* [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764)

* [Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343)

Signed-off-by: Max de Bayser <[email protected]>

* [Core] rename`PromptInputs` and `inputs` (vllm-project#8876)

* [misc] fix collect env (vllm-project#8894)

* [MISC] Fix invalid escape sequence '\' (vllm-project#8830)

Signed-off-by: Peter Pan <[email protected]>

* [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892)

* [TPU] Update pallas.py to support trillium (vllm-project#8871)

* [torch.compile] use empty tensor instead of None for profiling (vllm-project#8875)

* [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271)

* [Bugfix] fix for deepseek w4a16 (vllm-project#8906)

Co-authored-by: mgoin <[email protected]>

* [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378)

Co-authored-by: Varun Sundar Rabindranath <[email protected]>

* [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911)

* [Core] Priority-based scheduling in async engine (vllm-project#8850)

* [misc] fix wheel name (vllm-project#8919)

* [Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824)

Signed-off-by: tylertitsworth <[email protected]>
Co-authored-by: youkaichao <[email protected]>

* [Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921)

* [Bugfix] Fix code for downloading models from modelscope (vllm-project#8443)

* [Bugfix] Fix PP for Multi-Step (vllm-project#8887)

* [CI/Build] Update models tests & examples (vllm-project#8874)

Co-authored-by: Roger Wang <[email protected]>

* [Frontend] Make beam search emulator temperature modifiable (vllm-project#8928)

Co-authored-by: Eduard Balzin <[email protected]>

* [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891)

* [doc] organize installation doc and expose per-commit docker (vllm-project#8931)

* [Core] Improve choice of Python multiprocessing method (vllm-project#8823)

Signed-off-by: Russell Bryant <[email protected]>
Co-authored-by: youkaichao <[email protected]>

* [Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824)

* [Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741)

Co-authored-by: Tyler Michael Smith <[email protected]>

* [CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925)

* [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930)

* [Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896)

* [Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199)

* [BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870)

Co-authored-by: Roger Wang <[email protected]>

* [Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944)

* [Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942)

* [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533)

* [Model] support input embeddings for qwen2vl (vllm-project#8856)

* [Misc][CI/Build] Include `cv2` via `mistral_common[opencv]`  (vllm-project#8951)

* [Model][LoRA]LoRA support added for MiniCPMV2.6 (vllm-project#8943)

Co-authored-by: DarkLight1337 <[email protected]>

* [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg (vllm-project#8946)

* [Core] Make scheduling policy settable via EngineArgs (vllm-project#8956)

* [Misc] Adjust max_position_embeddings for LoRA compatibility (vllm-project#8957)

* [ci] Add CODEOWNERS for test directories  (vllm-project#8795)

Signed-off-by: kevin <[email protected]>

* [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. (vllm-project#8975)

* [Frontend][Core] Move guided decoding params into sampling params (vllm-project#8252)

Signed-off-by: Joe Runde <[email protected]>
Co-authored-by: Nick Hill <[email protected]>

* [CI/Build] Fix machete generated kernel files ordering (vllm-project#8976)

Signed-off-by: kevin <[email protected]>
Co-authored-by: Cody Yu <[email protected]>

* [torch.compile] fix tensor alias (vllm-project#8982)

* [Misc] add process_weights_after_loading for DummyLoader (vllm-project#8969)

* [Bugfix] Fix Fuyu tensor parallel inference (vllm-project#8986)

* [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders (vllm-project#8991)

Signed-off-by: Alex-Brooks <[email protected]>

* [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API (vllm-project#8965)

* [Doc] Update list of supported models (vllm-project#8987)

* Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (vllm-project#8997)

* [Spec Decode] (1/2) Remove batch expansion (vllm-project#8839)

* [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (vllm-project#8804)

Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Andrew Feldman <[email protected]>

* [Misc] Update Default Image Mapper Error Log (vllm-project#8977)

Signed-off-by: Alex-Brooks <[email protected]>
Co-authored-by: Roger Wang <[email protected]>

* [Core] CUDA Graphs for Multi-Step + Chunked-Prefill (vllm-project#8645)

Co-authored-by: Varun Sundar Rabindranath <[email protected]>

* [OpenVINO] Enable GPU support for OpenVINO vLLM backend (vllm-project#8192)

* [Model]  Adding Granite MoE. (vllm-project#8206)

Co-authored-by: Nick Hill <[email protected]>

* [Doc] Update Granite model docs (vllm-project#9025)

* [Bugfix] example template should not add parallel_tool_prompt if tools is none (vllm-project#9007)

* [Misc] log when using default MoE config (vllm-project#8971)

* [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser (vllm-project#9020)

* [Core] Make BlockSpaceManagerV2 the default BlockManager to use. (vllm-project#8678)

* [Frontend] [Neuron] Parse literals out of override-neuron-config (vllm-project#8959)

Co-authored-by: Jerzy Zagorski <[email protected]>

* [misc] add forward context for attention (vllm-project#9029)

* Fix failing spec decode test (vllm-project#9054)

* [Bugfix] Weight loading fix for OPT model (vllm-project#9042)

Co-authored-by: dvres <[email protected]>

* [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model (vllm-project#8405)

* [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (vllm-project#8845)

* [Misc] Enable multi-step output streaming by default (vllm-project#9047)

* [Models] Add remaining model PP support (vllm-project#7168)

Signed-off-by: Muralidhar Andoorveedu <[email protected]>
Signed-off-by: Murali Andoorveedu <[email protected]>
Co-authored-by: DarkLight1337 <[email protected]>

* [Misc] Move registry to its own file (vllm-project#9064)

* [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL (vllm-project#9071)

* [Bugfix] Flash attention arches not getting set properly (vllm-project#9062)

* [Model] add a bunch of supported lora modules for mixtral (vllm-project#9008)

Signed-off-by: Prashant Gupta <[email protected]>

* Remove AMD Ray Summit Banner (vllm-project#9075)

* [Hardware][PowerPC] Make oneDNN dependency optional for Power (vllm-project#9039)

Signed-off-by: Varad Ahirwadkar <[email protected]>

* [Core][VLM] Test registration for OOT multimodal models (vllm-project#8717)

Co-authored-by: DarkLight1337 <[email protected]>

* Adds truncate_prompt_tokens param for embeddings creation (vllm-project#8999)

Signed-off-by: Flavia Beo <[email protected]>

* [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (vllm-project#8973)

Co-authored-by: Dipika <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>

* [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (vllm-project#7412)

* [Misc] Improved prefix cache example (vllm-project#9077)

* [Misc] Add random seed for prefix cache benchmark (vllm-project#9081)

* [Misc] Fix CI lint (vllm-project#9085)

* [Hardware][Neuron] Add on-device sampling support for Neuron (vllm-project#8746)

Co-authored-by: Ashraf Mahgoub <[email protected]>

* [torch.compile] improve allreduce registration (vllm-project#9061)

* [Doc] Update README.md with Ray summit slides (vllm-project#9088)

* [Bugfix] use blockmanagerv1 for encoder-decoder (vllm-project#9084)

Co-authored-by: Roger Wang <[email protected]>

* [Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (vllm-project#8979)

* [Model] Support Gemma2 embedding model (vllm-project#9004)

* [Bugfix] Deprecate registration of custom configs to huggingface (vllm-project#9083)

* [Bugfix] Fix order of arguments matters in config.yaml (vllm-project#8960)

* [core] use forward context for flash infer (vllm-project#9097)

* [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model (vllm-project#9101)

* [Frontend] API support for beam search (vllm-project#9087)

Co-authored-by: youkaichao <[email protected]>

* [Misc] Remove user-facing error for removed VLM args (vllm-project#9104)

* [Model] PP support for embedding models and update docs (vllm-project#9090)

Co-authored-by: Roger Wang <[email protected]>

* [Bugfix] fix tool_parser error handling when serve a model not support it (vllm-project#8709)

* [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (vllm-project#9038)

Co-authored-by: Varun Sundar Rabindranath <[email protected]>

* [Bugfix][Hardware][CPU] Fix CPU model input for decode (vllm-project#9044)

* [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (vllm-project#9103)

* [core] remove beam search from the core (vllm-project#9105)

* [Model] Explicit interface for vLLM models and support OOT embedding models (vllm-project#9108)

* [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend (vllm-project#9089)

* [Core] Refactor GGUF parameters packing and forwarding (vllm-project#8859)

* [Model] Support NVLM-D and fix QK Norm in InternViT (vllm-project#9045)

Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Isotr0py <[email protected]>

* [Doc]: Add deploying_with_k8s guide (vllm-project#8451)

* [CI/Build] Add linting for github actions workflows (vllm-project#7876)

Signed-off-by: Russell Bryant <[email protected]>

* [Doc] Include performance benchmark in README (vllm-project#9135)

* [misc] fix comment and variable name (vllm-project#9139)

* Add Slack to README (vllm-project#9137)

* [misc] update utils to support comparing multiple settings (vllm-project#9140)

* [Intel GPU] Fix xpu decode input  (vllm-project#9145)

* [misc] improve ux on readme (vllm-project#9147)

* [Frontend] API support for beam search for MQLLMEngine (vllm-project#9117)

* [Core][Frontend] Add Support for Inference Time mm_processor_kwargs (vllm-project#9131)

Signed-off-by: Alex-Brooks <[email protected]>

* Factor out common weight loading code

* Fix EAGLE model loading

* [Frontend] Add Early Validation For Chat Template / Tool Call Parser (vllm-project#9151)

Signed-off-by: Alex-Brooks <[email protected]>

* Improve efficiency

* Rename

* Update LLaVA-NeXT-Video

* [CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models (vllm-project#8758)

Signed-off-by: Peter Pan <[email protected]>

* [Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing (vllm-project#8537)

* Automatic loading and save memory

* Rename

* Update docstring

* Simplify

* Cleanup

* Fully enable recursive loading

* Clarify

* [Doc] Update vlm.rst to include an example on videos (vllm-project#9155)

Co-authored-by: Cyrus Leung <[email protected]>

* Fix incorrect semantics

* Move function

* Update error message

* Fix Ultravox loading

* spacing

* [Doc] Improve contributing and installation documentation (vllm-project#9132)

Signed-off-by: Rafael Vasquez <[email protected]>

* Fix server

* [Bugfix] Try to handle older versions of pytorch (vllm-project#9086)

---------

Signed-off-by: kevin <[email protected]>
Signed-off-by: Max de Bayser <[email protected]>
Signed-off-by: Peter Pan <[email protected]>
Signed-off-by: tylertitsworth <[email protected]>
Signed-off-by: Russell Bryant <[email protected]>
Signed-off-by: Joe Runde <[email protected]>
Signed-off-by: Alex-Brooks <[email protected]>
Signed-off-by: Muralidhar Andoorveedu <[email protected]>
Signed-off-by: Murali Andoorveedu <[email protected]>
Signed-off-by: Prashant Gupta <[email protected]>
Signed-off-by: Varad Ahirwadkar <[email protected]>
Signed-off-by: Flavia Beo <[email protected]>
Signed-off-by: Rafael Vasquez <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: fyuan1316 <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: Kevin H. Luu <[email protected]>
Co-authored-by: Pernekhan Utemuratov <[email protected]>
Co-authored-by: Chirag Jain <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Maximilien de Bayser <[email protected]>
Co-authored-by: Peter Pan <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Brittany <[email protected]>
Co-authored-by: Luka Govedič <[email protected]>
Co-authored-by: Lucas Wilkinson <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Sebastian Schoennenbeck <[email protected]>
Co-authored-by: Tyler Titsworth <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: tastelikefeet <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Edouard B. <[email protected]>
Co-authored-by: Eduard Balzin <[email protected]>
Co-authored-by: Chen Zhang <[email protected]>
Co-authored-by: Russell Bryant <[email protected]>
Co-authored-by: sroy745 <[email protected]>
Co-authored-by: ElizaWszola <[email protected]>
Co-authored-by: Zilin Zhu <[email protected]>
Co-authored-by: Jee Jee Li <[email protected]>
Co-authored-by: juncheoll <[email protected]>
Co-authored-by: danieljannai21 <[email protected]>
Co-authored-by: Mor Zusman <[email protected]>
Co-authored-by: whyiug <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Lily Liu <[email protected]>
Co-authored-by: Joe Runde <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: Alex Brooks <[email protected]>
Co-authored-by: vlsav <[email protected]>
Co-authored-by: afeldman-nm <[email protected]>
Co-authored-by: Andrew Feldman <[email protected]>
Co-authored-by: Sergey Shlyapnikov <[email protected]>
Co-authored-by: Shawn Tan <[email protected]>
Co-authored-by: Travis Johnson <[email protected]>
Co-authored-by: Guillaume Calmettes <[email protected]>
Co-authored-by: xendo <[email protected]>
Co-authored-by: Jerzy Zagorski <[email protected]>
Co-authored-by: Domen Vreš <[email protected]>
Co-authored-by: dvres <[email protected]>
Co-authored-by: 代君 <[email protected]>
Co-authored-by: Murali Andoorveedu <[email protected]>
Co-authored-by: Prashant Gupta <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Varad Ahirwadkar <[email protected]>
Co-authored-by: Flávia Béo <[email protected]>
Co-authored-by: Dipika <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Andy Dai <[email protected]>
Co-authored-by: Chongming Ni <[email protected]>
Co-authored-by: Ashraf Mahgoub <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: hhzhang16 <[email protected]>
Co-authored-by: Xin Yang <[email protected]>
Co-authored-by: TJian <[email protected]>
Co-authored-by: Brendan Wong <[email protected]>
Co-authored-by: Yanyi Liu <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: TimWang <[email protected]>
Co-authored-by: Kunshang Ji <[email protected]>
Co-authored-by: Daniele <[email protected]>
Co-authored-by: Sayak Paul <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Rafael Vasquez <[email protected]>
Co-authored-by: bnellnm <[email protected]>
  • Loading branch information
Show file tree
Hide file tree
Showing 411 changed files with 18,718 additions and 9,884 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
tasks:
- name: "gsm8k"
metrics:
- name: "exact_match,strict-match"
value: 0.764
- name: "exact_match,flexible-extract"
value: 0.764
limit: 250
num_fewshot: 5
1 change: 1 addition & 0 deletions .buildkite/lm-eval-harness/configs/models-small.txt
Original file line number Diff line number Diff line change
@@ -1,6 +1,7 @@
Meta-Llama-3-8B-Instruct.yaml
Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
Minitron-4B-Base-FP8.yaml
Expand Down
7 changes: 6 additions & 1 deletion .buildkite/lm-eval-harness/test_lm_eval_correctness.py
Original file line number Diff line number Diff line change
Expand Up @@ -49,10 +49,15 @@ def test_lm_eval_correctness():
results = launch_lm_eval(eval_config)

# Confirm scores match ground truth.
success = True
for task in eval_config["tasks"]:
for metric in task["metrics"]:
ground_truth = metric["value"]
measured_value = results["results"][task["name"]][metric["name"]]
print(f'{task["name"]} | {metric["name"]}: '
f'ground_truth={ground_truth} | measured={measured_value}')
assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
success = success and numpy.isclose(
ground_truth, measured_value, rtol=RTOL)

# Assert at the end, print all scores even on failure for debugging.
assert success
28 changes: 28 additions & 0 deletions .buildkite/nightly-benchmarks/nightly-annotation.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,28 @@

## Description

This file contains the downloading link for benchmarking results.

- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
- [benchmarking results](artifact://results.zip)
- [benchmarking code](artifact://nightly-benchmarks.zip)

Please download the visualization scripts in the post


## Results reproduction

- Find the docker we use in `benchmarking pipeline`
- Deploy the docker, and inside the docker:
- Download `nightly-benchmarks.zip`.
- In the same folder, run the following code
```
export HF_TOKEN=<your HF token>
apt update
apt install -y git
unzip nightly-benchmarks.zip
VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
```

And the results will be inside `./benchmarks/results`.

78 changes: 36 additions & 42 deletions .buildkite/nightly-benchmarks/nightly-descriptions.md
Original file line number Diff line number Diff line change
@@ -1,45 +1,39 @@

# Nightly benchmark

The main goal of this benchmarking is two-fold:
- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().


## Docker images

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
- vllm/vllm-openai:v0.5.0.post1
- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- openmmlab/lmdeploy:v0.5.0
- ghcr.io/huggingface/text-generation-inference:2.1

<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->


## Hardware

One AWS node with 8x NVIDIA A100 GPUs.


## Workload description

We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:

- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
- Output length: the corresponding output length of these 500 prompts.
- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->

## Plots

In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.

<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >

## Results

{nightly_results_benchmarking_table}
This benchmark aims to:
- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.

Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.

Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)


## Setup

- Docker images:
- vLLM: `vllm/vllm-openai:v0.6.2`
- SGLang: `lmsysorg/sglang:v0.3.2-cu121`
- LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
- TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
- *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
- Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
- Hardware
- 8x Nvidia A100 GPUs
- Workload:
- Dataset
- ShareGPT dataset
- Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
- Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
- Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
- Models: llama-3 8B, llama-3 70B.
- We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
- Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
- Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).

# Known issues

- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
- TGI does not support `ignore-eos` flag.
98 changes: 87 additions & 11 deletions .buildkite/nightly-benchmarks/nightly-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ common_pod_spec: &common_pod_spec

common_container_settings: &common_container_settings
command:
- bash .buildkite/nightly-benchmarks/run-nightly-suite.sh
- bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
resources:
limits:
nvidia.com/gpu: 8
Expand All @@ -37,7 +37,10 @@ common_container_settings: &common_container_settings

steps:
- block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
- label: "A100 trt benchmark"



- label: "A100 vllm step 10"
priority: 100
agents:
queue: A100
Expand All @@ -46,7 +49,21 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
- image: vllm/vllm-openai:v0.6.2
<<: *common_container_settings



- label: "A100 sglang benchmark"
priority: 100
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
<<: *common_pod_spec
containers:
- image: lmsysorg/sglang:v0.3.2-cu121
<<: *common_container_settings

- label: "A100 lmdeploy benchmark"
Expand All @@ -58,11 +75,13 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: openmmlab/lmdeploy:v0.5.0
- image: openmmlab/lmdeploy:v0.6.1-cu12
<<: *common_container_settings


- label: "A100 vllm benchmark"



- label: "A100 trt llama-8B"
priority: 100
agents:
queue: A100
Expand All @@ -71,10 +90,25 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: vllm/vllm-openai:latest
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
<<: *common_container_settings
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: TEST_SELECTOR
value: "llama8B"

- label: "A100 tgi benchmark"

- label: "A100 trt llama-70B"
priority: 100
agents:
queue: A100
Expand All @@ -83,12 +117,54 @@ steps:
podSpec:
<<: *common_pod_spec
containers:
- image: ghcr.io/huggingface/text-generation-inference:2.1
- image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
<<: *common_container_settings
env:
- name: VLLM_USAGE_SOURCE
value: ci-test
- name: HF_HOME
value: /root/.cache/huggingface
- name: VLLM_SOURCE_CODE_LOC
value: /workspace/build/buildkite/vllm/performance-benchmark
- name: HF_TOKEN
valueFrom:
secretKeyRef:
name: hf-token-secret
key: token
- name: TEST_SELECTOR
value: "llama70B"


# FIXME(Kuntai): uncomment this after NVIDIA gives us their test docker image
# - label: "A100 trt benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
# <<: *common_container_settings


# FIXME(Kuntai): uncomment this after TGI supports `--ignore-eos`.
# - label: "A100 tgi benchmark"
# priority: 100
# agents:
# queue: A100
# plugins:
# - kubernetes:
# podSpec:
# <<: *common_pod_spec
# containers:
# - image: ghcr.io/huggingface/text-generation-inference:2.2.0
# <<: *common_container_settings

- wait

- label: "Plot"
- label: "Collect the results"
priority: 100
agents:
queue: A100
Expand Down Expand Up @@ -117,4 +193,4 @@ steps:
name: hf-token-secret
key: token

- wait
- block: ":rocket: check the results!"
76 changes: 0 additions & 76 deletions .buildkite/nightly-benchmarks/run-nightly-suite.sh

This file was deleted.

Loading

0 comments on commit a466f09

Please sign in to comment.