Upstream merge 24 10 08 (#226)

* [Build/CI] Upgrade to gcc 10 in the base build Docker image (vllm-project#8814) * [Docs] Add README to the build docker image (vllm-project#8825) * [CI/Build] Fix missing ci dependencies (vllm-project#8834) * [misc][installation] build from source without compilation (vllm-project#8818) * [ci] Soft fail Entrypoints, Samplers, LoRA, Decoder-only VLM (vllm-project#8872) Signed-off-by: kevin <[email protected]> * [Bugfix] Include encoder prompts len to non-stream api usage response (vllm-project#8861) * [Misc] Change dummy profiling and BOS fallback warns to log once (vllm-project#8820) * [Bugfix] Fix print_warning_once's line info (vllm-project#8867) * fix validation: Only set tool_choice `auto` if at least one tool is provided (vllm-project#8568) * [Bugfix] Fixup advance_step.cu warning (vllm-project#8815) * [BugFix] Fix test breakages from transformers 4.45 upgrade (vllm-project#8829) * [Installation] Allow lower versions of FastAPI to maintain Ray 2.9 compatibility (vllm-project#8764) * [Feature] Add support for Llama 3.1 and 3.2 tool use (vllm-project#8343) Signed-off-by: Max de Bayser <[email protected]> * [Core] rename`PromptInputs` and `inputs` (vllm-project#8876) * [misc] fix collect env (vllm-project#8894) * [MISC] Fix invalid escape sequence '\' (vllm-project#8830) Signed-off-by: Peter Pan <[email protected]> * [Bugfix][VLM] Fix Fuyu batching inference with `max_num_seqs>1` (vllm-project#8892) * [TPU] Update pallas.py to support trillium (vllm-project#8871) * [torch.compile] use empty tensor instead of None for profiling (vllm-project#8875) * [Kernel] AQ AZP 4/4: Integrate asymmetric quantization to linear method (vllm-project#7271) * [Bugfix] fix for deepseek w4a16 (vllm-project#8906) Co-authored-by: mgoin <[email protected]> * [Core] Multi-Step + Single Step Prefills via Chunked Prefill code path (vllm-project#8378) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [misc][distributed] add VLLM_SKIP_P2P_CHECK flag (vllm-project#8911) * [Core] Priority-based scheduling in async engine (vllm-project#8850) * [misc] fix wheel name (vllm-project#8919) * [Bugfix][Intel] Fix XPU Dockerfile Build (vllm-project#7824) Signed-off-by: tylertitsworth <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Misc] Remove vLLM patch of `BaichuanTokenizer` (vllm-project#8921) * [Bugfix] Fix code for downloading models from modelscope (vllm-project#8443) * [Bugfix] Fix PP for Multi-Step (vllm-project#8887) * [CI/Build] Update models tests & examples (vllm-project#8874) Co-authored-by: Roger Wang <[email protected]> * [Frontend] Make beam search emulator temperature modifiable (vllm-project#8928) Co-authored-by: Eduard Balzin <[email protected]> * [Bugfix] Support testing prefill throughput with benchmark_serving.py --hf-output-len 1 (vllm-project#8891) * [doc] organize installation doc and expose per-commit docker (vllm-project#8931) * [Core] Improve choice of Python multiprocessing method (vllm-project#8823) Signed-off-by: Russell Bryant <[email protected]> Co-authored-by: youkaichao <[email protected]> * [Bugfix] Block manager v2 with preemption and lookahead slots (vllm-project#8824) * [Bugfix] Fix Marlin MoE act order when is_k_full == False (vllm-project#8741) Co-authored-by: Tyler Michael Smith <[email protected]> * [CI/Build] Add test decorator for minimum GPU memory (vllm-project#8925) * [Build/CI] Set FETCHCONTENT_BASE_DIR to one location for better caching (vllm-project#8930) * [Model] Support Qwen2.5-Math-RM-72B (vllm-project#8896) * [Model][LoRA]LoRA support added for MiniCPMV2.5 (vllm-project#7199) * [BugFix] Fix seeded random sampling with encoder-decoder models (vllm-project#8870) Co-authored-by: Roger Wang <[email protected]> * [Misc] Fix typo in BlockSpaceManagerV1 (vllm-project#8944) * [Frontend] Added support for HF's new `continue_final_message` parameter (vllm-project#8942) * [Kernel][Model] Varlen prefill + Prefill chunking support for mamba kernels and Jamba model (vllm-project#8533) * [Model] support input embeddings for qwen2vl (vllm-project#8856) * [Misc][CI/Build] Include `cv2` via `mistral_common[opencv]` (vllm-project#8951) * [Model][LoRA]LoRA support added for MiniCPMV2.6 (vllm-project#8943) Co-authored-by: DarkLight1337 <[email protected]> * [Model] Expose InternVL2 max_dynamic_patch as a mm_processor_kwarg (vllm-project#8946) * [Core] Make scheduling policy settable via EngineArgs (vllm-project#8956) * [Misc] Adjust max_position_embeddings for LoRA compatibility (vllm-project#8957) * [ci] Add CODEOWNERS for test directories (vllm-project#8795) Signed-off-by: kevin <[email protected]> * [CI][SpecDecode] Fix spec decode tests, use flash attention backend for spec decode CI tests. (vllm-project#8975) * [Frontend][Core] Move guided decoding params into sampling params (vllm-project#8252) Signed-off-by: Joe Runde <[email protected]> Co-authored-by: Nick Hill <[email protected]> * [CI/Build] Fix machete generated kernel files ordering (vllm-project#8976) Signed-off-by: kevin <[email protected]> Co-authored-by: Cody Yu <[email protected]> * [torch.compile] fix tensor alias (vllm-project#8982) * [Misc] add process_weights_after_loading for DummyLoader (vllm-project#8969) * [Bugfix] Fix Fuyu tensor parallel inference (vllm-project#8986) * [Bugfix] Fix Token IDs Reference for MiniCPM-V When Images are Provided With No Placeholders (vllm-project#8991) Signed-off-by: Alex-Brooks <[email protected]> * [Core] [Frontend] Priority scheduling for embeddings and in the OpenAI-API (vllm-project#8965) * [Doc] Update list of supported models (vllm-project#8987) * Update benchmark_serving.py to read and write json-datasets, results in UTF8, for better compatibility with Windows (vllm-project#8997) * [Spec Decode] (1/2) Remove batch expansion (vllm-project#8839) * [Core] Combined support for multi-step scheduling, chunked prefill & prefix caching (vllm-project#8804) Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> * [Misc] Update Default Image Mapper Error Log (vllm-project#8977) Signed-off-by: Alex-Brooks <[email protected]> Co-authored-by: Roger Wang <[email protected]> * [Core] CUDA Graphs for Multi-Step + Chunked-Prefill (vllm-project#8645) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [OpenVINO] Enable GPU support for OpenVINO vLLM backend (vllm-project#8192) * [Model] Adding Granite MoE. (vllm-project#8206) Co-authored-by: Nick Hill <[email protected]> * [Doc] Update Granite model docs (vllm-project#9025) * [Bugfix] example template should not add parallel_tool_prompt if tools is none (vllm-project#9007) * [Misc] log when using default MoE config (vllm-project#8971) * [BugFix] Enforce Mistral ToolCall id constraint when using the Mistral tool call parser (vllm-project#9020) * [Core] Make BlockSpaceManagerV2 the default BlockManager to use. (vllm-project#8678) * [Frontend] [Neuron] Parse literals out of override-neuron-config (vllm-project#8959) Co-authored-by: Jerzy Zagorski <[email protected]> * [misc] add forward context for attention (vllm-project#9029) * Fix failing spec decode test (vllm-project#9054) * [Bugfix] Weight loading fix for OPT model (vllm-project#9042) Co-authored-by: dvres <[email protected]> * [Frontend][Feature] support tool calling for internlm/internlm2_5-7b-chat model (vllm-project#8405) * [CI/Build] Per file CUDA Archs (improve wheel size and dev build times) (vllm-project#8845) * [Misc] Enable multi-step output streaming by default (vllm-project#9047) * [Models] Add remaining model PP support (vllm-project#7168) Signed-off-by: Muralidhar Andoorveedu <[email protected]> Signed-off-by: Murali Andoorveedu <[email protected]> Co-authored-by: DarkLight1337 <[email protected]> * [Misc] Move registry to its own file (vllm-project#9064) * [Bugfix] Reshape the dimensions of the input image embeddings in Qwen2VL (vllm-project#9071) * [Bugfix] Flash attention arches not getting set properly (vllm-project#9062) * [Model] add a bunch of supported lora modules for mixtral (vllm-project#9008) Signed-off-by: Prashant Gupta <[email protected]> * Remove AMD Ray Summit Banner (vllm-project#9075) * [Hardware][PowerPC] Make oneDNN dependency optional for Power (vllm-project#9039) Signed-off-by: Varad Ahirwadkar <[email protected]> * [Core][VLM] Test registration for OOT multimodal models (vllm-project#8717) Co-authored-by: DarkLight1337 <[email protected]> * Adds truncate_prompt_tokens param for embeddings creation (vllm-project#8999) Signed-off-by: Flavia Beo <[email protected]> * [Kernel] Zero point support in fused MarlinMoE kernel + AWQ Fused MoE (vllm-project#8973) Co-authored-by: Dipika <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> * [CI] Update performance benchmark: upgrade trt-llm to r24.07, and add SGLang (vllm-project#7412) * [Misc] Improved prefix cache example (vllm-project#9077) * [Misc] Add random seed for prefix cache benchmark (vllm-project#9081) * [Misc] Fix CI lint (vllm-project#9085) * [Hardware][Neuron] Add on-device sampling support for Neuron (vllm-project#8746) Co-authored-by: Ashraf Mahgoub <[email protected]> * [torch.compile] improve allreduce registration (vllm-project#9061) * [Doc] Update README.md with Ray summit slides (vllm-project#9088) * [Bugfix] use blockmanagerv1 for encoder-decoder (vllm-project#9084) Co-authored-by: Roger Wang <[email protected]> * [Bugfix] Fixes Phi3v & Ultravox Multimodal EmbeddingInputs (vllm-project#8979) * [Model] Support Gemma2 embedding model (vllm-project#9004) * [Bugfix] Deprecate registration of custom configs to huggingface (vllm-project#9083) * [Bugfix] Fix order of arguments matters in config.yaml (vllm-project#8960) * [core] use forward context for flash infer (vllm-project#9097) * [Bugfix] Fix try-catch conditions to import correct Flash Attention Backend in Draft Model (vllm-project#9101) * [Frontend] API support for beam search (vllm-project#9087) Co-authored-by: youkaichao <[email protected]> * [Misc] Remove user-facing error for removed VLM args (vllm-project#9104) * [Model] PP support for embedding models and update docs (vllm-project#9090) Co-authored-by: Roger Wang <[email protected]> * [Bugfix] fix tool_parser error handling when serve a model not support it (vllm-project#8709) * [Bugfix] Fix incorrect updates to num_computed_tokens in multi-step scheduling (vllm-project#9038) Co-authored-by: Varun Sundar Rabindranath <[email protected]> * [Bugfix][Hardware][CPU] Fix CPU model input for decode (vllm-project#9044) * [BugFix][Core] Fix BlockManagerV2 when Encoder Input is None (vllm-project#9103) * [core] remove beam search from the core (vllm-project#9105) * [Model] Explicit interface for vLLM models and support OOT embedding models (vllm-project#9108) * [Hardware][CPU] Cross-attention and Encoder-Decoder models support on CPU backend (vllm-project#9089) * [Core] Refactor GGUF parameters packing and forwarding (vllm-project#8859) * [Model] Support NVLM-D and fix QK Norm in InternViT (vllm-project#9045) Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Isotr0py <[email protected]> * [Doc]: Add deploying_with_k8s guide (vllm-project#8451) * [CI/Build] Add linting for github actions workflows (vllm-project#7876) Signed-off-by: Russell Bryant <[email protected]> * [Doc] Include performance benchmark in README (vllm-project#9135) * [misc] fix comment and variable name (vllm-project#9139) * Add Slack to README (vllm-project#9137) * [misc] update utils to support comparing multiple settings (vllm-project#9140) * [Intel GPU] Fix xpu decode input (vllm-project#9145) * [misc] improve ux on readme (vllm-project#9147) * [Frontend] API support for beam search for MQLLMEngine (vllm-project#9117) * [Core][Frontend] Add Support for Inference Time mm_processor_kwargs (vllm-project#9131) Signed-off-by: Alex-Brooks <[email protected]> * Factor out common weight loading code * Fix EAGLE model loading * [Frontend] Add Early Validation For Chat Template / Tool Call Parser (vllm-project#9151) Signed-off-by: Alex-Brooks <[email protected]> * Improve efficiency * Rename * Update LLaVA-NeXT-Video * [CI/Build] Add examples folder into Docker image so that we can leverage the templates*.jinja when serving models (vllm-project#8758) Signed-off-by: Peter Pan <[email protected]> * [Bugfix] fix OpenAI API server startup with --disable-frontend-multiprocessing (vllm-project#8537) * Automatic loading and save memory * Rename * Update docstring * Simplify * Cleanup * Fully enable recursive loading * Clarify * [Doc] Update vlm.rst to include an example on videos (vllm-project#9155) Co-authored-by: Cyrus Leung <[email protected]> * Fix incorrect semantics * Move function * Update error message * Fix Ultravox loading * spacing * [Doc] Improve contributing and installation documentation (vllm-project#9132) Signed-off-by: Rafael Vasquez <[email protected]> * Fix server * [Bugfix] Try to handle older versions of pytorch (vllm-project#9086) --------- Signed-off-by: kevin <[email protected]> Signed-off-by: Max de Bayser <[email protected]> Signed-off-by: Peter Pan <[email protected]> Signed-off-by: tylertitsworth <[email protected]> Signed-off-by: Russell Bryant <[email protected]> Signed-off-by: Joe Runde <[email protected]> Signed-off-by: Alex-Brooks <[email protected]> Signed-off-by: Muralidhar Andoorveedu <[email protected]> Signed-off-by: Murali Andoorveedu <[email protected]> Signed-off-by: Prashant Gupta <[email protected]> Signed-off-by: Varad Ahirwadkar <[email protected]> Signed-off-by: Flavia Beo <[email protected]> Signed-off-by: Rafael Vasquez <[email protected]> Co-authored-by: Tyler Michael Smith <[email protected]> Co-authored-by: Michael Goin <[email protected]> Co-authored-by: fyuan1316 <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: Kevin H. Luu <[email protected]> Co-authored-by: Pernekhan Utemuratov <[email protected]> Co-authored-by: Chirag Jain <[email protected]> Co-authored-by: Nick Hill <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Maximilien de Bayser <[email protected]> Co-authored-by: Peter Pan <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: Brittany <[email protected]> Co-authored-by: Luka Govedič <[email protected]> Co-authored-by: Lucas Wilkinson <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Sebastian Schoennenbeck <[email protected]> Co-authored-by: Tyler Titsworth <[email protected]> Co-authored-by: youkaichao <[email protected]> Co-authored-by: tastelikefeet <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Edouard B. <[email protected]> Co-authored-by: Eduard Balzin <[email protected]> Co-authored-by: Chen Zhang <[email protected]> Co-authored-by: Russell Bryant <[email protected]> Co-authored-by: sroy745 <[email protected]> Co-authored-by: ElizaWszola <[email protected]> Co-authored-by: Zilin Zhu <[email protected]> Co-authored-by: Jee Jee Li <[email protected]> Co-authored-by: juncheoll <[email protected]> Co-authored-by: danieljannai21 <[email protected]> Co-authored-by: Mor Zusman <[email protected]> Co-authored-by: whyiug <[email protected]> Co-authored-by: Roger Wang <[email protected]> Co-authored-by: Lily Liu <[email protected]> Co-authored-by: Joe Runde <[email protected]> Co-authored-by: Cody Yu <[email protected]> Co-authored-by: Divakar Verma <[email protected]> Co-authored-by: Alex Brooks <[email protected]> Co-authored-by: vlsav <[email protected]> Co-authored-by: afeldman-nm <[email protected]> Co-authored-by: Andrew Feldman <[email protected]> Co-authored-by: Sergey Shlyapnikov <[email protected]> Co-authored-by: Shawn Tan <[email protected]> Co-authored-by: Travis Johnson <[email protected]> Co-authored-by: Guillaume Calmettes <[email protected]> Co-authored-by: xendo <[email protected]> Co-authored-by: Jerzy Zagorski <[email protected]> Co-authored-by: Domen Vreš <[email protected]> Co-authored-by: dvres <[email protected]> Co-authored-by: 代君 <[email protected]> Co-authored-by: Murali Andoorveedu <[email protected]> Co-authored-by: Prashant Gupta <[email protected]> Co-authored-by: Simon Mo <[email protected]> Co-authored-by: Varad Ahirwadkar <[email protected]> Co-authored-by: Flávia Béo <[email protected]> Co-authored-by: Dipika <[email protected]> Co-authored-by: Dipika Sikka <[email protected]> Co-authored-by: Kuntai Du <[email protected]> Co-authored-by: Andy Dai <[email protected]> Co-authored-by: Chongming Ni <[email protected]> Co-authored-by: Ashraf Mahgoub <[email protected]> Co-authored-by: Zhuohan Li <[email protected]> Co-authored-by: hhzhang16 <[email protected]> Co-authored-by: Xin Yang <[email protected]> Co-authored-by: TJian <[email protected]> Co-authored-by: Brendan Wong <[email protected]> Co-authored-by: Yanyi Liu <[email protected]> Co-authored-by: Isotr0py <[email protected]> Co-authored-by: TimWang <[email protected]> Co-authored-by: Kunshang Ji <[email protected]> Co-authored-by: Daniele <[email protected]> Co-authored-by: Sayak Paul <[email protected]> Co-authored-by: Cyrus Leung <[email protected]> Co-authored-by: Rafael Vasquez <[email protected]> Co-authored-by: bnellnm <[email protected]>
ROCm · Oct 9, 2024 · a466f09 · a466f09
1 parent b51fe69
commit a466f09
Show file tree

Hide file tree

Showing 411 changed files with 18,718 additions and 9,884 deletions.
diff --git a/...ldkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml b/...ldkite/lm-eval-harness/configs/Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
@@ -0,0 +1,11 @@
+# bash .buildkite/lm-eval-harness/run-lm-eval-gsm-vllm-baseline.sh -m nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test -b "auto" -l 250 -f 5 -t 1
+model_name: "nm-testing/Meta-Llama-3-8B-Instruct-W8-Channel-A8-Dynamic-Asym-Per-Token-Test"
+tasks:
+- name: "gsm8k"
+  metrics:
+  - name: "exact_match,strict-match"
+    value: 0.764
+  - name: "exact_match,flexible-extract"
+    value: 0.764
+limit: 250
+num_fewshot: 5
diff --git a/.buildkite/lm-eval-harness/configs/models-small.txt b/.buildkite/lm-eval-harness/configs/models-small.txt
@@ -1,6 +1,7 @@
 Meta-Llama-3-8B-Instruct.yaml
 Meta-Llama-3-8B-Instruct-FP8-compressed-tensors.yaml
 Meta-Llama-3-8B-Instruct-INT8-compressed-tensors.yaml
+Meta-Llama-3-8B-Instruct-INT8-compressed-tensors-asym.yaml
 Meta-Llama-3-8B-Instruct-nonuniform-compressed-tensors.yaml
 Meta-Llama-3-8B-Instruct-Channelwise-compressed-tensors.yaml
 Minitron-4B-Base-FP8.yaml

diff --git a/.buildkite/lm-eval-harness/test_lm_eval_correctness.py b/.buildkite/lm-eval-harness/test_lm_eval_correctness.py
@@ -49,10 +49,15 @@ def test_lm_eval_correctness():
     results = launch_lm_eval(eval_config)
 
     # Confirm scores match ground truth.
+    success = True
     for task in eval_config["tasks"]:
         for metric in task["metrics"]:
             ground_truth = metric["value"]
             measured_value = results["results"][task["name"]][metric["name"]]
             print(f'{task["name"]} | {metric["name"]}: '
                   f'ground_truth={ground_truth} | measured={measured_value}')
-            assert numpy.isclose(ground_truth, measured_value, rtol=RTOL)
+            success = success and numpy.isclose(
+                ground_truth, measured_value, rtol=RTOL)
+
+    # Assert at the end, print all scores even on failure for debugging.
+    assert success
diff --git a/.buildkite/nightly-benchmarks/nightly-annotation.md b/.buildkite/nightly-benchmarks/nightly-annotation.md
@@ -0,0 +1,28 @@
+
+## Description
+
+This file contains the downloading link for benchmarking results.
+
+- [benchmarking pipeline](artifact://nightly-pipeline.yaml)
+- [benchmarking results](artifact://results.zip)
+- [benchmarking code](artifact://nightly-benchmarks.zip)
+
+Please download the visualization scripts in the post
+
+
+## Results reproduction
+
+- Find the docker we use in `benchmarking pipeline`
+- Deploy the docker, and inside the docker:
+  - Download `nightly-benchmarks.zip`. 
+  - In the same folder, run the following code
+```
+export HF_TOKEN=<your HF token>
+apt update
+apt install -y git
+unzip nightly-benchmarks.zip
+VLLM_SOURCE_CODE_LOC=./ bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
+```
+
+And the results will be inside `./benchmarks/results`.
+
diff --git a/.buildkite/nightly-benchmarks/nightly-descriptions.md b/.buildkite/nightly-benchmarks/nightly-descriptions.md
@@ -1,45 +1,39 @@
 
 # Nightly benchmark
 
-The main goal of this benchmarking is two-fold:
-- Performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and tgi) leads in performance in what workload.
-- Reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions in [reproduce.md]().
-
-
-## Docker images
-
-We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following docker images:
-- vllm/vllm-openai:v0.5.0.post1
-- nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
-- openmmlab/lmdeploy:v0.5.0
-- ghcr.io/huggingface/text-generation-inference:2.1
-
-<!-- Please check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/nightly-pipeline.yaml">nightly-pipeline.yaml</a> artifact for more details on how we deploy the docker images. -->
-
-
-## Hardware
-
-One AWS node with 8x NVIDIA A100 GPUs.
-
-
-## Workload description
-
-We benchmark vllm, tensorrt-llm, lmdeploy and tgi using the following workload:
-
-- Input length: randomly sample 500 prompts from ShareGPT dataset (with fixed random seed).
-- Output length: the corresponding output length of these 500 prompts.
-- Models: llama-3 8B, llama-3 70B, mixtral 8x7B.
-- Average QPS (query per second): 4 for the small model (llama-3 8B) and 2 for other two models. For each QPS, the arrival time of each query is determined using a random Poisson process (with fixed random seed).
-- Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
-
-<!-- Check <a href="artifact://workspace/build/buildkite/vllm/performance-benchmark/.buildkite/nightly-benchmarks/tests/nightly-tests.json">nightly-tests.json</a> artifact for more details. -->
-
-## Plots
-
-In the following plots, the dot shows the mean and the error bar shows the standard error of the mean. Value 0 means that the corresponding benchmark crashed.
-
-<img src="artifact://nightly_results.png" alt="Benchmarking results" height=250 >
-
-## Results
-
-{nightly_results_benchmarking_table}
+This benchmark aims to:
+- Provide performance clarity: Provide clarity on which one (vllm, tensorrt-llm, lmdeploy and SGLang) leads in performance in what workload.
+- Be reproducible: one can run the exact same set of benchmarking commands inside the exact same docker by following reproducing instructions.
+
+Latest results: [results link](https://blog.vllm.ai/2024/09/05/perf-update.html), scroll to the end.
+
+Latest reproduction guilde: [github issue link](https://github.com/vllm-project/vllm/issues/8176)
+
+
+## Setup
+
+- Docker images:
+  - vLLM: `vllm/vllm-openai:v0.6.2`
+  - SGLang: `lmsysorg/sglang:v0.3.2-cu121`
+  - LMDeploy: `openmmlab/lmdeploy:v0.6.1-cu12`
+  - TensorRT-LLM: `nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3`
+    - *NOTE: we uses r24.07 as the current implementation only works for this version. We are going to bump this up.*
+  - Check [nightly-pipeline.yaml](nightly-pipeline.yaml) for the concrete docker images, specs and commands we use for the benchmark.
+- Hardware
+  - 8x Nvidia A100 GPUs
+- Workload:
+  - Dataset
+    - ShareGPT dataset
+    - Prefill-heavy dataset (in average 462 input tokens, 16 tokens as output)
+    - Decode-heavy dataset (in average 462 input tokens, 256 output tokens)
+    - Check [nightly-tests.json](tests/nightly-tests.json) for the concrete configuration of datasets we use.
+  - Models: llama-3 8B, llama-3 70B.
+    - We do not use llama 3.1 as it is incompatible with trt-llm r24.07. ([issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105)).
+  - Average QPS (query per second): 2, 4, 8, 16, 32 and inf.
+    - Queries are randomly sampled, and arrival patterns are determined via Poisson process, but all with fixed random seed.
+  - Evaluation metrics: Throughput (higher the better), TTFT (time to the first token, lower the better), ITL (inter-token latency, lower the better).
+
+# Known issues
+
+- TRT-LLM crashes with Llama 3.1 8B [issue](https://github.com/NVIDIA/TensorRT-LLM/issues/2105).
+- TGI does not support `ignore-eos` flag.
diff --git a/.buildkite/nightly-benchmarks/nightly-pipeline.yaml b/.buildkite/nightly-benchmarks/nightly-pipeline.yaml
@@ -13,7 +13,7 @@ common_pod_spec: &common_pod_spec
 
 common_container_settings: &common_container_settings
   command:
-    - bash .buildkite/nightly-benchmarks/run-nightly-suite.sh
+    - bash .buildkite/nightly-benchmarks/scripts/run-nightly-benchmarks.sh
   resources:
     limits:
       nvidia.com/gpu: 8
@@ -37,7 +37,10 @@ common_container_settings: &common_container_settings
 
 steps:
   - block: ":rocket: Ready for comparing vllm against alternatives? This will take 4 hours."
-  - label: "A100 trt benchmark"
+
+
+
+  - label: "A100 vllm step 10"
     priority: 100
     agents:
       queue: A100
@@ -46,7 +49,21 @@ steps:
           podSpec:
             <<: *common_pod_spec
             containers:
-              - image: nvcr.io/nvidia/tritonserver:24.04-trtllm-python-py3
+              - image: vllm/vllm-openai:v0.6.2
+                <<: *common_container_settings
+
+
+
+  - label: "A100 sglang benchmark"
+    priority: 100
+    agents:
+      queue: A100
+    plugins:
+      - kubernetes:
+          podSpec:
+            <<: *common_pod_spec
+            containers:
+              - image: lmsysorg/sglang:v0.3.2-cu121
                 <<: *common_container_settings
 
   - label: "A100 lmdeploy benchmark"
@@ -58,11 +75,13 @@ steps:
           podSpec:
             <<: *common_pod_spec
             containers:
-              - image: openmmlab/lmdeploy:v0.5.0
+              - image: openmmlab/lmdeploy:v0.6.1-cu12
                 <<: *common_container_settings
-
 
-  - label: "A100 vllm benchmark"
+
+
+
+  - label: "A100 trt llama-8B"
     priority: 100
     agents:
       queue: A100
@@ -71,10 +90,25 @@ steps:
           podSpec:
             <<: *common_pod_spec
             containers:
-              - image: vllm/vllm-openai:latest 
+              - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
                 <<: *common_container_settings
+                env:
+                  - name: VLLM_USAGE_SOURCE
+                    value: ci-test
+                  - name: HF_HOME
+                    value: /root/.cache/huggingface
+                  - name: VLLM_SOURCE_CODE_LOC
+                    value: /workspace/build/buildkite/vllm/performance-benchmark
+                  - name: HF_TOKEN
+                    valueFrom:
+                      secretKeyRef:
+                        name: hf-token-secret
+                        key: token
+                  - name: TEST_SELECTOR
+                    value: "llama8B"
 
-  - label: "A100 tgi benchmark"
+
+  - label: "A100 trt llama-70B"
     priority: 100
     agents:
       queue: A100
@@ -83,12 +117,54 @@ steps:
           podSpec:
             <<: *common_pod_spec
             containers:
-              - image: ghcr.io/huggingface/text-generation-inference:2.1 
+              - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
                 <<: *common_container_settings
+                env:
+                  - name: VLLM_USAGE_SOURCE
+                    value: ci-test
+                  - name: HF_HOME
+                    value: /root/.cache/huggingface
+                  - name: VLLM_SOURCE_CODE_LOC
+                    value: /workspace/build/buildkite/vllm/performance-benchmark
+                  - name: HF_TOKEN
+                    valueFrom:
+                      secretKeyRef:
+                        name: hf-token-secret
+                        key: token
+                  - name: TEST_SELECTOR
+                    value: "llama70B"
+
+
+  # FIXME(Kuntai): uncomment this after NVIDIA gives us their test docker image 
+  # - label: "A100 trt benchmark"
+  #   priority: 100
+  #   agents:
+  #     queue: A100
+  #   plugins:
+  #     - kubernetes:
+  #         podSpec:
+  #           <<: *common_pod_spec
+  #           containers:
+  #             - image: nvcr.io/nvidia/tritonserver:24.07-trtllm-python-py3
+  #               <<: *common_container_settings
+
+
+  # FIXME(Kuntai): uncomment this after TGI supports `--ignore-eos`.
+  # - label: "A100 tgi benchmark"
+  #   priority: 100
+  #   agents:
+  #     queue: A100
+  #   plugins:
+  #     - kubernetes:
+  #         podSpec:
+  #           <<: *common_pod_spec
+  #           containers:
+  #             - image: ghcr.io/huggingface/text-generation-inference:2.2.0
+  #               <<: *common_container_settings
 
   - wait
 
-  - label: "Plot"
+  - label: "Collect the results"
     priority: 100
     agents:
       queue: A100
@@ -117,4 +193,4 @@ steps:
                     name: hf-token-secret
                     key: token
 
-  - wait
+  - block: ":rocket: check the results!"
diff --git a/.buildkite/nightly-benchmarks/run-nightly-suite.sh b/.buildkite/nightly-benchmarks/run-nightly-suite.sh