Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: add conclusion #1340

Merged
merged 1 commit into from
Sep 5, 2024
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
74 changes: 38 additions & 36 deletions benchmark/benchmark_vllm_060/README.md
Original file line number Diff line number Diff line change
@@ -1,5 +1,43 @@
## How to reproduce the benchmark results for SGLang v0.3.0 compared to vLLM v0.6.0

In short, with multi step enabled, in online scenarios that we benchmarked, the Median TTFT of vLLM is **3 times** that of SGLang, and the Median ITL is **10 times** that of SGLang. Lower Median TTFT and ITL are better. vLLM's multi-step optimization did not improve throughput while ensuring lower Median TTFT and ITL. Also, under maximum throughput benchmark, if vLLM does not set gpu util to 0.95 separately and uses the default configuration instead, its maximum throughput is **lower** than that of SGLang.

## Online benchmark results

### Llama 3.1 8B Instruct 1 x A100 80G

| RPS | Num prompts | Engine | Median E2E Latency | Median TTFT | Median TPOT | Median ITL |
|------|-------------|--------|--------------------|-------------|-------------|------------|
| 4 | 1200 | SGLang | 1564.17 | **31.98** | 13.17 | **11.93** |
| 4 | 1200 | vLLM | 1691.97 | **100.48** | 14.14 | **129.32** |
| 8 | 2400 | SGLang | 2175.02 | **35.68** | 17.85 | **14.41** |
| 8 | 2400 | vLLM | 2137.16 | **120.39** | 17.09 | **158.63** |

### Llama 3.1 70B Insruct 4 x H100 80G

| RPS | Num Prompts | Engine | Median E2E Latency | Median TTFT | Median TPOT | Median ITL |
|------|-------------|--------|--------------------|-------------|-------------|------------|
| 4 | 1200 | SGLang | 3005.24 | **53.94** | 25.03 | **21.67** |
| 4 | 1200 | vLLM | 2915.60 | **179.15** | 23.58 | **231.23** |
| 8 | 2400 | SGLang | 4064.98 | **58.11** | 33.07 | **24.45** |
| 8 | 2400 | vLLM | 3752.38 | **207.12** | 29.15 | **275.32** |

## Offline benchmark results

### Llama 3.1 8B Instruct 1 x A100 80G

| RPS | Num Prompts | Engine | Request throughput | Output token throughput |
|------|-------------|--------|--------------------|-------------------------|
| inf | 5000 | SGLang | 22.03 | **4281.51** |
| inf | 5000 | vLLM | 21.27 | **4132.37** |

### Llama 3.1 70B Insruct 4 x H100 80G

| RPS | Num Prompts | Engine | Request throughput | Output token throughput |
|------|-------------|--------|--------------------|-------------------------|
| inf | 5000 | SGLang | 19.84 | **3856.01** |
| inf | 5000 | vLLM | 19.04 | **3700.64** |

## Installation

```bash
Expand Down Expand Up @@ -49,39 +87,3 @@ python -m vllm.entrypoints.openai.api_server --model meta-llama/Meta-Llama-3.1-7
python3 -m sglang.bench_serving --backend sglang --dataset-name sharegpt --num-prompts 5000
python3 -m sglang.bench_serving --backend vllm --dataset-name sharegpt --num-prompts 5000
```

## Online benchmark results

### Llama 3.1 8B Instruct 1 x A100 80G

| RPS | Num prompts | Engine | Median E2E Latency | Median TTFT | Median TPOT | Median ITL |
|------|-------------|--------|--------------------|-------------|-------------|------------|
| 4 | 1200 | SGLang | 1564.17 | **31.98** | 13.17 | **11.93** |
| 4 | 1200 | vLLM | 1691.97 | **100.48** | 14.14 | **129.32** |
| 8 | 2400 | SGLang | 2175.02 | **35.68** | 17.85 | **14.41** |
| 8 | 2400 | vLLM | 2137.16 | **120.39** | 17.09 | **158.63** |

### Llama 3.1 70B Insruct 4 x H100 80G

| RPS | Num Prompts | Engine | Median E2E Latency | Median TTFT | Median TPOT | Median ITL |
|------|-------------|--------|--------------------|-------------|-------------|------------|
| 4 | 1200 | SGLang | 3005.24 | **53.94** | 25.03 | **21.67** |
| 4 | 1200 | vLLM | 2915.60 | **179.15** | 23.58 | **231.23** |
| 8 | 2400 | SGLang | 4064.98 | **58.11** | 33.07 | **24.45** |
| 8 | 2400 | vLLM | 3752.38 | **207.12** | 29.15 | **275.32** |

## Offline benchmark results

### Llama 3.1 8B Instruct 1 x A100 80G

| RPS | Num Prompts | Engine | Request throughput | Output token throughput |
|------|-------------|--------|--------------------|-------------------------|
| inf | 5000 | SGLang | 22.03 | **4281.51** |
| inf | 5000 | vLLM | 21.27 | **4132.37** |

### Llama 3.1 70B Insruct 4 x H100 80G

| RPS | Num Prompts | Engine | Request throughput | Output token throughput |
|------|-------------|--------|--------------------|-------------------------|
| inf | 5000 | SGLang | 19.84 | **3856.01** |
| inf | 5000 | vLLM | 19.04 | **3700.64** |
Loading