Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Baseline for SGLang Benchmark Test #602

Draft
wants to merge 20 commits into
base: main
Choose a base branch
from

Conversation

stbaione
Copy link
Contributor

Description

The SGLang Benchmark Test has been running for awhile, but only benchmarks the shortfin server itself. In order to get a baseline metric and enable long-term convergence in-terms of performance, we need to be able to track metrics of the SGLang server using the same benchmark method.

This adds an sglang_benchmark_test to complement the shortfin_benchmark_test. Also restructures app_tests/benchmark_tests/llm -> app_tests/benchmark_tests/llm/sglang_benchmarks. This keeps the benchmark tests organized and allows for the folder to be extended with other types of benchmarks in the future.

Why are we using docker to start the SGLang server?

Currently, the pyprompt.toml file inside of SGLang requires vllm==0.6.3.dev13 to run on ROCm. I looked into potentially building vLLM from source for this test, but couldn't find a branch, tag, or release that matched that signature. From their own comments inside of pyproject.toml, it appears to only be available inside of a ROCm base image:

# HIP (Heterogeneous-computing Interface for Portability) for AMD
# => base docker rocm/vllm-dev:20241022, not from public vllm whl
srt_hip = ["sglang[runtime_common]", "torch", "vllm==0.6.3.dev13"]

Their instructions on installing SGLang and running for ROCm also appear to suggest the docker method:

Instructions from their docs for running with ROCm

docker build --build-arg SGL_BRANCH=v0.3.5.post2 -t v0.3.5.post2-rocm620 -f Dockerfile.rocm .

alias drun='docker run -it --rm --network=host --device=/dev/kfd --device=/dev/dri --ipc=host \
    --shm-size 16G --group-add video --cap-add=SYS_PTRACE --security-opt seccomp=unconfined \
    -v $HOME/dockerx:/dockerx -v /data:/data'

drun -p 30000:30000 \
    -v ~/.cache/huggingface:/root/.cache/huggingface \
    --env "HF_TOKEN=<secret>" \
    v0.3.5.post2-rocm620 \
    python3 -m sglang.launch_server --model-path meta-llama/Llama-3.1-8B-Instruct --host 0.0.0.0 --port 30000

The workflow file handles starting the container and cleaning up once the workflow is done. I set the timeout for waiting for the server to start to 10 minutes to give the SGLang server enough time to load necessary model weights and startup.

stbaione and others added 20 commits November 22, 2024 01:12
Add sgl server benchmark to workflow file,
Restructure `app_tests/benchmark_tests`
Temporarily comment out shortfin job to verify sglang benchmark job
Update benchmark tests to download model on demand
Add disable-cuda-graph option to allow server to properly run
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant