You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When running this command python vllm/benchmarks/benchmark_throughput.py --input-len 512 --output-len 256 --tensor-parallel-size 4 --memory-utilization 0.7 --model <model> with the following model setups, performance is inconsistent:
7b 16 bit slows to almost half speed
Baseline 4916.03 tok/s @ 6.4 requests/s
--kv-cache-dtype FP8 drops to 2405.54 tok/s @ 3.13 requests/s
Yet 120b 4 bit GPTQ gets slightly faster, but not outside of what my average benchmarks show so I assume it's just variance
Baseline 154.30 tok/s @ 0.2 requests/s
FP8 181.35 tok/s @ 0.25 requests/s
Here is the 7b fp16 model I attempted this with, along with this 120b 4bit GPTQ
This is on a 4x AMD Instinct MI100 system with a GPU bridge, applying the fixes in Dockerfile.rocm to update the FA branch, FA arch, and the numpy fix prior to today's PR #3962
It's possible that the decrease is due to the lack of FP8 hardware on the card, but I would assume it would impact all models in that case
The text was updated successfully, but these errors were encountered:
TNT3530
changed the title
[Bug]: FP8 KV Cache performance loss on FP16 models in ROCm
[Performance]: FP8 KV Cache performance loss on FP16 models in ROCm
Apr 11, 2024
Your current environment
previous upload
🐛 Describe the bug
When running this command
python vllm/benchmarks/benchmark_throughput.py --input-len 512 --output-len 256 --tensor-parallel-size 4 --memory-utilization 0.7 --model <model>
with the following model setups, performance is inconsistent:7b 16 bit slows to almost half speed
Baseline 4916.03 tok/s @ 6.4 requests/s
--kv-cache-dtype FP8 drops to 2405.54 tok/s @ 3.13 requests/s
Yet 120b 4 bit GPTQ gets slightly faster, but not outside of what my average benchmarks show so I assume it's just variance
Baseline 154.30 tok/s @ 0.2 requests/s
FP8 181.35 tok/s @ 0.25 requests/s
Here is the 7b fp16 model I attempted this with, along with this 120b 4bit GPTQ
This is on a 4x AMD Instinct MI100 system with a GPU bridge, applying the fixes in Dockerfile.rocm to update the FA branch, FA arch, and the numpy fix prior to today's PR #3962
It's possible that the decrease is due to the lack of FP8 hardware on the card, but I would assume it would impact all models in that case
The text was updated successfully, but these errors were encountered: