Running on an A100 node #3359
Replies: 2 comments 7 replies
-
Are the gpus interconnected using NVLink or PCIe? Is it possible to rebuild with |
Beta Was this translation helpful? Give feedback.
-
@ggerganov How did you get "143.43 tokens per second" with CUDA_VISIBLE_DEVICES=0 ? Can you share your command, model and settings? I can get "109.17 tokens per second". thanks CUDA_VISIBLE_DEVICES=1 ./main -m models/models--TheBloke--Llama-2-7b-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf -i --interactive-first -ngl 40 -n 50 Log start llama_print_timings: load time = 2196.50 ms |
Beta Was this translation helpful? Give feedback.
-
[OUTDATED]
I currently have access to a node with 8x A100 and doing some experiments, decided to share some of the results.
Slow without
CUDA_VISIBLE_DEVICES=0
Not sure why, but if I run
main
without setting the environmentCUDA_VISIBLE_DEVICES=0
, the performance is ~8 times worse compared to when setting it:Any ideas what is causing this?
Performance benchmarks
LLAMA_CUDA_MMV_Y=2
seems to slightly improve the performanceLLAMA_CUDA_DMMV_X=64
also slightly improves the performance-mmq 0
(-nommq) significantly improves prefill speedCMAKE_CUDA_ARCHITECTURES=native
build: 39ddda2 (1301)
build: 39ddda2 (1301)
build: 48edda3 (1330)
For reference, here is the same test on M2 Ultra
build: 99115f3 (1273)
build: 99115f3 (1273)
real 3m2.119s
user 0m8.147s
sys 0m8.614s
Beta Was this translation helpful? Give feedback.
All reactions