Running on an A100 node #3359

ggerganov · 2023-09-27T11:17:09Z

ggerganov
Sep 27, 2023
Maintainer

[OUTDATED]

I currently have access to a node with 8x A100 and doing some experiments, decided to share some of the results.

Slow without `CUDA_VISIBLE_DEVICES=0`

Not sure why, but if I run main without setting the environment CUDA_VISIBLE_DEVICES=0, the performance is ~8 times worse compared to when setting it:

# with CUDA_VISIBLE_DEVICES=0

ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0

llama_print_timings:        load time =   617.80 ms
llama_print_timings:      sample time =    34.84 ms /    62 runs   (    0.56 ms per token,  1779.72 tokens per second)
llama_print_timings: prompt eval time =   108.45 ms /     8 tokens (   13.56 ms per token,    73.77 tokens per second)
llama_print_timings:        eval time =   425.29 ms /    61 runs   (    6.97 ms per token,   143.43 tokens per second)
llama_print_timings:       total time =   589.89 ms

# without CUDA_VISIBLE_DEVICES=0

ggml_init_cublas: found 8 CUDA devices:
  Device 0: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 1: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 2: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 3: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 4: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 5: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 6: NVIDIA A100-SXM4-40GB, compute capability 8.0
  Device 7: NVIDIA A100-SXM4-40GB, compute capability 8.0

llama_print_timings:        load time =   864.04 ms
llama_print_timings:      sample time =    35.80 ms /    62 runs   (    0.58 ms per token,  1731.75 tokens per second)
llama_print_timings: prompt eval time =   143.87 ms /     8 tokens (   17.98 ms per token,    55.60 tokens per second)
llama_print_timings:        eval time =  3316.43 ms /    61 runs   (   54.37 ms per token,    18.39 tokens per second)
llama_print_timings:       total time =  3517.09 ms

Any ideas what is causing this?

Performance benchmarks

Using LLAMA_CUDA_MMV_Y=2 seems to slightly improve the performance
Using LLAMA_CUDA_DMMV_X=64 also slightly improves the performance
After ggml-cuda : perform cublas mat mul of quantized types as f16 #3412, using -mmq 0 (-nommq) significantly improves prefill speed
Using CUDA 11.7
Building with CMAKE_CUDA_ARCHITECTURES=native

# bench for LLaMA 7B v2
CUDA_VISIBLE_DEVICES=0 ../scripts/run-all-perf.sh llama-7b-v2 "f16 q8_0 q4_0 q4_1 q5_0 q5_1 q6_k q5_k q5_k_s q4_k q4_k_s q3_k q3_k_s" "-ngl 999 -t 1 -n 128 -p 512 -mmq 0"

model	size	params	backend	ngl	th	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	999	1	pp 512	5454.71 ± 7.94
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	999	1	tg 128	72.07 ± 0.06
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	999	1	pp 512	4129.29 ± 3.80
llama 7B mostly Q8_0	6.67 GiB	6.74 B	CUDA	999	1	tg 128	101.68 ± 0.06
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	999	1	pp 512	4122.66 ± 0.98
llama 7B mostly Q4_0	3.56 GiB	6.74 B	CUDA	999	1	tg 128	142.93 ± 0.11
llama 7B mostly Q4_1	3.95 GiB	6.74 B	CUDA	999	1	pp 512	4130.02 ± 1.82
llama 7B mostly Q4_1	3.95 GiB	6.74 B	CUDA	999	1	tg 128	141.04 ± 0.10
llama 7B mostly Q5_0	4.33 GiB	6.74 B	CUDA	999	1	pp 512	4029.09 ± 1.92
llama 7B mostly Q5_0	4.33 GiB	6.74 B	CUDA	999	1	tg 128	122.89 ± 0.07
llama 7B mostly Q5_1	4.72 GiB	6.74 B	CUDA	999	1	pp 512	4024.18 ± 1.64
llama 7B mostly Q5_1	4.72 GiB	6.74 B	CUDA	999	1	tg 128	124.55 ± 0.08
llama 7B mostly Q6_K	5.15 GiB	6.74 B	CUDA	999	1	pp 512	4469.75 ± 2.08
llama 7B mostly Q6_K	5.15 GiB	6.74 B	CUDA	999	1	tg 128	103.06 ± 0.06
llama 7B mostly Q5_K - Medium	4.45 GiB	6.74 B	CUDA	999	1	pp 512	4532.36 ± 2.69
llama 7B mostly Q5_K - Medium	4.45 GiB	6.74 B	CUDA	999	1	tg 128	118.11 ± 0.09
llama 7B mostly Q5_K - Small	4.33 GiB	6.74 B	CUDA	999	1	pp 512	4529.98 ± 2.48
llama 7B mostly Q5_K - Small	4.33 GiB	6.74 B	CUDA	999	1	tg 128	121.53 ± 0.09
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	999	1	pp 512	4532.25 ± 1.97
llama 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	CUDA	999	1	tg 128	127.15 ± 0.09
llama 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CUDA	999	1	pp 512	4530.28 ± 1.71
llama 7B mostly Q4_K - Small	3.59 GiB	6.74 B	CUDA	999	1	tg 128	132.32 ± 0.10
llama 7B mostly Q3_K - Medium	3.07 GiB	6.74 B	CUDA	999	1	pp 512	4488.63 ± 3.58
llama 7B mostly Q3_K - Medium	3.07 GiB	6.74 B	CUDA	999	1	tg 128	111.70 ± 0.08
llama 7B mostly Q3_K - Small	2.75 GiB	6.74 B	CUDA	999	1	pp 512	4492.20 ± 2.91
llama 7B mostly Q3_K - Small	2.75 GiB	6.74 B	CUDA	999	1	tg 128	99.86 ± 0.06

build: 39ddda2 (1301)

# bench for LLaMA 13B v2
CUDA_VISIBLE_DEVICES=0 time ../scripts/run-all-perf.sh llama-13b-v2 "f16 q8_0 q4_0 q4_1 q5_0 q5_1 q6_k q5_k q5_k_s q4_k q4_k_s q3_k q3_k_s" "-ngl 999 -t 1 -n 128 -p 512 -mmq 0"

model	size	params	backend	ngl	th	test	t/s
llama 13B mostly F16	24.24 GiB	13.02 B	CUDA	999	1	pp 512	3473.38 ± 3.15
llama 13B mostly F16	24.24 GiB	13.02 B	CUDA	999	1	tg 128	39.76 ± 0.01
llama 13B mostly Q8_0	12.88 GiB	13.02 B	CUDA	999	1	pp 512	2463.15 ± 1.60
llama 13B mostly Q8_0	12.88 GiB	13.02 B	CUDA	999	1	tg 128	61.61 ± 0.01
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	999	1	pp 512	2466.40 ± 0.18
llama 13B mostly Q4_0	6.86 GiB	13.02 B	CUDA	999	1	tg 128	91.55 ± 0.03
llama 13B mostly Q4_1	7.61 GiB	13.02 B	CUDA	999	1	pp 512	2468.36 ± 0.30
llama 13B mostly Q4_1	7.61 GiB	13.02 B	CUDA	999	1	tg 128	90.44 ± 0.03
llama 13B mostly Q5_0	8.36 GiB	13.02 B	CUDA	999	1	pp 512	2395.81 ± 0.31
llama 13B mostly Q5_0	8.36 GiB	13.02 B	CUDA	999	1	tg 128	79.22 ± 0.02
llama 13B mostly Q5_1	9.10 GiB	13.02 B	CUDA	999	1	pp 512	2393.12 ± 1.20
llama 13B mostly Q5_1	9.10 GiB	13.02 B	CUDA	999	1	tg 128	79.45 ± 0.01
llama 13B mostly Q6_K	9.95 GiB	13.02 B	CUDA	999	1	pp 512	2728.76 ± 0.79
llama 13B mostly Q6_K	9.95 GiB	13.02 B	CUDA	999	1	tg 128	65.74 ± 0.01
llama 13B mostly Q5_K - Medium	8.60 GiB	13.02 B	CUDA	999	1	pp 512	2754.25 ± 1.30
llama 13B mostly Q5_K - Medium	8.60 GiB	13.02 B	CUDA	999	1	tg 128	76.23 ± 0.02
llama 13B mostly Q5_K - Small	8.36 GiB	13.02 B	CUDA	999	1	pp 512	2753.87 ± 1.32
llama 13B mostly Q5_K - Small	8.36 GiB	13.02 B	CUDA	999	1	tg 128	78.50 ± 0.03
llama 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	CUDA	999	1	pp 512	2755.60 ± 1.31
llama 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	CUDA	999	1	tg 128	81.86 ± 0.03
llama 13B mostly Q4_K - Small	6.90 GiB	13.02 B	CUDA	999	1	pp 512	2755.07 ± 1.84
llama 13B mostly Q4_K - Small	6.90 GiB	13.02 B	CUDA	999	1	tg 128	85.59 ± 0.02
llama 13B mostly Q3_K - Medium	5.90 GiB	13.02 B	CUDA	999	1	pp 512	2748.00 ± 1.39
llama 13B mostly Q3_K - Medium	5.90 GiB	13.02 B	CUDA	999	1	tg 128	68.66 ± 0.02
llama 13B mostly Q3_K - Small	5.27 GiB	13.02 B	CUDA	999	1	pp 512	2748.35 ± 1.63
llama 13B mostly Q3_K - Small	5.27 GiB	13.02 B	CUDA	999	1	tg 128	59.50 ± 0.01

build: 39ddda2 (1301)

# bench for LLaMA 7B at different batch sizes
LLAMA_CUBLAS=1 make -j && CUDA_VISIBLE_DEVICES=5 ./llama-bench -m models/openllama-7b/ggml-model-f16.gguf -p 1,2,4,8,10,16,32,60,64,128,256,512,1024,2048 -ngl 100 -mmq 0 -t 1 -n 0

model	size	params	backend	ngl	threads	test	t/s
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 1	41.90 ± 0.97
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 2	50.21 ± 0.03
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 4	106.82 ± 0.05
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 8	212.72 ± 1.26
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 10	236.05 ± 1.16
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 16	414.91 ± 2.69
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 32	750.36 ± 6.43
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 60	1303.27 ± 10.70
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 64	1386.89 ± 11.62
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 128	2597.60 ± 26.66
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 256	4100.16 ± 55.35
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 512	5235.21 ± 10.50
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 1024	4857.50 ± 3.01
llama 7B mostly F16	12.55 GiB	6.74 B	CUDA	100	1	pp 2048	3505.47 ± 1.27

build: 48edda3 (1330)

For reference, here is the same test on M2 Ultra

model	size	params	backend	ngl	threads	test	t/s
llama2 7B mostly F16	12.55 GiB	6.74 B	Metal	999	4	pp 512	1490.51 ± 1.33
llama2 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	999	4	pp 512	1326.80 ± 0.68
llama2 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	999	4	pp 512	1355.31 ± 0.77
llama2 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	999	4	pp 512	1352.15 ± 0.85
llama2 7B mostly Q6_K	5.15 GiB	6.74 B	Metal	999	4	pp 512	1106.62 ± 0.19
llama2 7B mostly Q5_K - Medium	4.45 GiB	6.74 B	Metal	999	4	pp 512	1103.32 ± 0.76
llama2 7B mostly Q5_K - Small	4.33 GiB	6.74 B	Metal	999	4	pp 512	1102.13 ± 0.44
llama2 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	Metal	999	4	pp 512	1169.10 ± 0.37
llama2 7B mostly Q4_K - Small	3.59 GiB	6.74 B	Metal	999	4	pp 512	1178.26 ± 0.44
llama2 7B mostly Q3_K - Medium	3.07 GiB	6.74 B	Metal	999	4	pp 512	1142.14 ± 0.56
llama2 7B mostly Q3_K - Small	2.75 GiB	6.74 B	Metal	999	4	pp 512	1119.44 ± 0.24
llama2 7B mostly F16	12.55 GiB	6.74 B	Metal	999	4	tg 128	40.84 ± 0.05
llama2 7B mostly Q8_0	6.67 GiB	6.74 B	Metal	999	4	tg 128	64.54 ± 0.06
llama2 7B mostly Q4_0	3.56 GiB	6.74 B	Metal	999	4	tg 128	90.69 ± 0.17
llama2 7B mostly Q4_1	3.95 GiB	6.74 B	Metal	999	4	tg 128	85.98 ± 0.10
llama2 7B mostly Q6_K	5.15 GiB	6.74 B	Metal	999	4	tg 128	71.38 ± 0.08
llama2 7B mostly Q5_K - Medium	4.45 GiB	6.74 B	Metal	999	4	tg 128	72.57 ± 0.07
llama2 7B mostly Q5_K - Small	4.33 GiB	6.74 B	Metal	999	4	tg 128	74.01 ± 0.06
llama2 7B mostly Q4_K - Medium	3.80 GiB	6.74 B	Metal	999	4	tg 128	83.57 ± 0.15
llama2 7B mostly Q4_K - Small	3.59 GiB	6.74 B	Metal	999	4	tg 128	86.78 ± 0.18
llama2 7B mostly Q3_K - Medium	3.07 GiB	6.74 B	Metal	999	4	tg 128	83.43 ± 0.10
llama2 7B mostly Q3_K - Small	2.75 GiB	6.74 B	Metal	999	4	tg 128	85.22 ± 0.10

build: 99115f3 (1273)

model	size	params	backend	ngl	threads	test	t/s
llama2 13B mostly F16	24.24 GiB	13.02 B	Metal	999	4	pp 512	790.32 ± 0.28
llama2 13B mostly Q8_0	12.88 GiB	13.02 B	Metal	999	4	pp 512	708.47 ± 0.29
llama2 13B mostly Q4_0	6.86 GiB	13.02 B	Metal	999	4	pp 512	722.70 ± 0.12
llama2 13B mostly Q4_1	7.61 GiB	13.02 B	Metal	999	4	pp 512	721.37 ± 0.14
llama2 13B mostly Q6_K	9.95 GiB	13.02 B	Metal	999	4	pp 512	589.94 ± 0.17
llama2 13B mostly Q5_K - Medium	8.60 GiB	13.02 B	Metal	999	4	pp 512	582.08 ± 0.22
llama2 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	Metal	999	4	pp 512	622.52 ± 0.17
llama2 13B mostly Q3_K - Medium	5.90 GiB	13.02 B	Metal	999	4	pp 512	607.34 ± 0.19
llama2 13B mostly F16	24.24 GiB	13.02 B	Metal	999	4	tg 128	22.60 ± 0.03
llama2 13B mostly Q8_0	12.88 GiB	13.02 B	Metal	999	4	tg 128	37.69 ± 0.01
llama2 13B mostly Q4_0	6.86 GiB	13.02 B	Metal	999	4	tg 128	56.66 ± 0.04
llama2 13B mostly Q4_1	7.61 GiB	13.02 B	Metal	999	4	tg 128	52.65 ± 0.02
llama2 13B mostly Q6_K	9.95 GiB	13.02 B	Metal	999	4	tg 128	42.73 ± 0.01
llama2 13B mostly Q5_K - Medium	8.60 GiB	13.02 B	Metal	999	4	tg 128	44.42 ± 0.02
llama2 13B mostly Q4_K - Medium	7.33 GiB	13.02 B	Metal	999	4	tg 128	51.30 ± 0.05
llama2 13B mostly Q3_K - Medium	5.90 GiB	13.02 B	Metal	999	4	tg 128	51.21 ± 0.02

build: 99115f3 (1273)

real 3m2.119s
user 0m8.147s
sys 0m8.614s

alonfaraj · 2023-09-27T20:04:55Z

alonfaraj
Sep 27, 2023

Are the gpus interconnected using NVLink or PCIe?

Is it possible to rebuild with LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 just to see if it makes any difference?

4 replies

ggerganov Sep 28, 2023
Maintainer Author

Are the gpus interconnected using NVLink or PCIe?

How do I find out? I think it's PCIe, but not 100% sure

LLAMA_CUDA_PEER_MAX_BATCH_SIZE=0 did not make a difference. I tried playing with --tensor-split but no effect.

alonfaraj Sep 28, 2023

Ok so if they interconencted using PCIe I guess LLAMA_CUDA_PEER_MAX_BATCH_SIZE not relevant.

How do I find out? I think it's PCIe, but not 100% sure

You can use nvidia-smi topo -m to see the GPUs topology and how they connected to each other.

ggerganov Sep 28, 2023
Maintainer Author

$ nvidia-smi topo -m

	GPU0	GPU1	GPU2	GPU3	GPU4	GPU5	GPU6	GPU7	mlx5_0	mlx5_1	mlx5_2	mlx5_3	mlx5_4	mlx5_5	mlx5_6	mlx5_7	mlx5_8	mlx5_9	mlx5_10	mlx5_11	mlx5_12	mlx5_13	mlx5_14	mlx5_15	mlx5_16	mlx5_17	CPU Affinity	NUMA Affinity
GPU0	X 	NV12	NV12	NV12	NV12	NV12	NV12	NV12	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU1	NV12	X 	NV12	NV12	NV12	NV12	NV12	NV12	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	48-63,176-191	3
GPU2	NV12	NV12	X 	NV12	NV12	NV12	NV12	NV12	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU3	NV12	NV12	NV12	X 	NV12	NV12	NV12	NV12	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	16-31,144-159	1
GPU4	NV12	NV12	NV12	NV12	X 	NV12	NV12	NV12	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PXB	PXB	112-127,240-254	7
GPU5	NV12	NV12	NV12	NV12	NV12	X 	NV12	NV12	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PXB	PXB	112-127,240-254	7
GPU6	NV12	NV12	NV12	NV12	NV12	NV12	X 	NV12	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	80-95,208-223	5
GPU7	NV12	NV12	NV12	NV12	NV12	NV12	NV12	X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PXB	PXB	SYS	SYS	SYS	SYS	SYS	80-95,208-223	5
mlx5_0	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_1	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	X 	PIX	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_2	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	PIX	X 	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_3	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	PXB	PXB	X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_4	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PIX	X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_5	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	X 	PIX	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_6	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	X 	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_7	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	X 	PIX	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_8	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PIX	X 	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS		
mlx5_9	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	X 	PIX	PXB	PXB	SYS	SYS	SYS	SYS	SYS		
mlx5_10	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	X 	PXB	PXB	SYS	SYS	SYS	SYS	SYS		
mlx5_11	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	X 	PIX	SYS	SYS	SYS	SYS	SYS		
mlx5_12	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PIX	X 	SYS	SYS	SYS	SYS	SYS		
mlx5_13	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	X 	SYS	SYS	SYS	SYS		
mlx5_14	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	X 	PIX	PXB	PXB		
mlx5_15	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PIX	X 	PXB	PXB		
mlx5_16	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	X 	PIX		
mlx5_17	SYS	SYS	SYS	SYS	PXB	PXB	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	SYS	PXB	PXB	PIX	X 		

Legend:

  X    = Self
  SYS  = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
  NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
  PHB  = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
  PXB  = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
  PIX  = Connection traversing at most a single PCIe bridge
  NV#  = Connection traversing a bonded set of # NVLinks

alonfaraj Sep 28, 2023

Looks good.
Can you see anything about active NVLink in nvidia-smi output?

For example:
NVLink GPU Bridge Info: Bridge ID: 0 Link 0: Status: Active Version: NVLink X.X Bandwidth: XX GB/s Link 1: Status: Active Version: NVLink X.X Bandwidth: XX GB/s

Also, just to eliminate there is something wrong with nvidia-persistenced, maybe it worth running nvidia-smi -l in a separate terminal and running the test again while it still runs.

(Just throwing some ideas and trying get a hint..)

yirunwang · 2024-01-11T03:40:09Z

yirunwang
Jan 11, 2024

@ggerganov How did you get "143.43 tokens per second" with CUDA_VISIBLE_DEVICES=0 ? Can you share your command, model and settings? I can get "109.17 tokens per second". thanks

CUDA_VISIBLE_DEVICES=1 ./main -m models/models--TheBloke--Llama-2-7b-Chat-GGUF/snapshots/191239b3e26b2882fb562ffccdd1cf0f65402adb/llama-2-7b-chat.Q4_K_M.gguf -i --interactive-first -ngl 40 -n 50

Log start
main: build = 1748 (83e633c)
main: built with cc (Ubuntu 11.4.0-1ubuntu1~22.04) 11.4.0 for x86_64-linux-gnu
main: seed = 1704943544
ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

llama_print_timings: load time = 2196.50 ms
llama_print_timings: sample time = 28.94 ms / 95 runs ( 0.30 ms per token, 3283.22 tokens per second)
llama_print_timings: prompt eval time = 24.23 ms / 5 tokens ( 4.85 ms per token, 206.36 tokens per second)
llama_print_timings: eval time = 879.33 ms / 96 runs ( 9.16 ms per token, 109.17 tokens per second)

3 replies

phymbert Feb 28, 2024
Collaborator

Thanks for sharing this helpful guideline.

I am running some perf tests on 2 A100 80GB with latest server and mixtral8x7b-v0.1-instruct quantized with Q4_0. I have some questions @ggerganov @ngxson:

CUDA_VISIBLE_DEVICES=0, means the server sees only one device. Does it mean we do not support multi-GPU well ?
For the server, what is the link between --batch-size, --threads, --threads-batch and number of sequences --parallel ? I see in the benchmark you are using only one thread. I did not managed to exceed 70 t/s with 1 slot and 13t/s per slot with 32 sequences (KV Cache size = 500k), all 81 layers are offloaded.
CMAKE_CUDA_ARCHITECTURES=native does not execute with HEAD: ggml-cuda.cu:3211: ERROR: CUDA kernel vec_dot_q5_K_q8_1_impl_vmmq #5686
I will give a try to --no-mul-mat-q, added in the print usage and readme in: cleanup unused --no-mul-mat-q,-nommq, -mmq, --mul-mat-q, mul_mat_q #5772
With 500k KV Cache size I reach 79GB*A100 memory, but yet GPU cores utilisation never exceeds 40%.

Appreciate your help on this 👍 . Using the OAI completions, system prompt is fixed, user prompt varies from 5 to 120, and n_predict varies from 30 to 150.

Note the new prometheus /metrics exporter is useful to monitor all metrics, especially KV Cache usage ratio.

ngxson Feb 28, 2024
Collaborator

Sorry I never tried on A100 so I can't answer most of the question here, but for the options:

--parallel specifies number of seq per batch. If set to 32 for example, that means it can decode upto 32 seq in parallel per batch.
--batch-size means maximum number of tokens per batch, I'm not sure if it's per seq per batch or just per batch, but you can increase this number to see if it gives better performance. Normally, on training, large batch always run faster, but takes up more VRAM.
--threads and --threads-batch are basically the same and only used on CPU, it's default to 4 threads on GPU.

Also, the prompt length seems to be quite small. Maybe you can try with continuous batching.

ggerganov Mar 1, 2024
Maintainer Author

@phymbert The information here is quite outdated. Multi-GPU support has been improved since then. There are more improvements coming (#4918). I have to take the time to redo the tests here and update the information

number of threads no longer affects performance
--no-mul-mat-q does nothing now

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Running on an A100 node #3359

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 2 comments 7 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Running on an A100 node #3359

ggerganov Sep 27, 2023 Maintainer

[OUTDATED]

Slow without CUDA_VISIBLE_DEVICES=0

Performance benchmarks

Replies: 2 comments · 7 replies

alonfaraj Sep 27, 2023

ggerganov Sep 28, 2023 Maintainer Author

alonfaraj Sep 28, 2023

ggerganov Sep 28, 2023 Maintainer Author

alonfaraj Sep 28, 2023

yirunwang Jan 11, 2024

phymbert Feb 28, 2024 Collaborator

ngxson Feb 28, 2024 Collaborator

ggerganov Mar 1, 2024 Maintainer Author

ggerganov
Sep 27, 2023
Maintainer

Slow without `CUDA_VISIBLE_DEVICES=0`

Replies: 2 comments 7 replies

alonfaraj
Sep 27, 2023

ggerganov Sep 28, 2023
Maintainer Author

ggerganov Sep 28, 2023
Maintainer Author

yirunwang
Jan 11, 2024

phymbert Feb 28, 2024
Collaborator

ngxson Feb 28, 2024
Collaborator

ggerganov Mar 1, 2024
Maintainer Author