pipeline parallelism demo #4918

slaren · 2024-01-13T17:42:21Z

There isn't much synchronization required, just splitting the prompt into multiple micro-batches and queueing them in the CUDA streams is enough.

The micro-batch size is not configurable at the moment, it needs to be changed in n_microbatch in llama.cpp.

Incidentally, this also adds the ability to split batches into multiple micro-batches, so it is possible to call llama_decode with a batch larger than n_batch. I think the best way to implement this would be to use n_batch as the microbatch size, and modify the applications to ignore n_batch and submit the entire prompt or batch in a single call to llama_decode.

Offloading tok_embd improves performance significantly in this case.

3090Ti+3080, n_microbatch=256, tok_embd on GPU:

model	size	params	backend	ngl	n_batch	ts	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	1.20/1.00	pp 512	6225.92 ± 298.23
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	1.20/1.00	pp 1024	6744.66 ± 217.00
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	1.20/1.00	pp 2048	6842.97 ± 43.25
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	1.20/1.00	pp 4096	6068.10 ± 91.13

Master

MASTER, SINGLE GPU:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	pp 512	5451.46 ± 64.45
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	pp 1024	5256.66 ± 28.64
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	pp 2048	4653.10 ± 9.91
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	pp 4096	3510.43 ± 17.87

MASTER, TWO GPU:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	pp 512	4983.31 ± 130.20
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	pp 1024	4726.04 ± 131.48
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	pp 2048	4199.27 ± 42.80
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	pp 4096	3125.11 ± 183.91

JohannesGaessler · 2024-01-13T18:55:07Z

GPU	Model	Test	t/s master	t/s sl/micro-batching	Speedup
3x P40	llama 7B Q4_0	pp512	922.56	928.91	1.01
3x P40	llama 7B Q4_0	pp1024	896.02	921.61	1.03
3x P40	llama 7B Q4_0	pp2048	837.39	871.38	1.04
3x P40	llama 7B Q4_0	pp4096	705.73	763.01	1.08
3x P40	llama 7B Q4_0	tg128	55.73	53.93	0.97

With 3x P40 the improvement on my system is negligible. There is some improvement for large batch sizes but the best speed at 512 is barely affected. Also token generation seems to become slower.

For comparison, these are the results I get with --split-mode row on master:

model	n_batch	sm	test	t/s
llama 7B Q4_0	4096	row	pp 512	930.92 ± 25.12
llama 7B Q4_0	4096	row	pp 1024	984.85 ± 10.64
llama 7B Q4_0	4096	row	pp 2048	959.05 ± 5.86
llama 7B Q4_0	4096	row	pp 4096	807.01 ± 1.77
llama 7B Q4_0	4096	row	tg 128	55.85 ± 0.06

slaren · 2024-01-13T19:07:56Z

The copy between GPUs is still synchronous, and this limits the parallelism to two GPUs (the last split of the previous micro-batch runs simultaneously with the first split of the next one, but other splits are synchronized). The micro batch size also has a large impact on the performance as expected, 256 works well for me, but the P40 may need larger batches.

slaren · 2024-01-13T19:51:01Z

Should be fixed now. This also improved performance for me with two GPUs.

model	size	params	backend	ngl	n_batch	ts	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	1.20/1.00	pp 512	6605.23 ± 58.37
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	1.20/1.00	pp 1024	7264.06 ± 56.93
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	1.20/1.00	pp 2048	7367.69 ± 8.85
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	4096	1.20/1.00	pp 4096	6611.81 ± 11.77

ggerganov · 2024-01-13T20:16:17Z

Outstanding!

How do you explain that microbatch size of 256 is better than 512 when for these GPUs individually a batch size of 512 is optimal?

What else is required for this to become mergeable, apart from the n_batch changes that you mentioned?

slaren · 2024-01-13T20:25:45Z

How do you explain that microbatch size of 256 is better than 512 when for these GPUs individually a batch size of 512 is optimal?

In my system, the difference between batch size 256 and 512 is very small (master):

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	256	pp 512	5483.39 ± 15.99
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	256	pp 1024	5039.27 ± 67.36
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	256	pp 2048	4594.42 ± 45.94
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	512	pp 512	5468.76 ± 88.49
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	512	pp 1024	5038.14 ± 99.27
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	512	pp 2048	4660.13 ± 50.90

build: 4be5ef5 (1861)

What else is required for this to become mergeable, apart from the n_batch changes that you mentioned?

We need to figure what to do with the token_embd tensor, do we add a parameter to offload it? There is also the issue of the CPU compute buffers. To avoid overwriting the inputs of the previous micro-batch, each one needs a different CPU backend buffer. Currently it just allocated a fixed number, does this need to be configurable? Is there an optimal number of buffers after which adding more does nothing? I am not sure yet. Also duplicating the entire CPU compute buffer is not great, we should also consider different solutions.

JohannesGaessler · 2024-01-13T20:32:32Z

With the latest commit (af789e7) and n_microbatch = 512:

GPU	Model	Test	t/s master	t/s sl/micro-batching	Speedup
2x P40	llama 7B Q4_0	pp512	931.35	934.61	1.00
2x P40	llama 7B Q4_0	pp1024	900.90	974.55	1.08
2x P40	llama 7B Q4_0	pp2048	840.91	954.64	1.14
2x P40	llama 7B Q4_0	pp4096	708.91	867.86	1.22
2x P40	llama 7B Q4_0	tg128	56.02	55.96	1.00
3x P40	llama 7B Q4_0	pp512	910.35	923.13	1.01
3x P40	llama 7B Q4_0	pp1024	896.07	971.06	1.08
3x P40	llama 7B Q4_0	pp2048	836.93	949.95	1.14
3x P40	llama 7B Q4_0	pp4096	704.61	863.59	1.23
3x P40	llama 7B Q4_0	tg128	55.64	54.14	0.97

How do you explain that microbatch size of 256 is better than 512 when for these GPUs individually a batch size of 512 is optimal?

You get higher GPU utilization with microbatching because the second GPU can start after the first GPU has processed 256 tokens instead of having to wait for the full batch size.

What else is required for this to become mergeable, apart from the n_batch changes that you mentioned?

To make microbatching actually perform well I think it will be necessary to write better dequantization kernels. Presumably the reason slaren used FP16 is because in that case the weight matrices do not need to be dequantized so the performance for small batches is comparatively good. The kernel I wrote in #4895 could presumably be adapted for other formats but templating it will be difficult. With MMQ the weight matrices do not need to be dequantized but then the baseline performance for Volta or newer is lower so the utility is questionable. If #4801 works out it would also help a lot since dequantizing to int8 needs only half as much memory bendwidth as dequantizing to

ggerganov · 2024-01-13T20:42:15Z

@slaren In your results why does the performance drops for pp 4096 compared to pp 2048? My expectation would be that with this parallelism the performance should flat out at some pp and do not drop.

Edit: Ah it's because the attention compute grows since it is a single sequence. Got it.

slaren · 2024-01-13T20:45:58Z

The performance is best with F16, but there is still a good speedup with Q4_0.

model	test	`4be5ef5` t/s	`af789e7` t/s	speedup
llama 7B Q4_0	pp 512	3936.45	4505.63	1.144
llama 7B Q4_0	pp 1024	4269.13	5140.05	1.204
llama 7B Q4_0	pp 2048	3996.18	5324.33	1.332
llama 7B Q4_0	pp 4096	3148.68	4974.44	1.579

In your results why does the performance drops for pp 4096 compared to pp 2048? My expectation would be that with this parallelism the performance should flat out at some pp and do not drop.

Performance always drops when the context is larger, but the speedup relative to master is higher with pp 4096 (it's almost 2x with F16).

ggerganov · 2024-01-13T20:55:17Z

Just for fun, here are results on 8x RTX 4090 with n_microbatch = 256:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 4: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 5: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 6: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 7: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 512	10870.38 ± 1347.13
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 1024	16350.59 ± 1461.70
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 2048	22210.58 ± 1207.91
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 4096	23366.68 ± 812.25
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 8192	20061.01 ± 164.88
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	tg 128	59.27 ± 0.03
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 512	7728.90 ± 755.58
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 1024	11940.32 ± 263.53
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 2048	15450.37 ± 103.45
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 4096	16703.70 ± 129.33
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 8192	15263.72 ± 139.05
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	tg 128	93.76 ± 0.23
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 512	7384.39 ± 791.82
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 1024	12294.95 ± 498.54
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 2048	15748.87 ± 202.99
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 4096	17048.52 ± 174.44
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 8192	15382.21 ± 161.02
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	tg 128	140.22 ± 1.89

build: af789e7 (1861)

13B and 34B data

model	size	params	backend	ngl	n_batch	test	t/s
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 512	7220.16 ± 390.88
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 1024	11116.64 ± 288.99
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 2048	12729.47 ± 77.03
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 4096	12990.25 ± 81.00
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 8192	11201.17 ± 96.51
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	tg 128	32.62 ± 0.00
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 512	4415.31 ± 85.10
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 1024	7142.44 ± 50.10
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 2048	8346.23 ± 154.62
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 4096	8612.18 ± 24.13
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 8192	7639.31 ± 39.80
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	tg 128	54.05 ± 0.13
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 512	4638.83 ± 79.10
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 1024	7391.51 ± 98.74
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 2048	8917.02 ± 39.24
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 4096	8826.73 ± 120.69
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 8192	7835.55 ± 47.22
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	tg 128	86.15 ± 0.43

build: af789e7 (1861)

model	size	params	backend	ngl	n_batch	test	t/s
llama 34B F16	62.85 GiB	33.74 B	CUDA	99	4096	pp 512	3192.60 ± 7.01
llama 34B F16	62.85 GiB	33.74 B	CUDA	99	4096	pp 1024	4992.97 ± 23.54
llama 34B F16	62.85 GiB	33.74 B	CUDA	99	4096	pp 2048	5616.59 ± 70.76
llama 34B F16	62.85 GiB	33.74 B	CUDA	99	4096	pp 4096	5473.76 ± 24.30
llama 34B F16	62.85 GiB	33.74 B	CUDA	99	4096	tg 128	13.28 ± 0.00
llama 34B Q8_0	33.39 GiB	33.74 B	CUDA	99	4096	pp 512	1798.18 ± 2.67
llama 34B Q8_0	33.39 GiB	33.74 B	CUDA	99	4096	pp 1024	2239.08 ± 6.61
llama 34B Q8_0	33.39 GiB	33.74 B	CUDA	99	4096	pp 2048	2795.90 ± 1.94
llama 34B Q8_0	33.39 GiB	33.74 B	CUDA	99	4096	pp 4096	2895.76 ± 2.91
llama 34B Q8_0	33.39 GiB	33.74 B	CUDA	99	4096	tg 128	23.52 ± 0.01
llama 34B Q4_0	17.74 GiB	33.74 B	CUDA	99	4096	pp 512	1938.62 ± 6.55
llama 34B Q4_0	17.74 GiB	33.74 B	CUDA	99	4096	pp 1024	2404.66 ± 9.11
llama 34B Q4_0	17.74 GiB	33.74 B	CUDA	99	4096	pp 2048	3021.33 ± 7.27
llama 34B Q4_0	17.74 GiB	33.74 B	CUDA	99	4096	pp 4096	3072.98 ± 8.52
llama 34B Q4_0	17.74 GiB	33.74 B	CUDA	99	4096	tg 128	40.32 ± 0.11

build: af789e7 (1861)

And some more data points on 8x A100:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model	size	params	backend	ngl	n_batch	test	t/s
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 512	10803.45 ± 885.15
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 1024	16922.62 ± 1231.16
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 2048	21848.49 ± 826.91
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 4096	24816.68 ± 655.57
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	pp 8192	22859.46 ± 305.16
llama 7B F16	12.55 GiB	6.74 B	CUDA	99	8192	tg 128	71.01 ± 1.22
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 512	8442.00 ± 720.08
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 1024	12799.57 ± 675.17
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 2048	16123.86 ± 435.61
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 4096	18358.08 ± 267.10
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	pp 8192	17252.62 ± 85.68
llama 7B Q8_0	6.67 GiB	6.74 B	CUDA	99	8192	tg 128	97.65 ± 2.84
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 512	7589.52 ± 528.69
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 1024	12066.38 ± 217.63
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 2048	15141.24 ± 346.10
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 4096	16798.69 ± 33.26
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	pp 8192	16404.56 ± 77.35
llama 7B Q4_0	3.56 GiB	6.74 B	CUDA	99	8192	tg 128	126.30 ± 5.02

build: af789e7 (1861)

13B data

model	size	params	backend	ngl	n_batch	test	t/s
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 512	7335.75 ± 566.35
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 1024	10975.81 ± 343.34
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 2048	13925.55 ± 344.44
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 4096	15117.55 ± 329.93
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	pp 8192	13981.44 ± 82.05
llama 13B F16	24.25 GiB	13.02 B	CUDA	99	8192	tg 128	44.06 ± 0.57
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 512	5205.63 ± 87.11
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 1024	7898.83 ± 161.54
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 2048	9920.56 ± 81.39
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 4096	10832.47 ± 83.56
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	pp 8192	10154.16 ± 14.45
llama 13B Q8_0	12.88 GiB	13.02 B	CUDA	99	8192	tg 128	63.78 ± 1.38
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 512	4808.40 ± 119.52
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 1024	7345.30 ± 131.71
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 2048	9194.74 ± 132.21
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 4096	10079.28 ± 175.39
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	pp 8192	9664.38 ± 56.10
llama 13B Q4_0	6.86 GiB	13.02 B	CUDA	99	8192	tg 128	86.61 ± 2.78

build: af789e7 (1861)

JohannesGaessler · 2024-01-13T21:18:20Z

@slaren In your results why does the performance drops for pp 4096 compared to pp 2048? My expectation would be that with this parallelism the performance should flat out at some pp and do not drop.

Edit: Ah it's because the attention compute grows since it is a single sequence. Got it.

Very large batches already have this behavior on master. The main problem is that the compute needed for soft max scales with batch size. This could be mitigated by writing a softmax kernel specifically for a diagonal infinite mask. You could potentially also save compute by not computing those elements that are later going to be masked anyways.

ggerganov · 2024-01-13T21:23:44Z

We also lack flash attention which results in 2 extra writes and reads of the KQ data to global memory

cmp-nct · 2024-01-14T17:37:30Z

Awesome, 20000 tokens/sec .. that's quite a change to where we've been a month ago ;) I think it was 280/sec in that configuration.

sorasoras · 2024-01-21T06:20:08Z

This is interesting in my use case.
I am running 13B Q2K on 7900xtx it get like 70T/s TG batch translation. i run two instance,it could hit 43*2=86t/s

sorasoras · 2024-01-21T10:15:25Z

 GET_ROWS(type=q2_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q2_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q2_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q2_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=q3_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q3_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q3_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q3_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=q4_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q4_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q4_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q4_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=q5_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q5_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q5_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q5_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=q6_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q6_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q6_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q6_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=i32,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=i32,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=i32,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=i32,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]

get_rows don't seems to working right now.

slaren · 2024-01-21T15:57:10Z

It is expected, the CUDA get_rows implementation does not support k-quants.

slaren · 2024-01-29T21:32:43Z

Short update about this: I realized that there is a possible data race when copying data between backends, and it is not enough to create multiple copies of the CPU compute buffer. A possible solution would be to create multiple copies of the GPU compute buffers (that's what the current version does), but the cost in VRAM is too high to do this. The only tensors that really need to be duplicated are the tensors copied between backends at the start of each split, and this requires a lot less memory than duplicating the entire compute buffer, so that's what I am working on.

This will require implementing all of the logic to handle this in ggml_backend_sched. This will also allow all applications that use ggml_backend_sched to benefit automatically from pipeline parallelism, and it could even be used with partial offloading to do work simultaneously on the CPU and on the GPU (that would require making the CPU backend asynchronous). However, the allocation logic is going to change significantly after the change to ggml-alloc that is required to fix #5140, and that is going to affect ggml_backend_sched too. So I am implementing the ggml-alloc fix first.

cmp-nct · 2024-01-29T21:41:10Z

and it could even be used with partial offloading to do work simultaneously on the CPU and on the GPU (that would require making the CPU backend asynchronous). However, the allocation logic is going to cha

Sounds very promising, I'm happy my little suggestion goes so big.
I didn't consider that CPU and GPU parallelization is a thing, that could turn out huge for bigger models if this works well.. Especially when a model is "barely" not fitting on GPU, then the CPU would have just a tiny portion of the layers so it might be able to not delay the model beyond memory transfer times.

Very nice work so far

slaren · 2024-01-29T21:46:15Z

Technically it is possible, there is no reason to treat the CPU backend different than any other backend. But realistically, the CPU is so much slower than the GPU that I wouldn't expect any meaningful improvement in performance, and as it is now, most matrix multiplications are always done on the GPU during prompt processing anyway.

JohannesGaessler · 2024-01-29T22:05:27Z

It is true that the GPU is used for matrix multiplications with batch sizes >= 32 anyways. But for those matrix multiplications most of the runtime goes towards CPU<->GPU data transfers which can be executed in parallel with GPU computations. So it should still help quite a lot.

phymbert · 2024-03-01T11:31:39Z

Hi @slaren, thanks a lot for your effort on cuda backend.

Here are the results, on my infra 2 A100 80GB, what do you think ?

build with: -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=native

Models: codellama7b-(f16|q4_0),codellama70b-q4.0,mixtral8x7b-v0.1-q4_0

sl/micro-batching f69ab89

1a. `CUDA_VISIBLE_DEVICES=0`

1b. `CUDA_VISIBLE_DEVICES=0,1`

master f105471

2a. `CUDA_VISIBLE_DEVICES=0`

2b. `CUDA_VISIBLE_DEVICES=0,1`

If it can help, I see lot of not enough buffers, syncing now on your branch (I had to comment the line).

Note: @ggerganov I feel like the model name for mixtral8x7b is misleading, maybe we should include the moe config in the model name. I will have a look.

slaren · 2024-03-01T12:43:38Z

The results look reasonable. Mixtral does not work with pipeline parallelism due to the way the mul_mat_id operation (for MoE) is implemented, it forces a synchronization which stops the asynchronous computation.

This is branch is very outdated and the final implementation will be very different, and at this point there is no need to run more tests on this branch. I'll close this PR to avoid confusion.

pipeline parallelism demo

dbbaf82

slaren added the demo Demonstrate some concept or idea, not intended to be merged label Jan 13, 2024

fix async copy between backends

af789e7

slaren added 8 commits January 14, 2024 18:47

make llama_decode async, sync on get_logits

0068da7

perplexity : ignore n_batch, submit whole chunk in one call

e264f22

minor

e5de370

Merge remote-tracking branch 'origin/master' into sl/micro-batching

09688c7

add n_ubatch (-ub) parameter

bc98eda

ggml : multi-threaded get_rows

a971987

also duplicate gpu compute buffers to avoid races

16e12ab

ggml : limit get_rows threads to the number of rows

940c01e

backend : add event API

963a122

slaren mentioned this pull request Jan 24, 2024

The inference performance of 8xH100+nvlink is worse than that of 4xA100 pcie #4747

Closed

slaren added 2 commits January 24, 2024 21:08

ggml : add tensor flags

cad4652

Merge remote-tracking branch 'origin/master' into sl/micro-batching

f69ab89

This was referenced Feb 1, 2024

ggml : add optional CPU backend context, support reusing threads, async compute ggerganov/ggml#721

Closed

Vulkan Intel Fixes, Optimizations and Debugging Flags #5301

Merged

slaren closed this Mar 1, 2024

slaren deleted the sl/micro-batching branch March 21, 2024 12:44

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

pipeline parallelism demo #4918

pipeline parallelism demo #4918

slaren commented Jan 13, 2024 •

edited

Loading

JohannesGaessler commented Jan 13, 2024 •

edited

Loading

slaren commented Jan 13, 2024

slaren commented Jan 13, 2024

ggerganov commented Jan 13, 2024

slaren commented Jan 13, 2024 •

edited

Loading

JohannesGaessler commented Jan 13, 2024

ggerganov commented Jan 13, 2024 •

edited

Loading

slaren commented Jan 13, 2024 •

edited

Loading

ggerganov commented Jan 13, 2024 •

edited

Loading

JohannesGaessler commented Jan 13, 2024

ggerganov commented Jan 13, 2024

cmp-nct commented Jan 14, 2024

sorasoras commented Jan 21, 2024

sorasoras commented Jan 21, 2024

slaren commented Jan 21, 2024

slaren commented Jan 29, 2024

cmp-nct commented Jan 29, 2024

slaren commented Jan 29, 2024

JohannesGaessler commented Jan 29, 2024