Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pipeline parallelism demo #4918

Closed
wants to merge 13 commits into from
Closed

pipeline parallelism demo #4918

wants to merge 13 commits into from

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Jan 13, 2024

There isn't much synchronization required, just splitting the prompt into multiple micro-batches and queueing them in the CUDA streams is enough.

The micro-batch size is not configurable at the moment, it needs to be changed in n_microbatch in llama.cpp.

Incidentally, this also adds the ability to split batches into multiple micro-batches, so it is possible to call llama_decode with a batch larger than n_batch. I think the best way to implement this would be to use n_batch as the microbatch size, and modify the applications to ignore n_batch and submit the entire prompt or batch in a single call to llama_decode.

Offloading tok_embd improves performance significantly in this case.

3090Ti+3080, n_microbatch=256, tok_embd on GPU:

model size params backend ngl n_batch ts test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 1.20/1.00 pp 512 6225.92 ± 298.23
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 1.20/1.00 pp 1024 6744.66 ± 217.00
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 1.20/1.00 pp 2048 6842.97 ± 43.25
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 1.20/1.00 pp 4096 6068.10 ± 91.13
Master

MASTER, SINGLE GPU:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 pp 512 5451.46 ± 64.45
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 pp 1024 5256.66 ± 28.64
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 pp 2048 4653.10 ± 9.91
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 pp 4096 3510.43 ± 17.87

MASTER, TWO GPU:
Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes
Device 1: NVIDIA GeForce RTX 3080, compute capability 8.6, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 pp 512 4983.31 ± 130.20
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 pp 1024 4726.04 ± 131.48
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 pp 2048 4199.27 ± 42.80
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 pp 4096 3125.11 ± 183.91

@slaren slaren added the demo Demonstrate some concept or idea, not intended to be merged label Jan 13, 2024
@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Jan 13, 2024

GPU Model Test t/s master t/s sl/micro-batching Speedup
3x P40 llama 7B Q4_0 pp512 922.56 928.91 1.01
3x P40 llama 7B Q4_0 pp1024 896.02 921.61 1.03
3x P40 llama 7B Q4_0 pp2048 837.39 871.38 1.04
3x P40 llama 7B Q4_0 pp4096 705.73 763.01 1.08
3x P40 llama 7B Q4_0 tg128 55.73 53.93 0.97

With 3x P40 the improvement on my system is negligible. There is some improvement for large batch sizes but the best speed at 512 is barely affected. Also token generation seems to become slower.

For comparison, these are the results I get with --split-mode row on master:

model n_batch sm test t/s
llama 7B Q4_0 4096 row pp 512 930.92 ± 25.12
llama 7B Q4_0 4096 row pp 1024 984.85 ± 10.64
llama 7B Q4_0 4096 row pp 2048 959.05 ± 5.86
llama 7B Q4_0 4096 row pp 4096 807.01 ± 1.77
llama 7B Q4_0 4096 row tg 128 55.85 ± 0.06

@slaren
Copy link
Collaborator Author

slaren commented Jan 13, 2024

The copy between GPUs is still synchronous, and this limits the parallelism to two GPUs (the last split of the previous micro-batch runs simultaneously with the first split of the next one, but other splits are synchronized). The micro batch size also has a large impact on the performance as expected, 256 works well for me, but the P40 may need larger batches.

@slaren
Copy link
Collaborator Author

slaren commented Jan 13, 2024

Should be fixed now. This also improved performance for me with two GPUs.

model size params backend ngl n_batch ts test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 1.20/1.00 pp 512 6605.23 ± 58.37
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 1.20/1.00 pp 1024 7264.06 ± 56.93
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 1.20/1.00 pp 2048 7367.69 ± 8.85
llama 7B F16 12.55 GiB 6.74 B CUDA 99 4096 1.20/1.00 pp 4096 6611.81 ± 11.77

@ggerganov
Copy link
Owner

Outstanding!

How do you explain that microbatch size of 256 is better than 512 when for these GPUs individually a batch size of 512 is optimal?

What else is required for this to become mergeable, apart from the n_batch changes that you mentioned?

@slaren
Copy link
Collaborator Author

slaren commented Jan 13, 2024

How do you explain that microbatch size of 256 is better than 512 when for these GPUs individually a batch size of 512 is optimal?

In my system, the difference between batch size 256 and 512 is very small (master):

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 256 pp 512 5483.39 ± 15.99
llama 7B F16 12.55 GiB 6.74 B CUDA 99 256 pp 1024 5039.27 ± 67.36
llama 7B F16 12.55 GiB 6.74 B CUDA 99 256 pp 2048 4594.42 ± 45.94
llama 7B F16 12.55 GiB 6.74 B CUDA 99 512 pp 512 5468.76 ± 88.49
llama 7B F16 12.55 GiB 6.74 B CUDA 99 512 pp 1024 5038.14 ± 99.27
llama 7B F16 12.55 GiB 6.74 B CUDA 99 512 pp 2048 4660.13 ± 50.90

build: 4be5ef5 (1861)

What else is required for this to become mergeable, apart from the n_batch changes that you mentioned?

We need to figure what to do with the token_embd tensor, do we add a parameter to offload it? There is also the issue of the CPU compute buffers. To avoid overwriting the inputs of the previous micro-batch, each one needs a different CPU backend buffer. Currently it just allocated a fixed number, does this need to be configurable? Is there an optimal number of buffers after which adding more does nothing? I am not sure yet. Also duplicating the entire CPU compute buffer is not great, we should also consider different solutions.

@JohannesGaessler
Copy link
Collaborator

With the latest commit (af789e7) and n_microbatch = 512:

GPU Model Test t/s master t/s sl/micro-batching Speedup
2x P40 llama 7B Q4_0 pp512 931.35 934.61 1.00
2x P40 llama 7B Q4_0 pp1024 900.90 974.55 1.08
2x P40 llama 7B Q4_0 pp2048 840.91 954.64 1.14
2x P40 llama 7B Q4_0 pp4096 708.91 867.86 1.22
2x P40 llama 7B Q4_0 tg128 56.02 55.96 1.00
3x P40 llama 7B Q4_0 pp512 910.35 923.13 1.01
3x P40 llama 7B Q4_0 pp1024 896.07 971.06 1.08
3x P40 llama 7B Q4_0 pp2048 836.93 949.95 1.14
3x P40 llama 7B Q4_0 pp4096 704.61 863.59 1.23
3x P40 llama 7B Q4_0 tg128 55.64 54.14 0.97

How do you explain that microbatch size of 256 is better than 512 when for these GPUs individually a batch size of 512 is optimal?

You get higher GPU utilization with microbatching because the second GPU can start after the first GPU has processed 256 tokens instead of having to wait for the full batch size.

What else is required for this to become mergeable, apart from the n_batch changes that you mentioned?

To make microbatching actually perform well I think it will be necessary to write better dequantization kernels. Presumably the reason slaren used FP16 is because in that case the weight matrices do not need to be dequantized so the performance for small batches is comparatively good. The kernel I wrote in #4895 could presumably be adapted for other formats but templating it will be difficult. With MMQ the weight matrices do not need to be dequantized but then the baseline performance for Volta or newer is lower so the utility is questionable. If #4801 works out it would also help a lot since dequantizing to int8 needs only half as much memory bendwidth as dequantizing to

@ggerganov
Copy link
Owner

ggerganov commented Jan 13, 2024

@slaren In your results why does the performance drops for pp 4096 compared to pp 2048? My expectation would be that with this parallelism the performance should flat out at some pp and do not drop.

Edit: Ah it's because the attention compute grows since it is a single sequence. Got it.

@slaren
Copy link
Collaborator Author

slaren commented Jan 13, 2024

The performance is best with F16, but there is still a good speedup with Q4_0.

model test 4be5ef5 t/s af789e7 t/s speedup
llama 7B Q4_0 pp 512 3936.45 4505.63 1.144
llama 7B Q4_0 pp 1024 4269.13 5140.05 1.204
llama 7B Q4_0 pp 2048 3996.18 5324.33 1.332
llama 7B Q4_0 pp 4096 3148.68 4974.44 1.579

In your results why does the performance drops for pp 4096 compared to pp 2048? My expectation would be that with this parallelism the performance should flat out at some pp and do not drop.

Performance always drops when the context is larger, but the speedup relative to master is higher with pp 4096 (it's almost 2x with F16).

@ggerganov
Copy link
Owner

ggerganov commented Jan 13, 2024

Just for fun, here are results on 8x RTX 4090 with n_microbatch = 256:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 1: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 2: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 3: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 4: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 5: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 6: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes
Device 7: NVIDIA GeForce RTX 4090, compute capability 8.9, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 512 10870.38 ± 1347.13
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 1024 16350.59 ± 1461.70
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 2048 22210.58 ± 1207.91
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 4096 23366.68 ± 812.25
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 8192 20061.01 ± 164.88
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 tg 128 59.27 ± 0.03
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 512 7728.90 ± 755.58
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 1024 11940.32 ± 263.53
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 2048 15450.37 ± 103.45
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 4096 16703.70 ± 129.33
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 8192 15263.72 ± 139.05
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 tg 128 93.76 ± 0.23
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 512 7384.39 ± 791.82
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 1024 12294.95 ± 498.54
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 2048 15748.87 ± 202.99
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 4096 17048.52 ± 174.44
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 8192 15382.21 ± 161.02
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 tg 128 140.22 ± 1.89

build: af789e7 (1861)

13B and 34B data
model size params backend ngl n_batch test t/s
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 512 7220.16 ± 390.88
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 1024 11116.64 ± 288.99
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 2048 12729.47 ± 77.03
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 4096 12990.25 ± 81.00
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 8192 11201.17 ± 96.51
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 tg 128 32.62 ± 0.00
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 512 4415.31 ± 85.10
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 1024 7142.44 ± 50.10
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 2048 8346.23 ± 154.62
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 4096 8612.18 ± 24.13
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 8192 7639.31 ± 39.80
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 tg 128 54.05 ± 0.13
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 512 4638.83 ± 79.10
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 1024 7391.51 ± 98.74
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 2048 8917.02 ± 39.24
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 4096 8826.73 ± 120.69
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 8192 7835.55 ± 47.22
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 tg 128 86.15 ± 0.43

build: af789e7 (1861)

model size params backend ngl n_batch test t/s
llama 34B F16 62.85 GiB 33.74 B CUDA 99 4096 pp 512 3192.60 ± 7.01
llama 34B F16 62.85 GiB 33.74 B CUDA 99 4096 pp 1024 4992.97 ± 23.54
llama 34B F16 62.85 GiB 33.74 B CUDA 99 4096 pp 2048 5616.59 ± 70.76
llama 34B F16 62.85 GiB 33.74 B CUDA 99 4096 pp 4096 5473.76 ± 24.30
llama 34B F16 62.85 GiB 33.74 B CUDA 99 4096 tg 128 13.28 ± 0.00
llama 34B Q8_0 33.39 GiB 33.74 B CUDA 99 4096 pp 512 1798.18 ± 2.67
llama 34B Q8_0 33.39 GiB 33.74 B CUDA 99 4096 pp 1024 2239.08 ± 6.61
llama 34B Q8_0 33.39 GiB 33.74 B CUDA 99 4096 pp 2048 2795.90 ± 1.94
llama 34B Q8_0 33.39 GiB 33.74 B CUDA 99 4096 pp 4096 2895.76 ± 2.91
llama 34B Q8_0 33.39 GiB 33.74 B CUDA 99 4096 tg 128 23.52 ± 0.01
llama 34B Q4_0 17.74 GiB 33.74 B CUDA 99 4096 pp 512 1938.62 ± 6.55
llama 34B Q4_0 17.74 GiB 33.74 B CUDA 99 4096 pp 1024 2404.66 ± 9.11
llama 34B Q4_0 17.74 GiB 33.74 B CUDA 99 4096 pp 2048 3021.33 ± 7.27
llama 34B Q4_0 17.74 GiB 33.74 B CUDA 99 4096 pp 4096 3072.98 ± 8.52
llama 34B Q4_0 17.74 GiB 33.74 B CUDA 99 4096 tg 128 40.32 ± 0.11

build: af789e7 (1861)


And some more data points on 8x A100:

ggml_init_cublas: GGML_CUDA_FORCE_MMQ: no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 8 CUDA devices:
Device 0: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 1: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 2: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 3: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 4: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 5: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 6: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes
Device 7: NVIDIA A100 80GB PCIe, compute capability 8.0, VMM: yes

model size params backend ngl n_batch test t/s
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 512 10803.45 ± 885.15
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 1024 16922.62 ± 1231.16
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 2048 21848.49 ± 826.91
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 4096 24816.68 ± 655.57
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 pp 8192 22859.46 ± 305.16
llama 7B F16 12.55 GiB 6.74 B CUDA 99 8192 tg 128 71.01 ± 1.22
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 512 8442.00 ± 720.08
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 1024 12799.57 ± 675.17
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 2048 16123.86 ± 435.61
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 4096 18358.08 ± 267.10
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 pp 8192 17252.62 ± 85.68
llama 7B Q8_0 6.67 GiB 6.74 B CUDA 99 8192 tg 128 97.65 ± 2.84
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 512 7589.52 ± 528.69
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 1024 12066.38 ± 217.63
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 2048 15141.24 ± 346.10
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 4096 16798.69 ± 33.26
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 pp 8192 16404.56 ± 77.35
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 8192 tg 128 126.30 ± 5.02

build: af789e7 (1861)

13B data
model size params backend ngl n_batch test t/s
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 512 7335.75 ± 566.35
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 1024 10975.81 ± 343.34
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 2048 13925.55 ± 344.44
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 4096 15117.55 ± 329.93
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 pp 8192 13981.44 ± 82.05
llama 13B F16 24.25 GiB 13.02 B CUDA 99 8192 tg 128 44.06 ± 0.57
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 512 5205.63 ± 87.11
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 1024 7898.83 ± 161.54
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 2048 9920.56 ± 81.39
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 4096 10832.47 ± 83.56
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 pp 8192 10154.16 ± 14.45
llama 13B Q8_0 12.88 GiB 13.02 B CUDA 99 8192 tg 128 63.78 ± 1.38
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 512 4808.40 ± 119.52
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 1024 7345.30 ± 131.71
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 2048 9194.74 ± 132.21
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 4096 10079.28 ± 175.39
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 pp 8192 9664.38 ± 56.10
llama 13B Q4_0 6.86 GiB 13.02 B CUDA 99 8192 tg 128 86.61 ± 2.78

build: af789e7 (1861)

@JohannesGaessler
Copy link
Collaborator

@slaren In your results why does the performance drops for pp 4096 compared to pp 2048? My expectation would be that with this parallelism the performance should flat out at some pp and do not drop.

Edit: Ah it's because the attention compute grows since it is a single sequence. Got it.

Very large batches already have this behavior on master. The main problem is that the compute needed for soft max scales with batch size. This could be mitigated by writing a softmax kernel specifically for a diagonal infinite mask. You could potentially also save compute by not computing those elements that are later going to be masked anyways.

@ggerganov
Copy link
Owner

We also lack flash attention which results in 2 extra writes and reads of the KQ data to global memory

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 14, 2024

Awesome, 20000 tokens/sec .. that's quite a change to where we've been a month ago ;) I think it was 280/sec in that configuration.

@sorasoras
Copy link

This is interesting in my use case.
I am running 13B Q2K on 7900xtx it get like 70T/s TG batch translation. i run two instance,it could hit 43*2=86t/s

@sorasoras
Copy link

 GET_ROWS(type=q2_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q2_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q2_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q2_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=q3_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q3_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q3_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q3_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=q4_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q4_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q4_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q4_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=q5_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q5_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q5_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q5_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=q6_K,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=q6_K,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=q6_K,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=q6_K,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=iq2_xxs,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=iq2_xs,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]
  GET_ROWS(type=i32,n=256,m=5,r=4,b=1,v=0): not supported [CUDA0]
  GET_ROWS(type=i32,n=256,m=5,r=4,b=1,v=1): not supported [CUDA0]
  GET_ROWS(type=i32,n=256,m=5,r=4,b=7,v=0): not supported [CUDA0]
  GET_ROWS(type=i32,n=256,m=5,r=4,b=7,v=1): not supported [CUDA0]

get_rows don't seems to working right now.

@slaren
Copy link
Collaborator Author

slaren commented Jan 21, 2024

It is expected, the CUDA get_rows implementation does not support k-quants.

@slaren
Copy link
Collaborator Author

slaren commented Jan 29, 2024

Short update about this: I realized that there is a possible data race when copying data between backends, and it is not enough to create multiple copies of the CPU compute buffer. A possible solution would be to create multiple copies of the GPU compute buffers (that's what the current version does), but the cost in VRAM is too high to do this. The only tensors that really need to be duplicated are the tensors copied between backends at the start of each split, and this requires a lot less memory than duplicating the entire compute buffer, so that's what I am working on.

This will require implementing all of the logic to handle this in ggml_backend_sched. This will also allow all applications that use ggml_backend_sched to benefit automatically from pipeline parallelism, and it could even be used with partial offloading to do work simultaneously on the CPU and on the GPU (that would require making the CPU backend asynchronous). However, the allocation logic is going to change significantly after the change to ggml-alloc that is required to fix #5140, and that is going to affect ggml_backend_sched too. So I am implementing the ggml-alloc fix first.

@cmp-nct
Copy link
Contributor

cmp-nct commented Jan 29, 2024

and it could even be used with partial offloading to do work simultaneously on the CPU and on the GPU (that would require making the CPU backend asynchronous). However, the allocation logic is going to cha

Sounds very promising, I'm happy my little suggestion goes so big.
I didn't consider that CPU and GPU parallelization is a thing, that could turn out huge for bigger models if this works well.. Especially when a model is "barely" not fitting on GPU, then the CPU would have just a tiny portion of the layers so it might be able to not delay the model beyond memory transfer times.

Very nice work so far

@slaren
Copy link
Collaborator Author

slaren commented Jan 29, 2024

Technically it is possible, there is no reason to treat the CPU backend different than any other backend. But realistically, the CPU is so much slower than the GPU that I wouldn't expect any meaningful improvement in performance, and as it is now, most matrix multiplications are always done on the GPU during prompt processing anyway.

@JohannesGaessler
Copy link
Collaborator

It is true that the GPU is used for matrix multiplications with batch sizes >= 32 anyways. But for those matrix multiplications most of the runtime goes towards CPU<->GPU data transfers which can be executed in parallel with GPU computations. So it should still help quite a lot.

@phymbert
Copy link
Collaborator

phymbert commented Mar 1, 2024

Hi @slaren, thanks a lot for your effort on cuda backend.

Here are the results, on my infra 2 A100 80GB, what do you think ?

build with: -DCMAKE_BUILD_TYPE=Release -DLLAMA_CUBLAS=ON -DCMAKE_CUDA_ARCHITECTURES=native

Models: codellama7b-(f16|q4_0),codellama70b-q4.0,mixtral8x7b-v0.1-q4_0

sl/micro-batching f69ab89

1a. CUDA_VISIBLE_DEVICES=0

20240301_121345

1b. CUDA_VISIBLE_DEVICES=0,1

20240301_122755

master f105471

2a. CUDA_VISIBLE_DEVICES=0

20240301_115953

2b. CUDA_VISIBLE_DEVICES=0,1

20240301_120409

If it can help, I see lot of not enough buffers, syncing now on your branch (I had to comment the line).

Note: @ggerganov I feel like the model name for mixtral8x7b is misleading, maybe we should include the moe config in the model name. I will have a look.

@slaren
Copy link
Collaborator Author

slaren commented Mar 1, 2024

The results look reasonable. Mixtral does not work with pipeline parallelism due to the way the mul_mat_id operation (for MoE) is implemented, it forces a synchronization which stops the asynchronous computation.

This is branch is very outdated and the final implementation will be very different, and at this point there is no need to run more tests on this branch. I'll close this PR to avoid confusion.

@slaren slaren closed this Mar 1, 2024
@slaren slaren deleted the sl/micro-batching branch March 21, 2024 12:44
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
demo Demonstrate some concept or idea, not intended to be merged
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants