Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

backend : offload large batches to GPU #6083

Merged
merged 9 commits into from
Mar 18, 2024
Merged

backend : offload large batches to GPU #6083

merged 9 commits into from
Mar 18, 2024

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Mar 15, 2024

Moves the logic of auto-offloading to the GPU when processing large batches to ggml_backend_sched. Currently only CUDA and Vulkan support this, this will allow any backend to support this feature.

Instead of offloading only the matrix multiplications, the entire computation of the batch is offloaded. This reduces the amount of data that needs to be transferred between the GPU and CPU and improves performance significantly.

The weights are now copied to VRAM in the compute buffer, instead of the private CUDA pool buffer. As a result, the size of the compute buffers will increase significantly when offloading a model partially. However, the total VRAM usage should stay same, or slightly lower.

Backends that wish to support this feature need to implement the offload_op function. Only the CUDA backend implements it at this point.

Additionally, the CUDA backend will now attempt to register as a host pinned buffer the memory of the models, even when using mmap. Previously, host buffers were only supported with mmap disabled. This further increases the performance of automatic offloading. The usage of host pinned memory can be disabled by defining the GGML_CUDA_NO_PINNED environment variable.

RTX 3090 Ti, CUDA under WSL:
bench-7b-pp1024

bench-mixtral-pp1024

Raw data
model size params backend ngl n_batch n_ubatch mmap test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 1024 1024 0 pp 1024 388.15 ± 1.24
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 1024 1024 1 pp 1024 348.58 ± 2.46
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 1 1024 1024 0 pp 1024 397.95 ± 1.48
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 1 1024 1024 1 pp 1024 359.64 ± 1.94
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 2 1024 1024 0 pp 1024 409.85 ± 2.36
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 2 1024 1024 1 pp 1024 370.63 ± 3.55
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 3 1024 1024 0 pp 1024 422.51 ± 2.51
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 3 1024 1024 1 pp 1024 380.48 ± 1.25
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 4 1024 1024 0 pp 1024 433.78 ± 1.37
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 4 1024 1024 1 pp 1024 392.48 ± 2.48
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 5 1024 1024 0 pp 1024 447.87 ± 1.24
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 5 1024 1024 1 pp 1024 404.52 ± 2.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 6 1024 1024 0 pp 1024 463.28 ± 1.98
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 6 1024 1024 1 pp 1024 418.75 ± 2.95
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 7 1024 1024 0 pp 1024 478.75 ± 1.40
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 7 1024 1024 1 pp 1024 430.76 ± 2.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 8 1024 1024 0 pp 1024 495.91 ± 1.76
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 8 1024 1024 1 pp 1024 447.96 ± 3.44
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 9 1024 1024 0 pp 1024 515.97 ± 1.26
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 9 1024 1024 1 pp 1024 469.55 ± 2.47
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 10 1024 1024 0 pp 1024 535.50 ± 2.02
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 10 1024 1024 1 pp 1024 485.98 ± 3.47
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 11 1024 1024 0 pp 1024 555.22 ± 3.80
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 11 1024 1024 1 pp 1024 504.74 ± 3.24
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 12 1024 1024 0 pp 1024 581.50 ± 3.47
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 12 1024 1024 1 pp 1024 529.37 ± 1.10
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 13 1024 1024 0 pp 1024 605.49 ± 3.89
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 13 1024 1024 1 pp 1024 550.70 ± 1.20
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 14 1024 1024 0 pp 1024 636.00 ± 3.93
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 14 1024 1024 1 pp 1024 574.79 ± 1.91
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 15 1024 1024 0 pp 1024 669.54 ± 2.16
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 15 1024 1024 1 pp 1024 611.74 ± 3.48
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 16 1024 1024 0 pp 1024 696.63 ± 5.25
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 16 1024 1024 1 pp 1024 638.12 ± 2.97
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 17 1024 1024 0 pp 1024 739.56 ± 3.48
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 17 1024 1024 1 pp 1024 678.63 ± 3.30
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 18 1024 1024 0 pp 1024 784.66 ± 2.44
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 18 1024 1024 1 pp 1024 713.97 ± 2.73
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 19 1024 1024 0 pp 1024 828.81 ± 3.28
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 19 1024 1024 1 pp 1024 759.73 ± 3.57
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 20 1024 1024 0 pp 1024 884.96 ± 4.69
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 20 1024 1024 1 pp 1024 806.80 ± 6.71
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 21 1024 1024 0 pp 1024 948.70 ± 5.85
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 21 1024 1024 1 pp 1024 860.19 ± 5.59
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 22 1024 1024 0 pp 1024 1019.88 ± 3.62
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 22 1024 1024 1 pp 1024 933.79 ± 4.86
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 23 1024 1024 0 pp 1024 1101.54 ± 4.91
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 23 1024 1024 1 pp 1024 1007.86 ± 4.38
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 24 1024 1024 0 pp 1024 1194.18 ± 3.30
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 24 1024 1024 1 pp 1024 1095.93 ± 15.66
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 25 1024 1024 0 pp 1024 1311.94 ± 8.63
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 25 1024 1024 1 pp 1024 1207.60 ± 10.30
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 26 1024 1024 0 pp 1024 1442.92 ± 14.07
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 26 1024 1024 1 pp 1024 1346.63 ± 15.93
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 27 1024 1024 0 pp 1024 1615.53 ± 15.01
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 27 1024 1024 1 pp 1024 1490.20 ± 9.59
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 28 1024 1024 0 pp 1024 1818.64 ± 30.18
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 28 1024 1024 1 pp 1024 1710.29 ± 17.53
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 29 1024 1024 0 pp 1024 2144.10 ± 21.59
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 29 1024 1024 1 pp 1024 1993.06 ± 27.68
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 30 1024 1024 0 pp 1024 2546.11 ± 19.09
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 30 1024 1024 1 pp 1024 2371.35 ± 35.65
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 31 1024 1024 0 pp 1024 2885.51 ± 115.14
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 31 1024 1024 1 pp 1024 2863.88 ± 132.35
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 32 1024 1024 0 pp 1024 3732.88 ± 206.19
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 32 1024 1024 1 pp 1024 3694.50 ± 119.51
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 33 1024 1024 0 pp 1024 4685.48 ± 5.83
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 33 1024 1024 1 pp 1024 4653.46 ± 45.50

build: 4755afd (2431)

model size params backend ngl n_batch n_ubatch test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 1024 1024 pp 1024 1178.01 ± 52.80
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 1 1024 1024 pp 1024 1221.25 ± 20.50
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 2 1024 1024 pp 1024 1251.01 ± 30.48
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 3 1024 1024 pp 1024 1294.10 ± 15.29
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 4 1024 1024 pp 1024 1299.26 ± 36.69
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 5 1024 1024 pp 1024 1313.64 ± 53.28
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 6 1024 1024 pp 1024 1371.72 ± 48.12
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 7 1024 1024 pp 1024 1404.57 ± 38.03
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 8 1024 1024 pp 1024 1467.46 ± 42.10
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 9 1024 1024 pp 1024 1512.92 ± 44.17
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 10 1024 1024 pp 1024 1561.79 ± 32.51
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 11 1024 1024 pp 1024 1546.95 ± 33.21
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 12 1024 1024 pp 1024 1638.92 ± 38.17
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 13 1024 1024 pp 1024 1689.80 ± 66.00
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 14 1024 1024 pp 1024 1770.98 ± 30.59
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 15 1024 1024 pp 1024 1721.52 ± 79.84
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 16 1024 1024 pp 1024 1806.18 ± 95.38
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 17 1024 1024 pp 1024 1924.98 ± 55.63
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 18 1024 1024 pp 1024 1969.87 ± 81.24
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 19 1024 1024 pp 1024 2023.63 ± 63.53
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 20 1024 1024 pp 1024 2105.42 ± 160.57
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 21 1024 1024 pp 1024 2224.15 ± 130.01
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 22 1024 1024 pp 1024 2274.62 ± 54.49
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 23 1024 1024 pp 1024 2402.49 ± 98.09
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 24 1024 1024 pp 1024 2598.08 ± 99.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 25 1024 1024 pp 1024 2758.21 ± 67.71
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 26 1024 1024 pp 1024 2788.94 ± 168.04
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 27 1024 1024 pp 1024 3061.96 ± 81.72
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 28 1024 1024 pp 1024 3219.39 ± 97.09
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 29 1024 1024 pp 1024 3455.13 ± 77.40
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 30 1024 1024 pp 1024 3603.32 ± 77.86
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 31 1024 1024 pp 1024 3886.03 ± 106.86
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 32 1024 1024 pp 1024 4449.24 ± 4.91
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 33 1024 1024 pp 1024 4622.98 ± 8.33

build: 7664a45b (2441)

model size params backend ngl n_batch n_ubatch mmap test t/s
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 1024 1024 0 pp 1024 134.99 ± 0.17
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 1024 1024 1 pp 1024 94.98 ± 0.96
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 1 1024 1024 0 pp 1024 137.99 ± 0.18
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 1 1024 1024 1 pp 1024 94.75 ± 6.26
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 2 1024 1024 0 pp 1024 140.96 ± 0.19
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 2 1024 1024 1 pp 1024 99.24 ± 0.99
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 3 1024 1024 0 pp 1024 144.41 ± 0.32
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 3 1024 1024 1 pp 1024 101.79 ± 0.87
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 4 1024 1024 0 pp 1024 147.93 ± 0.42
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 4 1024 1024 1 pp 1024 104.54 ± 1.73
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 5 1024 1024 0 pp 1024 151.50 ± 0.25
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 5 1024 1024 1 pp 1024 108.43 ± 0.68
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 6 1024 1024 0 pp 1024 155.25 ± 0.41
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 6 1024 1024 1 pp 1024 111.04 ± 0.83
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 7 1024 1024 0 pp 1024 159.65 ± 0.39
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 7 1024 1024 1 pp 1024 114.14 ± 1.36
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 8 1024 1024 0 pp 1024 164.27 ± 0.42
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 8 1024 1024 1 pp 1024 118.19 ± 0.47
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 9 1024 1024 0 pp 1024 168.53 ± 0.30
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 9 1024 1024 1 pp 1024 121.97 ± 0.78
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 10 1024 1024 0 pp 1024 173.33 ± 0.66
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 10 1024 1024 1 pp 1024 126.47 ± 0.79
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 11 1024 1024 0 pp 1024 178.72 ± 0.26
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 11 1024 1024 1 pp 1024 132.09 ± 1.04
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 12 1024 1024 0 pp 1024 184.57 ± 0.45
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 12 1024 1024 1 pp 1024 135.79 ± 1.24
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 13 1024 1024 0 pp 1024 190.21 ± 0.58
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 13 1024 1024 1 pp 1024 141.10 ± 1.34
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 14 1024 1024 0 pp 1024 196.32 ± 0.35
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 14 1024 1024 1 pp 1024 147.36 ± 1.18
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 15 1024 1024 0 pp 1024 203.56 ± 0.48
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 15 1024 1024 1 pp 1024 152.44 ± 0.84
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 16 1024 1024 0 pp 1024 209.75 ± 0.60
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 16 1024 1024 1 pp 1024 157.82 ± 1.16
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 17 1024 1024 0 pp 1024 217.25 ± 0.71
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 17 1024 1024 1 pp 1024 165.75 ± 0.76
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 18 1024 1024 0 pp 1024 225.32 ± 0.77
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 18 1024 1024 1 pp 1024 171.47 ± 1.00
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 19 1024 1024 0 pp 1024 233.52 ± 0.36
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 19 1024 1024 1 pp 1024 179.67 ± 1.13
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 20 1024 1024 0 pp 1024 243.01 ± 0.55
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 20 1024 1024 1 pp 1024 189.02 ± 1.60
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 21 1024 1024 0 pp 1024 253.07 ± 0.47
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 21 1024 1024 1 pp 1024 198.75 ± 1.02
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 22 1024 1024 0 pp 1024 263.99 ± 0.49
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 22 1024 1024 1 pp 1024 210.41 ± 0.96
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 23 1024 1024 0 pp 1024 276.09 ± 0.38
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 23 1024 1024 1 pp 1024 221.90 ± 0.81
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 24 1024 1024 0 pp 1024 288.64 ± 0.34
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 24 1024 1024 1 pp 1024 234.89 ± 0.71
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 25 1024 1024 0 pp 1024 303.30 ± 0.45
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 25 1024 1024 1 pp 1024 251.23 ± 0.95
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 26 1024 1024 0 pp 1024 318.69 ± 0.62
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 26 1024 1024 1 pp 1024 267.34 ± 1.26
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 27 1024 1024 0 pp 1024 336.82 ± 1.00
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 27 1024 1024 1 pp 1024 290.10 ± 0.65
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 28 1024 1024 0 pp 1024 357.83 ± 0.52
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 28 1024 1024 1 pp 1024 313.36 ± 1.26
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 29 1024 1024 0 pp 1024 379.97 ± 0.58
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 29 1024 1024 1 pp 1024 342.14 ± 1.99
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 30 1024 1024 0 pp 1024 405.32 ± 0.72
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 30 1024 1024 1 pp 1024 375.44 ± 2.22
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 31 1024 1024 0 pp 1024 435.00 ± 1.21
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 31 1024 1024 1 pp 1024 416.74 ± 1.62
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 32 1024 1024 0 pp 1024 468.47 ± 1.59
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 32 1024 1024 1 pp 1024 466.26 ± 1.79
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 33 1024 1024 0 pp 1024 475.30 ± 0.69
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 33 1024 1024 1 pp 1024 476.06 ± 0.96

build: 46acb36 (2437)

model size params backend ngl n_batch n_ubatch test t/s
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 0 1024 1024 pp 1024 241.84 ± 2.76
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 1 1024 1024 pp 1024 235.62 ± 4.85
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 2 1024 1024 pp 1024 247.34 ± 3.94
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 3 1024 1024 pp 1024 249.00 ± 2.90
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 4 1024 1024 pp 1024 256.21 ± 2.90
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 5 1024 1024 pp 1024 256.75 ± 5.63
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 6 1024 1024 pp 1024 256.88 ± 5.62
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 7 1024 1024 pp 1024 258.29 ± 7.07
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 8 1024 1024 pp 1024 267.07 ± 1.51
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 9 1024 1024 pp 1024 265.86 ± 4.53
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 10 1024 1024 pp 1024 275.18 ± 0.92
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 11 1024 1024 pp 1024 278.09 ± 1.90
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 12 1024 1024 pp 1024 282.25 ± 8.06
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 13 1024 1024 pp 1024 293.48 ± 8.02
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 14 1024 1024 pp 1024 295.53 ± 3.59
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 15 1024 1024 pp 1024 312.34 ± 5.02
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 16 1024 1024 pp 1024 316.29 ± 6.70
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 17 1024 1024 pp 1024 319.90 ± 10.61
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 18 1024 1024 pp 1024 326.59 ± 4.95
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 19 1024 1024 pp 1024 332.98 ± 4.75
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 20 1024 1024 pp 1024 344.74 ± 8.89
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 21 1024 1024 pp 1024 352.98 ± 3.65
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 22 1024 1024 pp 1024 357.80 ± 5.44
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 23 1024 1024 pp 1024 368.76 ± 5.73
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 24 1024 1024 pp 1024 374.94 ± 3.11
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 25 1024 1024 pp 1024 388.92 ± 6.49
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 26 1024 1024 pp 1024 401.82 ± 4.66
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 27 1024 1024 pp 1024 408.44 ± 6.74
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 28 1024 1024 pp 1024 422.58 ± 4.10
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 29 1024 1024 pp 1024 435.36 ± 2.60
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 30 1024 1024 pp 1024 435.46 ± 8.18
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 31 1024 1024 pp 1024 461.35 ± 1.81
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 32 1024 1024 pp 1024 476.20 ± 1.11
llama 7B Q3_K - Large 19.03 GiB 46.70 B CUDA 33 1024 1024 pp 1024 478.29 ± 0.65

build: 7664a45b (2441)

70B Q4_0
GPU layers Model Test t/s master t/s sl/sched-auto-offload Speedup
0 llama 70B Q4_0 pp512 47.42 75.89 1.60
0 llama 70B Q4_0 pp1024 58.25 133.77 2.30
10 llama 70B Q4_0 pp512 53.49 86.37 1.61
10 llama 70B Q4_0 pp1024 65.27 154.15 2.36
20 llama 70B Q4_0 pp512 58.38 95.88 1.64
20 llama 70B Q4_0 pp1024 73.33 167.48 2.28
30 llama 70B Q4_0 pp512 70.39 148.83 2.11
30 llama 70B Q4_0 pp1024 84.76 240.11 2.83
40 llama 70B Q4_0 pp512 85.17 178.25 2.09
40 llama 70B Q4_0 pp1024 102.42 280.74 2.74

@slaren
Copy link
Collaborator Author

slaren commented Mar 15, 2024

This is the first step to allow the CUDA backend to free its resources when its ggml-backend objects are deleted. Currently, the CUDA backend allocates many resources as globals to support this feature.

@Artefact2
Copy link
Collaborator

Artefact2 commented Mar 15, 2024

diff --git a/ggml-cuda.cu b/ggml-cuda.cu
index 9e92acc0..13640f98 100644
--- a/ggml-cuda.cu
+++ b/ggml-cuda.cu
@@ -82,6 +82,10 @@
 #define cudaGetDeviceProperties hipGetDeviceProperties
 #define cudaGetErrorString hipGetErrorString
 #define cudaGetLastError hipGetLastError
+#define cudaHostRegister hipHostRegister
+#define cudaHostRegisterPortable hipHostRegisterPortable
+#define cudaHostRegisterReadOnly hipHostRegisterReadOnly
+#define cudaHostUnregister hipHostUnregister
 #define cudaLaunchHostFunc hipLaunchHostFunc
 #ifdef GGML_HIP_UMA
 #define cudaMalloc hipMallocManaged
model size params backend ngl test t/s
llama 13B Q4_0 6.88 GiB 13.02 B ROCm pr 0 pp 512 300.14 ± 0.29
llama 13B Q4_0 6.88 GiB 13.02 B ROCm master 0 pp 512 187.59 ± 0.21
llama 7B Q4_K - Small 24.91 GiB 46.70 B ROCm pr 0 pp 512 114.20 ± 0.32
llama 7B Q4_K - Small 24.91 GiB 46.70 B ROCm master 0 pp 512 59.93 ± 0.27

More benches here

@Dampfinchen
Copy link

Dampfinchen commented Mar 15, 2024

Wow, I'm speechless. This is beyond incredible and a HUGE leap forward!

llama_print_timings:        load time =   25189,16 ms
llama_print_timings:      sample time =      72,49 ms /   180 runs   (    0,40 ms per token,  2483,10 tokens per second)
llama_print_timings: prompt eval time =   31513,38 ms /  3602 tokens (    8,75 ms per token,   114,30 tokens per second)
llama_print_timings:        eval time =   41897,48 ms /   179 runs   (  234,06 ms per token,     4,27 tokens per second)
llama_print_timings:       total time =   73539,82 ms /  3781 tokens 

Speed before this PR:

llama_print_timings:        load time =    2482,92 ms
llama_print_timings:      sample time =      69,55 ms /   180 runs   (    0,39 ms per token,  2587,99 tokens per second)
llama_print_timings: prompt eval time =   51669,64 ms /  3602 tokens (   14,34 ms per token,    69,71 tokens per second)
llama_print_timings:        eval time =   42287,08 ms /   179 runs   (  236,24 ms per token,     4,23 tokens per second)
llama_print_timings:       total time =   94085,31 ms /  3781 tokens

That's indeed double the prompt processing speed! (5 layers offloaded with an RTX 2060 laptop and Mixtral.)

Thank you so much Slaren!!

@USBhost
Copy link

USBhost commented Mar 15, 2024

On my A6000 (using stock settings) there's a .31 tokens per second eval time regression for a 70b model. This .31 tps is consistent on just about every run.

./main -ngl 99 -m /mnt/40TB/AI/MiquMaid-v2-70B-DPO/ggml-model-Q4_K_M.gguf -p "Write a long story on why the sky is red."

Current HEAD 4e9a7f7f7fb6acbddd1462909c8d696e38edbfcc
llama_print_timings:        load time =   14921.14 ms
llama_print_timings:      sample time =     405.60 ms /   723 runs   (    0.56 ms per token,  1782.53 tokens per second)
llama_print_timings: prompt eval time =     286.73 ms /    12 tokens (   23.89 ms per token,    41.85 tokens per second)
llama_print_timings:        eval time =   50749.65 ms /   722 runs   (   70.29 ms per token,    14.23 tokens per second)
llama_print_timings:       total time =   51652.08 ms /   734 tokens

llama_print_timings:        load time =   14844.16 ms
llama_print_timings:      sample time =     678.35 ms /  1187 runs   (    0.57 ms per token,  1749.83 tokens per second)
llama_print_timings: prompt eval time =     287.25 ms /    12 tokens (   23.94 ms per token,    41.78 tokens per second)
llama_print_timings:        eval time =   83862.37 ms /  1186 runs   (   70.71 ms per token,    14.14 tokens per second)
llama_print_timings:       total time =   85181.27 ms /  1198 tokens

llama_print_timings:        load time =   14820.03 ms
llama_print_timings:      sample time =     671.43 ms /  1194 runs   (    0.56 ms per token,  1778.30 tokens per second)
llama_print_timings: prompt eval time =     287.75 ms /    12 tokens (   23.98 ms per token,    41.70 tokens per second)
llama_print_timings:        eval time =   84489.73 ms /  1193 runs   (   70.82 ms per token,    14.12 tokens per second)
llama_print_timings:       total time =   85801.20 ms /  1205 tokens

PR
llama_print_timings:        load time =   16542.88 ms
llama_print_timings:      sample time =     578.93 ms /  1032 runs   (    0.56 ms per token,  1782.61 tokens per second)
llama_print_timings: prompt eval time =     287.61 ms /    12 tokens (   23.97 ms per token,    41.72 tokens per second)
llama_print_timings:        eval time =   74705.90 ms /  1031 runs   (   72.46 ms per token,    13.80 tokens per second)
llama_print_timings:       total time =   75894.40 ms /  1043 tokens

llama_print_timings:        load time =   16675.73 ms
llama_print_timings:      sample time =     476.32 ms /   831 runs   (    0.57 ms per token,  1744.63 tokens per second)
llama_print_timings: prompt eval time =     289.29 ms /    12 tokens (   24.11 ms per token,    41.48 tokens per second)
llama_print_timings:        eval time =   59736.20 ms /   830 runs   (   71.97 ms per token,    13.89 tokens per second)
llama_print_timings:       total time =   60767.06 ms /   842 tokens

llama_print_timings:        load time =   16597.13 ms
llama_print_timings:      sample time =     392.05 ms /   692 runs   (    0.57 ms per token,  1765.06 tokens per second)
llama_print_timings: prompt eval time =     292.76 ms /    12 tokens (   24.40 ms per token,    40.99 tokens per second)
llama_print_timings:        eval time =   49830.24 ms /   691 runs   (   72.11 ms per token,    13.87 tokens per second)
llama_print_timings:       total time =   50735.73 ms /   703 tokens

A6000 + A4000. Again there's a regression around .30 tps also load time is longer on this PR.
taskset -ac 0 ./main -ngl 99 -m /mnt/40TB/AI/MiquMaid-v2-70B-DPO/ggml-model-Q4_K_M.gguf -p "Write a long story on the reason why the sky is green but make it spicy."

Current HEAD
llama_print_timings:        load time =   15157.09 ms
llama_print_timings:      sample time =     341.39 ms /   587 runs   (    0.58 ms per token,  1719.45 tokens per second)
llama_print_timings: prompt eval time =     554.58 ms /    19 tokens (   29.19 ms per token,    34.26 tokens per second)
llama_print_timings:        eval time =   47648.32 ms /   586 runs   (   81.31 ms per token,    12.30 tokens per second)
llama_print_timings:       total time =   48739.75 ms /   605 tokens

PR
llama_print_timings:        load time =   16780.64 ms
llama_print_timings:      sample time =     477.61 ms /   827 runs   (    0.58 ms per token,  1731.53 tokens per second)
llama_print_timings: prompt eval time =     558.27 ms /    19 tokens (   29.38 ms per token,    34.03 tokens per second)
llama_print_timings:        eval time =   68706.55 ms /   826 runs   (   83.18 ms per token,    12.02 tokens per second)
llama_print_timings:       total time =   70025.41 ms /   845 tokens

@slaren
Copy link
Collaborator Author

slaren commented Mar 15, 2024

@USBhost should be fixed now.

Interestingly, this was caused by an increase to GGML_SCHED_MAX_SPLITS. Increasing this constant also used to increase the size of a hash table, which needs to be cleaned on every evaluation. That was enough to increase the overhead enough to be measurable.

@tbocek
Copy link

tbocek commented Mar 15, 2024

I just tried this PR. I'm not sure what fixed it, but I don't get this error reported here (#5701) with benchmark-matmult. It now completes with ROCm/7900XTX. With master I see the same abort error, with this PR it works fine.

main: built with cc (GCC) 13.2.1 20230801 for x86_64-pc-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=1
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;            202256;     57.07
        1;       1; 11008;  4096;   128;    11542724608;            202198;     57.09
        2;       1; 11008;  4096;   128;    11542724608;            201140;     57.39
        3;       1; 11008;  4096;   128;    11542724608;            200890;     57.46
        4;       1; 11008;  4096;   128;    11542724608;            201915;     57.17
        5;       1; 11008;  4096;   128;    11542724608;            202146;     57.10
        6;       1; 11008;  4096;   128;    11542724608;            201838;     57.19
        7;       1; 11008;  4096;   128;    11542724608;            202511;     57.00
        8;       1; 11008;  4096;   128;    11542724608;            202692;     56.95
        9;       1; 11008;  4096;   128;    11542724608;            202369;     57.04

Average                                                                         57.14
=====================================================================================

@slaren
Copy link
Collaborator Author

slaren commented Mar 15, 2024

@tbocek unfortunately that has not really been fixed. benchmark-matmul depends on the ability of the CUDA/HIP backend to offload large matrix multiplications automatically, but that is no longer done, it requires using ggml-backend with ggml_backend_sched. So what you are measuring there is just the CPU performance.

@USBhost
Copy link

USBhost commented Mar 15, 2024

@USBhost should be fixed now.

Interestingly, this was caused by an increase to GGML_SCHED_MAX_SPLITS. Increasing this constant also used to increase the size of a hash table, which needs to be cleaned on every evaluation. That was enough to increase the overhead enough to be measurable.

Yeah that fixed it Thanks and it feels just a tad faster than master. But that load time still looking sus...
A6000 only.

PR
llama_print_timings:        load time =   16623.04 ms
llama_print_timings:      sample time =     236.75 ms /   419 runs   (    0.57 ms per token,  1769.81 tokens per second)
llama_print_timings: prompt eval time =     456.24 ms /    19 tokens (   24.01 ms per token,    41.64 tokens per second)
llama_print_timings:        eval time =   29054.86 ms /   418 runs   (   69.51 ms per token,    14.39 tokens per second)
llama_print_timings:       total time =   29870.07 ms /   437 tokens

llama_print_timings:        load time =   16588.26 ms
llama_print_timings:      sample time =     470.08 ms /   842 runs   (    0.56 ms per token,  1791.17 tokens per second)
llama_print_timings: prompt eval time =     455.79 ms /    19 tokens (   23.99 ms per token,    41.69 tokens per second)
llama_print_timings:        eval time =   58816.35 ms /   841 runs   (   69.94 ms per token,    14.30 tokens per second)
llama_print_timings:       total time =   59983.48 ms /   860 tokens

llama_print_timings:        load time =   16525.14 ms
llama_print_timings:      sample time =     293.29 ms /   532 runs   (    0.55 ms per token,  1813.88 tokens per second)
llama_print_timings: prompt eval time =     454.93 ms /    19 tokens (   23.94 ms per token,    41.76 tokens per second)
llama_print_timings:        eval time =   36999.50 ms /   531 runs   (   69.68 ms per token,    14.35 tokens per second)
llama_print_timings:       total time =   37902.38 ms /   550 tokens

@Dampfinchen
Copy link

For some reason, my computer really doesn't like this PR though. After text Gen, the terminal doesn't accept any input anymore and I can't start browsers. I have restart it, which takes much longer than usual. I'm using Linux Pop OS LTS 22.04.

@slaren
Copy link
Collaborator Author

slaren commented Mar 15, 2024

@Dampfinchen try setting the environment variable GGML_CUDA_NO_PINNED.

@Dampfinchen
Copy link

@Dampfinchen try setting the environment variable GGML_CUDA_NO_PINNED.

Yep, that fixes it! Thanks!

@slaren
Copy link
Collaborator Author

slaren commented Mar 15, 2024

You can also try --no-mmap, it will cause less memory to be pinned, but it will still maintain the same performance.

@fgdfgfthgr-fox
Copy link

Using Radeon VII, can confirm this does offer a major speedup on prompt processing, although does seem to reduce the token generation speed by just a bit.

@MaggotHATE
Copy link
Contributor

Tested with Vulkan, partial offload (7 layers, 7B model, Q6_K version, 478 tokens of prompt). On my low-end GPU (1060 3gb) there seems to be almost no difference:
Eval speed: 34.766766 (main) vs 34.025253 (PR)
Gen speed: 2.813251 vs 2.816204
Tokens generated: 1828 vs 1505 (just for additional context).

Looks like this PR would only help with more layers offloaded (and on better hardware) - but it works so far without problems.

@slaren
Copy link
Collaborator Author

slaren commented Mar 16, 2024

Vulkan supports offloading large batches automatically, but it has its own implementation. Only the CUDA backend supports the functionality added by this PR. Other backends will need to be implement a (very simple) offload_op function to choose the operations that the backend wants to handle. This is the offload_op of the CUDA backend:

llama.cpp/ggml-cuda.cu

Lines 11391 to 11401 in dc93f5a

GGML_CALL static bool ggml_backend_cuda_offload_op(ggml_backend_t backend, const ggml_tensor * op) {
const ggml_tensor * dst = op;
const int min_batch_size = 32;
if (dst->ne[1] > min_batch_size && dst->op != GGML_OP_GET_ROWS) {
return true;
}
return false;
}

However for this is work properly, backends need to be able to execute many graphs with little overhead, since this will result in a very large number of graph splits (hundreds, at least one for each weight).

@MaggotHATE
Copy link
Contributor

Ok, thanks for explaining - I saw

Currently, only CUDA and Vulkan support this.

and decided to test just in case.

@8XXD8
Copy link

8XXD8 commented Mar 16, 2024

I'm not seeing any meaningful difference in prompt processing, but with -sm row I can't load a 13b_Q8 model into 3X Radeon Pro VIIs, only -sm layer works, error:

llama_model_load: error loading model: failed to allocate buffer
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/home/user/text-generation-webui/models/llama2-13b-tiefighter.Q8_0.gguf'
main: error: unable to load model

Results:

single GPU Master
llama_print_timings:        load time =    3288.75 ms
llama_print_timings:      sample time =      47.17 ms /   256 runs   (    0.18 ms per token,  5427.06 tokens per second)
llama_print_timings: prompt eval time =    1857.60 ms /   563 tokens (    3.30 ms per token,   303.08 tokens per second)
llama_print_timings:        eval time =    7460.75 ms /   255 runs   (   29.26 ms per token,    34.18 tokens per second)
llama_print_timings:       total time =    9417.34 ms /   818 tokens


single GPU PR
llama_print_timings:        load time =    4753.63 ms
llama_print_timings:      sample time =      45.71 ms /   256 runs   (    0.18 ms per token,  5600.04 tokens per second)
llama_print_timings: prompt eval time =    1863.55 ms /   563 tokens (    3.31 ms per token,   302.11 tokens per second)
llama_print_timings:        eval time =    7405.82 ms /   255 runs   (   29.04 ms per token,    34.43 tokens per second)
llama_print_timings:       total time =    9364.84 ms /   818 tokens

3X RVII Master layer split
llama_print_timings:        load time =    4752.12 ms
llama_print_timings:      sample time =      53.87 ms /   256 runs   (    0.21 ms per token,  4752.36 tokens per second)
llama_print_timings: prompt eval time =    1629.90 ms /   563 tokens (    2.90 ms per token,   345.42 tokens per second)
llama_print_timings:        eval time =    7447.81 ms /   255 runs   (   29.21 ms per token,    34.24 tokens per second)
llama_print_timings:       total time =    9193.20 ms /   818 tokens

3X RVII PR layer split
llama_print_timings:        load time =    6615.16 ms
llama_print_timings:      sample time =      59.28 ms /   256 runs   (    0.23 ms per token,  4318.20 tokens per second)
llama_print_timings: prompt eval time =    1632.83 ms /   563 tokens (    2.90 ms per token,   344.80 tokens per second)
llama_print_timings:        eval time =    7435.52 ms /   255 runs   (   29.16 ms per token,    34.29 tokens per second)
llama_print_timings:       total time =    9195.99 ms /   818 tokens

@slaren
Copy link
Collaborator Author

slaren commented Mar 16, 2024

@8XXD8 this only affects prompt processing with partial offloading. Full offloading is unchanged. The issue with -sm row should be fixed now.

@Artefact2
Copy link
Collaborator

I think this PR breaks imatrix when partially offloading, I am getting smaller imatrix files with lots of missing info for some tensors.

ggml-backend-impl.h Outdated Show resolved Hide resolved
ggml-backend-impl.h Show resolved Hide resolved
ggml-backend.c Show resolved Hide resolved
ggml-cuda.cu Show resolved Hide resolved
ggml-cuda.h Outdated Show resolved Hide resolved
ggml-cuda.cu Show resolved Hide resolved
Co-authored-by: Johannes Gäßler <[email protected]>
@slaren
Copy link
Collaborator Author

slaren commented Mar 16, 2024

I think this PR breaks imatrix when partially offloading

Yes, it does. When the weights are copied to the GPU, the name of the tensor is different (for example blk.0.attn_k.weight becomes CUDA0#blk.0.attn_k.weight#0), and imatrix fails to recognize them. Not sure what is the best way to fix that yet.

@slaren slaren marked this pull request as ready for review March 17, 2024 13:56
@slaren
Copy link
Collaborator Author

slaren commented Mar 17, 2024

I think the ggml-ci cuda-v100 runner has some issue, the logs say no CUDA-capable device is detected. It is also failing in master.

@ggerganov
Copy link
Owner

I think I fixed the drivers and restarted the job. Will review the PR tomorrow

@ggerganov ggerganov added the high priority Very important issue label Mar 18, 2024
@slaren slaren merged commit 2bf8d0f into master Mar 18, 2024
63 of 69 checks passed
@slaren slaren deleted the sl/sched-auto-offload branch March 18, 2024 10:03
@slaren
Copy link
Collaborator Author

slaren commented Mar 18, 2024

@0cc4m is should be possible to adapt the Vulkan backend now to use this and remove ggml_vk_free_cpu_assist and the code in ggml.c.

@JohannesGaessler
Copy link
Collaborator

Are there plans to also implement pre-loading the data for the next layer as the current one is being processed? Since prompt processing is compute bound it should theoretically be possible to achieve ~100% GPU speed even at 0 GPU layers. The tradeoff would be that VRAM usage goes up so you would be able to offload fewer layers which in turn makes generation slower.

@slaren
Copy link
Collaborator Author

slaren commented Mar 21, 2024

We should implement that for sure. With a large enough batch size we could reach close to the batch performance of full offload, which could have a significant impact. It's not an immediate priority for me right now, but I will work on this eventually if nobody does it before.

@JohannesGaessler
Copy link
Collaborator

Regarding my previous comment: looking at some profiling data suggests that it won't be quite as simple:

Screenshot_20240321_014856

With a Ryzen 5950X, 3200 MHz dual channel RAM, and an RTX 3090 the amount of time spent on memory transfers currently seems to be significantly larger than the amount of time spent on compute. Also there are still significant gaps where the GPU is idling and the CPU seems to be doing some work.

@slaren
Copy link
Collaborator Author

slaren commented Mar 21, 2024

I don't know what batch size you are using, but with a large enough batch size, I can already see over 50% utilization with -ngl 0. The CPU work that you are seeing may be perplexity calculating the perplexity from the logits, when testing pipeline parallelism it was easy to get this to take over 50% of the total time.

@JohannesGaessler
Copy link
Collaborator

JohannesGaessler commented Mar 21, 2024

I don't know what batch size you are using, but with a large enough batch size, I can already see over 50% utilization with -ngl 0.

I was using a batch size of 512 for the perplexity binary.

The CPU work that you are seeing may be perplexity calculating the perplexity from the logits, when testing pipeline parallelism it was easy to get this to take over 50% of the total time.

No, the area that I was showing was from the middle of the calculation. Also, I am seeing the same gaps with llama-bench. Against my initial expectation I am also seeing that llama-bench pp scales with the number of threads:

model size params backend ngl threads test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 1 pp 512 791.24 ± 0.53
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 2 pp 512 883.43 ± 2.58
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 3 pp 512 929.52 ± 2.14
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 4 pp 512 946.93 ± 1.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 5 pp 512 959.62 ± 1.45
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 6 pp 512 963.22 ± 0.65
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 7 pp 512 967.96 ± 0.75
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 8 pp 512 968.98 ± 1.42
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 9 pp 512 970.06 ± 1.31
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 10 pp 512 968.56 ± 1.14
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 11 pp 512 965.29 ± 0.59
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 12 pp 512 960.36 ± 0.25
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 13 pp 512 957.10 ± 1.20
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 14 pp 512 954.39 ± 0.87
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 15 pp 512 950.37 ± 0.47
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 16 pp 512 944.65 ± 0.43

gprof for ./llama-bench --model models/opt/${model_name}-${quantization}.gguf -r 100 -ngl 0 -n 0 -t 1 suggests that the culprit is some tensor duplication:

Each sample counts as 0.01 seconds.
  %   cumulative   self              self     total           
 time   seconds   seconds    calls  ms/call  ms/call  name    
 93.56     17.28    17.28     6464     2.67     2.81  ggml_compute_forward_dup_f32
  4.82     18.17     0.89 52953088     0.00     0.00  ggml_fp32_to_fp16_row
  0.81     18.32     0.15    51712     0.00     0.00  dequantize_row_q4_0
  0.43     18.40     0.08      101     0.79   182.53  llama_decode
  0.05     18.41     0.01  1127987     0.00     0.00  ggml_blck_size
  0.05     18.42     0.01   112846     0.00     0.00  ggml_new_tensor
  0.05     18.43     0.01    42622     0.00     0.00  ggml_backend_sched_get_tensor_backend
  0.05     18.44     0.01      101     0.10     0.10  ggml_gallocr_alloc_graph
  0.05     18.45     0.01       18     0.56     0.56  std::_Hashtable<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> >, std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int>, std::allocator<std::pair<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > const, int> >, std::__detail::_Select1st, std::equal_to<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::hash<std::__cxx11::basic_string<char, std::char_traits<char>, std::allocator<char> > >, std::__detail::_Mod_range_hashing, std::__detail::_Default_ranged_hash, std::__detail::_Prime_rehash_policy, std::__detail::_Hashtable_traits<true, false, true> >::_M_rehash(unsigned long, unsigned long const&)
  0.05     18.46     0.01                             ggml_cuda_mul(ggml_backend_cuda_context&, ggml_tensor const*, ggml_tensor const*, ggml_tensor*)
  0.05     18.47     0.01                             ggml_backend_cuda_set_tensor_async(ggml_backend*, ggml_tensor*, void const*, unsigned long, unsigned long)
  0.00     18.47     0.00  1690128     0.00     0.00  ggml_hash_find

The total runtime was 67.02 s so ggml_compute_forward_dup_f32 took up ~25% of the total runtime.

@slaren
Copy link
Collaborator Author

slaren commented Mar 21, 2024

It's the ggml_cpy to store the new blocks in the KV cache. An observation is that it would be possible to do the conversion to F16 in the GPU, which would reduce the amount of data that needs to be copied to the CPU, and reduce the overhead of the ggml_cpy. I am surprised that you get better performance with more than 1 thread, with batch size 512 the time is probably dominated by the transfer so the overhead of launching the threads becomes less significant.

Device 0: NVIDIA GeForce RTX 3090 Ti, compute capability 8.6, VMM: yes

model size params backend ngl threads n_ubatch test t/s
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 0 1 4096 pp 4096 1694.34 ± 7.93
llama 7B Q4_0 3.56 GiB 6.74 B CUDA 99 1 4096 pp 4096 3653.03 ± 3.42
diff --git a/llama.cpp b/llama.cpp
index cd7a7b8d..bd0847bb 100644
--- a/llama.cpp
+++ b/llama.cpp
@@ -5428,6 +5428,10 @@ static void llm_build_kv_store(
     cb(v_cache_view, "v_cache_view", il);

     // important: storing RoPE-ed version of K in the KV cache!
+    k_cur = ggml_cast(ctx, k_cur, k_cache_view->type);
+    v_cur_t = ggml_cast(ctx, v_cur_t, v_cache_view->type);
+    ggml_build_forward_expand(graph, k_cur);
+    ggml_build_forward_expand(graph, v_cur_t);
     ggml_build_forward_expand(graph, ggml_cpy(ctx, k_cur,   k_cache_view));
     ggml_build_forward_expand(graph, ggml_cpy(ctx, v_cur_t, v_cache_view));
 }

@slaren slaren mentioned this pull request Mar 23, 2024
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 1, 2024
* backend : offload large batches to GPU

* fix hip

* code cleanup

* fix CUDA split buffers

* Update ggml-backend-impl.h

Co-authored-by: Johannes Gäßler <[email protected]>

* cuda : fix memset without set_device

* imatrix : remove sched affix from weight names

* sched : add a new split if the current one has too many inputs
reduce max inputs per split
more cleanup

* update backends

ggml-ci

---------

Co-authored-by: Johannes Gäßler <[email protected]>
hodlen pushed a commit to hodlen/llama.cpp that referenced this pull request Apr 3, 2024
* backend : offload large batches to GPU

* fix hip

* code cleanup

* fix CUDA split buffers

* Update ggml-backend-impl.h

Co-authored-by: Johannes Gäßler <[email protected]>

* cuda : fix memset without set_device

* imatrix : remove sched affix from weight names

* sched : add a new split if the current one has too many inputs
reduce max inputs per split
more cleanup

* update backends

ggml-ci

---------

Co-authored-by: Johannes Gäßler <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
high priority Very important issue
Projects
None yet
Development

Successfully merging this pull request may close these issues.

10 participants