Add ReLU and SQR CUDA ops to fix Persimmon offloading #4041

KerfuffleV2 · 2023-11-11T22:53:22Z

See #4038 - Persimmon uses ReLU and SQR but those CUDA ops didn't exist. Looks like they are in Metal already just as a note.

This pull adds those ops. This still isn't enough for full offloading. You can offload n_layers + 1. So for the 8B model with 36 layers, -ngl 37 works but -ngl 38 does not.

edit: ~~The next op it fails on seems to be CPY. Is the solution just do add that as well?~~ Actually seems like CPY already exists so the problem must be something else like maybe it's not combination of tensor types that can be copied.

#3  0x00005555556bf394 in ggml_cuda_cpy (src0=0x7ffada8a1100, src1=0x7ffada8a1280, dst=0x0) at ggml-cuda.cu:7576
7576        GGML_ASSERT(src1->backend == GGML_BACKEND_GPU);
(gdb) p *src0
$1 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_GPU, buffer = 0x555567196b40, n_dims = 4, ne = {64, 64, 2, 3}, nb = {4, 768, 49152, 256}, op = GGML_OP_PERMUTE, op_params = {0, 3, 
    1, 2, 0 <repeats 12 times>}, is_param = false, grad = 0x0, src = {0x7ffada8a0f80, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 1, perf_cycles = 0, perf_time_us = 0, 
  view_src = 0x7ffada8a0e00, view_offs = 0, data = 0x7ffab9610160, name = "tmpqkv-0 (permuted)", '\000' <repeats 44 times>, extra = 0x55556b1e3c40, 
  padding = '\000' <repeats 11 times>}
(gdb) p *src1
$2 = {type = GGML_TYPE_F32, backend = GGML_BACKEND_CPU, buffer = 0x555567196b40, n_dims = 4, ne = {64, 64, 2, 3}, nb = {4, 256, 16384, 32768}, op = GGML_OP_CONT, op_params = {
    0 <repeats 16 times>}, is_param = false, grad = 0x0, src = {0x7ffada8a1100, 0x0, 0x0, 0x0, 0x0, 0x0}, perf_runs = 0, perf_cycles = 0, perf_time_us = 0, view_src = 0x0, 
  view_offs = 0, data = 0x7ffab9628160, name = "tmpqkv-0\000(permuted) (cont)", '\000' <repeats 37 times>, extra = 0x0, padding = '\000' <repeats 11 times>}

One of the operands is on CPU. I don't know how to fix that though. edit: Well, the next issue after that is CPY only supports up to 3 dimensions but those tensors are 4D.

KerfuffleV2 · 2023-11-12T06:45:16Z

ggml-cuda.cu

+#define CUDA_RELU_BLOCK_SIZE 256
+#define CUDA_SQR_BLOCK_SIZE 256


I'm not really sure what the optimal block sizes are. I just copied from SILU.

SleepyYui

Looks good to me.
Block sizes should not matter that much, a thread on the Nvidia forums (a decade ago) suggests 128-256.

ggerganov

The Persimmon graph seems to be doing some overly-complicated stuff. I haven't looked deeply in the logic, but we should simplify it. If necessary, the convert script can output data in a more convenient way if that can help to reduce the number of and types of ops currently used in the attention

KerfuffleV2 · 2023-11-13T07:54:04Z

Thanks for the response.

The Persimmon graph seems to be doing some overly-complicated stuff.

Unfortunately I don't really know anything about the model, I just downloaded it based on an issue about the CUDA offloading and was able to find the issue.

If you have the time to answer, is there any way we can limit -ngl to just repeating layers + 1 just for Persimmon? I tried to look if there was a simple way to do that. It would be nice if people could do -ngl 100 or whatever and it wouldn't crash.

ggerganov · 2023-11-13T07:59:09Z

No, it's better to let it crash. Otherwise we will forget about this problem and won't fix it.
We can print a warning that references and issue/comment about this

… many layers

KerfuffleV2 · 2023-11-13T08:37:18Z

I added an #ifdef to the loader for the Persimmon case:

llm_load_tensors: using ROCm for GPU acceleration
llm_load_tensors: CUDA backend missing Persimmon CUDA ops, can offload at most 37 layers. See: https://github.com/ggerganov/llama.cpp/issues/4038
error loading model: Persimmon CUDA offload failed
llama_load_model_from_file: failed to load model
llama_init_from_gpt_params: error: failed to load model '/blah/adept-persimmon-8b-base-Q4_K_M.gguf'
main: error: unable to load model

Also checked CLBlast, doesn't seem to work with anything more than the repeating layers. I just get garbage output. I was going to add a message for that as well, but I don't know if it's an error specific to my system or whatever or the underlying cause.

…erganov#4041) * Add ReLU and SQR CUDA ops to fix Persimmon offloading * Persimmon loader: More helpful error on CUDA/ROCM when offloading too many layers

Add ReLU and SQR CUDA ops to fix Persimmon offloading

8df6fe6

KerfuffleV2 mentioned this pull request Nov 11, 2023

Adept Persimmon Models not working with CUDA Acceleration #4038

Closed

KerfuffleV2 added bug Something isn't working model Model specific Nvidia GPU Issues specific to Nvidia GPUs labels Nov 11, 2023

KerfuffleV2 commented Nov 12, 2023

View reviewed changes

KerfuffleV2 mentioned this pull request Nov 13, 2023

sync : ggml (backend v2) #3912

Merged

SleepyYui reviewed Nov 13, 2023

View reviewed changes

ggerganov approved these changes Nov 13, 2023

View reviewed changes

Persimmon loader: More helpful error on CUDA/ROCM when offloading too…

3fad249

… many layers

KerfuffleV2 merged commit bb50a79 into ggerganov:master Nov 13, 2023
32 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ReLU and SQR CUDA ops to fix Persimmon offloading #4041

Add ReLU and SQR CUDA ops to fix Persimmon offloading #4041

KerfuffleV2 commented Nov 11, 2023 •

edited

Loading

KerfuffleV2 Nov 12, 2023 •

edited

Loading

SleepyYui left a comment •

edited

Loading

ggerganov left a comment

KerfuffleV2 commented Nov 13, 2023

ggerganov commented Nov 13, 2023

KerfuffleV2 commented Nov 13, 2023

		#define CUDA_RELU_BLOCK_SIZE 256
		#define CUDA_SQR_BLOCK_SIZE 256

Add ReLU and SQR CUDA ops to fix Persimmon offloading #4041

Add ReLU and SQR CUDA ops to fix Persimmon offloading #4041

Conversation

KerfuffleV2 commented Nov 11, 2023 • edited Loading

KerfuffleV2 Nov 12, 2023 • edited Loading

Choose a reason for hiding this comment

SleepyYui left a comment • edited Loading

Choose a reason for hiding this comment

ggerganov left a comment

Choose a reason for hiding this comment

KerfuffleV2 commented Nov 13, 2023

ggerganov commented Nov 13, 2023

KerfuffleV2 commented Nov 13, 2023

KerfuffleV2 commented Nov 11, 2023 •

edited

Loading

KerfuffleV2 Nov 12, 2023 •

edited

Loading

SleepyYui left a comment •

edited

Loading