Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CUDA cannot generate images #95

Closed
Tracked by #122
wailovet opened this issue Nov 29, 2023 · 29 comments · Fixed by ggerganov/ggml#669
Closed
Tracked by #122

CUDA cannot generate images #95

wailovet opened this issue Nov 29, 2023 · 29 comments · Fixed by ggerganov/ggml#669

Comments

@wailovet
Copy link

wailovet commented Nov 29, 2023

I encountered a strange problem. After using CUDA, I got a pure green picture when running.But it works fine on another computer.

sd_cuda.exe  -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
    n_threads:       6
    mode:            txt2img
    model_path:      meinamix_meinaV11-f16.gguf
    output_path:     output.png
    init_img:
    prompt:          1girl
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   euler_a
    schedule:        default
    sample_steps:    20
    strength:        0.75
    rng:             cuda
    seed:            42
    batch_count:     1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO]  stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   0:                              sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   1:                             sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   2:                           sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   3:                            sd.vocab.tokens arr
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887  - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 0.92s
[INFO]  stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO]  stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 455 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 415 ms
[INFO]  stable-diffusion.cpp:4681 - get_learned_condition completed, taking 876 ms
[INFO]  stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
  |==================================================| 20/20 - 7.42s/it
[INFO]  stable-diffusion.cpp:4706 - sampling completed, taking 157.10s
[INFO]  stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 157.12s
[INFO]  stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.65s
[INFO]  stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.66s
[INFO]  stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.66s
[INFO]  stable-diffusion.cpp:4735 - txt2img completed in 164.66s
save result image to 'output.png'


output

@Green-Sky
Copy link
Contributor

make sure everything is the same, you can check the hashes of the model eg.
Also, you can try different parameters, to help cage the bug.

@FSSRepo
Copy link
Contributor

FSSRepo commented Nov 29, 2023

I think it could be an issue related to the Tensor Cores.

Try uncommenting line 99 of ggml-cuda.cu:

#define GGML_CUDA_MAX_NODES 8192

// define this if you want to always fallback to MMQ kernels and not use cuBLAS for matrix multiplication
// on modern hardware, using cuBLAS is recommended as it utilizes F16 tensor cores which are very performant
// for large computational tasks. the drawback is that this requires some extra amount of VRAM:
// -  7B quantum model: +100-200 MB
// - 13B quantum model: +200-400 MB
//

#define GGML_CUDA_FORCE_MMQ  // decomment this line and try again

// TODO: improve this to be correct for more hardware
//       for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores
//       probably other such cases, and not sure what happens on AMD hardware
#if !defined(GGML_CUDA_FORCE_MMQ)
#define CUDA_USE_TENSOR_CORES
#endif

@wailovet
Copy link
Author

I think it could be an issue related to the Tensor Cores.

Try uncommenting line 99 of ggml-cuda.cu:

#define GGML_CUDA_MAX_NODES 8192

// define this if you want to always fallback to MMQ kernels and not use cuBLAS for matrix multiplication
// on modern hardware, using cuBLAS is recommended as it utilizes F16 tensor cores which are very performant
// for large computational tasks. the drawback is that this requires some extra amount of VRAM:
// -  7B quantum model: +100-200 MB
// - 13B quantum model: +200-400 MB
//

#define GGML_CUDA_FORCE_MMQ  // decomment this line and try again

// TODO: improve this to be correct for more hardware
//       for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores
//       probably other such cases, and not sure what happens on AMD hardware
#if !defined(GGML_CUDA_FORCE_MMQ)
#define CUDA_USE_TENSOR_CORES
#endif

The running speed is much faster, but the generated image is still pure green.

sd-cuda-mmq.exe  -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
    n_threads:       6
    mode:            txt2img
    model_path:      meinamix_meinaV11-f16.gguf
    output_path:     output.png
    init_img:
    prompt:          1girl
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   euler_a
    schedule:        default
    sample_steps:    20
    strength:        0.75
    rng:             cuda
    seed:            42
    batch_count:     1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO]  stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   0:                              sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   1:                             sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   2:                           sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   3:                            sd.vocab.tokens arr
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887  - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 7.03s
[INFO]  stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO]  stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 18 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 16 ms
[INFO]  stable-diffusion.cpp:4681 - get_learned_condition completed, taking 40 ms
[INFO]  stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
  |==================================================| 20/20 - 1.83s/it
[INFO]  stable-diffusion.cpp:4706 - sampling completed, taking 37.21s
[INFO]  stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 37.23s
[INFO]  stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.58s
[INFO]  stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.58s
[INFO]  stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.58s
[INFO]  stable-diffusion.cpp:4735 - txt2img completed in 43.85s
save result image to 'output.png'

@FSSRepo
Copy link
Contributor

FSSRepo commented Nov 30, 2023

Try another model and a different prompt; attempt to generate with CPU if it provides a coherent image. I can't think of any ideas with the limited information I have.

@wailovet
Copy link
Author

wailovet commented Nov 30, 2023

make sure everything is the same, you can check the hashes of the model eg. Also, you can try different parameters, to help cage the bug.

The program and model are stored in my mobile hard disk. I tried f32, f16 and changed sample_method=LCM and got the same output image.

@wailovet
Copy link
Author

Try another model and a different prompt; attempt to generate with CPU if it provides a coherent image. I can't think of any ideas with the limited information I have.

image
Here are all the models I've tried, everything works fine with the cpu.
Can you tell me where to add logs in the code to provide more information to locate this problem?

@wailovet
Copy link
Author

wailovet commented Nov 30, 2023

image
I try to hook here,replacing ggml_conv_2d with ggml_conv_2d_test
image
Then I compared the first few values ​​of the output results of the cpu version and the cuda version

Differences in "result"

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09 

@FSSRepo
Copy link
Contributor

FSSRepo commented Nov 30, 2023

Differences in "result"

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09 

Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?

@FSSRepo
Copy link
Contributor

FSSRepo commented Nov 30, 2023

You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.

@wailovet
Copy link
Author

wailovet commented Dec 1, 2023

Differences in "result"

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09 

Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?

ggml_tensor* postive = sd->get_learned_condition(work_ctx, prompt);
CPU output: -0.387249 0.0171568 -0.054192 -0.183599 -0.0261911 -0.338466 -0.0235674 -0.187387 0.186605 -0.0903851
CUDA output: -0.387245 0.0171541 -0.0541848 -0.18359 -0.026197 -0.338474 -0.0235705 -0.187385 0.186602 -0.0903773
Maybe there are some subtle differences, but I think the impact should be minimal

Here is my check of the output

image
postive:✔️
image
negative:✔️

struct ggml_tensor * im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true);
im2col :✔️
struct ggml_tensor * mma = ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] * im2col->ne[2] * im2col->ne[1]);
mma :✔️
struct ggml_tensor * mmb = ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2]), a->ne[3]);
mmb :✔️
struct ggml_tensor * result = ggml_mul_mat(ctx, mma, mmb);
result:❌

@wailovet
Copy link
Author

wailovet commented Dec 1, 2023

You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.

I tried enabling taesd,got the result
output

Running it on another laptop of mine can generate normal images and the efficiency is significantly improved.

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 1, 2023

That seems quite challenging to debug as it is the matrix multiplication kernel, and I can't think of a solution since I wasn't the one who created it.

@Cyberhan123
Copy link
Contributor

This is usually caused by insufficient GPU memory.

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 6, 2023

This is usually caused by insufficient GPU memory.

The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM

@Cyberhan123
Copy link
Contributor

This is usually caused by insufficient GPU memory.

The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM

Can we get cuda version?

@wailovet
Copy link
Author

wailovet commented Dec 6, 2023

This is usually caused by insufficient GPU memory.

The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM

Can we get cuda version?

v11.8

@Cyberhan123
Copy link
Contributor

It doesn't look wrong. How could this happen?

@wailovet
Copy link
Author

wailovet commented Dec 6, 2023

It doesn't look wrong. How could this happen?

I'm confused too, I tried using llama.cpp and it worked fine too. Maybe I should buy a new GPU

@olumideolufy29
Copy link

I had same issue has @wailovet. I ran the inferencing on MX150 Nvidia GPU, cuda v11.7

Could there be so compatibility issue with pascal GPU?

@FSSRepo
Copy link
Contributor

FSSRepo commented Dec 11, 2023

I'm not very experienced in CUDA; in fact, I'm struggling to add some features that could significantly accelerate image generation speed in CUDA. However, I'm facing many issues due to my lack of understanding in GPU engineering, so I can't shed light on the matter. I'm sorry that it's not working for some people. If I had equivalent hardware for testing, perhaps I could be of assistance.

@bssrdf
Copy link
Contributor

bssrdf commented Dec 28, 2023

Just to provide another data point and a potential fix.

I have a GTX 1070 and also got images with all green pixels. The CUDA version is 12.1

As @wailovet showed above, the problem seems coming from cuda version of mul_mat. One observation is that if you run ggml's test-conv2d case, most likely it will fail if your GPU has computation capability <= 7.5.

I suspect the culprit is in
https://github.com/FSSRepo/ggml/blob/70474c6890c015b53dc10a2300ae35246cc73589/src/ggml-cuda.cu#L6953-L6979
Here src0 is converted to FP32 if it is not, but src1 is not checked and converted. If you add a similar section of code to convert src1 to FP32, test-conv2d will pass. Unfortunately my fix crashed sd although it made test-conv2d pass. I lack the skill to make a bullet-proof fix and leave that to ones who can robustly do.

I have got a fix that works. Here is the patch.

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index 0d8b8d1..13e443f 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -6952,7 +6952,9 @@ inline void ggml_cuda_op_mul_mat_cublas(
     }
     else {
         float * src0_ddq_as_f32 = nullptr;
+        float * src1_ddq_as_f32 = nullptr;
         size_t src0_as = 0;
+        size_t src1_as = 0;

         if (src0->type != GGML_TYPE_F32) {
             const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src0->type);
@@ -6960,7 +6962,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             src0_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(row_diff*ne00 * sizeof(float), &src0_as); // NOLINT
             to_fp32_cuda(src0_dd_i, src0_ddq_as_f32, row_diff*ne00, stream);
         }
+        if (src1->type != GGML_TYPE_F32) {
+            // printf(" src1 is not FP32 \n");
+            const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src1->type);
+            GGML_ASSERT(to_fp32_cuda != nullptr);
+            src1_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(src1_ncols*ne10 * sizeof(float), &src1_as); // NOLINT
+            to_fp32_cuda(src1_ddf_i, src1_ddq_as_f32, src1_ncols*ne10, stream);
+        }
         const float * src0_ddf_i = src0->type == GGML_TYPE_F32 ? (const float *) src0_dd_i : src0_ddq_as_f32;
+        const float * src1_ddf1_i = src1->type == GGML_TYPE_F32 ? (const float *) src1_ddf_i : src1_ddq_as_f32;

         const float alpha = 1.0f;
         const float beta = 0.0f;
@@ -6970,12 +6980,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             cublasSgemm(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N,
                     row_diff, src1_ncols, ne10,
                     &alpha, src0_ddf_i, ne00,
-                            src1_ddf_i, ne10,
+                            src1_ddf1_i, ne10,
                     &beta,  dst_dd_i,   ldc));

         if (src0_as != 0) {
             ggml_cuda_pool_free(src0_ddq_as_f32, src0_as);
         }
+        if (src1_as != 0) {
+            ggml_cuda_pool_free(src1_ddq_as_f32, src1_as);
+        }
     }

Anyone with old NVIDIA GPUs can give a try. It also fixes two test cases: test-conv1d and test-conv2d

@leejet
Copy link
Owner

leejet commented Dec 28, 2023

Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.

@bssrdf
Copy link
Contributor

bssrdf commented Dec 28, 2023

Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.

That'll be great! Glad I can finally try SD with the generation old 1070. Still, it is much faster than CPU 😄

@SmallAndSoft
Copy link

Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.

It has been merged now and fixed whisper.cpp for older GPUs. Time to fix this issue too?
Thanks!

@leejet
Copy link
Owner

leejet commented Jan 3, 2024

I've attempted to update this branch #134 to the latest ggml, but encountered some issues when generating images larger than 512x512. I haven't had time to pinpoint the exact cause yet.

@leejet
Copy link
Owner

leejet commented Jan 5, 2024

@wailovet @bssrdf @SmallAndSoft I've updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

@SmallAndSoft
Copy link

@leejet That fixed the issue for my GTX 1060.
Thank you very much!

@bssrdf
Copy link
Contributor

bssrdf commented Jan 5, 2024

Thank you, @leejet, for bringing in this update. For some reason, SD runs much faster on cuda backend with this update, especially the decoding latent step.

@wailovet
Copy link
Author

wailovet commented Jan 8, 2024

@wailovet @bssrdf @SmallAndSoft I've updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

I tried the execution result of cuda and everything is fine
Thank you very much!

@wailovet wailovet closed this as completed Jan 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

8 participants