CUDA cannot generate images #95

wailovet · 2023-11-29T09:56:14Z

I encountered a strange problem. After using CUDA, I got a pure green picture when running.But it works fine on another computer.

sd_cuda.exe  -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
    n_threads:       6
    mode:            txt2img
    model_path:      meinamix_meinaV11-f16.gguf
    output_path:     output.png
    init_img:
    prompt:          1girl
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   euler_a
    schedule:        default
    sample_steps:    20
    strength:        0.75
    rng:             cuda
    seed:            42
    batch_count:     1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO]  stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   0:                              sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   1:                             sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   2:                           sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   3:                            sd.vocab.tokens arr
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887  - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 0.92s
[INFO]  stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO]  stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 455 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 415 ms
[INFO]  stable-diffusion.cpp:4681 - get_learned_condition completed, taking 876 ms
[INFO]  stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
  |==================================================| 20/20 - 7.42s/it
[INFO]  stable-diffusion.cpp:4706 - sampling completed, taking 157.10s
[INFO]  stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 157.12s
[INFO]  stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.65s
[INFO]  stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.66s
[INFO]  stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.66s
[INFO]  stable-diffusion.cpp:4735 - txt2img completed in 164.66s
save result image to 'output.png'

The text was updated successfully, but these errors were encountered:

Green-Sky · 2023-11-29T11:34:54Z

make sure everything is the same, you can check the hashes of the model eg.
Also, you can try different parameters, to help cage the bug.

FSSRepo · 2023-11-29T13:37:00Z

I think it could be an issue related to the Tensor Cores.

Try uncommenting line 99 of ggml-cuda.cu:

#define GGML_CUDA_MAX_NODES 8192

// define this if you want to always fallback to MMQ kernels and not use cuBLAS for matrix multiplication
// on modern hardware, using cuBLAS is recommended as it utilizes F16 tensor cores which are very performant
// for large computational tasks. the drawback is that this requires some extra amount of VRAM:
// -  7B quantum model: +100-200 MB
// - 13B quantum model: +200-400 MB
//

#define GGML_CUDA_FORCE_MMQ  // decomment this line and try again

// TODO: improve this to be correct for more hardware
//       for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores
//       probably other such cases, and not sure what happens on AMD hardware
#if !defined(GGML_CUDA_FORCE_MMQ)
#define CUDA_USE_TENSOR_CORES
#endif

wailovet · 2023-11-30T02:27:08Z

I think it could be an issue related to the Tensor Cores.

Try uncommenting line 99 of ggml-cuda.cu:

#define GGML_CUDA_MAX_NODES 8192

// define this if you want to always fallback to MMQ kernels and not use cuBLAS for matrix multiplication
// on modern hardware, using cuBLAS is recommended as it utilizes F16 tensor cores which are very performant
// for large computational tasks. the drawback is that this requires some extra amount of VRAM:
// -  7B quantum model: +100-200 MB
// - 13B quantum model: +200-400 MB
//

#define GGML_CUDA_FORCE_MMQ  // decomment this line and try again

// TODO: improve this to be correct for more hardware
//       for example, currently fails for GeForce GTX 1660 which is TURING arch (> VOLTA) but does not have tensor cores
//       probably other such cases, and not sure what happens on AMD hardware
#if !defined(GGML_CUDA_FORCE_MMQ)
#define CUDA_USE_TENSOR_CORES
#endif

The running speed is much faster, but the generated image is still pure green.

sd-cuda-mmq.exe  -m meinamix_meinaV11-f16.gguf -p "1girl" -v
Option:
    n_threads:       6
    mode:            txt2img
    model_path:      meinamix_meinaV11-f16.gguf
    output_path:     output.png
    init_img:
    prompt:          1girl
    negative_prompt:
    cfg_scale:       7.00
    width:           512
    height:          512
    sample_method:   euler_a
    schedule:        default
    sample_steps:    20
    strength:        0.75
    rng:             cuda
    seed:            42
    batch_count:     1
System Info:
    BLAS = 1
    SSE3 = 1
    AVX = 1
    AVX2 = 1
    AVX512 = 0
    AVX512_VBMI = 0
    AVX512_VNNI = 0
    FMA = 1
    NEON = 0
    ARM_FMA = 0
    F16C = 1
    FP16_VA = 0
    WASM_SIMD = 0
    VSX = 0
[DEBUG] stable-diffusion.cpp:3701 - Using CUDA backend
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   yes
ggml_init_cublas: CUDA_USE_TENSOR_CORES: no
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1070, compute capability 6.1
[INFO]  stable-diffusion.cpp:3715 - loading model from 'meinamix_meinaV11-f16.gguf'
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   0:                              sd.model.name str
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   1:                             sd.model.dtype i32
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   2:                           sd.model.version i8
[DEBUG] stable-diffusion.cpp:3733 - load_from_file: - kv   3:                            sd.vocab.tokens arr
[INFO]  stable-diffusion.cpp:3743 - Stable Diffusion 1.x | meinamix_meinaV11.safetensors
[INFO]  stable-diffusion.cpp:3751 - model data type: f16
[DEBUG] stable-diffusion.cpp:3755 - loading vocab
[DEBUG] stable-diffusion.cpp:3771 - ggml tensor size = 416 bytes
[DEBUG] stable-diffusion.cpp:887  - clip params backend buffer size =  236.18 MB (449 tensors)
[DEBUG] stable-diffusion.cpp:2028 - unet params backend buffer size =  1641.16 MB (706 tensors)
[DEBUG] stable-diffusion.cpp:3118 - vae params backend buffer size =  95.47 MB (164 tensors)
[DEBUG] stable-diffusion.cpp:3780 - preparing memory for the weights
[DEBUG] stable-diffusion.cpp:3798 - loading weights
[DEBUG] stable-diffusion.cpp:3903 - model size = 1969.67MB
[INFO]  stable-diffusion.cpp:3913 - total memory buffer size = 1972.80MB (clip 236.18MB, unet 1641.16MB, vae 95.47MB)
[INFO]  stable-diffusion.cpp:3915 - loading model from 'meinamix_meinaV11-f16.gguf' completed, taking 7.03s
[INFO]  stable-diffusion.cpp:3939 - running in eps-prediction mode
[DEBUG] stable-diffusion.cpp:3966 - finished loaded file
[DEBUG] stable-diffusion.cpp:4647 - prompt after extract and remove lora: "1girl"
[INFO]  stable-diffusion.cpp:4652 - apply_loras completed, taking 0.00s
[DEBUG] stable-diffusion.cpp:1118 - parse '1girl' to [['1girl', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "1girl" to tokens ["1</w>", "girl</w>", ]
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 18 ms
[DEBUG] stable-diffusion.cpp:1118 - parse '' to [['', 1], ]
[DEBUG] stable-diffusion.cpp:521  - split prompt "" to tokens []
[DEBUG] stable-diffusion.cpp:1051 - learned condition compute buffer size: 1.58 MB
[DEBUG] stable-diffusion.cpp:4061 - computing condition graph completed, taking 16 ms
[INFO]  stable-diffusion.cpp:4681 - get_learned_condition completed, taking 40 ms
[INFO]  stable-diffusion.cpp:4691 - sampling using Euler A method
[INFO]  stable-diffusion.cpp:4694 - generating image: 1/1
[DEBUG] stable-diffusion.cpp:2384 - diffusion compute buffer size: 552.57 MB
  |==================================================| 20/20 - 1.83s/it
[INFO]  stable-diffusion.cpp:4706 - sampling completed, taking 37.21s
[INFO]  stable-diffusion.cpp:4714 - generating 1 latent images completed, taking 37.23s
[INFO]  stable-diffusion.cpp:4716 - decoding 1 latents
[DEBUG] stable-diffusion.cpp:3252 - vae compute buffer size: 1664.00 MB
[DEBUG] stable-diffusion.cpp:4605 - computing vae [mode: DECODE] graph completed, taking 6.58s
[INFO]  stable-diffusion.cpp:4724 - latent 1 decoded, taking 6.58s
[INFO]  stable-diffusion.cpp:4728 - decode_first_stage completed, taking 6.58s
[INFO]  stable-diffusion.cpp:4735 - txt2img completed in 43.85s
save result image to 'output.png'

FSSRepo · 2023-11-30T02:31:24Z

Try another model and a different prompt; attempt to generate with CPU if it provides a coherent image. I can't think of any ideas with the limited information I have.

wailovet · 2023-11-30T02:32:58Z

make sure everything is the same, you can check the hashes of the model eg. Also, you can try different parameters, to help cage the bug.

The program and model are stored in my mobile hard disk. I tried f32, f16 and changed sample_method=LCM and got the same output image.

wailovet · 2023-11-30T02:37:11Z

Try another model and a different prompt; attempt to generate with CPU if it provides a coherent image. I can't think of any ideas with the limited information I have.

Here are all the models I've tried, everything works fine with the cpu.
Can you tell me where to add logs in the code to provide more information to locate this problem?

wailovet · 2023-11-30T07:57:21Z

I try to hook here,replacing ggml_conv_2d with ggml_conv_2d_test

Then I compared the first few values of the output results of the cpu version and the cuda version

Differences in "result"

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09

FSSRepo · 2023-11-30T12:41:10Z

Differences in "result"

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09

Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?

FSSRepo · 2023-11-30T12:42:58Z

You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.

wailovet · 2023-12-01T02:16:01Z

Differences in "result"

CPU result
-0.304439 0.422194 0.0867985 -0.19692 0.223478 0.438775 -0.0987804 0.0194783 -0.650625 0.692133 -0.734613 -0.017556 1.1144 0.0192951 -0.619648 -0.0158069 -0.333611 0.840091 -1.09174 0.428399 0.341398 0.275071 -0.269062 -0.170968 -0.28541 -0.251124 0.208278 -0.29216 0.314511 -0.10386 0.0744066 0.141419

CUDA result
7.46726e-13 -2.17465e-09 8.88322e-10 -2.58623e-09 -4.62539e-10 4.44896e-09 8.31899e-10 -1.79477e-10 2.92877e-10 -1.62645e-09 1.65822e-09 -3.76766e-09 2.47073e-09 2.52588e-09 -2.88589e-10 2.86937e-10 4.25023e-11 2.44531e-10 3.76198e-10 -1.71427e-09 -1.19091e-09 2.9318e-09 -1.80849e-09 4.23559e-10 -1.52254e-09 3.87822e-09 3.07924e-10 -7.92377e-10 -1.29449e-09 -7.56863e-10 1.57558e-11 -1.30719e-09

Could you compare the clip outputs (hidden state) of get_learned_condition() to confirm that only the im2col kernel could be causing issues?

ggml_tensor* postive = sd->get_learned_condition(work_ctx, prompt);
CPU output: -0.387249 0.0171568 -0.054192 -0.183599 -0.0261911 -0.338466 -0.0235674 -0.187387 0.186605 -0.0903851
CUDA output: -0.387245 0.0171541 -0.0541848 -0.18359 -0.026197 -0.338474 -0.0235705 -0.187385 0.186602 -0.0903773
Maybe there are some subtle differences, but I think the impact should be minimal

Here is my check of the output

postive:✔️

negative:✔️

struct ggml_tensor * im2col = ggml_im2col(ctx, a, b, s0, s1, p0, p1, d0, d1, true);
im2col :✔️
struct ggml_tensor * mma = ggml_reshape_2d(ctx, im2col, im2col->ne[0], im2col->ne[3] * im2col->ne[2] * im2col->ne[1]);
mma :✔️
struct ggml_tensor * mmb = ggml_reshape_2d(ctx, a, (a->ne[0] * a->ne[1] * a->ne[2]), a->ne[3]);
mmb :✔️
struct ggml_tensor * result = ggml_mul_mat(ctx, mma, mmb);
result:❌

wailovet · 2023-12-01T02:21:43Z

You could also try my pull request #88; I optimized the im2col kernel to make more efficient use of GPU resources.

I tried enabling taesd,got the result

Running it on another laptop of mine can generate normal images and the efficiency is significantly improved.

FSSRepo · 2023-12-01T14:18:36Z

That seems quite challenging to debug as it is the matrix multiplication kernel, and I can't think of a solution since I wasn't the one who created it.

Cyberhan123 · 2023-12-06T02:09:54Z

This is usually caused by insufficient GPU memory.

FSSRepo · 2023-12-06T02:12:59Z

This is usually caused by insufficient GPU memory.

The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM

Cyberhan123 · 2023-12-06T02:27:45Z

This is usually caused by insufficient GPU memory.

The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM

Can we get cuda version？

wailovet · 2023-12-06T09:07:11Z

This is usually caused by insufficient GPU memory.

The user has a GTX 1070, it has 8GB VRAM, and I can run w/o issues with a RTX 3050 laptop 4GB VRAM

Can we get cuda version？

v11.8

Cyberhan123 · 2023-12-06T09:22:53Z

It doesn't look wrong. How could this happen?

wailovet · 2023-12-06T09:27:00Z

It doesn't look wrong. How could this happen?

I'm confused too, I tried using llama.cpp and it worked fine too. Maybe I should buy a new GPU

olumideolufy29 · 2023-12-07T05:22:26Z

I had same issue has @wailovet. I ran the inferencing on MX150 Nvidia GPU, cuda v11.7

Could there be so compatibility issue with pascal GPU?

FSSRepo · 2023-12-11T02:09:57Z

I'm not very experienced in CUDA; in fact, I'm struggling to add some features that could significantly accelerate image generation speed in CUDA. However, I'm facing many issues due to my lack of understanding in GPU engineering, so I can't shed light on the matter. I'm sorry that it's not working for some people. If I had equivalent hardware for testing, perhaps I could be of assistance.

bssrdf · 2023-12-28T02:48:23Z

Just to provide another data point and a potential fix.

I have a GTX 1070 and also got images with all green pixels. The CUDA version is 12.1

As @wailovet showed above, the problem seems coming from cuda version of mul_mat. One observation is that if you run ggml's test-conv2d case, most likely it will fail if your GPU has computation capability <= 7.5.

I suspect the culprit is in
https://github.com/FSSRepo/ggml/blob/70474c6890c015b53dc10a2300ae35246cc73589/src/ggml-cuda.cu#L6953-L6979
Here src0 is converted to FP32 if it is not, but src1 is not checked and converted. If you add a similar section of code to convert src1 to FP32, test-conv2d will pass. ~~Unfortunately my fix crashed sd although it made test-conv2d pass. I lack the skill to make a bullet-proof fix and leave that to ones who can robustly do.~~

I have got a fix that works. Here is the patch.

diff --git a/src/ggml-cuda.cu b/src/ggml-cuda.cu
index 0d8b8d1..13e443f 100644
--- a/src/ggml-cuda.cu
+++ b/src/ggml-cuda.cu
@@ -6952,7 +6952,9 @@ inline void ggml_cuda_op_mul_mat_cublas(
     }
     else {
         float * src0_ddq_as_f32 = nullptr;
+        float * src1_ddq_as_f32 = nullptr;
         size_t src0_as = 0;
+        size_t src1_as = 0;

         if (src0->type != GGML_TYPE_F32) {
             const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src0->type);
@@ -6960,7 +6962,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             src0_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(row_diff*ne00 * sizeof(float), &src0_as); // NOLINT
             to_fp32_cuda(src0_dd_i, src0_ddq_as_f32, row_diff*ne00, stream);
         }
+        if (src1->type != GGML_TYPE_F32) {
+            // printf(" src1 is not FP32 \n");
+            const to_fp32_cuda_t to_fp32_cuda = ggml_get_to_fp32_cuda(src1->type);
+            GGML_ASSERT(to_fp32_cuda != nullptr);
+            src1_ddq_as_f32 = (float *) ggml_cuda_pool_malloc(src1_ncols*ne10 * sizeof(float), &src1_as); // NOLINT
+            to_fp32_cuda(src1_ddf_i, src1_ddq_as_f32, src1_ncols*ne10, stream);
+        }
         const float * src0_ddf_i = src0->type == GGML_TYPE_F32 ? (const float *) src0_dd_i : src0_ddq_as_f32;
+        const float * src1_ddf1_i = src1->type == GGML_TYPE_F32 ? (const float *) src1_ddf_i : src1_ddq_as_f32;

         const float alpha = 1.0f;
         const float beta = 0.0f;
@@ -6970,12 +6980,15 @@ inline void ggml_cuda_op_mul_mat_cublas(
             cublasSgemm(g_cublas_handles[id], CUBLAS_OP_T, CUBLAS_OP_N,
                     row_diff, src1_ncols, ne10,
                     &alpha, src0_ddf_i, ne00,
-                            src1_ddf_i, ne10,
+                            src1_ddf1_i, ne10,
                     &beta,  dst_dd_i,   ldc));

         if (src0_as != 0) {
             ggml_cuda_pool_free(src0_ddq_as_f32, src0_as);
         }
+        if (src1_as != 0) {
+            ggml_cuda_pool_free(src1_ddq_as_f32, src1_as);
+        }
     }

Anyone with old NVIDIA GPUs can give a try. It also fixes two test cases: test-conv1d and test-conv2d

leejet · 2023-12-28T16:29:46Z

Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.

bssrdf · 2023-12-28T18:18:20Z

Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.

That'll be great! Glad I can finally try SD with the generation old 1070. Still, it is much faster than CPU 😄

SmallAndSoft · 2024-01-02T22:25:40Z

Once the upstream ggml merges your PR, I'll update ggml to the corresponding commit to fix this issue.

It has been merged now and fixed whisper.cpp for older GPUs. Time to fix this issue too?
Thanks!

leejet · 2024-01-03T14:34:25Z

I've attempted to update this branch #134 to the latest ggml, but encountered some issues when generating images larger than 512x512. I haven't had time to pinpoint the exact cause yet.

leejet · 2024-01-05T15:30:36Z

@wailovet @bssrdf @SmallAndSoft I've updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

SmallAndSoft · 2024-01-05T19:37:25Z

@leejet That fixed the issue for my GTX 1060.
Thank you very much!

bssrdf · 2024-01-05T21:25:12Z

Thank you, @leejet, for bringing in this update. For some reason, SD runs much faster on cuda backend with this update, especially the decoding latent step.

wailovet · 2024-01-08T10:27:36Z

@wailovet @bssrdf @SmallAndSoft I've updated ggml to the latest code. You can try using the latest master branch to see if the issue still persists.

I tried the execution result of cuda and everything is fine
Thank you very much!

Cyberhan123 mentioned this issue Dec 23, 2023

Is there anyone who can't generate images correctly? #122

Open

4 tasks

bssrdf mentioned this issue Dec 28, 2023

Fix mul-mat error for older GPUs ggerganov/ggml#669

Merged

wailovet closed this as completed Jan 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

CUDA cannot generate images #95

CUDA cannot generate images #95

wailovet commented Nov 29, 2023 •

edited

Loading

Green-Sky commented Nov 29, 2023

FSSRepo commented Nov 29, 2023 •

edited

Loading

wailovet commented Nov 30, 2023

FSSRepo commented Nov 30, 2023

wailovet commented Nov 30, 2023 •

edited

Loading

wailovet commented Nov 30, 2023

wailovet commented Nov 30, 2023 •

edited

Loading

FSSRepo commented Nov 30, 2023 •

edited

Loading

FSSRepo commented Nov 30, 2023

wailovet commented Dec 1, 2023

wailovet commented Dec 1, 2023

FSSRepo commented Dec 1, 2023

Cyberhan123 commented Dec 6, 2023

FSSRepo commented Dec 6, 2023 •

edited

Loading

Cyberhan123 commented Dec 6, 2023

wailovet commented Dec 6, 2023

Cyberhan123 commented Dec 6, 2023

wailovet commented Dec 6, 2023

olumideolufy29 commented Dec 7, 2023

FSSRepo commented Dec 11, 2023

bssrdf commented Dec 28, 2023 •

edited

Loading

leejet commented Dec 28, 2023

bssrdf commented Dec 28, 2023

SmallAndSoft commented Jan 2, 2024

leejet commented Jan 3, 2024

leejet commented Jan 5, 2024

SmallAndSoft commented Jan 5, 2024

bssrdf commented Jan 5, 2024 •

edited

Loading

wailovet commented Jan 8, 2024

CUDA cannot generate images #95

CUDA cannot generate images #95

Comments

wailovet commented Nov 29, 2023 • edited Loading

Green-Sky commented Nov 29, 2023

FSSRepo commented Nov 29, 2023 • edited Loading

wailovet commented Nov 30, 2023

FSSRepo commented Nov 30, 2023

wailovet commented Nov 30, 2023 • edited Loading

wailovet commented Nov 30, 2023

wailovet commented Nov 30, 2023 • edited Loading

FSSRepo commented Nov 30, 2023 • edited Loading

FSSRepo commented Nov 30, 2023

wailovet commented Dec 1, 2023

wailovet commented Dec 1, 2023

FSSRepo commented Dec 1, 2023

Cyberhan123 commented Dec 6, 2023

FSSRepo commented Dec 6, 2023 • edited Loading

Cyberhan123 commented Dec 6, 2023

wailovet commented Dec 6, 2023

Cyberhan123 commented Dec 6, 2023

wailovet commented Dec 6, 2023

olumideolufy29 commented Dec 7, 2023

FSSRepo commented Dec 11, 2023

bssrdf commented Dec 28, 2023 • edited Loading

leejet commented Dec 28, 2023

bssrdf commented Dec 28, 2023

SmallAndSoft commented Jan 2, 2024

leejet commented Jan 3, 2024

leejet commented Jan 5, 2024

SmallAndSoft commented Jan 5, 2024

bssrdf commented Jan 5, 2024 • edited Loading

wailovet commented Jan 8, 2024

wailovet commented Nov 29, 2023 •

edited

Loading

FSSRepo commented Nov 29, 2023 •

edited

Loading

wailovet commented Nov 30, 2023 •

edited

Loading

wailovet commented Nov 30, 2023 •

edited

Loading

FSSRepo commented Nov 30, 2023 •

edited

Loading

FSSRepo commented Dec 6, 2023 •

edited

Loading

bssrdf commented Dec 28, 2023 •

edited

Loading

bssrdf commented Jan 5, 2024 •

edited

Loading