ROCm(6.0) benchmark failed #5701

riverzhou · 2024-02-24T14:01:45Z

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

If the bug concerns the server, please try to reproduce it first using the server test scenario framework.

[river@drfxi bin]$ ./benchmark
main: build = 2252 (525213d2)
main: built with clang version 17.0.6 (Fedora 17.0.6-6.fc40) for x86_64-redhat-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7800 XT, compute capability 11.0, VMM: no
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=1
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;              6695;   1724.08

ABORT - ERROR in Matrix Multiplication result - expected 11542724608.00, got 4294967296.00 (delta 7247757312.00 > allowed_delta 11542.72)

yoopyman · 2024-02-24T19:27:45Z

On my 7800 XT it worked only after I added this cmake options:

-DLLAMA_CUDA_FORCE_DMMV=1 -DLLAMA_CUDA_FORCE_MMQ=1

sorasoras · 2024-02-24T19:38:40Z

Q4_1 stop working for some time

Engininja2 · 2024-02-24T23:39:52Z

I think this is from fp16 with cublas. The benchmark is asking for (1.0*2.0)+(1.0*2.0)...(1.0*2.0) 11008 times for each element in the result. In fp16 4096+2=4096, which is the value stored in the result tensor for me on both a 5700XT and a GTX 1050 forced to take that code branch. The sum of your incorrect result is twice what I get so I'd guess there's a second accumulator, maybe thanks to your rocblas using half2, happening there.

Changing cublasGemmEx to use CUBLAS_COMPUTE_32F and making its alpha & beta arguments floats allows the benchmark to complete successfully, but then you're not gaining the desired speed boost from 16bit compute.

riverzhou · 2024-02-25T03:56:47Z

On my 7800 XT it worked only after I added this cmake options:

-DLLAMA_CUDA_FORCE_DMMV=1 -DLLAMA_CUDA_FORCE_MMQ=1

Thanks.
I just add -DLLAMA_CUDA_FORCE_MMQ=ON and then the benchmark is OK now.
Force MMQ is a switch that do not use TensorCore.

github-actions · 2024-04-10T01:06:10Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

riverzhou added the bug-unconfirmed label Feb 24, 2024

tbocek mentioned this issue Mar 15, 2024

backend : offload large batches to GPU #6083

Merged

github-actions bot added the stale label Mar 27, 2024

github-actions bot closed this as completed Apr 10, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROCm(6.0) benchmark failed #5701

ROCm(6.0) benchmark failed #5701

riverzhou commented Feb 24, 2024

yoopyman commented Feb 24, 2024

sorasoras commented Feb 24, 2024

Engininja2 commented Feb 24, 2024 •

edited

Loading

riverzhou commented Feb 25, 2024 •

edited

Loading

github-actions bot commented Apr 10, 2024

ROCm(6.0) benchmark failed #5701

ROCm(6.0) benchmark failed #5701

Comments

riverzhou commented Feb 24, 2024

yoopyman commented Feb 24, 2024

sorasoras commented Feb 24, 2024

Engininja2 commented Feb 24, 2024 • edited Loading

riverzhou commented Feb 25, 2024 • edited Loading

github-actions bot commented Apr 10, 2024

Engininja2 commented Feb 24, 2024 •

edited

Loading

riverzhou commented Feb 25, 2024 •

edited

Loading