Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROCm(6.0) benchmark failed #5701

Closed
riverzhou opened this issue Feb 24, 2024 · 5 comments
Closed

ROCm(6.0) benchmark failed #5701

riverzhou opened this issue Feb 24, 2024 · 5 comments

Comments

@riverzhou
Copy link

Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.

If the bug concerns the server, please try to reproduce it first using the server test scenario framework.

[river@drfxi bin]$ ./benchmark
main: build = 2252 (525213d2)
main: built with clang version 17.0.6 (Fedora 17.0.6-6.fc40) for x86_64-redhat-linux-gnu
Starting Test
Allocating Memory of size 800194560 bytes, 763 MB
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 ROCm devices:
  Device 0: AMD Radeon RX 7800 XT, compute capability 11.0, VMM: no
Creating new tensors

------ Test 1 - Matrix Mult via F32 code
n_threads=1
            m11: type = 0 (  f32) ne = 11008 x  4096 x     1, nb = (    4, 44032, 180355072) - Sum of tensor m11 is 45088768.00
             m2: type = 0 (  f32) ne = 11008 x   128 x     1, nb = (    4, 44032, 5636096) - Sum of tensor m2 is 2818048.00
   gf->nodes[0]: type = 0 (  f32) ne =  4096 x   128 x     1, nb = (    4, 16384, 2097152) - Sum of tensor gf->nodes[0] is 11542724608.00

------ Test 2 - Matrix Mult via q4_1 code
n_threads=1
Matrix Multiplication of (11008,4096,1) x (11008,128,1) - about  11.54 gFLOPS

Iteration;NThreads; SizeX; SizeY; SizeZ; Required_FLOPS; Elapsed_u_Seconds; gigaFLOPS
=====================================================================================
        0;       1; 11008;  4096;   128;    11542724608;              6695;   1724.08

ABORT - ERROR in Matrix Multiplication result - expected 11542724608.00, got 4294967296.00 (delta 7247757312.00 > allowed_delta 11542.72)
@yoopyman
Copy link

On my 7800 XT it worked only after I added this cmake options:

-DLLAMA_CUDA_FORCE_DMMV=1 -DLLAMA_CUDA_FORCE_MMQ=1

@sorasoras
Copy link

Q4_1 stop working for some time

@Engininja2
Copy link
Contributor

Engininja2 commented Feb 24, 2024

I think this is from fp16 with cublas. The benchmark is asking for (1.0*2.0)+(1.0*2.0)...(1.0*2.0) 11008 times for each element in the result. In fp16 4096+2=4096, which is the value stored in the result tensor for me on both a 5700XT and a GTX 1050 forced to take that code branch. The sum of your incorrect result is twice what I get so I'd guess there's a second accumulator, maybe thanks to your rocblas using half2, happening there.

Changing cublasGemmEx to use CUBLAS_COMPUTE_32F and making its alpha & beta arguments floats allows the benchmark to complete successfully, but then you're not gaining the desired speed boost from 16bit compute.

@riverzhou
Copy link
Author

riverzhou commented Feb 25, 2024

On my 7800 XT it worked only after I added this cmake options:

-DLLAMA_CUDA_FORCE_DMMV=1 -DLLAMA_CUDA_FORCE_MMQ=1

Thanks.
I just add -DLLAMA_CUDA_FORCE_MMQ=ON and then the benchmark is OK now.
Force MMQ is a switch that do not use TensorCore.

Copy link
Contributor

This issue was closed because it has been inactive for 14 days since being marked as stale.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants