-
Notifications
You must be signed in to change notification settings - Fork 9.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ROCm(6.0) benchmark failed #5701
Comments
On my 7800 XT it worked only after I added this cmake options: -DLLAMA_CUDA_FORCE_DMMV=1 -DLLAMA_CUDA_FORCE_MMQ=1 |
Q4_1 stop working for some time |
I think this is from fp16 with cublas. The benchmark is asking for (1.0*2.0)+(1.0*2.0)...(1.0*2.0) 11008 times for each element in the result. In fp16 4096+2=4096, which is the value stored in the result tensor for me on both a 5700XT and a GTX 1050 forced to take that code branch. The sum of your incorrect result is twice what I get so I'd guess there's a second accumulator, maybe thanks to your rocblas using half2, happening there. Changing cublasGemmEx to use CUBLAS_COMPUTE_32F and making its alpha & beta arguments floats allows the benchmark to complete successfully, but then you're not gaining the desired speed boost from 16bit compute. |
Thanks. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Please include information about your system, the steps to reproduce the bug, and the version of llama.cpp that you are using. If possible, please provide a minimal code example that reproduces the bug.
If the bug concerns the server, please try to reproduce it first using the server test scenario framework.
The text was updated successfully, but these errors were encountered: