-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance investigation using AMD BLIS instead of OpenBLAS on 16 core AMD Zen1 #637
Comments
Just checked OpenBLAS. Same behaviour. |
Is that the right file name? Probably the real issue here is that when -f is used with a non-existing file it doesn't show any error. |
On a side note keep in mind that using BLAS to evaluate the perplexity may give misleading values, since BLAS appears to do matrix multiplication with higher precision, but it is not available when generating, only for the prompt. |
Good catch. Running now. TVM. |
I only install the blis, and do the same as you did. my system_info in main.cpp do not show BLAS =1. But with muti speed bust. make LLAMA_OPENBLAS=1 g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/main/main.cpp ggml.o llama.o common.o -o main -lblis ==== Run ./main -h for help. ==== g++ -I. -I./examples -O3 -DNDEBUG -std=c++11 -fPIC -Wall -Wextra -Wpedantic -Wcast-qual -Wno-unused-function -pthread examples/quantize/quantize.cpp ggml.o llama.o -o quantize -lblis main: warning: model does not support context sizes greater than 2048 tokens (5377 specified);expect poor results system_info: n_threads = 16 / 16 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 0 | SSE3 = 1 | VSX = 0 | |
I think you need to increase the batch size to cause it to use BLAS. Note that you also have to use the following when building BLISS to enable BLAS support in BLISS:
If you get it to work keep an eye on your total CPU% using |
I checked my blis config.mk and show mk_Enable_BLAS = yes. But your mentioned cblas no. Do you think I need to change that value from no to yes? |
One major inconsistency is that the text generated is of different lengths with the different BLAS libs, so the total time for BLISS was less because of an "When it is too good to be true, it is probably not true!" I'll see if I can get an apples-to-apples perplexity run working. |
For a BLISS clean build:
If |
I rebuild the blis with ./configure --enable-cblas zen3 Then rebuild llama.cpp by And nothing change.... Besides I change -b to 256, Still BLAS=0, seems I need to install openblas?😅😂 |
Did you change the Makefile to link against BLIS instead of OpenBLAS? I wouldn't worry about it. There's clearly some weird threading interaction between BLIS and OpenBLAS:
BLIS:
|
Did you compile blis with multithreading enabled? It defaults to off. Haven't tested to see if that's the threading interaction yet, though. |
I think I did post it yet DGGML_USE_OPENBLAS -I/usr/local/include/blis The blis should work. Still somehow not make sense. |
Good idea. Tried it, but it dIdn't seem to change anything. |
with blis, even the BLAS=0 been showed llama_print_timings: load time = 4804.64 ms real 1m23.066s without blis llama_print_timings: load time = 4730.88 ms real 1m23.950s conclusion: Did more test in -b 128 but still with llamaopenblas=1 builds performance are the slower.... even I was thinking the speed increased. Maybe the problem is the my system structure since i use apx to manage my system apps. |
@FNsi I just bypassed the whole LLAMA_OPENBLAS flag by forcing the flags into default in the makefile. Mine looks like
around line 35 or so. BLAS=1 is shown when I run inference. |
I'll re-open it if people are interested in playing around with BLIS. Similar to OpenBLAS, |
I think I realized the problem I made just figured out with abroot, I need to change DGGML_USE_OPENBLAS -I/usr/local/include/blis to DGGML_USE_OPENBLAS -I/.system/usr/local/include/blis So the llama.cpp I built just bypassed that even with llamaopenblas=1 Thank your guys😂 |
For what it's worth, there seems to be two BLIS repos, the AMD maintained fork at https://github.com/amd/blis, and the original at https://github.com/flame/blis which is updated far more frequently. I'm not sure if the original repo maintainers are incorporating AMD's changes but it might be worth comparing the two if someone's doing performance testing anyway. |
|
🤷♂️ |
@gjmulder Same threading issues too? |
@omarkazmi it is nearly twice as fast when doing perplexity! Wohoo! Before it was sitting at 100% CPU, now 187% 🥳 🥳 🥳 EDIT: That was sarcasm. 2200+% CPU with OpenBLAS. |
Funny things happened again
blis or blas (blis)
real 9m10.882s
blis or blas (blys),USA pronunciation adj., adv.
real 5m34.481s |
@FNsi it is only six tokens. The difference in performance is likely due to the shortness of the sample.
Note that longer runs look to progressively take longer for each additional token generated, so some of the 25% gain might be due to the fact that the BLAS run generated 81 less tokens. |
I agree, and I saw your comment about 2000% increase? How you made it? I also try build the multi threads of blis, however seems nothing different. |
I have 16 AMD cores (i.e. 32 hypercores). With BLAS |
I assume the first 'with' you said is without? That's a huge improvement! |
BLAS seems to want to multithread independent of what I set |
And if the blas can be running in 16 threads...... |
Prerequisites
Please answer the following questions for yourself before submitting an issue.
Expected Behavior
Compiling against AMD optimized BLS implementation of BLAS allows me to run perplexity tests
Current Behavior
Compiling against AMD optimized BLS implementation of BLAS causes perplexity command to process 0 chunks
Steps to Reproduce
174 second run just calling
./main
linked against OpenBLAS:47 second run calling
./main
linked against AMD bliss BLAS libs:Perplexity run with blis doesn't process any chunks :
The text was updated successfully, but these errors were encountered: