-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test-backend-ops : add performance eval mode + improve CUDA repeat and binary broadcast ops performance #636
Conversation
The performance that I see when broadcasting with add/mul seems pretty good, so I am not sure what needs to be optimized. @FSSRepo can you give me some test cases for sd? Ie. the types and dimensions of the parameters.
|
At the moment, I'm not at home. Could you wait for a few hours? |
When you can, there is no rush at all. |
Some dimens by stable diffusion add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 21 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 34 us
add: A[1280, 16, 16, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 33 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 17 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 73 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 72 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 21 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 33 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 41 us
add: A[16, 16, 2560, 1] B[1, 1, 2560, 1] - 42 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 46 us
add: A[1280, 1, 1, 1] B[1280, 1, 1, 1] - 14 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 30 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 24 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 32 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 20 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 33 us
add: A[1280, 16, 16, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 16 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 73 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 85 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 19 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 44 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 75 us
add: A[16, 16, 1920, 1] B[1, 1, 1920, 1] - 37 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 24 us
add: A[1280, 1, 1, 1] B[1280, 1, 1, 1] - 9 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 24 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 20 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 23 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 20 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 21 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 17 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 32 us
add: A[1280, 16, 16, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 40 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 19 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 88 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 80 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 26 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 34 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 21 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 32 us
add: A[32, 32, 1280, 1] B[1, 1, 1280, 1] - 73 us
add: A[32, 32, 1920, 1] B[1, 1, 1920, 1] - 417 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 42 us
add: A[640, 1, 1, 1] B[640, 1, 1, 1] - 9 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 55 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 59 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 353 us |
@slaren I understand that The dimensions I provided are for the tensors |
It's the number of repetitions, so the dimensions of the tensor ggml_tensor * a = ggml_new_tensor_4d(ctx, type, ne[0]*nr[0], ne[1]*nr[1], ne[2]*nr[2], ne[3]*nr[3]);
ggml_tensor * b = ggml_new_tensor(ctx, type, 4, ne.data());
ggml_tensor * out = op(ctx, a, b); Where |
I get this with these test cases:
|
Seems good |
Adjusting the block dims depending on the dimensions of the tensors improves some results:
|
Processing two elements per thread:
This should be good enough already for these ops, they are usually a very small fraction of the overall time anyway. |
@slaren i will test your performance improvement in stable-diffusion |
Hm, whisper seems broken with CUDA atm? Edit: whisper test that are currently failing: diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 8540ebd..abd4de6 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -1152,6 +1152,10 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
add_test_bin_bcast(GGML_TYPE_F32, {5120, 1, 1, 1}, {1, 256, 1, 1});
add_test_bin_bcast(GGML_TYPE_F32, {640, 1, 1, 1}, {1, 1, 1, 1});
+ // whisper
+ add_test_bin_bcast(GGML_TYPE_F32, {1500, 512, 1, 1}, {1, 512, 1, 1});
+ add_test_bin_bcast(GGML_TYPE_F32, {3000, 512, 1, 1}, {1, 512, 1, 1});
+
test_cases.emplace_back(new test_scale());
for (float eps : {1e-6f, 1e-5f, 1e-3f, 1e-1f}) { Edit2: whisper is now fixed, though these tests run OOM - not sure if expected |
@slaren with these dimensions in CUDA, it crashes (using stable diffusion, LoRA computing). CUDA OP: ADD A[1, 1, 320, 320] B[1, 1, 320, 320] |
Both issues should be fixed now. |
I don't think so, the tests shouldn't be so big as to cause OOM issues. When does that happen? |
CUDA OP: ADD [3, 3, 2560, 1280] [3, 3, 2560, 1280]
blocks num(1, 1, 78020)
block dim(1, 3, 42)
CUDA error 9 at C:\proyectos\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:7238: invalid configuration argument
current device: 0 |
With that patch with the whisper tests, it fails like this: make -j && ./bin/test-backend-ops -b CUDA0 -o ADD
|
Is that |
In this case |
No, it's not. I got confused about the meaning of the tests arguments - ignore these tests. Not sure about @FSSRepo's case |
@slaren I think it could be checked if the shapes are the same, and the larger dimensions are swapped first, and the smaller ones last.
|
I tried a few different solutions with a single kernel, but it resulted in decreased performance, so instead I added a fallback kernel for large tensors. I also added dimension collapsing so that multi dimensional tensors are processed as 1d tensors when possible, and that may improve performance slightly in some cases.
|
This implementation[ ADD] - 22.503000 ms - 419 - 0.053706 ms
[ MUL] - 8.619000 ms - 125 - 0.068952 ms
[ CONCAT] - 2.235000 ms - 12 - 0.186250 ms
[ NORM] - 2.643000 ms - 48 - 0.055062 ms
[GROUP_NORM] - 5.309000 ms - 61 - 0.087033 ms
[ MUL_MAT] - 120.028000 ms - 361 - 0.332488 ms
[ SCALE] - 2.323000 ms - 32 - 0.072594 ms
[ CONT] - 11.394000 ms - 160 - 0.071213 ms
[ SOFT_MAX] - 101.511002 ms - 32 - 3.172219 ms
[ IM2COL] - 62.881001 ms - 97 - 0.648258 ms
[ UPSCALE] - 0.566000 ms - 3 - 0.188667 ms
[ UNARY] - 4.700000 ms - 84 - 0.055952 ms
Total Time: 344.712036 ms My implementation
It seems that the performance is almost identical to my implementation, and it doesn't pose any issues. Everything is working well. Good job! |
@slaren Planning to make a sync with llama.cpp/whisper.cpp after we merge this PR and ggerganov/llama.cpp#4309. Any concerns? |
No, I don't expect any significant issues. |
Use with
test-backend-ops perf [-o op] [-b backend]
.Repeats the ops a number of times and calculates the memory throughput.
op_size
in the op test classggml_nbytes(dst) * 3
to account for broadcasting