Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test-backend-ops : add performance eval mode + improve CUDA repeat and binary broadcast ops performance #636

Merged
merged 18 commits into from
Dec 7, 2023

Conversation

slaren
Copy link
Collaborator

@slaren slaren commented Dec 5, 2023

Use with test-backend-ops perf [-o op] [-b backend].

Repeats the ops a number of times and calculates the memory throughput.

  • The memory transfer size of the op is given by the sum of the sizes of the destination tensor and all the sources, but it can be overriden by implementing op_size in the op test class
  • Binary ops with broadcasting calculate the size as ggml_nbytes(dst) * 3 to account for broadcasting
  • Matrix multiplication tries to calculate the size by considering all the memory accesses required for a standard $O(N^3)$ matrix multiplication
  • None of this takes into account cache effects, so it is possible to get a throughput higher than the system memory bandwidth
  • The number of repetitions (runs) per op depends on the memory size of the op. It tries to repeat the op enough times to get a total memory transfer of at least 8 GB for CPU, or 32 GB for GPU backends, with a maximum of 8192 repetitions.

@slaren
Copy link
Collaborator Author

slaren commented Dec 5, 2023

The performance that I see when broadcasting with add/mul seems pretty good, so I am not sure what needs to be optimized. @FSSRepo can you give me some test cases for sd? Ie. the types and dimensions of the parameters.

  ADD(type=f32,ne=[16,10,1,1],nr=[1,1,1,1]):                               8192 runs -    10.28 us/run -        1 kB/run -    0.17 GB/s
  ADD(type=f32,ne=[16,10,10,1],nr=[1,1,1,1]):                              8192 runs -     3.28 us/run -       18 kB/run -    5.45 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,1]):                             8192 runs -     4.05 us/run -      187 kB/run -   44.20 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[2,1,1,1]):                             8192 runs -     4.04 us/run -      375 kB/run -   88.56 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,2,1,1]):                             8192 runs -     4.90 us/run -      375 kB/run -   72.95 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,1]):                             8192 runs -     4.77 us/run -      375 kB/run -   74.90 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,2]):                             8192 runs -     4.52 us/run -      375 kB/run -   79.11 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,2]):                             8192 runs -     5.84 us/run -      750 kB/run -  122.41 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[1,2,2,2]):                             8192 runs -     8.34 us/run -     1500 kB/run -  171.61 GB/s
  ADD(type=f32,ne=[16,10,10,10],nr=[2,2,2,2]):                             8192 runs -     8.31 us/run -     3000 kB/run -  344.47 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[1,1,1,1]):                            171 runs -   223.71 us/run -   196608 kB/run -  838.15 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[2,1,1,1]):                             86 runs -   371.72 us/run -   393216 kB/run - 1008.82 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[2,2,1,1]):                             43 runs -   738.05 us/run -   786432 kB/run - 1016.20 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[2,2,2,1]):                             22 runs -  1472.18 us/run -  1572864 kB/run - 1018.90 GB/s
  ADD(type=f32,ne=[4096,4096,1,1],nr=[2,2,2,2]):                             11 runs -  2938.55 us/run -  3145728 kB/run - 1020.91 GB/s

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 5, 2023

At the moment, I'm not at home. Could you wait for a few hours?

@slaren
Copy link
Collaborator Author

slaren commented Dec 5, 2023

When you can, there is no rush at all.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 6, 2023

Some dimens by stable diffusion

add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 21 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 34 us
add: A[1280, 16, 16, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 33 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 17 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 73 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 72 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 21 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 33 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 41 us
add: A[16, 16, 2560, 1] B[1, 1, 2560, 1] - 42 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 46 us
add: A[1280, 1, 1, 1] B[1280, 1, 1, 1] - 14 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 30 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 24 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 32 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 20 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 33 us
add: A[1280, 16, 16, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 16 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 73 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 85 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 19 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 44 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 75 us
add: A[16, 16, 1920, 1] B[1, 1, 1920, 1] - 37 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 24 us
add: A[1280, 1, 1, 1] B[1280, 1, 1, 1] - 9 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 24 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 20 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 23 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 22 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 20 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 21 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 17 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 32 us
add: A[1280, 16, 16, 1] B[1280, 1, 1, 1] - 16 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 20 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 40 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 19 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 88 us
add: A[5120, 256, 1, 1] B[5120, 1, 1, 1] - 80 us
add: A[1280, 256, 1, 1] B[1280, 1, 1, 1] - 26 us
add: A[1280, 16, 16, 1] B[1280, 16, 16, 1] - 34 us
add: A[16, 16, 1280, 1] B[1, 1, 1280, 1] - 21 us
add: A[16, 16, 1280, 1] B[16, 16, 1280, 1] - 32 us
add: A[32, 32, 1280, 1] B[1, 1, 1280, 1] - 73 us
add: A[32, 32, 1920, 1] B[1, 1, 1920, 1] - 417 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 42 us
add: A[640, 1, 1, 1] B[640, 1, 1, 1] - 9 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 55 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 59 us
add: A[32, 32, 640, 1] B[1, 1, 640, 1] - 353 us

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 6, 2023

@slaren I understand that ne= is the number of elements, but what is nr=?

The dimensions I provided are for the tensors a and b in the ggml_add function and their time in microseconds, using cuda.

@slaren
Copy link
Collaborator Author

slaren commented Dec 6, 2023

It's the number of repetitions, so the dimensions of the tensor a are ne * nr and the dimensions of b are ne.

    ggml_tensor * a = ggml_new_tensor_4d(ctx, type, ne[0]*nr[0], ne[1]*nr[1], ne[2]*nr[2], ne[3]*nr[3]);
    ggml_tensor * b = ggml_new_tensor(ctx, type, 4, ne.data());
    ggml_tensor * out = op(ctx, a, b);

Where op is any binary op that supports broadcasting, like ggml_add or ggml_mul.

@slaren
Copy link
Collaborator Author

slaren commented Dec 6, 2023

I get this with these test cases:

  ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]):                    8192 runs -    10.27 us/run -       15 kB/run -    1.39 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]):                  8192 runs -     5.95 us/run -     3840 kB/run -  615.30 GB/s
  ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]):                  8192 runs -     5.97 us/run -     3840 kB/run -  613.33 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]):                  8192 runs -     5.93 us/run -     3840 kB/run -  617.64 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]):                  8192 runs -    16.64 us/run -     3840 kB/run -  220.05 GB/s
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]):                  8192 runs -    16.67 us/run -     3840 kB/run -  219.72 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]):                  5826 runs -    23.62 us/run -     5760 kB/run -  232.55 GB/s
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]):                  4370 runs -    30.58 us/run -     7680 kB/run -  239.52 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]):                  2185 runs -    42.24 us/run -    15360 kB/run -  346.77 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]):                  1457 runs -    62.31 us/run -    23040 kB/run -  352.63 GB/s
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]):                   4370 runs -    16.64 us/run -     7680 kB/run -  440.18 GB/s
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]):                  2185 runs -    15.57 us/run -    15360 kB/run -  940.96 GB/s
  ADD(type=f32,ne=[640,1,1,1],nr=[1,1,1,1]):                     8192 runs -     3.15 us/run -        7 kB/run -    2.27 GB/s

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 6, 2023

Seems good

@slaren
Copy link
Collaborator Author

slaren commented Dec 6, 2023

Adjusting the block dims depending on the dimensions of the tensors improves some results:

  ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]):                    8192 runs -   122.68 us/run -       15 kB/run -    0.12 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]):                  8192 runs -     6.36 us/run -     3840 kB/run -  575.77 GB/s
  ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]):                  8192 runs -     6.31 us/run -     3840 kB/run -  580.16 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]):                  8192 runs -     6.34 us/run -     3840 kB/run -  577.96 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]):                  8192 runs -     5.96 us/run -     3840 kB/run -  614.85 GB/s
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]):                  8192 runs -     5.87 us/run -     3840 kB/run -  623.86 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]):                  5826 runs -     7.20 us/run -     5760 kB/run -  763.11 GB/s
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]):                  4370 runs -     8.67 us/run -     7680 kB/run -  845.24 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]):                  2185 runs -    16.32 us/run -    15360 kB/run -  897.71 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]):                  1457 runs -    22.88 us/run -    23040 kB/run -  960.41 GB/s
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]):                   4370 runs -     8.69 us/run -     7680 kB/run -  843.31 GB/s
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]):                  2185 runs -    16.23 us/run -    15360 kB/run -  902.41 GB/s

@slaren
Copy link
Collaborator Author

slaren commented Dec 6, 2023

Processing two elements per thread:

  ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]):                    8192 runs -    10.76 us/run -       15 kB/run -    1.33 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]):                  8192 runs -     5.33 us/run -     3840 kB/run -  687.38 GB/s
  ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]):                  8192 runs -     5.25 us/run -     3840 kB/run -  697.43 GB/s
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]):                  8192 runs -     5.32 us/run -     3840 kB/run -  688.97 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]):                  8192 runs -     5.21 us/run -     3840 kB/run -  703.33 GB/s
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]):                  8192 runs -     4.89 us/run -     3840 kB/run -  749.53 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]):                  5826 runs -     5.68 us/run -     5760 kB/run -  966.92 GB/s
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]):                  4370 runs -     6.53 us/run -     7680 kB/run - 1121.87 GB/s
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]):                  2185 runs -    15.27 us/run -    15360 kB/run -  959.04 GB/s
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]):                  1457 runs -    21.18 us/run -    23040 kB/run - 1037.30 GB/s
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]):                   4370 runs -     6.51 us/run -     7680 kB/run - 1124.47 GB/s
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]):                  2185 runs -    15.09 us/run -    15360 kB/run -  970.82 GB/s

This should be good enough already for these ops, they are usually a very small fraction of the overall time anyway.

@slaren slaren marked this pull request as ready for review December 6, 2023 15:25
@slaren slaren changed the title test-backend-ops : add performance eval mode test-backend-ops : add performance eval mode + improve CUDA repeat and binary broadcast ops performance Dec 6, 2023
@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 6, 2023

@slaren i will test your performance improvement in stable-diffusion

@ggerganov
Copy link
Owner

ggerganov commented Dec 6, 2023

Hm, whisper seems broken with CUDA atm?


Edit: whisper test that are currently failing:

diff --git a/tests/test-backend-ops.cpp b/tests/test-backend-ops.cpp
index 8540ebd..abd4de6 100644
--- a/tests/test-backend-ops.cpp
+++ b/tests/test-backend-ops.cpp
@@ -1152,6 +1152,10 @@ static bool test_backend(ggml_backend_t backend, test_mode mode, const char * op
     add_test_bin_bcast(GGML_TYPE_F32, {5120, 1, 1, 1}, {1, 256, 1, 1});
     add_test_bin_bcast(GGML_TYPE_F32, {640, 1, 1, 1}, {1, 1, 1, 1});
 
+    // whisper
+    add_test_bin_bcast(GGML_TYPE_F32, {1500, 512, 1, 1}, {1, 512, 1, 1});
+    add_test_bin_bcast(GGML_TYPE_F32, {3000, 512, 1, 1}, {1, 512, 1, 1});
+
     test_cases.emplace_back(new test_scale());
 
     for (float eps : {1e-6f, 1e-5f, 1e-3f, 1e-1f}) {

Edit2: whisper is now fixed, though these tests run OOM - not sure if expected

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 6, 2023

@slaren with these dimensions in CUDA, it crashes (using stable diffusion, LoRA computing).

CUDA OP: ADD  A[1, 1, 320, 320] B[1, 1, 320, 320]

@slaren
Copy link
Collaborator Author

slaren commented Dec 6, 2023

Both issues should be fixed now.

@slaren
Copy link
Collaborator Author

slaren commented Dec 6, 2023

whisper is now fixed, though these tests run OOM - not sure if expected

I don't think so, the tests shouldn't be so big as to cause OOM issues. When does that happen?

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 6, 2023

CUDA OP: ADD [3, 3, 2560, 1280] [3, 3, 2560, 1280]
blocks num(1, 1, 78020)
block dim(1, 3, 42)

CUDA error 9 at C:\proyectos\stable-diffusion.cpp\ggml\src\ggml-cuda.cu:7238: invalid configuration argument
current device: 0

@ggerganov
Copy link
Owner

I don't think so, the tests shouldn't be so big as to cause OOM issues. When does that happen?

With that patch with the whisper tests, it fails like this:

make -j && ./bin/test-backend-ops -b CUDA0 -o ADD
Testing 2 backends

Backend 1/2 (CPU)
  Skipping
Backend 2/2 (CUDA0)
ggml_init_cublas: GGML_CUDA_FORCE_MMQ:   no
ggml_init_cublas: CUDA_USE_TENSOR_CORES: yes
ggml_init_cublas: found 1 CUDA devices:
  Device 0: NVIDIA GeForce GTX 1660, compute capability 7.5
  Backend name: CUDA
  ADD(type=f32,ne=[1,1,8,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1,1,320,320],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[16,10,1,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[16,10,10,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[2,1,1,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,2,1,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,1]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,2]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,2]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[1,2,2,2]): OK
  ADD(type=f32,ne=[16,10,10,10],nr=[2,2,2,2]): OK
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]): OK
  ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]): OK
  ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]): OK
  ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]): OK
  ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]): OK
  ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]): OK
  ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]): OK
  ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]): OK
  ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]): OK
  ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]): OK
  ADD(type=f32,ne=[640,1,1,1],nr=[1,1,1,1]): OK

CUDA error 9 at /home/ggerganov/development/github/ggml/src/ggml-cuda.cu:6998: invalid configuration argument
current device: 0
GGML_ASSERT: /home/ggerganov/development/github/ggml/src/ggml-cuda.cu:6998: !"CUDA error"
Could not attach to process.  If your uid matches the uid of the target
process, check the setting of /proc/sys/kernel/yama/ptrace_scope, or try
again as the root user.  For more details, see /etc/sysctl.d/10-ptrace.conf
ptrace: Operation not permitted.
No stack.
The program is not being run.

@slaren
Copy link
Collaborator Author

slaren commented Dec 6, 2023

Is that nr correct though? That would cause 262144 rows. Still, the issue is that cuda limits grid sizes in dims z and y to 65536. So I guess we need to move the entire grid size to the x dim and compute the indices from that, which is a bit annoying.

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 6, 2023

Is that nr correct though?

In this case CUDA OP: ADD [3, 3, 2560, 1280] [3, 3, 2560, 1280] nr should be [1, 1, 1, 1], and without using broadcasting.

@ggerganov
Copy link
Owner

Is that nr correct though?

No, it's not. I got confused about the meaning of the tests arguments - ignore these tests.

Not sure about @FSSRepo's case

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 6, 2023

@slaren I think it could be checked if the shapes are the same, and the larger dimensions are swapped first, and the smaller ones last.

[3,3,2560,1280] -> [2560, 1280, 3, 3] in order to avoid the CUDA element limits, afterward, it is returned to the original order.

@slaren
Copy link
Collaborator Author

slaren commented Dec 7, 2023

I tried a few different solutions with a single kernel, but it resulted in decreased performance, so instead I added a fallback kernel for large tensors. I also added dimension collapsing so that multi dimensional tensors are processed as 1d tensors when possible, and that may improve performance slightly in some cases.

ADD(type=f32,ne=[1,1,8,1],nr=[1,1,1,1]):          8192 runs -     3.37 us/run -        0 kB/run -    0.03 GB/s
ADD(type=f32,ne=[1,1,320,320],nr=[1,1,1,1]):      8192 runs -     3.86 us/run -     1200 kB/run -  296.47 GB/s
ADD(type=f32,ne=[16,10,1,1],nr=[1,1,1,1]):        8192 runs -     3.42 us/run -        1 kB/run -    0.52 GB/s
ADD(type=f32,ne=[16,10,10,1],nr=[1,1,1,1]):       8192 runs -     3.43 us/run -       18 kB/run -    5.22 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,1]):      8192 runs -     3.46 us/run -      187 kB/run -   51.72 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[2,1,1,1]):      8192 runs -     3.68 us/run -      375 kB/run -   97.16 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,2,1,1]):      8192 runs -     3.69 us/run -      375 kB/run -   96.85 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,1]):      8192 runs -     3.69 us/run -      375 kB/run -   97.04 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,1,1,2]):      8192 runs -     3.61 us/run -      375 kB/run -   99.07 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,1,2,2]):      8192 runs -     3.77 us/run -      750 kB/run -  189.94 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[1,2,2,2]):      8192 runs -     4.00 us/run -     1500 kB/run -  357.74 GB/s
ADD(type=f32,ne=[16,10,10,10],nr=[2,2,2,2]):      8192 runs -     4.59 us/run -     3000 kB/run -  623.11 GB/s
ADD(type=f32,ne=[1280,1,1,1],nr=[1,1,1,1]):       8192 runs -     3.31 us/run -       15 kB/run -    4.32 GB/s
ADD(type=f32,ne=[1280,1,1,1],nr=[1,16,16,1]):     8192 runs -     4.91 us/run -     3840 kB/run -  745.62 GB/s
ADD(type=f32,ne=[1280,16,16,1],nr=[1,1,1,1]):     8192 runs -     4.88 us/run -     3840 kB/run -  750.94 GB/s
ADD(type=f32,ne=[1280,1,1,1],nr=[1,256,1,1]):     8192 runs -     4.91 us/run -     3840 kB/run -  746.06 GB/s
ADD(type=f32,ne=[1,1,1280,1],nr=[16,16,1,1]):     8192 runs -     4.86 us/run -     3840 kB/run -  752.84 GB/s
ADD(type=f32,ne=[16,16,1280,1],nr=[1,1,1,1]):     8192 runs -     4.87 us/run -     3840 kB/run -  751.22 GB/s
ADD(type=f32,ne=[1,1,1920,1],nr=[16,16,1,1]):     5826 runs -     5.68 us/run -     5760 kB/run -  966.45 GB/s
ADD(type=f32,ne=[1,1,2560,1],nr=[16,16,1,1]):     4370 runs -     6.50 us/run -     7680 kB/run - 1127.64 GB/s
ADD(type=f32,ne=[1,1,1280,1],nr=[32,32,1,1]):     2185 runs -    15.25 us/run -    15360 kB/run -  960.36 GB/s
ADD(type=f32,ne=[1,1,1920,1],nr=[32,32,1,1]):     1457 runs -    21.13 us/run -    23040 kB/run - 1039.86 GB/s
ADD(type=f32,ne=[1,1,640,1],nr=[32,32,1,1]):      4370 runs -     6.51 us/run -     7680 kB/run - 1125.93 GB/s
ADD(type=f32,ne=[5120,1,1,1],nr=[1,256,1,1]):     2185 runs -    15.09 us/run -    15360 kB/run -  970.90 GB/s
ADD(type=f32,ne=[640,1,1,1],nr=[1,1,1,1]):        8192 runs -     3.32 us/run -        7 kB/run -    2.16 GB/s
ADD(type=f32,ne=[3,3,2560,1280],nr=[1,1,1,1]):      98 runs -   392.74 us/run -   345600 kB/run -  839.20 GB/s
ADD(type=f32,ne=[3,3,2560,1280],nr=[2,1,1,1]):      49 runs -   860.69 us/run -   691200 kB/run -  765.87 GB/s

@FSSRepo
Copy link
Collaborator

FSSRepo commented Dec 7, 2023

@slaren

This implementation

[       ADD] - 22.503000 ms - 419 - 0.053706 ms
[       MUL] - 8.619000 ms - 125 - 0.068952 ms
[    CONCAT] - 2.235000 ms - 12 - 0.186250 ms
[      NORM] - 2.643000 ms - 48 - 0.055062 ms
[GROUP_NORM] - 5.309000 ms - 61 - 0.087033 ms
[   MUL_MAT] - 120.028000 ms - 361 - 0.332488 ms
[     SCALE] - 2.323000 ms - 32 - 0.072594 ms
[      CONT] - 11.394000 ms - 160 - 0.071213 ms
[  SOFT_MAX] - 101.511002 ms - 32 - 3.172219 ms
[    IM2COL] - 62.881001 ms - 97 - 0.648258 ms
[   UPSCALE] - 0.566000 ms - 3 - 0.188667 ms
[     UNARY] - 4.700000 ms - 84 - 0.055952 ms
Total Time: 344.712036 ms

My implementation

[       ADD] - 21.808001 ms - 419 - 0.052048 ms
[       MUL] - 8.414000 ms - 125 - 0.067312 ms
[    CONCAT] - 2.197000 ms - 12 - 0.183083 ms
[      NORM] - 2.564000 ms - 48 - 0.053417 ms
[GROUP_NORM] - 5.223000 ms - 61 - 0.085623 ms
[   MUL_MAT] - 117.841003 ms - 361 - 0.326429 ms
[     SCALE] - 2.251000 ms - 32 - 0.070344 ms
[      CONT] - 11.176000 ms - 160 - 0.069850 ms
[  SOFT_MAX] - 101.967003 ms - 32 - 3.186469 ms
[    IM2COL] - 62.348999 ms - 97 - 0.642773 ms
[   UPSCALE] - 0.544000 ms - 3 - 0.181333 ms
[     UNARY] - 4.892000 ms - 84 - 0.058238 ms
Total Time: 341.226013 ms

It seems that the performance is almost identical to my implementation, and it doesn't pose any issues. Everything is working well. Good job!

@ggerganov
Copy link
Owner

@slaren Planning to make a sync with llama.cpp/whisper.cpp after we merge this PR and ggerganov/llama.cpp#4309. Any concerns?

@slaren
Copy link
Collaborator Author

slaren commented Dec 7, 2023

No, I don't expect any significant issues.

@slaren slaren merged commit 990f931 into master Dec 7, 2023
4 checks passed
@slaren slaren deleted the test-backend-perf branch December 7, 2023 17:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants