Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

update OneMKL gemm_batch call inside dpnp.matmul and column_major version of gemm #1793

Merged
merged 13 commits into from
May 8, 2024

Conversation

vtavana
Copy link
Collaborator

@vtavana vtavana commented Apr 19, 2024

In this PR, oneapi::mkl::blas::column_major::gemm and oneapi::mkl::blas::column_major::gemm_batch are added to be used in dpnp.matmul when the base of input arrays is f-contiguous.

In addition, the gemm_batch is updated to improve the performance for some cases by keeping the base of intermediate arrays the same as input arrays (working with arrays views instead of copying).

Also, for large values of batch_size in gemm_batch, the batch_size is split to smaller chunk sizes to avoid OneMKL error.

  • Have you provided a meaningful PR description?
  • Have you added a test, reproducer or referred to issue with a reproducer?
  • Have you tested your changes locally for CPU and GPU devices?
  • Have you made sure that new changes do not introduce compiler warnings?
  • Have you checked performance impact of proposed changes?
  • If this PR is a work in progress, are you filing the PR as a draft?

@vtavana vtavana self-assigned this Apr 19, 2024
Copy link
Contributor

github-actions bot commented Apr 19, 2024

View rendered docs @ https://intelpython.github.io/dpnp/pull//index.html

@vtavana vtavana marked this pull request as ready for review April 19, 2024 23:03
dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved
dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved
dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved
Copy link
Contributor

@antonwolfy antonwolfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a great improvement. Thank you @vtavana !

split batch_size to smaller chunks and
gemm with column major when both input array F-contig
@vtavana vtavana requested a review from antonwolfy May 3, 2024 22:12
@vtavana vtavana changed the title update OneMKL gemm_batch call inside dpnp.matmul update OneMKL gemm_batch call inside dpnp.matmul and column_major version of gemm May 3, 2024
@vtavana
Copy link
Collaborator Author

vtavana commented May 5, 2024

Timing for calling gemm in different scenarios

  • dpnp timing on Intel Xeon is slower than NumPy but it is independent of this PR, and should be addressed separately.
Case # NumPy old (Iris Xe) new (Iris Xe) old (Intel Core) new (Intel Core)
1 1.74 s ± 443 ms 118 ms ± 22.5 ms 110 ms ± 6.12 ms 1.71 s ± 439 ms 1.65 s ± 438 ms
2 2 s ± 81.7 ms 123 ms ± 9.9 ms 145 ms ± 26.3 ms 1.88 s ± 427 ms 1.84 s ± 359 ms
3 1.82 s ± 472 ms 125 ms ± 7.88 ms 149 ms ± 14.2 ms 2.01 s ± 106 ms 1.76 s ± 372 ms
4 2.31 s ± 363 ms 135 ms ± 28.6 ms 121 ms ± 18 ms 1.99 s ± 393 ms 1.79 s ± 375 ms
Case # NumPy old (PVC) new (PVC) old (Intel Xeon) new (Intel Xeon)
1 81.8 ms ± 1.32 ms 8.35 ms ± 20.6 µs 8.47 ms ± 73.1 µs 159 ms ± 7.38 ms 155 ms ± 6.39 ms
2 81.4 ms ± 1.14 ms 8.8 ms ± 62.2 µs 8.73 ms ± 43.2 µs 157 ms ± 8.78 ms 158 ms ± 5.76 ms
3 83.4 ms ± 1.13 ms 8.42 ms ± 67.1 µs 8.46 ms ± 64 µs 156 ms ± 3.87 ms 157 ms ± 5.03 ms
4 83.4 ms ± 1.04 ms 8.64 ms ± 2.76 µs 8.45 ms ± 78.6 µs 153 ms ± 8.33 ms 152 ms ± 3.45 ms
import  dpnp
size = 4096
device="gpu"
a = dpnp.ones((size, size), order="C", device=device)
b = dpnp.ones((size, size), order="C", device=device)
%timeit dpnp.matmul(a, b) # Case 1

a = dpnp.ones((size, size), order="C", device=device)
b = dpnp.ones((size, size), order="F", device=device)
%timeit dpnp.matmul(a, b) # Case 2

a = dpnp.ones((size, size), order="F", device=device)
b = dpnp.ones((size, size), order="C", device=device)
%timeit dpnp.matmul(a, b) # Case 3

a = dpnp.ones((size, size), order="F", device=device)
b = dpnp.ones((size, size), order="F", device=device)
%timeit dpnp.matmul(a, b) # Case 4

@vtavana
Copy link
Collaborator Author

vtavana commented May 5, 2024

Timing for calling gemm_batch in different scenarios

  • dpnp timing for cases when one input is c-contiguous and the other one is f-contiguous is not better than old version on PVC.
  • Regardless on new changes, cpu timing is mostly better than gpu (Iris Xe) timing. This is an issue with OneMKL itself.
Case # NumPy old (Iris Xe) new (Iris Xe) old (Intel Core) new (Intel Core)
1 3.66 s ± 426 ms 1.91 s ± 234 ms 1.89 s ± 224 ms 903 ms ± 109 ms 1.02 s ± 75.6 ms
2 3.31 s ± 392 ms 5.65 s ± 3.89 s 1.74 s ± 268 ms 2.19 s ± 129 ms 899 ms ± 336 ms
3 3.4 s ± 481 ms 2.81 s ± 341 ms 443 ms ± 36.9 ms 1.91 s ± 45.1 ms 825 ms ± 296 ms
4 3.71 s ± 588 ms 4.71 s ± 2.27 s 1.97 s ± 264 ms 2.94 s ± 356 ms 871 ms ± 193 ms
Case # NumPy old (PVC) new (PVC) old (Intel Xeon) new (Intel Xeon)
1 3.67 s ± 2.25 ms 57.1 ms ± 862 µs 57.6 ms ± 579 µs 586 ms ± 47.6 ms 497 ms ± 30.9 ms
2 3.78 s ± 1.69 ms 68.6 ms ± 471 µs 116 ms ± 588 µs 850 ms ± 36.1 ms 488 ms ± 16.1 ms
3 3.56 s ± 329 µs 68.6 ms ± 477 µs 75.3 ms ± 566 µs 869 ms ± 42.7 ms 510 ms ± 30.2 ms
4 3.63 s ± 1.42 ms 77 ms ± 38.8 µs 58.2 ms ± 589 µs 1.06 s ± 28.1 ms 472 ms ± 25.2 ms
import  dpnp
size = 4095
device="gpu"
a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
%timeit dpnp.matmul(a, b) # Case 1, both C

a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
b = b.transpose(0, 1, 3, 2)
%timeit dpnp.matmul(a, b) # Case 2, a is C, b is F

a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
a = a.transpose(0, 1, 3, 2)
%timeit dpnp.matmul(a, b) # Case 3, a is F, b is C

a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
a = a.transpose(0, 1, 3, 2)
b = b.transpose(0, 1, 3, 2)
%timeit dpnp.matmul(a, b) # Case 4, both F

@vtavana
Copy link
Collaborator Author

vtavana commented May 5, 2024

Timing for calling gemm_batch in different scenarios.

  • size on Iris Xe and Intel Core (2000, 2000, 4, 4)
  • size on PVC and Intel Xeon (4095, 4095, 4, 4)
  • timing is comparable or better with new changes mostly (on PVC some cases become slightly slower)
Case # NumPy old (Iris Xe) new (Iris Xe) old (Intel Core) new (Intel Core)
1 609 ms ± 21.3 ms 377 ms ± 9.01 ms 373 ms ± 2.43 ms 248 ms ± 24.2 ms 262 ms ± 25.6 ms
2 934 ms ± 87.7 ms 325 ms ± 1.43 ms 326 ms ± 1.26 ms 197 ms ± 24.9 ms 174 ms ± 10.1 ms
3 219 ms ± 46.3 ms 126 ms ± 2.49 ms 132 ms ± 2.37 ms 259 ms ± 43.1 ms 258 ms ± 13.4 ms
4 201 ms ± 33.5 ms 179 ms ± 36.8 ms 144 ms ± 6.26 ms 292 ms ± 21.6 ms 318 ms ± 50.5 ms
5 293 ms ± 10.4 ms 258 ms ± 17.9 ms 198 ms ± 20.1 ms 264 ms ± 17.3 ms 174 ms ± 9.41 ms
6 281 ms ± 13.5 ms 287 ms ± 7.02 ms 193 ms ± 20.3 ms 329 ms ± 36.9 ms 144 ms ± 15.5 ms
7 611 ms ± 60.3 ms 550 ms ± 26.5 ms 373 ms ± 3.17 ms 547 ms ± 150 ms 260 ms ± 12 ms
8 833 ms ± 154 ms 626 ms ± 87.5 ms 465 ms ± 48.9 ms 625 ms ± 40.5 ms 466 ms ± 44.6 ms
9 221 ms ± 27.1 ms 127 ms ± 1.85 ms 131 ms ± 2.14 ms 258 ms ± 10 ms 269 ms ± 28.1 ms
10 185 ms ± 31.4 ms 171 ms ± 32.4 ms 142 ms ± 4.01 ms 302 ms ± 19.2 ms 238 ms ± 44.2 ms
11 293 ms ± 8.25 ms 265 ms ± 23.7 ms 189 ms ± 1.24 ms 198 ms ± 13.7 ms 140 ms ± 25.3 ms
12 284 ms ± 14.7 ms 302 ms ± 32.7 ms 207 ms ± 5.14 ms 397 ms ± 15.8 ms 315 ms ± 17.1 ms
Case # NumPy old (PVC) new (PVC) old (Intel Xeon) new (Intel Xeon)
1 3.67 s ± 1.61 ms 57 ms ± 898 µs 57.7 ms ± 602 µs 610 ms ± 36 ms 503 ms ± 41.8 ms
2 3.41 s ± 322 µs 54.7 ms ± 8.02 µs 55.9 ms ± 37.8 µs 370 ms ± 3.84 ms 275 ms ± 36.9 ms
3 935 ms ± 24.8 ms 20.2 ms ± 147 µs 22.9 ms ± 141 µs 328 ms ± 34 ms 257 ms ± 28.1 ms
4 881 ms ± 59.7 ms 23.2 ms ± 133 µs 28.9 ms ± 157 µs 333 ms ± 33.6 ms 326 ms ± 26.3 ms
5 1.87 s ± 253 µs 39.2 ms ± 469 µs 42.7 ms ± 430 µs 577 ms ± 27.5 ms 471 ms ± 14.1 ms
6 1.73 s ± 183 µs 44.4 ms ± 526 µs 51.9 ms ± 143 µs 624 ms ± 31.9 ms 580 ms ± 13.3 ms
7 3.65 s ± 1.94 ms 72.7 ms ± 213 µs 58.1 ms ± 630 µs 1.11 s ± 36.6 ms 494 ms ± 29.9 ms
8 3.38 s ± 176 µs 74.1 ms ± 322 µs 69.9 ms ± 84.9 µs 914 ms ± 35 ms 531 ms ± 11.5 ms
9 922 ms ± 22.4 ms 21.5 ms ± 76.1 µs 23.3 ms ± 61.7 µs 335 ms ± 31.3 ms 319 ms ± 30.6 ms
10 850 ms ± 129 µs 24.3 ms ± 272 µs 29.2 ms ± 128 µs 362 ms ± 33.8 ms 354 ms ± 10.8 ms
11 1.83 s ± 623 µs 42 ms ± 439 µs 42.9 ms ± 401 µs 606 ms ± 27.3 ms 534 ms ± 13 ms
12 1.7 s ± 200 µs 47 ms ± 468 µs 52.2 ms ± 103 µs 637 ms ± 28.3 ms 625 ms ± 13.6 ms
import  dpnp
# the base is c-contiguous for cases 1-6
size = 4095
a = dpnp.ones((size, size, 4, 4), device="gpu")
OUT = dpnp.empty((size, size, 4, 4), device="gpu")

%timeit dpnp.matmul(a, a) # Case 1
out = OUT; 
%timeit dpnp.matmul(a, a, out=out) # Case 2
b = a[::2, ::2]; 
%timeit dpnp.matmul(b, b)  # Case 3
out = OUT[::2, ::2]; 
%timeit dpnp.matmul(b, b, out=out) # Case 4
b = a[:, ::2]; 
%timeit dpnp.matmul(b, b) # Case 5
out = OUT[:, ::2]; 
%timeit dpnp.matmul(b, b, out=out) # Case 6

# the base is f-contiguous for cases 7-12
a = a.transpose(0, 1, 3, 2); 
%timeit dpnp.matmul(a, a) # Case 7
out = OUT; 
%timeit dpnp.matmul(a, a, out=out) # Case 8
b = a[::2, ::2]; 
%timeit dpnp.matmul(b, b) # Case 9
out = OUT[::2, ::2]; 
%timeit dpnp.matmul(b, b, out=out) # Case 10
b = a[:, ::2]; 
%timeit dpnp.matmul(b, b) # Case 11
out = OUT[:, ::2]; 
%timeit dpnp.matmul(b, b, out=out) # Case 12

dpnp/backend/extensions/blas/gemm.cpp Show resolved Hide resolved
dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved
dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved
Copy link
Contributor

@antonwolfy antonwolfy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @vtavana , no more comments from me

@vtavana vtavana merged commit 4c4b5c5 into master May 8, 2024
56 checks passed
@vtavana vtavana deleted the update-gemm-batch branch May 8, 2024 16:43
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants