update OneMKL `gemm_batch` call inside `dpnp.matmul` and column_major version of `gemm` #1793

vtavana · 2024-04-19T20:58:26Z

In this PR, oneapi::mkl::blas::column_major::gemm and oneapi::mkl::blas::column_major::gemm_batch are added to be used in dpnp.matmul when the base of input arrays is f-contiguous.

In addition, the gemm_batch is updated to improve the performance for some cases by keeping the base of intermediate arrays the same as input arrays (working with arrays views instead of copying).

Also, for large values of batch_size in gemm_batch, the batch_size is split to smaller chunk sizes to avoid OneMKL error.

Have you provided a meaningful PR description?
Have you added a test, reproducer or referred to issue with a reproducer?
Have you tested your changes locally for CPU and GPU devices?
Have you made sure that new changes do not introduce compiler warnings?
Have you checked performance impact of proposed changes?
If this PR is a work in progress, are you filing the PR as a draft?

github-actions · 2024-04-19T21:17:09Z

View rendered docs @ https://intelpython.github.io/dpnp/pull//index.html

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py

antonwolfy

This is a great improvement. Thank you @vtavana !

split batch_size to smaller chunks and gemm with column major when both input array F-contig

vtavana · 2024-05-05T23:25:07Z

Timing for calling gemm in different scenarios

dpnp timing on Intel Xeon is slower than NumPy but it is independent of this PR, and should be addressed separately.

Case #	NumPy	old (Iris Xe)	new (Iris Xe)	old (Intel Core)	new (Intel Core)
1	1.74 s ± 443 ms	118 ms ± 22.5 ms	110 ms ± 6.12 ms	1.71 s ± 439 ms	1.65 s ± 438 ms
2	2 s ± 81.7 ms	123 ms ± 9.9 ms	145 ms ± 26.3 ms	1.88 s ± 427 ms	1.84 s ± 359 ms
3	1.82 s ± 472 ms	125 ms ± 7.88 ms	149 ms ± 14.2 ms	2.01 s ± 106 ms	1.76 s ± 372 ms
4	2.31 s ± 363 ms	135 ms ± 28.6 ms	121 ms ± 18 ms	1.99 s ± 393 ms	1.79 s ± 375 ms

Case #	NumPy	old (PVC)	new (PVC)	old (Intel Xeon)	new (Intel Xeon)
1	81.8 ms ± 1.32 ms	8.35 ms ± 20.6 µs	8.47 ms ± 73.1 µs	159 ms ± 7.38 ms	155 ms ± 6.39 ms
2	81.4 ms ± 1.14 ms	8.8 ms ± 62.2 µs	8.73 ms ± 43.2 µs	157 ms ± 8.78 ms	158 ms ± 5.76 ms
3	83.4 ms ± 1.13 ms	8.42 ms ± 67.1 µs	8.46 ms ± 64 µs	156 ms ± 3.87 ms	157 ms ± 5.03 ms
4	83.4 ms ± 1.04 ms	8.64 ms ± 2.76 µs	8.45 ms ± 78.6 µs	153 ms ± 8.33 ms	152 ms ± 3.45 ms

import  dpnp
size = 4096
device="gpu"
a = dpnp.ones((size, size), order="C", device=device)
b = dpnp.ones((size, size), order="C", device=device)
%timeit dpnp.matmul(a, b) # Case 1

a = dpnp.ones((size, size), order="C", device=device)
b = dpnp.ones((size, size), order="F", device=device)
%timeit dpnp.matmul(a, b) # Case 2

a = dpnp.ones((size, size), order="F", device=device)
b = dpnp.ones((size, size), order="C", device=device)
%timeit dpnp.matmul(a, b) # Case 3

a = dpnp.ones((size, size), order="F", device=device)
b = dpnp.ones((size, size), order="F", device=device)
%timeit dpnp.matmul(a, b) # Case 4

vtavana · 2024-05-05T23:36:49Z

Timing for calling gemm_batch in different scenarios

dpnp timing for cases when one input is c-contiguous and the other one is f-contiguous is not better than old version on PVC.
Regardless on new changes, cpu timing is mostly better than gpu (Iris Xe) timing. This is an issue with OneMKL itself.

Case #	NumPy	old (Iris Xe)	new (Iris Xe)	old (Intel Core)	new (Intel Core)
1	3.66 s ± 426 ms	1.91 s ± 234 ms	1.89 s ± 224 ms	903 ms ± 109 ms	1.02 s ± 75.6 ms
2	3.31 s ± 392 ms	5.65 s ± 3.89 s	1.74 s ± 268 ms	2.19 s ± 129 ms	899 ms ± 336 ms
3	3.4 s ± 481 ms	2.81 s ± 341 ms	443 ms ± 36.9 ms	1.91 s ± 45.1 ms	825 ms ± 296 ms
4	3.71 s ± 588 ms	4.71 s ± 2.27 s	1.97 s ± 264 ms	2.94 s ± 356 ms	871 ms ± 193 ms

Case #	NumPy	old (PVC)	new (PVC)	old (Intel Xeon)	new (Intel Xeon)
1	3.67 s ± 2.25 ms	57.1 ms ± 862 µs	57.6 ms ± 579 µs	586 ms ± 47.6 ms	497 ms ± 30.9 ms
2	3.78 s ± 1.69 ms	68.6 ms ± 471 µs	116 ms ± 588 µs	850 ms ± 36.1 ms	488 ms ± 16.1 ms
3	3.56 s ± 329 µs	68.6 ms ± 477 µs	75.3 ms ± 566 µs	869 ms ± 42.7 ms	510 ms ± 30.2 ms
4	3.63 s ± 1.42 ms	77 ms ± 38.8 µs	58.2 ms ± 589 µs	1.06 s ± 28.1 ms	472 ms ± 25.2 ms

import  dpnp
size = 4095
device="gpu"
a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
%timeit dpnp.matmul(a, b) # Case 1, both C

a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
b = b.transpose(0, 1, 3, 2)
%timeit dpnp.matmul(a, b) # Case 2, a is C, b is F

a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
a = a.transpose(0, 1, 3, 2)
%timeit dpnp.matmul(a, b) # Case 3, a is F, b is C

a = dpnp.ones((size, size, 4, 4), device=device)
b = dpnp.ones((size, size, 4, 4), device=device)
a = a.transpose(0, 1, 3, 2)
b = b.transpose(0, 1, 3, 2)
%timeit dpnp.matmul(a, b) # Case 4, both F

vtavana · 2024-05-05T23:47:38Z

Timing for calling gemm_batch in different scenarios.

size on Iris Xe and Intel Core (2000, 2000, 4, 4)
size on PVC and Intel Xeon (4095, 4095, 4, 4)
timing is comparable or better with new changes mostly (on PVC some cases become slightly slower)

Case #	NumPy	old (Iris Xe)	new (Iris Xe)	old (Intel Core)	new (Intel Core)
1	609 ms ± 21.3 ms	377 ms ± 9.01 ms	373 ms ± 2.43 ms	248 ms ± 24.2 ms	262 ms ± 25.6 ms
2	934 ms ± 87.7 ms	325 ms ± 1.43 ms	326 ms ± 1.26 ms	197 ms ± 24.9 ms	174 ms ± 10.1 ms
3	219 ms ± 46.3 ms	126 ms ± 2.49 ms	132 ms ± 2.37 ms	259 ms ± 43.1 ms	258 ms ± 13.4 ms
4	201 ms ± 33.5 ms	179 ms ± 36.8 ms	144 ms ± 6.26 ms	292 ms ± 21.6 ms	318 ms ± 50.5 ms
5	293 ms ± 10.4 ms	258 ms ± 17.9 ms	198 ms ± 20.1 ms	264 ms ± 17.3 ms	174 ms ± 9.41 ms
6	281 ms ± 13.5 ms	287 ms ± 7.02 ms	193 ms ± 20.3 ms	329 ms ± 36.9 ms	144 ms ± 15.5 ms
7	611 ms ± 60.3 ms	550 ms ± 26.5 ms	373 ms ± 3.17 ms	547 ms ± 150 ms	260 ms ± 12 ms
8	833 ms ± 154 ms	626 ms ± 87.5 ms	465 ms ± 48.9 ms	625 ms ± 40.5 ms	466 ms ± 44.6 ms
9	221 ms ± 27.1 ms	127 ms ± 1.85 ms	131 ms ± 2.14 ms	258 ms ± 10 ms	269 ms ± 28.1 ms
10	185 ms ± 31.4 ms	171 ms ± 32.4 ms	142 ms ± 4.01 ms	302 ms ± 19.2 ms	238 ms ± 44.2 ms
11	293 ms ± 8.25 ms	265 ms ± 23.7 ms	189 ms ± 1.24 ms	198 ms ± 13.7 ms	140 ms ± 25.3 ms
12	284 ms ± 14.7 ms	302 ms ± 32.7 ms	207 ms ± 5.14 ms	397 ms ± 15.8 ms	315 ms ± 17.1 ms

Case #	NumPy	old (PVC)	new (PVC)	old (Intel Xeon)	new (Intel Xeon)
1	3.67 s ± 1.61 ms	57 ms ± 898 µs	57.7 ms ± 602 µs	610 ms ± 36 ms	503 ms ± 41.8 ms
2	3.41 s ± 322 µs	54.7 ms ± 8.02 µs	55.9 ms ± 37.8 µs	370 ms ± 3.84 ms	275 ms ± 36.9 ms
3	935 ms ± 24.8 ms	20.2 ms ± 147 µs	22.9 ms ± 141 µs	328 ms ± 34 ms	257 ms ± 28.1 ms
4	881 ms ± 59.7 ms	23.2 ms ± 133 µs	28.9 ms ± 157 µs	333 ms ± 33.6 ms	326 ms ± 26.3 ms
5	1.87 s ± 253 µs	39.2 ms ± 469 µs	42.7 ms ± 430 µs	577 ms ± 27.5 ms	471 ms ± 14.1 ms
6	1.73 s ± 183 µs	44.4 ms ± 526 µs	51.9 ms ± 143 µs	624 ms ± 31.9 ms	580 ms ± 13.3 ms
7	3.65 s ± 1.94 ms	72.7 ms ± 213 µs	58.1 ms ± 630 µs	1.11 s ± 36.6 ms	494 ms ± 29.9 ms
8	3.38 s ± 176 µs	74.1 ms ± 322 µs	69.9 ms ± 84.9 µs	914 ms ± 35 ms	531 ms ± 11.5 ms
9	922 ms ± 22.4 ms	21.5 ms ± 76.1 µs	23.3 ms ± 61.7 µs	335 ms ± 31.3 ms	319 ms ± 30.6 ms
10	850 ms ± 129 µs	24.3 ms ± 272 µs	29.2 ms ± 128 µs	362 ms ± 33.8 ms	354 ms ± 10.8 ms
11	1.83 s ± 623 µs	42 ms ± 439 µs	42.9 ms ± 401 µs	606 ms ± 27.3 ms	534 ms ± 13 ms
12	1.7 s ± 200 µs	47 ms ± 468 µs	52.2 ms ± 103 µs	637 ms ± 28.3 ms	625 ms ± 13.6 ms

import  dpnp
# the base is c-contiguous for cases 1-6
size = 4095
a = dpnp.ones((size, size, 4, 4), device="gpu")
OUT = dpnp.empty((size, size, 4, 4), device="gpu")

%timeit dpnp.matmul(a, a) # Case 1
out = OUT; 
%timeit dpnp.matmul(a, a, out=out) # Case 2
b = a[::2, ::2]; 
%timeit dpnp.matmul(b, b)  # Case 3
out = OUT[::2, ::2]; 
%timeit dpnp.matmul(b, b, out=out) # Case 4
b = a[:, ::2]; 
%timeit dpnp.matmul(b, b) # Case 5
out = OUT[:, ::2]; 
%timeit dpnp.matmul(b, b, out=out) # Case 6

# the base is f-contiguous for cases 7-12
a = a.transpose(0, 1, 3, 2); 
%timeit dpnp.matmul(a, a) # Case 7
out = OUT; 
%timeit dpnp.matmul(a, a, out=out) # Case 8
b = a[::2, ::2]; 
%timeit dpnp.matmul(b, b) # Case 9
out = OUT[::2, ::2]; 
%timeit dpnp.matmul(b, b, out=out) # Case 10
b = a[:, ::2]; 
%timeit dpnp.matmul(b, b) # Case 11
out = OUT[:, ::2]; 
%timeit dpnp.matmul(b, b, out=out) # Case 12

dpnp/backend/extensions/blas/gemm.cpp

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py

…eep their alphabetic order

antonwolfy

Thank you @vtavana , no more comments from me

vtavana added 2 commits April 19, 2024 08:53

updating gemm_batch call

f12d833

fix failed tests for dot function

78bec82

vtavana self-assigned this Apr 19, 2024

vtavana marked this pull request as ready for review April 19, 2024 23:03

vtavana requested review from vlad-perevezentsev, antonwolfy and npolina4 April 19, 2024 23:27

Merge branch 'master' into update-gemm-batch

00a7755

antonwolfy reviewed Apr 22, 2024

View reviewed changes

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved

vtavana and others added 2 commits April 23, 2024 19:21

address comments

0da5437

Merge branch 'master' into update-gemm-batch

c5141c9

antonwolfy approved these changes Apr 24, 2024

View reviewed changes

vtavana added 2 commits May 3, 2024 16:49

Merge branch 'master' into update-gemm-batch

f7d7527

split batch_size and column major

2275290

split batch_size to smaller chunks and gemm with column major when both input array F-contig

vtavana requested a review from antonwolfy May 3, 2024 22:12

vtavana changed the title ~~update OneMKL gemm_batch call inside dpnp.matmul~~ update OneMKL gemm_batch call inside dpnp.matmul and column_major version of gemm May 3, 2024

fix failed tests

d9aa16e

antonwolfy reviewed May 6, 2024

View reviewed changes

dpnp/backend/extensions/blas/gemm.cpp Show resolved Hide resolved

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved

antonwolfy reviewed May 6, 2024

View reviewed changes

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Show resolved Hide resolved

vtavana and others added 3 commits May 7, 2024 11:49

simplify gemm implementation

e72c85f

Merge branch 'master' into update-gemm-batch

6897b98

waiting for host task before changing results

fa7cc33

antonwolfy reviewed May 8, 2024

View reviewed changes

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Outdated Show resolved Hide resolved

antonwolfy reviewed May 8, 2024

View reviewed changes

dpnp/dpnp_utils/dpnp_utils_linearalgebra.py Show resolved Hide resolved

vtavana added 2 commits May 8, 2024 09:25

address comments

6cce243

reorder functions in dpnp/dpnp_utils/dpnp_utils_linearalgebra.py to k…

db21604

…eep their alphabetic order

antonwolfy approved these changes May 8, 2024

View reviewed changes

vtavana merged commit 4c4b5c5 into master May 8, 2024
56 checks passed

vtavana deleted the update-gemm-batch branch May 8, 2024 16:43

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

update OneMKL `gemm_batch` call inside `dpnp.matmul` and column_major version of `gemm` #1793

update OneMKL `gemm_batch` call inside `dpnp.matmul` and column_major version of `gemm` #1793

vtavana commented Apr 19, 2024 •

edited

Loading

github-actions bot commented Apr 19, 2024 •

edited

Loading

antonwolfy left a comment

vtavana commented May 5, 2024 •

edited

Loading

vtavana commented May 5, 2024 •

edited

Loading

vtavana commented May 5, 2024

antonwolfy left a comment

update OneMKL gemm_batch call inside dpnp.matmul and column_major version of gemm #1793

update OneMKL gemm_batch call inside dpnp.matmul and column_major version of gemm #1793

Conversation

vtavana commented Apr 19, 2024 • edited Loading

github-actions bot commented Apr 19, 2024 • edited Loading

antonwolfy left a comment

Choose a reason for hiding this comment

vtavana commented May 5, 2024 • edited Loading

vtavana commented May 5, 2024 • edited Loading

vtavana commented May 5, 2024

antonwolfy left a comment

Choose a reason for hiding this comment

update OneMKL `gemm_batch` call inside `dpnp.matmul` and column_major version of `gemm` #1793

update OneMKL `gemm_batch` call inside `dpnp.matmul` and column_major version of `gemm` #1793

vtavana commented Apr 19, 2024 •

edited

Loading

github-actions bot commented Apr 19, 2024 •

edited

Loading

vtavana commented May 5, 2024 •

edited

Loading

vtavana commented May 5, 2024 •

edited

Loading