This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
feat: oneDNN wrapper based on oneDNN_jll #156
Draft
avik-pal
wants to merge
2
commits into
main
Choose a base branch
from
ap/onednn
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Draft
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/onednn
branch
2 times, most recently
from
September 9, 2024 21:04
475ac00
to
7812347
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: 844beaf | Previous: 7ba127a | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6167 ns |
4667 ns |
1.32 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6229.5 ns |
6666.5 ns |
0.93 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
7500 ns |
0.81 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7416 ns |
5750 ns |
1.29 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
119269 ns |
117321 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2830199 ns |
2723919 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
835125 ns |
3008750 ns |
0.28 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
455544 ns |
404195 ns |
1.13 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9791 ns |
9896 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9895.5 ns |
9833 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9875 ns |
9979 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10000 ns |
9958.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
595639 ns |
533872 ns |
1.12 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18414070 ns |
18512917 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
2633292 ns |
2324292 ns |
1.13 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
656066 ns |
674968 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1458 ns |
1437.5 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1500 ns |
2875 ns |
0.52 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1500 ns |
2083 ns |
0.72 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
2208 ns |
1437.5 ns |
1.54 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
22524 ns |
21479 ns |
1.05 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1309757 ns |
1282166 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
207541 ns |
190209 ns |
1.09 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
29130 ns |
29540 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3958 ns |
4250 ns |
0.93 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4417 ns |
4167 ns |
1.06 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4083 ns |
4145.5 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3834 ns |
4375 ns |
0.88 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
146914 ns |
144438.5 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
8793357 ns |
9108147.5 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1531042 ns |
1604875 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
147022 ns |
145092 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57708 ns |
55875 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46916 ns |
39209 ns |
1.20 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
45167 ns |
46625 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82145.5 ns |
84167 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38357 ns |
36824 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
642389 ns |
542002 ns |
1.19 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1024291 ns |
1333104 ns |
0.77 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
83071 ns |
81391 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2041646 ns |
2024917 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2084416 ns |
2079125 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2084937 ns |
2081625 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1993500 ns |
1993125 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
232297.5 ns |
226688 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
8026795 ns |
7623752 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
4457750 ns |
7427958 ns |
0.60 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1566555 ns |
1252074 ns |
1.25 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
149083.5 ns |
174750 ns |
0.85 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
151250 ns |
164541.5 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
146062.5 ns |
148812.5 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
145166 ns |
144375 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166368.5 ns |
165480 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7633809.5 ns |
7680925 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1607645.5 ns |
1457521 ns |
1.10 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
199292 ns |
204852 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1115833 ns |
1117250 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1109042 ns |
1109375.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1116042 ns |
1113334 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1112584 ns |
1112187.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
685909.5 ns |
694582 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33611727.5 ns |
33705507.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
5937958.5 ns |
6238375 ns |
0.95 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1023739.5 ns |
1026961 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5062 ns |
4417 ns |
1.15 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4917 ns |
5041 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4708.5 ns |
5208 ns |
0.90 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4083 ns |
4583 ns |
0.89 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
92394 ns |
93299.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5364370 ns |
5368327 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
449521 ns |
634041.5 ns |
0.71 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
67281 ns |
69460 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8625 ns |
8375 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8687.5 ns |
8542 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8542 ns |
8833 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8667 ns |
8833 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
598599 ns |
604485 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
36192580 ns |
36365543 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5961458.5 ns |
5669937.5 ns |
1.05 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
392053.5 ns |
388374 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17250.5 ns |
17000 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17708.5 ns |
17709 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17937.5 ns |
18021 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17729 ns |
16895.5 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
68975.5 ns |
66654.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3001183 ns |
2923981.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1319417 ns |
477833 ns |
2.76 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79131 ns |
78451 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
216000 ns |
216834 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
223000 ns |
219896 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
212792 ns |
225583.5 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
220083 ns |
217625 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
357846.5 ns |
356473 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
14192257.5 ns |
14201022 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5791666 ns |
5644395.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
468694 ns |
465005 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
667 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
750 ns |
750 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
833 ns |
812.5 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
625 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
21039 ns |
20462 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1158458 ns |
1162134.5 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
287041.5 ns |
302625 ns |
0.95 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
30980 ns |
32870 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1375 ns |
1417 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1458.5 ns |
1458 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1458 ns |
1417 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1333 ns |
1416 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
125621 ns |
125127 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8779470 ns |
8831211 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1564250 ns |
1526500 ns |
1.02 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
135731 ns |
136521 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7292 ns |
7208 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5416 ns |
1.12 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
6125 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9959 ns |
10666 ns |
0.93 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24706 ns |
23625 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1229054.5 ns |
1207481 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
432792 ns |
356458 ns |
1.21 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47191 ns |
48881 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
228937.5 ns |
226166 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
245125 ns |
265333 ns |
0.92 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228416.5 ns |
234854 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215562 ns |
219500 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
199703 ns |
192027 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
29760693 ns |
31211143.5 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9054125 ns |
9046313 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
643376 ns |
649247 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4083 ns |
4125 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4084 ns |
4083 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4125 ns |
4084 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4166 ns |
4083 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23957 ns |
23477 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
2055936 ns |
2001417 ns |
1.03 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
224166.5 ns |
214833 ns |
1.04 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
46275.5 ns |
47261 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16917 ns |
17083 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16916 ns |
17000 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17083 ns |
16833 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16792 ns |
17334 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
196624 ns |
195303 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
10472777 ns |
14536946 ns |
0.72 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
966209 ns |
918208 ns |
1.05 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
174212 ns |
174652 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
510000 ns |
508750 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
405500 ns |
330583 ns |
1.23 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
404500 ns |
404666 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
865000 ns |
864791 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113431 ns |
113620 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
401239 ns |
401393 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
429875 ns |
490979 ns |
0.88 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
241452 ns |
242133 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2320583 ns |
2313834 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2030250 ns |
1747479 ns |
1.16 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2017459 ns |
2035208 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3273458 ns |
3272708.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
244185 ns |
241207 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11598243 ns |
10021457.5 ns |
1.16 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1916000 ns |
2011770.5 ns |
0.95 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
738512 ns |
743443 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6604 ns |
4708.5 ns |
1.40 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7042 ns |
7625 ns |
0.92 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6083.5 ns |
7708 ns |
0.79 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
7250 ns |
5479.5 ns |
1.32 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
92991 ns |
92351.5 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5513071 ns |
5442998 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
770854.5 ns |
783479 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
65480 ns |
65411 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12083.5 ns |
10333.5 ns |
1.17 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11750 ns |
11875 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12042 ns |
11750 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11375 ns |
12062.5 ns |
0.94 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
631759 ns |
634956 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38968123.5 ns |
40400531.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5644020.5 ns |
5457291.5 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
408519 ns |
409979.5 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
541 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
583 ns |
0.86 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23457 ns |
23181 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2129985 ns |
2216579 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
235625 ns |
332584 ns |
0.71 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
46801 ns |
47221 ns |
0.99 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2083 ns |
2166 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2084 ns |
2167 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2166 ns |
2084 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2084 ns |
2084 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
223097 ns |
215755 ns |
1.03 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
11589618 ns |
11357397.5 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
2049375 ns |
1978417 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
174291.5 ns |
172626.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8458 ns |
8937.5 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9125 ns |
9729.5 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
9646 ns |
9459 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8875 ns |
8958 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
103644.5 ns |
96639 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3192803 ns |
3207607 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
805042 ns |
876000 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
71721 ns |
71941 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17834 ns |
18521 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17896 ns |
19104.5 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18479 ns |
17625 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
18291 ns |
18812.5 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
583162.5 ns |
554001 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
17664128 ns |
16517942.5 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5192125 ns |
5180916.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
380613.5 ns |
378539 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
458 ns |
1.18 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
459 ns |
625 ns |
0.73 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
709 ns |
666 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
541 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
36527.5 ns |
35213 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1184419 ns |
1186873 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
295334 ns |
466396 ns |
0.63 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
45821 ns |
46270 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10812.5 ns |
9312.5 ns |
1.16 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8708 ns |
9916.5 ns |
0.88 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10000 ns |
9167 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8417 ns |
9458.5 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
258735 ns |
267136 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18465714.5 ns |
18948901 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5142083.5 ns |
4572250 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
369734 ns |
367694 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397625 ns |
395333 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287687.5 ns |
214416 ns |
1.34 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287542 ns |
288292 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
755625 ns |
756291 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112363 ns |
111882 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
334860 ns |
329474.5 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
364667 ns |
300208.5 ns |
1.21 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
76050 ns |
77331 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1459479 ns |
1453791.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1133958.5 ns |
852583 ns |
1.33 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1132729 ns |
1132645.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2439375 ns |
2440625 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
210126 ns |
207032 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
8834558 ns |
10204120 ns |
0.87 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1568229 ns |
1668041.5 ns |
0.94 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
320624 ns |
324428.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7187 ns |
7041.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7375 ns |
7750 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7625 ns |
9396 ns |
0.81 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6812.5 ns |
7791.5 ns |
0.87 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
141360.5 ns |
144806.5 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5654882.5 ns |
5813106.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
450292 ns |
437250 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
70081 ns |
66071 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16312.5 ns |
13083 ns |
1.25 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15042 ns |
14479 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15375 ns |
15709 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14458.5 ns |
15354.5 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
968205 ns |
956377 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
42860911.5 ns |
42729213 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5831875 ns |
5700250 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
430104 ns |
428955 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
26375 ns |
24000 ns |
1.10 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
26958 ns |
24875 ns |
1.08 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
28083 ns |
29292 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
28750 ns |
27667 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
204088 ns |
199144 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7654854 ns |
7744284 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
634979 ns |
999584 ns |
0.64 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
114891.5 ns |
116931 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
156083 ns |
103583 ns |
1.51 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
104375 ns |
152687 ns |
0.68 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
150750 ns |
153583 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
118312.5 ns |
151000 ns |
0.78 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1088120 ns |
1075746 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43090708 ns |
43042130 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5963375 ns |
5733792 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
593256 ns |
590946.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
80167 ns |
75000 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
85167 ns |
77084 ns |
1.10 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
82750 ns |
86333.5 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
75167 ns |
74875 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
211316.5 ns |
205585 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7489316 ns |
8027595.5 ns |
0.93 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
539791.5 ns |
519187.5 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
127081 ns |
127562 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
276562.5 ns |
293542 ns |
0.94 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
208458 ns |
308750 ns |
0.68 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
305791.5 ns |
315187.5 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
268333.5 ns |
304208 ns |
0.88 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1151232 ns |
1108118 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
41375439 ns |
40422383 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6693729.5 ns |
6276458 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
691937 ns |
695017 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16875 ns |
15875 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
17417 ns |
17521 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
17604 ns |
18500 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
17084 ns |
16958 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
153754.5 ns |
146489 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5655250 ns |
5586208 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
450709 ns |
723083.5 ns |
0.62 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
232832 ns |
232683 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
27958 ns |
26667 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26958 ns |
26687.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
28271 ns |
28208.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
28104 ns |
27708.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
979026 ns |
982068.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
41077814 ns |
40344043 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5908521 ns |
5743229 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
690436 ns |
686807.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11250 ns |
11083 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11292 ns |
12042 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12041 ns |
12334 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
10625 ns |
10791 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
126931.5 ns |
124134 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3475371 ns |
3473152 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
808854 ns |
880000 ns |
0.92 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
238583 ns |
234213 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
23000 ns |
21958 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
22500 ns |
22729.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
22667 ns |
21895.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
22000 ns |
22000 ns |
1 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
698360 ns |
701831.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23080681 ns |
21157140 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5570042 ns |
5204750 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
673757 ns |
674667 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
62875.5 ns |
63437.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
63000 ns |
65521 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
64042 ns |
66750 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
63500 ns |
63042 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
110364 ns |
106345.5 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3393751.5 ns |
3373870 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1301084 ns |
480667 ns |
2.71 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
233322 ns |
233433 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
445041.5 ns |
437896 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
437667 ns |
456000 ns |
0.96 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
479167 ns |
450542 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
450375 ns |
444000 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
524383 ns |
515188 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
20565993 ns |
21597008 ns |
0.95 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6117292 ns |
6095791.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
704751.5 ns |
717017.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7396 ns |
6792 ns |
1.09 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8187.5 ns |
8000 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7417 ns |
8583.5 ns |
0.86 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7229 ns |
6917 ns |
1.05 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
147886.5 ns |
146052.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5538159 ns |
5510181.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
449250 ns |
726500 ns |
0.62 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
64911 ns |
65301 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16084 ns |
14292 ns |
1.13 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15354 ns |
15292 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16354.5 ns |
14084 ns |
1.16 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14166.5 ns |
16209 ns |
0.87 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
954692 ns |
947670 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
38171523 ns |
39845105 ns |
0.96 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5788167 ns |
5499875 ns |
1.05 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
402704 ns |
399764 ns |
1.01 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6155208 ns |
6131500 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6371792 ns |
3224875 ns |
1.98 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6370958 ns |
6379229.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11912687 ns |
11911084 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
302156 ns |
349856 ns |
0.86 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
303693 ns |
303248 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19141583 ns |
19059708.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19944479.5 ns |
11090437.5 ns |
1.80 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
19924250 ns |
20005646 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36530333.5 ns |
36446770.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1017640 ns |
1081781.5 ns |
0.94 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1172952 ns |
1153782 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1000 ns |
958 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
917 ns |
1000 ns |
0.92 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1000 ns |
958 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
917 ns |
917 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23463 ns |
23071 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2180811.5 ns |
2085318 ns |
1.05 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
235959 ns |
332541.5 ns |
0.71 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
208162 ns |
207622 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3666 ns |
3667 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3709 ns |
3750 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3708 ns |
3667 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
284285.5 ns |
281551.5 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11253813 ns |
12095727 ns |
0.93 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2167729 ns |
2129583 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
625786 ns |
626307 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7541.5 ns |
8042 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
9125 ns |
8145.5 ns |
1.12 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8729 ns |
9042 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8708 ns |
7937.5 ns |
1.10 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
123165 ns |
121104 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3667231 ns |
3679976 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
744666.5 ns |
802541.5 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
65270 ns |
65471 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
12062 ns |
13125 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11667 ns |
12875 ns |
0.91 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11959 ns |
11417 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
12562.5 ns |
12708 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
651060 ns |
638151 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22798911 ns |
22685670 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5299417 ns |
4390333 ns |
1.21 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
358783 ns |
355644 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
250 ns |
333 ns |
0.75 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22961 ns |
22337 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2071449 ns |
2195388.5 ns |
0.94 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
228708 ns |
207833 ns |
1.10 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
46370 ns |
47401 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2917 ns |
3042 ns |
0.96 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
3000 ns |
3375 ns |
0.89 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3042 ns |
2916 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2875 ns |
3333 ns |
0.86 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
207109 ns |
204047 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9152954 ns |
14763707.5 ns |
0.62 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1607270.5 ns |
1611395.5 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
157106.5 ns |
157641.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11917 ns |
10250 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
12208 ns |
12167 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11584 ns |
12187.5 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11666.5 ns |
10604 ns |
1.10 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
124551.5 ns |
121713.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3309022 ns |
3281210 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
918375 ns |
904791.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
233552 ns |
233512.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20979.5 ns |
21104.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
22334 ns |
22583 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21542 ns |
21083 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21604 ns |
21708 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
599594 ns |
595173 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20120761.5 ns |
20531194.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4710917 ns |
4095583 ns |
1.15 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
648136.5 ns |
638246.5 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4334 ns |
4417 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4500 ns |
4375 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4375 ns |
4417 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24357 ns |
24193.5 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2211901.5 ns |
2211530 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
228354 ns |
215041 ns |
1.06 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
47410 ns |
47690 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16583 ns |
16292 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16708 ns |
16291 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16416 ns |
16667 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16625 ns |
16416 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
332885.5 ns |
330020.5 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
12676739 ns |
12280627 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1102416.5 ns |
1639709 ns |
0.67 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
208862 ns |
206457.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
1917 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2000 ns |
2167 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2208 ns |
2084 ns |
1.06 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2083 ns |
2084 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
37203 ns |
35891 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1177427 ns |
1213015 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
427479.5 ns |
474917 ns |
0.90 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
203212 ns |
204052 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
19583 ns |
19687.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
19125 ns |
17187.5 ns |
1.11 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21125 ns |
17750 ns |
1.19 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
16500 ns |
16667 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
296769 ns |
293976.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20540351 ns |
21212198 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5401041 ns |
4767354.5 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
684347 ns |
686777 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
59416.5 ns |
55771 ns |
1.07 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
65209 ns |
62792 ns |
1.04 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
62542 ns |
65604.5 ns |
0.95 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51291 ns |
51333 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66719 ns |
66418 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
116606.5 ns |
114241 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
184521 ns |
202896 ns |
0.91 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
159771 ns |
135104 ns |
1.18 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
155166.5 ns |
130083 ns |
1.19 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
297562.5 ns |
245666 ns |
1.21 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
220006.5 ns |
215296 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
606336 ns |
607861 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
83042 ns |
79709 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
83875 ns |
107104 ns |
0.78 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
84958 ns |
85167 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81500 ns |
124166.5 ns |
0.66 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190562.5 ns |
192861 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5798141 ns |
5531381 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1997417 ns |
1816084 ns |
1.10 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204577 ns |
203512 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1920709 ns |
1869895.5 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1913792 ns |
1901084 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1920000 ns |
1917666.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1920125 ns |
1889333 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
539927 ns |
531825 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
27537419.5 ns |
32650285 ns |
0.84 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8864020.5 ns |
8859584 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
920879 ns |
925670 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22072.5 ns |
21389 ns |
1.03 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2090569 ns |
2065883 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
338958 ns |
336229.5 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
40850 ns |
42770.5 ns |
0.96 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1833 ns |
1834 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1792 ns |
1834 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
257371 ns |
253832 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9709585 ns |
10417238 ns |
0.93 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1020646 ns |
1009479 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
179951 ns |
184376.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9042 ns |
8000 ns |
1.13 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9667 ns |
10042 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
9125 ns |
10375 ns |
0.88 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9750 ns |
8167 ns |
1.19 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
121818.5 ns |
119090.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3357352 ns |
3309191 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
891542 ns |
876708 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
232702 ns |
232622 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9708 ns |
9083 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10208 ns |
10625 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9541 ns |
9542 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8292 ns |
10125 ns |
0.82 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
534389 ns |
527209 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19775589 ns |
22247571 ns |
0.89 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4303291 ns |
3949187.5 ns |
1.09 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
622296 ns |
624237 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57875 ns |
56166 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46458 ns |
38916 ns |
1.19 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
45375 ns |
46125 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83000 ns |
83958 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
41762 ns |
40233 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1338270 ns |
1343252 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1165604 ns |
1123667 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
76810.5 ns |
76266 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1930875 ns |
1923750 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1971625 ns |
1952750.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1976958.5 ns |
1982854 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1877438 ns |
1850708.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
226570.5 ns |
221906.5 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33427846 ns |
33376877 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11278417 ns |
11408021 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1183436 ns |
1191052 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
419562.5 ns |
416333 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
419375.5 ns |
421645.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
419041 ns |
421208.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
416209 ns |
417667 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
215261 ns |
208798 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7681114 ns |
7659621 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
541416.5 ns |
518208 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
281483 ns |
282883 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
751479.5 ns |
747916.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
738396 ns |
671583 ns |
1.10 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
740208 ns |
673562.5 ns |
1.10 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
781437.5 ns |
748021 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1064114.5 ns |
1048327.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
44853248 ns |
45569778.5 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6467854 ns |
6335208.5 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
909209 ns |
914290 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3438520.5 ns |
3428937.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3375083 ns |
3384709 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3406958 ns |
3435000 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3397249.5 ns |
3417875 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189879.5 ns |
175238.5 ns |
1.08 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8059161.5 ns |
8069034 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1406375 ns |
1424083 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
424534.5 ns |
426124 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6192958.5 ns |
6191270.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6200125 ns |
6170041 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6188979 ns |
6167416.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6196396.5 ns |
6190792 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1020322 ns |
994959 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
65545133.5 ns |
50094330 ns |
1.31 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7436125 ns |
7413750 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1702616.5 ns |
1549811 ns |
1.10 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
472500 ns |
470666 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
341709 ns |
252458 ns |
1.35 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
341208 ns |
342417 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
897875 ns |
901125 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47313.5 ns |
46139 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
393377 ns |
884569 ns |
0.44 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
461042 ns |
368208 ns |
1.25 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
243782 ns |
243602 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2322666 ns |
2334750 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2037395.5 ns |
1752562 ns |
1.16 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2032542 ns |
2041187.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3278750 ns |
3280124.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
275549 ns |
255952 ns |
1.08 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
13477426 ns |
12850913 ns |
1.05 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2214708.5 ns |
2244770.5 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
765227 ns |
770018 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57291 ns |
55708 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46084 ns |
39041 ns |
1.18 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
44791 ns |
46020.5 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83042 ns |
84125 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
29341 ns |
28321 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1428289 ns |
1407008 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1147291 ns |
1106875 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
75001 ns |
76505.5 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2052187 ns |
2029708 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2068187.5 ns |
2082292 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2084042 ns |
2090958 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2003021 ns |
1949604 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
239105 ns |
232547 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
39375724 ns |
35887652 ns |
1.10 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11371687 ns |
11649979 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1036740 ns |
1052311 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57750 ns |
55833 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46792 ns |
39083.5 ns |
1.20 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
45334 ns |
46375 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82917 ns |
84042 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
51653.5 ns |
49287 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
797393 ns |
790006.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1122209 ns |
1049084 ns |
1.07 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
70150.5 ns |
69820 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1927354 ns |
1919458 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1981874.5 ns |
1955416.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1968271 ns |
1946334 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1887083 ns |
1890750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
245788 ns |
239685 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
17099740 ns |
17609091 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9616750 ns |
9788042 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1031735 ns |
918859 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
417 ns |
0.70 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
250 ns |
292 ns |
0.86 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
35895 ns |
34717 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1231936 ns |
1181143 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
295541 ns |
263500 ns |
1.12 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
45920 ns |
46211 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6125 ns |
6333 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6417 ns |
7500 ns |
0.86 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6666.5 ns |
6583 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6145.5 ns |
7000 ns |
0.88 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
217032.5 ns |
208392.5 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20544840 ns |
20162243 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4868375 ns |
4479667 ns |
1.09 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
372728.5 ns |
365124 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
291 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32714 ns |
32562 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1251803 ns |
1251080 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
258291.5 ns |
258000 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
36890 ns |
37000 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2666 ns |
2750 ns |
0.97 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
3041 ns |
3625 ns |
0.84 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2834 ns |
2709 ns |
1.05 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2666 ns |
2917 ns |
0.91 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
192919.5 ns |
189309.5 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
7137848 ns |
7798739 ns |
0.92 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
975833.5 ns |
905666.5 ns |
1.08 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
150721 ns |
151136.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
420792 ns |
467667 ns |
0.90 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
421958 ns |
444750 ns |
0.95 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
424625 ns |
425999.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
455854 ns |
421833.5 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
140632 ns |
137895 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5978562 ns |
5774821 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2094041 ns |
2386500 ns |
0.88 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
378363 ns |
367024 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3795708 ns |
3802521 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3744542 ns |
3765917 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3805146 ns |
3811417 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3800458 ns |
3799541.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
718592 ns |
709425 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33036728 ns |
33554230 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10823333 ns |
10457896 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1465698.5 ns |
1471404 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49868000.5 ns |
49735229.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35522791.5 ns |
25984959 ns |
1.37 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35434437.5 ns |
35560875 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
96915395.5 ns |
96902041.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1603373 ns |
1616773 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1046780 ns |
1045271 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154569979 ns |
153907333 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112355625.5 ns |
89247291.5 ns |
1.26 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
111830209 ns |
112379750 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
294823520.5 ns |
294166500 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6476750.5 ns |
6515848 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5572630 ns |
5562255.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
18791.5 ns |
14521 ns |
1.29 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
17584 ns |
14958 ns |
1.18 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
15458 ns |
16833 ns |
0.92 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
15000 ns |
14854.5 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
21075 ns |
20539.5 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1101717 ns |
1114507 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
220479 ns |
206959 ns |
1.07 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
25790 ns |
26060 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
10937.5 ns |
10625 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
9167 ns |
7771 ns |
1.18 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
9250 ns |
9208 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17042 ns |
17437.5 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
264109 ns |
260548 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
10241493.5 ns |
9528073.5 ns |
1.07 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1536583 ns |
1587125 ns |
0.97 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
147571 ns |
149326.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8500 ns |
7958 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7958 ns |
9292 ns |
0.86 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9833 ns |
9500 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
9979 ns |
7958.5 ns |
1.25 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
126526 ns |
116273.5 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3507407 ns |
3476228 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
778521 ns |
810375 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
233617 ns |
233683 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9458.5 ns |
9208.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9417 ns |
10645.5 ns |
0.88 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9875 ns |
10208 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9229.5 ns |
10375 ns |
0.89 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
628037.5 ns |
619508.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22754941 ns |
22906068.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
4815958 ns |
4432792 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
653036 ns |
654786 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10188 ns |
8291.5 ns |
1.23 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9750 ns |
10459 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
9812.5 ns |
10042 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10396 ns |
9250 ns |
1.12 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
123440.5 ns |
120531 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3376982 ns |
3436472 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
912146 ns |
901792 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
72011 ns |
71071 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13875 ns |
13250 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13146 ns |
16042 ns |
0.82 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13583 ns |
17208 ns |
0.79 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
14041 ns |
15167 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
597349 ns |
592138 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19218920 ns |
18951458.5 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4784916 ns |
4027062.5 ns |
1.19 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
344493 ns |
345753 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
458 ns |
459 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
500 ns |
541 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
36382 ns |
34521 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1239135 ns |
1191899 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
420458 ns |
371562.5 ns |
1.13 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
204071 ns |
206352 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7166 ns |
7062.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7125 ns |
8333.5 ns |
0.85 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7645.5 ns |
8583 ns |
0.89 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7292 ns |
8000 ns |
0.91 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
236575 ns |
233771 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21174965 ns |
23357164 ns |
0.91 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5710375 ns |
4885833 ns |
1.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
655981 ns |
662116 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
15625 ns |
12292 ns |
1.27 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
15958 ns |
13229 ns |
1.21 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
13750 ns |
15125 ns |
0.91 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
10166.5 ns |
10167 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
22442 ns |
22042 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1141056 ns |
1119591.5 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
212166.5 ns |
189125 ns |
1.12 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
184201 ns |
189132 ns |
0.97 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
32250 ns |
31875 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
32083.5 ns |
32333.5 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32208 ns |
32291.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
32250 ns |
32000 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
277672.5 ns |
276327 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
10900465 ns |
12201192 ns |
0.89 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1658291.5 ns |
1697542 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
589205 ns |
595015.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
440750 ns |
480875 ns |
0.92 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
440624.5 ns |
441083 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
448208 ns |
450250 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
444042 ns |
490979 ns |
0.90 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193833.5 ns |
194024 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5864901 ns |
5766516 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2017041.5 ns |
2629708 ns |
0.77 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
367053 ns |
368063.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3831813 ns |
3822958 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3823292 ns |
3807354 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3830041 ns |
3827834 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3829437.5 ns |
3826167 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
552859 ns |
544349 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28143374 ns |
29050298 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9281334 ns |
9196542 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1354052 ns |
1359983 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
785276417 ns |
838219667 ns |
0.94 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
540295917 ns |
415052604.5 ns |
1.30 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
554667500 ns |
543102500 ns |
1.02 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1560118395.5 ns |
1525021500 ns |
1.02 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22754234.5 ns |
22764607.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14753214 ns |
14772276 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
2530250667 ns |
3570164958 ns |
0.71 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1789641792 ns |
1502049709 ns |
1.19 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
2778601041 ns |
2269221042 ns |
1.22 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
5294222958 ns |
4773617583 ns |
1.11 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
346639040 ns |
369302709 ns |
0.94 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
88499181.5 ns |
87924411 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
77083.5 ns |
79646 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
78333 ns |
78895.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79208.5 ns |
78667 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
78979.5 ns |
77583 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
213193 ns |
207237 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7637698 ns |
7871351 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
545083 ns |
520375 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
106751 ns |
107601 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
197666.5 ns |
250834 ns |
0.79 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
191708 ns |
294583.5 ns |
0.65 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
244167 ns |
285708.5 ns |
0.85 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
266625 ns |
222333.5 ns |
1.20 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1057411 ns |
1049109.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42829342 ns |
43337417.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6225479 ns |
6122958 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
633805 ns |
640576 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199626249.5 ns |
199656458.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
138818666 ns |
103769666.5 ns |
1.34 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
138760500 ns |
139342042 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
388835292 ns |
388182208 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5838846 ns |
5838796 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3565003 ns |
3577840.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
619631416.5 ns |
616451521 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
439117667 ns |
351188291.5 ns |
1.25 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
438492541.5 ns |
439680896 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1178157416 ns |
1178137125 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26506796.5 ns |
26651952 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
22062982 ns |
22092888 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7333 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6167 ns |
5292 ns |
1.17 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
6084 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10417 ns |
10167 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
29047 ns |
27714.5 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1220374 ns |
1202781 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
427542 ns |
351458 ns |
1.22 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
46640 ns |
48481 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212542 ns |
218291.5 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220375 ns |
222250 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
221417 ns |
221209 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213354 ns |
213708.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
227983 ns |
222292 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
32340750 ns |
31765824 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9106020.5 ns |
9125125 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
526590 ns |
529665 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9020.5 ns |
7271 ns |
1.24 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8541 ns |
9541.5 ns |
0.90 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
9000 ns |
9791 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9396 ns |
8187.5 ns |
1.15 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
121446.5 ns |
117715.5 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3283620.5 ns |
3188633 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
904979.5 ns |
885458 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
69901 ns |
69700 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7833.5 ns |
7479 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7500 ns |
10479.5 ns |
0.72 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7854.5 ns |
10875 ns |
0.72 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7895.5 ns |
8875 ns |
0.89 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
528756.5 ns |
519786.5 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19746247 ns |
18597573.5 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4591708.5 ns |
3961208 ns |
1.16 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
319453 ns |
316073 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
583 ns |
416 ns |
1.40 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
417 ns |
750 ns |
0.56 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
27227 ns |
26338 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1166419.5 ns |
1200694 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
456541.5 ns |
488604.5 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
46370 ns |
46820 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9416.5 ns |
9291 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9020.5 ns |
10416 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9667 ns |
9208.5 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9562.5 ns |
11583 ns |
0.83 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
255444 ns |
253612 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
23081728 ns |
25803867.5 ns |
0.89 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5235833 ns |
5171833.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
391054 ns |
388624 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
107375 ns |
104834 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
98667 ns |
84834 ns |
1.16 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
99833 ns |
99500 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146812 ns |
146333 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
25167.5 ns |
24613 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1173177.5 ns |
1194962 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
263000 ns |
246062.5 ns |
1.07 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
189882 ns |
192062 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
477917 ns |
526854 ns |
0.91 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
478541 ns |
478875 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
515396 ns |
500416.5 ns |
1.03 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
528917 ns |
478958.5 ns |
1.10 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
235264 ns |
232619 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
11541234.5 ns |
11733131 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2156229 ns |
1709625 ns |
1.26 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
606146 ns |
610896 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5375 ns |
5125 ns |
1.05 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
6000 ns |
7167 ns |
0.84 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
5292 ns |
6791 ns |
0.78 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
4749.5 ns |
4042 ns |
1.18 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17066 ns |
16580 ns |
1.03 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
79131 ns |
79701 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
12041 ns |
11708 ns |
1.03 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10375 ns |
11584 ns |
0.90 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11375 ns |
10792 ns |
1.05 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
16646 ns |
17687.5 ns |
0.94 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
217111.5 ns |
214143.5 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
366413 ns |
366964 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39167 ns |
35792 ns |
1.09 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
51875 ns |
50791 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
49541 ns |
51833.5 ns |
0.96 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13625 ns |
13542 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
20639 ns |
21568 ns |
0.96 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
86411 ns |
87241 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
36958 ns |
38979.5 ns |
0.95 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
30916 ns |
30708 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
31749.5 ns |
30416 ns |
1.04 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
57312.5 ns |
58458 ns |
0.98 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
196718 ns |
192010 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
415084 ns |
395119 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1979.5 ns |
1729.5 ns |
1.14 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1792 ns |
1875 ns |
0.96 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2167 ns |
2146 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1812.5 ns |
1709 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
21203.5 ns |
20594 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1106310 ns |
1163029.5 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
307459 ns |
326833 ns |
0.94 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
33890 ns |
33120 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2125 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2291 ns |
2333 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2458 ns |
2250 ns |
1.09 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2250 ns |
2042 ns |
1.10 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
207088 ns |
204587 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
8807143 ns |
9292587 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1522270.5 ns |
1518500 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
143331 ns |
136826.5 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5270.5 ns |
4417 ns |
1.19 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5125 ns |
5250 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5750 ns |
6375.5 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5166.5 ns |
4041.5 ns |
1.28 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
146943.5 ns |
145077 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5705076 ns |
5424296 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
520708 ns |
725208 ns |
0.72 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
68485.5 ns |
69471 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8333 ns |
8041 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8250 ns |
8958 ns |
0.92 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8416 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8083 ns |
9208 ns |
0.88 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
888729.5 ns |
875812.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
39615920.5 ns |
40742928.5 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5705937.5 ns |
5580917 ns |
1.02 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
401149 ns |
389804 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56875 ns |
56792 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57583 ns |
56875 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57792 ns |
57584 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58208 ns |
58375 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
38777 ns |
37054 ns |
1.05 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
2031645.5 ns |
1234596.5 ns |
1.65 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
355042 ns |
336000 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
207012 ns |
203242 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
448625 ns |
485813 ns |
0.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
464333 ns |
499958.5 ns |
0.93 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
499625 ns |
468208 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
434291 ns |
438854.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
273231 ns |
268055 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
29758946.5 ns |
27322975 ns |
1.09 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8044833 ns |
8122166.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
825417 ns |
832729 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3330916.5 ns |
3291250 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2338208 ns |
1764708 ns |
1.32 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2311375 ns |
2339021 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6316875 ns |
6260292 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
207709 ns |
204625 ns |
1.02 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
212792 ns |
209992 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11447500 ns |
11332208 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8336208 ns |
6550833 ns |
1.27 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8225083 ns |
8325250 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21090292 ns |
20937125 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
739705 ns |
734916 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1050190 ns |
1048155.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6375 ns |
4291 ns |
1.49 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4750 ns |
5875 ns |
0.81 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5833 ns |
6583 ns |
0.89 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7083 ns |
4896 ns |
1.45 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
140910 ns |
137991.5 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5627243 ns |
5581467 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
787812.5 ns |
785625 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
56271 ns |
56390 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7208 ns |
7042 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7375 ns |
10562.5 ns |
0.70 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7334 ns |
7104.5 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7417 ns |
7833 ns |
0.95 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
762144 ns |
754679 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
35087672 ns |
34960226 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5326625 ns |
5245042 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
371874 ns |
371414 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
122750 ns |
127625 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
122020.5 ns |
95624.5 ns |
1.28 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
98459 ns |
100000 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
140792 ns |
95708 ns |
1.47 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
153890 ns |
152137 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5926081 ns |
5871279.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2123000 ns |
2635166.5 ns |
0.81 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
204552 ns |
203242 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1988542 ns |
2017959 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2003812.5 ns |
2027771 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2024875 ns |
2021167 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2026833 ns |
1987167 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
716956 ns |
703925.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32083849 ns |
31965494 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10716792 ns |
11055292 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1107075 ns |
1255893 ns |
0.88 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
32625 ns |
29375 ns |
1.11 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
35833 ns |
34500 ns |
1.04 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
33812.5 ns |
35250 ns |
0.96 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
520.5 ns |
583 ns |
0.89 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15839 ns |
15622 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
78371 ns |
80130 ns |
0.98 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2583 ns |
2542 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2792 ns |
3125 ns |
0.89 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
2958 ns |
2834 ns |
1.04 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2208 ns |
3000 ns |
0.74 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
140578 ns |
141408 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
339833 ns |
343344 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7292 ns |
7125 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5937.5 ns |
5375 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6041 ns |
6000 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10209 ns |
10209 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
38269.5 ns |
36671 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1192024.5 ns |
1208337 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
367166 ns |
331459 ns |
1.11 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47825.5 ns |
48221 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215792 ns |
217479 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220500 ns |
229625 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
247125 ns |
225000 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
206042 ns |
212875 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
249845.5 ns |
244929 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26197878.5 ns |
26091309.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7952645.5 ns |
7984187.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
574095 ns |
574266 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3958 ns |
3959 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
4000 ns |
3917 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3959 ns |
3917 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22218 ns |
21419 ns |
1.04 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2104977 ns |
2118188.5 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
247146 ns |
234583 ns |
1.05 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
41980 ns |
42620 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14958 ns |
14791 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14917 ns |
14750 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14917 ns |
14875 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15000 ns |
14833 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
314700 ns |
311492 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
10856377 ns |
10906139 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
997166 ns |
982000 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
196832 ns |
192231.5 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
102708 ns |
140834 ns |
0.73 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
141708 ns |
127417 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
104749.5 ns |
105167 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
99291 ns |
141000 ns |
0.70 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
140349 ns |
152595 ns |
0.92 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5866105 ns |
6050834 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2092834 ns |
2057334 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
205677 ns |
213297 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1930458 ns |
1917833 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1878145.5 ns |
1898875 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1920854 ns |
1922083 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1922917 ns |
1898854 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
704122 ns |
692137 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
32247238 ns |
31139112 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10354916.5 ns |
10436541 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1062040 ns |
1217872 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18416 ns |
18250 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17666.5 ns |
18625 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19500 ns |
20750 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20792 ns |
17749.5 ns |
1.17 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
113947.5 ns |
110137 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3467976 ns |
3282416 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1373875 ns |
480541.5 ns |
2.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
80120.5 ns |
79421 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
222604.5 ns |
252041.5 ns |
0.88 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221583 ns |
217541.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223084 ns |
219687.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
217250 ns |
222729.5 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
526883 ns |
519298 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
20789992 ns |
20051825.5 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6207729 ns |
6194812.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
476799.5 ns |
478425 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
23417 ns |
23291.5 ns |
1.01 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
31750 ns |
28583 ns |
1.11 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
27479.5 ns |
28792 ns |
0.95 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1584 ns |
1229.5 ns |
1.29 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16598 ns |
16210 ns |
1.02 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
81101 ns |
82241 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4917 ns |
4292 ns |
1.15 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
4833 ns |
4729 ns |
1.02 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5292 ns |
5042 ns |
1.05 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
5167 ns |
5771 ns |
0.90 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
210207 ns |
207444.5 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
389554 ns |
378084 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
306875 ns |
305417 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
306291 ns |
306250 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
306645.5 ns |
308084 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
305542 ns |
305750 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
242939.5 ns |
228609 ns |
1.06 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7807299 ns |
7545946 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
929959 ns |
604584 ns |
1.54 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
272783 ns |
273963 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
542084 ns |
532917 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
573291 ns |
538167 ns |
1.07 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
541250 ns |
539125 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
591834 ns |
572709 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1118094 ns |
1074383 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43107357.5 ns |
44755027.5 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6266417 ns |
6115208.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
856818 ns |
858603.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20083 ns |
19291 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19354.5 ns |
20708 ns |
0.93 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20958 ns |
22375.5 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20041 ns |
19875 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
119252.5 ns |
114907 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3858542 ns |
3614583 ns |
1.07 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1468541.5 ns |
593916 ns |
2.47 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
79501 ns |
79421 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213708 ns |
215708 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
212250 ns |
220584 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
219792 ns |
213625 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215917 ns |
215875 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
773895 ns |
762395 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
23689258.5 ns |
25444001 ns |
0.93 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7400708 ns |
7232562.5 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
536315 ns |
542290.5 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6708 ns |
6125 ns |
1.10 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6541.5 ns |
7083 ns |
0.92 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6916.5 ns |
7917 ns |
0.87 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6417 ns |
6208 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
143905 ns |
140165.5 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5650984 ns |
5168559 ns |
1.09 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
802042 ns |
799291 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
65691 ns |
65270 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10875 ns |
9542 ns |
1.14 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10437.5 ns |
10333.5 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10854.5 ns |
10375 ns |
1.05 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10917 ns |
11145.5 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
834374 ns |
826456 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
37471811 ns |
37337383 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5454042 ns |
5311708 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
388739 ns |
387474 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5167 ns |
4875 ns |
1.06 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5562.5 ns |
6917 ns |
0.80 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6271 ns |
7250 ns |
0.86 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6542 ns |
4812.5 ns |
1.36 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
148014 ns |
144262 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5536748 ns |
5426091.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
810000 ns |
808375 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
69510 ns |
66621 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7334 ns |
7458 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7708 ns |
8083 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7667 ns |
7541.5 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7625 ns |
7833 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
793246.5 ns |
783702 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
38329662 ns |
37497088 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5698792 ns |
5566229 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
391653 ns |
395004 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14496146 ns |
14350584 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10152125 ns |
7693688 ns |
1.32 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10003874.5 ns |
10127042 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27734000.5 ns |
27615959 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
530595 ns |
548306 ns |
0.97 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
398084 ns |
393134 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46337667 ns |
45943208 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33437833.5 ns |
26437417 ns |
1.26 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33081375 ns |
33454833 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85226375 ns |
84782667 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2816535 ns |
2657066 ns |
1.06 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3305850.5 ns |
3290613 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
65667 ns |
66375 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
66083 ns |
68584 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
68875 ns |
69333.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
66042 ns |
65979 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
110394 ns |
121920.5 ns |
0.91 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3591406.5 ns |
3593431.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1454916 ns |
508166 ns |
2.86 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
238347.5 ns |
229397.5 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
441166 ns |
446833 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
441125 ns |
452437.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
448167 ns |
446375 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
453333 ns |
445834 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
746940 ns |
728139 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26902056 ns |
26912797 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7664166 ns |
7552104 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
783197 ns |
790108 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
666 ns |
0.75 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
666 ns |
500 ns |
1.33 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
667 ns |
0.87 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
33544 ns |
32311 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1171151 ns |
1198752.5 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
290625 ns |
473500 ns |
0.61 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47120.5 ns |
47340 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9334 ns |
8666 ns |
1.08 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8729.5 ns |
9208 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9708.5 ns |
8458 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9500 ns |
17104 ns |
0.56 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
291398 ns |
286358 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21969530 ns |
20778583 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5359292 ns |
4681395.5 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
378433 ns |
375004 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9833 ns |
9875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9792 ns |
9875 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9875 ns |
9792 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9875 ns |
9833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23300.5 ns |
23012 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2128073 ns |
2014844 ns |
1.06 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
226208 ns |
215645.5 ns |
1.05 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
205132 ns |
205762 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
46417 ns |
45958 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
46042 ns |
46042 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46084 ns |
46041 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
46083 ns |
46250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
292954 ns |
290878 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11189770.5 ns |
9152947 ns |
1.22 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
969625 ns |
942542 ns |
1.03 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
599185 ns |
607695 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56375 ns |
56250 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57083 ns |
56458 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57166 ns |
57083 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
57917 ns |
57709 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
30029 ns |
28552 ns |
1.05 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1238260 ns |
1253508.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
532417 ns |
663666.5 ns |
0.80 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
202612 ns |
203541.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
449646 ns |
448583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
471541.5 ns |
465562 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
471750 ns |
465458.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
483645.5 ns |
454041.5 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
252167 ns |
245887 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31302385.5 ns |
33424426 ns |
0.94 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9573625 ns |
9545520.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
886119 ns |
887779 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
635979.5 ns |
645812.5 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
642375 ns |
575959 ns |
1.12 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
637500 ns |
640542 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
647729 ns |
646271 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
212654.5 ns |
208584 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8043371 ns |
8406939 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1391104 ns |
1406395.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
304122 ns |
315503 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2229500 ns |
2214979 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2220500 ns |
2211999.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2222167 ns |
2220812.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2244750 ns |
2227958 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
994128.5 ns |
978439 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48896347 ns |
47363900 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7263875 ns |
10481646 ns |
0.69 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1355913 ns |
1213952 ns |
1.12 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20083 ns |
18625 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
21709 ns |
20729 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20792 ns |
21583 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20375 ns |
18875 ns |
1.08 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
117053.5 ns |
113850.5 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3544768.5 ns |
3565557.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1358000 ns |
497958 ns |
2.73 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78741 ns |
79731 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221459 ns |
227375 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
263854 ns |
259417 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
227875 ns |
225541 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
224917 ns |
227084 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
737761.5 ns |
729838 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27674702 ns |
26163617 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7887646 ns |
7560500 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
545460 ns |
554315 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
500 ns |
584 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
541 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23883 ns |
23274 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1185582 ns |
1191789 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
441250 ns |
484250 ns |
0.91 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
47751 ns |
48040 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9500 ns |
9083 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9500 ns |
10437.5 ns |
0.91 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10271 ns |
9541 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9292 ns |
9500 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
270173 ns |
268183 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24317049 ns |
24685731.5 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6185229 ns |
5000875 ns |
1.24 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
400384 ns |
398234 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9167 ns |
7250 ns |
1.26 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8709 ns |
9187.5 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
9083 ns |
9645.5 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9500 ns |
8041 ns |
1.18 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
122465 ns |
118921.5 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3361391 ns |
3382327 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
904145.5 ns |
886791.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
69921 ns |
71801 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7625 ns |
7604 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7459 ns |
8125 ns |
0.92 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7937.5 ns |
7500 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7749.5 ns |
7562.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
511679 ns |
507494 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17778250 ns |
17189656.5 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4293062.5 ns |
3782375 ns |
1.14 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
321303 ns |
320313 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1500 ns |
1500 ns |
1 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1584 ns |
1708.5 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2041 ns |
1791 ns |
1.14 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1458 ns |
1375 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
21786 ns |
21598 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1144522 ns |
1189888 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
304958 ns |
313375 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
188582 ns |
190932 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3458 ns |
3541 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3375 ns |
3583 ns |
0.94 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3542 ns |
3458 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3416 ns |
3292 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
224607.5 ns |
218452 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11476067.5 ns |
9603283 ns |
1.20 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1662542 ns |
1797375 ns |
0.92 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
578026 ns |
583116 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
148020.5 ns |
148104.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
127750 ns |
106833 ns |
1.20 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
128333 ns |
128562.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
226084 ns |
225000 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
24758 ns |
23975 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1058093 ns |
1165725 ns |
0.91 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
275646 ns |
254292 ns |
1.08 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
39911 ns |
41470 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
143937 ns |
157645.5 ns |
0.91 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
111000 ns |
87625 ns |
1.27 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
125875 ns |
112000 ns |
1.12 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
250750 ns |
250708.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
222474 ns |
218220.5 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10717171.5 ns |
10460438 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2035208.5 ns |
1096666 ns |
1.86 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
265987 ns |
269773 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
7167 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6083 ns |
5333 ns |
1.14 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
6000 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10458 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34367 ns |
32755 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1202902 ns |
1178842 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
589750 ns |
330458 ns |
1.78 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50671 ns |
50720 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220833 ns |
253104 ns |
0.87 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
233708.5 ns |
229041.5 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
263729.5 ns |
234187.5 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
228334 ns |
227938 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
271007 ns |
263186.5 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27775567 ns |
27448206 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8334917 ns |
8237750 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
597416 ns |
594190.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
15375 ns |
13792 ns |
1.11 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15708 ns |
15166 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
15437.5 ns |
16499.5 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
15000 ns |
14667 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
142352 ns |
139540 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5444705.5 ns |
5436668.5 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
812500 ns |
786729 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
230862 ns |
232963 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23375 ns |
23000 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24187.5 ns |
23937.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
23917 ns |
23875 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23625 ns |
23979.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
876776.5 ns |
870094.5 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
39659188.5 ns |
40010466.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5781625 ns |
5595708 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
678837 ns |
679366 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9958 ns |
8750 ns |
1.14 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9208 ns |
10312.5 ns |
0.89 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10041.5 ns |
11271 ns |
0.89 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9791 ns |
9584 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
126368.5 ns |
123388.5 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3435538 ns |
3563169 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
522645.5 ns |
858292 ns |
0.61 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
73501 ns |
74460 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14167 ns |
13375 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13312.5 ns |
14458.5 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14667 ns |
13958 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14687.5 ns |
13625 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
675513 ns |
667308 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
20841620 ns |
21257602 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5175667 ns |
4997708 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
367658 ns |
365743 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10250 ns |
8583 ns |
1.19 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9417 ns |
10333 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
9916 ns |
10312.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
9167 ns |
9166 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
125489.5 ns |
121770.5 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3383910 ns |
3365145.5 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
931334 ns |
906625 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
73070 ns |
75170 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12708 ns |
12292 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12395.5 ns |
13437.5 ns |
0.92 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13395.5 ns |
12916 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13000 ns |
12458 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
562655.5 ns |
553718.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
19374906 ns |
18868109 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4456125 ns |
3865125.5 ns |
1.15 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
348694 ns |
341293 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
29271 ns |
26354.5 ns |
1.11 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
34646 ns |
30645.5 ns |
1.13 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
30334 ns |
31541 ns |
0.96 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
2000 ns |
1833 ns |
1.09 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16548 ns |
16183 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
86501 ns |
81001 ns |
1.07 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5333 ns |
5209 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
4916 ns |
5021 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5291.5 ns |
5417 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6375 ns |
6604 ns |
0.97 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
142100 ns |
140577.5 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
386404 ns |
370423.5 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
26630 ns |
25697 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1201023 ns |
1197018 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
440458 ns |
465667 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48650.5 ns |
47180 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6417 ns |
6125 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6166 ns |
6729 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6958 ns |
6333 ns |
1.10 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6584 ns |
6312.5 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
191484 ns |
187721.5 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22194887 ns |
23736279.5 ns |
0.94 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5924646 ns |
4952833.5 ns |
1.20 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
394354 ns |
386429 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
1959 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
1917 ns |
2042 ns |
0.94 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2084 ns |
2000 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
2042 ns |
1959 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
27533 ns |
26463 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1279119 ns |
1170027.5 ns |
1.09 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
316792 ns |
479625 ns |
0.66 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
205082 ns |
206252 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16917 ns |
16250 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16625 ns |
16666 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16750 ns |
16208.5 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16167 ns |
16417 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
277550 ns |
276067 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24740731.5 ns |
24921263 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5558667 ns |
5326083 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
701702 ns |
700836 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
152271 ns |
173875 ns |
0.88 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
152312.5 ns |
148750 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
156229.5 ns |
155708 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
147978.5 ns |
147458 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
217398 ns |
203847 ns |
1.07 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8109553 ns |
8347024.5 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1510125 ns |
1561917 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
222002 ns |
232482 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1325333 ns |
1328917 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1318416 ns |
1311771 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1327063 ns |
1320791 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1336000 ns |
1322500 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
922007.5 ns |
909940.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48499912 ns |
44667022 ns |
1.09 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6407833.5 ns |
7124333 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1114440 ns |
995559.5 ns |
1.12 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24792 ns |
22958 ns |
1.08 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
27167 ns |
26833 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25479 ns |
27625 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
25375 ns |
24667 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
247798.5 ns |
234608.5 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7567395.5 ns |
7924652 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
917625 ns |
576541 ns |
1.59 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
113721 ns |
116011 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
119208.5 ns |
118166.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
152104 ns |
122375 ns |
1.24 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
128875 ns |
158041.5 ns |
0.82 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
180624.5 ns |
123833.5 ns |
1.46 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1031705 ns |
1073695 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43993019 ns |
44153968 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6305625 ns |
6127166 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
613041 ns |
612925 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
250 ns |
375 ns |
0.67 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23487 ns |
23160 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1213583.5 ns |
1212472 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
448229 ns |
478542 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
47080 ns |
47471 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6750 ns |
6291 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6375 ns |
6833.5 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6833 ns |
6458 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6792 ns |
6584 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
207982 ns |
204382.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
26359574 ns |
24496787 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5739500 ns |
5334937.5 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
396654 ns |
388703 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6500 ns |
5208 ns |
1.25 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5834 ns |
7021 ns |
0.83 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6833 ns |
7458 ns |
0.92 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5958 ns |
5667 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
148693 ns |
145933.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5749490 ns |
5745568 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
456771 ns |
753959 ns |
0.61 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
231332 ns |
234802 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9937.5 ns |
9583 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10083 ns |
10375 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10250 ns |
10125 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9875 ns |
10042 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
915053 ns |
903827 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
40523619 ns |
42297357 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5971833 ns |
5826479 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
670296 ns |
668457 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
667 ns |
667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
709 ns |
0.88 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
667 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
708 ns |
625 ns |
1.13 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22978 ns |
22371 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2038038 ns |
2015786 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
224875 ns |
208416 ns |
1.08 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
205752 ns |
207552 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4625 ns |
4584 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4584 ns |
4833 ns |
0.95 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4667 ns |
4666 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4667 ns |
4584 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
231780.5 ns |
228749 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
9922230 ns |
10461831 ns |
0.95 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1617896 ns |
1654416.5 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
577656 ns |
580735 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8750 ns |
7750 ns |
1.13 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8208 ns |
9166.5 ns |
0.90 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8666.5 ns |
8834 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
9042 ns |
8291 ns |
1.09 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
126004.5 ns |
121959 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3566791 ns |
3411255 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
780521 ns |
827916 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
73741 ns |
74011 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8833 ns |
8625 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8520.5 ns |
9041.5 ns |
0.94 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8916.5 ns |
8583.5 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8833 ns |
8375 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
600610 ns |
591884.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
21487620 ns |
20708574.5 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
4955625 ns |
4264875 ns |
1.16 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
344374 ns |
342784 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
127354 ns |
122750 ns |
1.04 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
129958 ns |
96459 ns |
1.35 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
126833.5 ns |
130187.5 ns |
0.97 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
183417 ns |
180875 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46282.5 ns |
45830 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
100990 ns |
101721 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
329833 ns |
328000 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
313667 ns |
166666 ns |
1.88 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
341042 ns |
347541.5 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
610771 ns |
608646 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
195542 ns |
192063 ns |
1.02 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
504960 ns |
505519.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397500 ns |
395916 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287958 ns |
214250 ns |
1.34 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287459 ns |
288167 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756291 ns |
756500 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44714 ns |
43676.5 ns |
1.02 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1410517 ns |
1411321 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
410687.5 ns |
429792 ns |
0.96 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
80101 ns |
82131 ns |
0.98 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1460042 ns |
1458834 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1135937.5 ns |
857583 ns |
1.32 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1133124.5 ns |
1134333 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2442875 ns |
2441958.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
263533.5 ns |
249859 ns |
1.05 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
11351221 ns |
10370982 ns |
1.09 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1766000 ns |
1909646 ns |
0.92 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
352028 ns |
352903 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
647458 ns |
616500 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
630625 ns |
598250 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
641459 ns |
648916.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
661042 ns |
642667 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
208592 ns |
200586.5 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8373757 ns |
7794534 ns |
1.07 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1383542 ns |
1363291 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
310873 ns |
313733 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2458709 ns |
2445375 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2452000 ns |
2426917 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2439917 ns |
2441500 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2458188 ns |
2440750 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1014801 ns |
994961 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
51359759 ns |
50766350 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7516625 ns |
9661291 ns |
0.78 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1314993 ns |
1307388 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
32375.5 ns |
28521 ns |
1.14 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
35687.5 ns |
34625 ns |
1.03 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
32625 ns |
33916.5 ns |
0.96 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
792 ns |
875 ns |
0.91 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15915 ns |
15425.5 ns |
1.03 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
79021 ns |
79381 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3125 ns |
3062.5 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3167 ns |
3416 ns |
0.93 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3500 ns |
3208 ns |
1.09 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3208 ns |
3209 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
141537 ns |
139741 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
337773 ns |
338953 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
406750 ns |
404500 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
408208 ns |
402125 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
407000 ns |
408334 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
420041 ns |
422458 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
44689 ns |
43145 ns |
1.04 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1425430 ns |
1417291 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1163333 ns |
1128750.5 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
239212 ns |
239562 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3871979.5 ns |
3863292 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3986250 ns |
3971625 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3984708 ns |
3996791 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3777084 ns |
3757979.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
249522 ns |
242826 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
37376599 ns |
38623864 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11951583 ns |
11673750 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1430284 ns |
1433229 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3959 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3916 ns |
3917 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
4000 ns |
3916 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
34785 ns |
33968 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1242586 ns |
1232483 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
232208 ns |
167334 ns |
1.39 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
38191 ns |
38620 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15792 ns |
15666 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15750 ns |
15750 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
16042 ns |
15625 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15833 ns |
15625 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
258998 ns |
255128 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
9462058 ns |
8717525 ns |
1.09 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
879875 ns |
843520.5 ns |
1.04 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
171032 ns |
169816.5 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404875 ns |
402625 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
295750 ns |
220209 ns |
1.34 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
295208 ns |
295959 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760417 ns |
760791.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113901 ns |
113239 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1031250.5 ns |
1047524 ns |
0.98 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
400833 ns |
348895.5 ns |
1.15 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
87691 ns |
89300.5 ns |
0.98 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1487250 ns |
1474958.5 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1160000 ns |
881146 ns |
1.32 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1154146 ns |
1159083.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2466542 ns |
2461917 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
257263 ns |
241292 ns |
1.07 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
12586074 ns |
9318727.5 ns |
1.35 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1873583 ns |
1946459 ns |
0.96 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
350633 ns |
354883 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
26361 ns |
25844 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1271336 ns |
1200537.5 ns |
1.06 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
440542 ns |
496709 ns |
0.89 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
205822 ns |
209382 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7584 ns |
7375 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7292 ns |
8104.5 ns |
0.90 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7833 ns |
7500 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7917 ns |
7375 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
214765.5 ns |
217033.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25503667 ns |
25754399 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5647042 ns |
5254333.5 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
690301.5 ns |
685977 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
836041 ns |
825125.5 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
614834 ns |
468584 ns |
1.31 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
614083 ns |
621500 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1539125 ns |
1536542 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129881 ns |
130845.5 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
180786.5 ns |
229862 ns |
0.79 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2696417 ns |
2661979 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1999750 ns |
1535250.5 ns |
1.30 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1981917 ns |
2000792 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4944584 ns |
4906416 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
240669 ns |
242304 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
764397 ns |
841449 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
334 ns |
291 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31975 ns |
32216 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1271692 ns |
1218492 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
280292 ns |
464375 ns |
0.60 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
47100 ns |
47630 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6417 ns |
6125 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6083.5 ns |
6708 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6500 ns |
6500 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6542 ns |
6375 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
223021.5 ns |
224154.5 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20811368 ns |
21407773 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5080916 ns |
4615291 ns |
1.10 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
362464 ns |
357793.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2380833 ns |
2392708 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2381375 ns |
2371959 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2382833 ns |
2404416 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2427417 ns |
2370084 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
202015 ns |
200035.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8009212 ns |
7868335 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1480334 ns |
1597041.5 ns |
0.93 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
376468.5 ns |
373933 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4661770.5 ns |
4648292 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4647875 ns |
4644250 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4654042 ns |
4636708 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4641666.5 ns |
4642750 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
896525 ns |
891890 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
47660997 ns |
46027858 ns |
1.04 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6349937 ns |
6938541.5 ns |
0.92 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1389723 ns |
1391633 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
7042 ns |
7187.5 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
7417 ns |
7542 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7334 ns |
7125 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7000 ns |
6875 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
23533 ns |
23289 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1183451.5 ns |
1167669 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
267979.5 ns |
243458.5 ns |
1.10 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
39501 ns |
39800 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
49271 ns |
46396.5 ns |
1.06 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
70500 ns |
32917 ns |
2.14 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
33667 ns |
45875.5 ns |
0.73 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
46563 ns |
67312 ns |
0.69 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
216537 ns |
214725 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10983818 ns |
10485830 ns |
1.05 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2080125 ns |
1121562 ns |
1.85 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
266342 ns |
269102.5 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
20458 ns |
19604.5 ns |
1.04 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
24875 ns |
24021 ns |
1.04 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
23334 ns |
23750 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5417 ns |
5084 ns |
1.07 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
17721 ns |
17227 ns |
1.03 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
83171 ns |
83741 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
12229.5 ns |
11916 ns |
1.03 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10167 ns |
9354.5 ns |
1.09 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
10709 ns |
10417 ns |
1.03 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18084 ns |
17958 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
227773 ns |
225890 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
370583 ns |
371753 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
405959 ns |
404000 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
296833 ns |
222584 ns |
1.33 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296250 ns |
296875 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762833 ns |
762667 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46165 ns |
46288 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1416604 ns |
1401617.5 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
481917 ns |
358375 ns |
1.34 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
88501 ns |
89491 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1477375 ns |
1480896 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1167458.5 ns |
888250 ns |
1.31 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1163208 ns |
1164959 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2469709 ns |
2465417 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
290042 ns |
288016 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
11583417 ns |
12678894 ns |
0.91 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2071041 ns |
2117375 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
375873 ns |
381744 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
433750 ns |
432125 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
436959 ns |
430333 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
434875 ns |
436917 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
447417 ns |
448604.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
54489 ns |
54122.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
999697 ns |
1002212 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1138729 ns |
1059021 ns |
1.08 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
232887.5 ns |
234952 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3882375.5 ns |
3895042 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4013333 ns |
4004458 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4016667 ns |
4030291.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3803708 ns |
3789979 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
264090.5 ns |
260055 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31304151.5 ns |
30675954 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10435291.5 ns |
10349458.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1354298 ns |
1223712 ns |
1.11 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8750 ns |
8750 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
7625 ns |
6917 ns |
1.10 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
7667 ns |
7583 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12458 ns |
12416 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
23665 ns |
23553.5 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2164697 ns |
2134096 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
229042 ns |
214667 ns |
1.07 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
208572 ns |
211142 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45458 ns |
44958 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45750 ns |
45083 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
45250 ns |
45000 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45250 ns |
44958 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
339376.5 ns |
344550 ns |
0.98 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
12744869 ns |
14001329.5 ns |
0.91 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1713646 ns |
1862458 ns |
0.92 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
655776.5 ns |
659011.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
88250.5 ns |
122729 ns |
0.72 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
85167 ns |
83521 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
125250 ns |
87354.5 ns |
1.43 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
105979 ns |
105375 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189286.5 ns |
190055 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5807215 ns |
5969481 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2011458 ns |
1972791.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
224843 ns |
214447 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2027291.5 ns |
2012458.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2020042 ns |
1980000 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2018750.5 ns |
2023917 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2024770.5 ns |
2011645.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
534082.5 ns |
529776 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
30062921 ns |
29142428 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9367958 ns |
9305500.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
946039 ns |
1088680 ns |
0.87 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
force-pushed
the
ap/onednn
branch
2 times, most recently
from
September 15, 2024 23:50
d883497
to
b5511e7
Compare
src/onednn/types.jl
Outdated
@@ -0,0 +1,19 @@ | |||
@wrap_type MemoryPtr dnnl_memory_t dnnl_memory_destroy | |||
|
|||
function MemoryPtrNoFinalizer(A::AbstractArray, desc = memory_descriptor(A)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
Suggested change
function MemoryPtrNoFinalizer(A::AbstractArray, desc = memory_descriptor(A)) | |
function MemoryPtrNoFinalizer(A::AbstractArray, desc=memory_descriptor(A)) |
src/onednn/types.jl
Outdated
|
||
@wrap_type Engine dnnl_engine_t dnnl_engine_destroy | ||
|
||
function EngineNoFinalizer(kind = Lib.dnnl_cpu, index = 0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
[JuliaFormatter] reported by reviewdog 🐶
Suggested change
function EngineNoFinalizer(kind = Lib.dnnl_cpu, index = 0) | |
function EngineNoFinalizer(kind=Lib.dnnl_cpu, index=0) |
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
code is mostly based off of https://github.com/hildebrandmw/OneDNN.jl