This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
feat: instancenorm
with running statistics
#152
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/in_stat_track
branch
4 times, most recently
from
September 4, 2024 23:11
386c753
to
a2993bd
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: a2993bd | Previous: 9d522c5 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5917 ns |
5750 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6375 ns |
6187.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6875 ns |
7979 ns |
0.86 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6354.5 ns |
6958.5 ns |
0.91 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
119595 ns |
119461 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
2638983 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
841208 ns |
723417 ns |
1.16 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
416294 ns |
417664 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9771 ns |
9834 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10084 ns |
9792 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9958 ns |
9916 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9709 ns |
10166 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
556032 ns |
551816 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17113852 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
2407250 ns |
2364708 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
679437 ns |
695047 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1500 ns |
1458 ns |
1.03 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
2959 ns |
1687.5 ns |
1.75 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1834 ns |
1917 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3000 ns |
1250 ns |
2.40 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
22035 ns |
21782 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI |
1290376 ns |
||
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
206917 ns |
189208 ns |
1.09 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
31001 ns |
30960 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4208.5 ns |
3958.5 ns |
1.06 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4270.5 ns |
4167 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4229.5 ns |
4000 ns |
1.06 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4146 ns |
4334 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
147211 ns |
148046.5 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI |
9003436.5 ns |
||
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
1634875 ns |
1745084 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
152001 ns |
148342 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57750 ns |
56083 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46458 ns |
39917 ns |
1.16 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46833 ns |
47000 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82000 ns |
82750 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36979 ns |
37366 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
545608 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1037042 ns |
1348187.5 ns |
0.77 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
82860 ns |
80291 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2035334 ns |
2017708 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2076166 ns |
2083959 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2083083 ns |
2090792 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2002146 ns |
1999604 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
230796 ns |
232635 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
7641898 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7003625 ns |
7104833 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1090251 ns |
1540007 ns |
0.71 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
150000 ns |
143708 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
149791 ns |
173750.5 ns |
0.86 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
176250 ns |
165562.5 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
173250 ns |
165979 ns |
1.04 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166820.5 ns |
166570 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7577001 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1533291.5 ns |
1701792 ns |
0.90 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
173682 ns |
205502.5 ns |
0.85 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1087229 ns |
1100292 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1124500 ns |
1114709 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1111062.5 ns |
1122042 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1108417 ns |
1119916 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
716617 ns |
713685 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
35528650.5 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6150958 ns |
7357125 ns |
0.84 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1025161 ns |
1039502 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4750 ns |
4458 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5104 ns |
4291 ns |
1.19 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6041 ns |
6208 ns |
0.97 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4250 ns |
4416 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
94368.5 ns |
94296 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5654517 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
444916 ns |
782083.5 ns |
0.57 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
62585.5 ns |
69431 ns |
0.90 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8833 ns |
8542 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8792 ns |
8834 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8542 ns |
9083 ns |
0.94 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8375 ns |
8583 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
611231.5 ns |
608245 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
38660115 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
6129729.5 ns |
5666604.5 ns |
1.08 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
388614 ns |
384864 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17729 ns |
17229 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17625 ns |
17250 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21417 ns |
22250 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19291.5 ns |
18312.5 ns |
1.05 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
67023 ns |
68096 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
2950314 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1293375 ns |
1292667 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
77260.5 ns |
74070.5 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
223709 ns |
218583 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
212459 ns |
244459 ns |
0.87 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
227500 ns |
213333 ns |
1.07 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219917 ns |
220875 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
360786 ns |
359693 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
14019876 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5619749.5 ns |
7278917 ns |
0.77 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
476405 ns |
475315 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
708 ns |
0.77 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
667 ns |
584 ns |
1.14 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
792 ns |
916.5 ns |
0.86 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
666 ns |
583 ns |
1.14 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
20833 ns |
20807.5 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI |
1137434 ns |
||
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
300334 ns |
297208 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
34260 ns |
33001 ns |
1.04 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1417 ns |
1375 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1458 ns |
1458 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1666 ns |
1583 ns |
1.05 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1417 ns |
1417 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
126906.5 ns |
126203 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI |
8506726 ns |
||
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
1628625 ns |
1457625 ns |
1.12 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
127611 ns |
138172 ns |
0.92 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7292 ns |
7333 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
5375 ns |
1.14 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
6083 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10084 ns |
10291 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23821 ns |
24430 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1260059 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
671167 ns |
351229 ns |
1.91 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
49060 ns |
47101 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225000 ns |
219208 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
227833 ns |
261791 ns |
0.87 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
240062.5 ns |
228625 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
222479 ns |
223750 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
192842.5 ns |
194664 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31473361 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8703333 ns |
11964250 ns |
0.73 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
620596 ns |
617187 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4083 ns |
4125 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4084 ns |
4167 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4125 ns |
4125 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4083 ns |
4084 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23794 ns |
23689 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI |
1974229 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
222875 ns |
203375 ns |
1.10 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
48831 ns |
48541 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
17041 ns |
16958 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16875 ns |
16583 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17167 ns |
17250 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
17167 ns |
16917 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
197595 ns |
196884 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI |
9863825 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
969833 ns |
1560667 ns |
0.62 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
177642 ns |
174782 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
511000 ns |
509333 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
405812.5 ns |
332250 ns |
1.22 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
406125 ns |
404250 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
866000 ns |
865708 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113024 ns |
114284.5 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI |
393046 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
464375 ns |
392875 ns |
1.18 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
247752 ns |
248273 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2306958 ns |
2318021 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2028375 ns |
1745083 ns |
1.16 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2028666 ns |
2021000 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3274208.5 ns |
3274791.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
243951 ns |
244508 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
9852070 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
1941875 ns |
2001875 ns |
0.97 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
760828 ns |
763478 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6542 ns |
5833 ns |
1.12 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7124.5 ns |
7167 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8416.5 ns |
7271 ns |
1.16 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6542 ns |
6124.5 ns |
1.07 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
92767.5 ns |
92855.5 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5331169 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
798833 ns |
861271 ns |
0.93 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
60351 ns |
60401 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11479 ns |
11375 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12062.5 ns |
11750 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12042 ns |
12229 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11666.5 ns |
11125 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
654548 ns |
638820 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
38795434 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5639791.5 ns |
6435375 ns |
0.88 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
415434 ns |
416514.5 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
541 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
541 ns |
541 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
541 ns |
541 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23597 ns |
23671 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI |
2266003 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
325292 ns |
318791 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
51190 ns |
53351 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2083 ns |
2167 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2209 ns |
2084 ns |
1.06 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2166 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2125 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
223010 ns |
222818.5 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI |
11010123 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
2006416 ns |
1967167 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
177622 ns |
180782 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9166 ns |
8708 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9791 ns |
8833 ns |
1.11 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11687.5 ns |
9895.5 ns |
1.18 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8708 ns |
8709 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
106417 ns |
100619 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3372568.5 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
844499.5 ns |
898521 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
74990 ns |
74410.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17750 ns |
17375 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18645.5 ns |
17167 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18479 ns |
19375 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17250 ns |
18250 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
595296.5 ns |
574738 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
17174592 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5296333 ns |
5654917 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
386994 ns |
389229 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
583 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
500 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
667 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
35962 ns |
36237 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
1201763 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
463458 ns |
463667 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47860 ns |
48401 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8416 ns |
8437.5 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10042 ns |
9312 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9750 ns |
9875 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9062.5 ns |
9708 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
256831 ns |
254845 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18555060 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5261083 ns |
5087792 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
373643.5 ns |
375784 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396667 ns |
395833.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287875 ns |
215750 ns |
1.33 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287834 ns |
288166 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
755792 ns |
756000 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111567.5 ns |
112957 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI |
322878.5 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
450958.5 ns |
299833 ns |
1.50 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
76691 ns |
76681 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1443229 ns |
1455646 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1135146 ns |
862000 ns |
1.32 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1132416.5 ns |
1130021 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2438083 ns |
2442563 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
208450 ns |
210541 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI |
9796183 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1573708 ns |
1636104.5 ns |
0.96 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
327283 ns |
325573.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7437.5 ns |
7000 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8062.5 ns |
7084 ns |
1.14 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8479 ns |
8125 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7145.5 ns |
7041 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
145243 ns |
136948 ns |
1.06 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5512391 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
446312.5 ns |
760125 ns |
0.59 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
62560 ns |
68820 ns |
0.91 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15041.5 ns |
14625 ns |
1.03 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14771 ns |
15042 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15291 ns |
14958.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13750 ns |
15625 ns |
0.88 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
965397 ns |
931253.5 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
44376533 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6134875 ns |
6306249.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
438064 ns |
436305 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25958.5 ns |
25542 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
27125 ns |
27334 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
29750 ns |
28354 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
23979.5 ns |
31542 ns |
0.76 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
202823 ns |
200462.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
8058826 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
996250 ns |
1129500 ns |
0.88 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
116721 ns |
112942 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
110875 ns |
149250 ns |
0.74 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
151458 ns |
131583.5 ns |
1.15 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
150625 ns |
106479 ns |
1.41 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
104500 ns |
153208 ns |
0.68 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1088269 ns |
1062590 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
42023192 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5932833.5 ns |
5978292 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
595155 ns |
590197 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
77542 ns |
76250 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
75812.5 ns |
74291.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79875 ns |
77333 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
73208 ns |
76792 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
208775 ns |
209030.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7708981 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
536542 ns |
638458 ns |
0.84 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
126861 ns |
130572 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
289583 ns |
216500 ns |
1.34 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
323249.5 ns |
297395.5 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
314416.5 ns |
212146 ns |
1.48 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
275875 ns |
306208 ns |
0.90 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1125851.5 ns |
1140320 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
39323813 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6511542 ns |
7480542 ns |
0.87 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
698637 ns |
697363 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
16750 ns |
15833 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
17750 ns |
17291.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
18187.5 ns |
17875 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
16667 ns |
16687.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
147960 ns |
150183 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
5605473.5 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
442666.5 ns |
779979 ns |
0.57 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
240302 ns |
237943 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26916.5 ns |
26458.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
28250 ns |
25708 ns |
1.10 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27541 ns |
27625 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26250 ns |
27750 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
991977 ns |
987976 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
40237198 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6023542 ns |
7131041.5 ns |
0.84 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
701302.5 ns |
701547 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
10792 ns |
10396 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
10979 ns |
11563 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12958 ns |
12833 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
10375 ns |
10875.5 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
124814.5 ns |
125970.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
3549474 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
954792 ns |
910812.5 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
238007.5 ns |
241512 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
22208 ns |
21083 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
22750 ns |
21604.5 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
22958.5 ns |
23041.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21334 ns |
21541.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
705181 ns |
709336 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
21586618 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5537979.5 ns |
5733333 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
669177 ns |
676248 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
63166.5 ns |
62667 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
64084 ns |
63771 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
65541 ns |
65667 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
64375 ns |
67667 ns |
0.95 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
107012.5 ns |
107292 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3389638 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1279125 ns |
1352583.5 ns |
0.95 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
237317.5 ns |
240373 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
449416.5 ns |
444083 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
480208.5 ns |
448875 ns |
1.07 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
447521 ns |
440458 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
446958 ns |
445833.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
516298 ns |
521267 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
21075444 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6049625 ns |
8808750 ns |
0.69 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
718537 ns |
728812.5 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7000 ns |
6958.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7479 ns |
7291 ns |
1.03 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8812.5 ns |
8771 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7021 ns |
7104 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
146225 ns |
147758.5 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
5722281 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
446291 ns |
763583 ns |
0.58 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
61390 ns |
60941 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13750 ns |
15125 ns |
0.91 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14562 ns |
14417 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16750 ns |
15334 ns |
1.09 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14583 ns |
15958 ns |
0.91 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
960744.5 ns |
958359.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
38852284 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5796291 ns |
6378396 ns |
0.91 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
405558.5 ns |
409474 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6142167 ns |
6155291 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
6377458 ns |
3225687.5 ns |
1.98 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6374646 ns |
6379541 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11908917 ns |
11906125 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
303195 ns |
351844 ns |
0.86 |
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
286322 ns |
301554 ns |
0.95 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19075354.5 ns |
19041833.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
19951541.5 ns |
11118520.5 ns |
1.79 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
19978750 ns |
19989395.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36445104 ns |
36469125 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1022515 ns |
1015731 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1160387 ns |
1151512 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
917 ns |
959 ns |
0.96 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
959 ns |
958 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1000 ns |
959 ns |
1.04 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
958 ns |
958 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23341 ns |
23791 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1978246 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
326084 ns |
317417 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
214712 ns |
215032 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3667 ns |
3667 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3791 ns |
3667 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
281453 ns |
283833 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
11721582 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2113042 ns |
2116208 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
643326.5 ns |
634877 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8417 ns |
7167 ns |
1.17 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
9062 ns |
7833.5 ns |
1.16 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9333.5 ns |
9291 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7812.5 ns |
7500 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
122008 ns |
122503 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3404786 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
790625 ns |
866646 ns |
0.91 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
67280 ns |
66931 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
12125 ns |
11709 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
12729.5 ns |
11834 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12500 ns |
13291 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11562.5 ns |
11875 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
642521 ns |
651319 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22835654 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5317417 ns |
5038083 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
362758.5 ns |
365314 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
334 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22606 ns |
22923 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI |
2068918 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
331916 ns |
208979.5 ns |
1.59 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
51351 ns |
50651 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
3000 ns |
3000 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2958 ns |
2959 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3292 ns |
3250 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
3208 ns |
2959 ns |
1.08 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
204531 ns |
206218 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI |
9340318 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
1670229 ns |
1699541.5 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
160441.5 ns |
158851.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
12000 ns |
10375 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11666 ns |
11854.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13250 ns |
12417 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10396 ns |
12333 ns |
0.84 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
122200.5 ns |
123182.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3246604 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
915479 ns |
877125 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
239182 ns |
241463 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
21104 ns |
22062 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21792 ns |
21625 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21500 ns |
21708 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20563 ns |
20084 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
600065 ns |
605852.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20165731 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
4787416 ns |
5025000 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
665137 ns |
667502 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4417 ns |
4417 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4416 ns |
4584 ns |
0.96 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4417 ns |
4417 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4375 ns |
4375 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24894 ns |
24334 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI |
2248390 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
223916 ns |
208417 ns |
1.07 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
54030 ns |
54130 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16458 ns |
16375 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16791 ns |
16375 ns |
1.03 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16625 ns |
16667 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16583 ns |
16875 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
330393 ns |
333246 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI |
13174147.5 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
1096687.5 ns |
1768771 ns |
0.62 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
216292 ns |
214042.5 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
2084 ns |
2084 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
2083 ns |
2000 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
2167 ns |
2166 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
2042 ns |
2041 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
36372 ns |
36196 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1201021 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
446833 ns |
473000 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
206512 ns |
205752 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
16812 ns |
17667 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
19625 ns |
18937.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
18083.5 ns |
17625 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
16417 ns |
16896 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
296169 ns |
297235 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
21318598 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5102291 ns |
5572167 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
695637 ns |
694748 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
59542 ns |
55979.5 ns |
1.06 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
65334 ns |
60709 ns |
1.08 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
65875 ns |
65812.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
51291 ns |
51583 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66549 ns |
66558 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
98811 ns |
120591.5 ns |
0.82 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
195875 ns |
185895.5 ns |
1.05 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
129792 ns |
146354 ns |
0.89 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
156812.5 ns |
136208 ns |
1.15 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
308959 ns |
297104 ns |
1.04 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
218139 ns |
218976.5 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
591506 ns |
584106 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
84458.5 ns |
112833.5 ns |
0.75 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
83666 ns |
86417 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83708 ns |
89416 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82771 ns |
81000 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192192 ns |
191966 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5581758 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1985375 ns |
1945000 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
198612 ns |
209467.5 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1917333 ns |
1912250 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1921042 ns |
1923916 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1913083 ns |
1917917 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1896792 ns |
1922250 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
537492 ns |
536309 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
25658976.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8852334 ns |
11093750 ns |
0.80 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1081525.5 ns |
935284.5 ns |
1.16 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
291 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21792 ns |
21820 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI |
2054894 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
370250 ns |
327833.5 ns |
1.13 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
45091 ns |
46181 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1875 ns |
1791 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1792 ns |
1833 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
252179 ns |
254627 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI |
9589577 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
1112041.5 ns |
1640833 ns |
0.68 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
182142 ns |
187212 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
10708 ns |
8209 ns |
1.30 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
10521 ns |
9083 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10708.5 ns |
9896 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8208 ns |
8417 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
119900.5 ns |
120586.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3415652 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
880333 ns |
873250 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
237573 ns |
236722 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9208 ns |
10292 ns |
0.89 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9895.5 ns |
8958 ns |
1.10 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9750 ns |
9917 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8792 ns |
8666 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
528737 ns |
532717.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
18354264 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4675166 ns |
4452292 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
631771.5 ns |
646767 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57667 ns |
56750 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46500 ns |
39708 ns |
1.17 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46334 ns |
47166 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83541 ns |
83125 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
39618 ns |
40431 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1322744 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1121333 ns |
1093666 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
73931 ns |
77971 ns |
0.95 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1938853.5 ns |
1903833 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1991958.5 ns |
1979312 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1945875 ns |
1983896 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1887000 ns |
1849208 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
221942 ns |
224788 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
33002765 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11113083 ns |
14363791.5 ns |
0.77 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1027840 ns |
1042991 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
417250 ns |
415042 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
417187.5 ns |
418584 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
421250 ns |
420291 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
418333 ns |
420459 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
211347 ns |
212100.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7575667.5 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
539625 ns |
1065709 ns |
0.51 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
287053 ns |
286133 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
738291.5 ns |
742875 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
683771 ns |
758958 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
757916 ns |
691062.5 ns |
1.10 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
681208.5 ns |
742624.5 ns |
0.92 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1065352.5 ns |
1063422.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43581622 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6560209 ns |
7312146 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
924089 ns |
924920 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
3444959 ns |
3442959 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
3412209 ns |
3441833 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
3411667 ns |
3417500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
3424875 ns |
3453000 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
174242.5 ns |
174858 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8338635 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1410542 ns |
1420583 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
433624 ns |
452865 ns |
0.96 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
6186541.5 ns |
6180375 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
6251146 ns |
6232875 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
6196875 ns |
6229979 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
6211542 ns |
6252666 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1007247 ns |
1007257 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50404661 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7388542 ns |
9641124.5 ns |
0.77 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1563280.5 ns |
1560736 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
470792 ns |
471375 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
340500 ns |
253334 ns |
1.34 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
340708.5 ns |
341708 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
902792 ns |
902583 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46331.5 ns |
46913 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI |
883706 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
498958.5 ns |
338020.5 ns |
1.48 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
249362 ns |
250492 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2334396 ns |
2320416 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2027917 ns |
1761167 ns |
1.15 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2035000 ns |
2033167 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3282208.5 ns |
3279375 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
258830.5 ns |
260626 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
13167072 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2215396 ns |
2319917 ns |
0.95 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
789218 ns |
785678 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57958 ns |
56166 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
45958 ns |
39417 ns |
1.17 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46291 ns |
46584 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82916.5 ns |
82917 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28557 ns |
28863 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1311996 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1137604.5 ns |
1130625 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
76966 ns |
79170.5 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2020625 ns |
2020083 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2097750 ns |
2062917 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2087000 ns |
2078437.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1971500 ns |
2004145.5 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
236092 ns |
238429 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36800248.5 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11422291 ns |
15264270.5 ns |
0.75 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1060671 ns |
1057241 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57375 ns |
56292 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46584 ns |
39833 ns |
1.17 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46500 ns |
47416 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82542 ns |
82875 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
49183 ns |
50090 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
815504 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1092000 ns |
1054834 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
77081 ns |
74900 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1936771 ns |
1924167 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1983542 ns |
1968250 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1972750 ns |
1980792 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1881250 ns |
1891208 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
240760 ns |
243592 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
17572213 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9756750 ns |
12800042 ns |
0.76 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
929739 ns |
1070466 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
291 ns |
292 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
35201 ns |
35236 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
1199599 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
461437.5 ns |
461750 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48150 ns |
50011 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6375 ns |
6709 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7500 ns |
6520.5 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6959 ns |
7625 ns |
0.91 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6667 ns |
6541 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
213488.5 ns |
216284 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
20286833.5 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4892959 ns |
5088292 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
373504 ns |
373774 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
291 ns |
250 ns |
1.16 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32390.5 ns |
32446 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI |
1231275 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
255041 ns |
248500 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
41000 ns |
40510 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2875 ns |
2917 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
3125 ns |
3250 ns |
0.96 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3167 ns |
3083 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
3042 ns |
3458 ns |
0.88 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
189695.5 ns |
191592.5 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI |
7603161 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
962208 ns |
1031291.5 ns |
0.93 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
155736.5 ns |
153502 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
429770.5 ns |
423917 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
423416 ns |
473500 ns |
0.89 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
423313 ns |
427833 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
427000 ns |
424125 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
138026.5 ns |
138519 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5855028 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2057062.5 ns |
2048875 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
351493 ns |
380684 ns |
0.92 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3807895.5 ns |
3799062.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3817542 ns |
3822458 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3807458.5 ns |
3802667 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3760479.5 ns |
3823563 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
709783 ns |
717031.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31085025 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10433209 ns |
12950229 ns |
0.81 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1493135 ns |
1325953 ns |
1.13 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49863500 ns |
49840813 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
35514500 ns |
25988833 ns |
1.37 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35511042 ns |
35525750 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
96900416.5 ns |
96904729.5 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1600320.5 ns |
1593190 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1005650 ns |
1014101 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154538292 ns |
153775938 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
112336291.5 ns |
89008896 ns |
1.26 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112413750 ns |
112384750 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
294933354 ns |
296752479 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6479054.5 ns |
6476290 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5530406 ns |
5534451 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
19125 ns |
15062.5 ns |
1.27 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
19375 ns |
15625 ns |
1.24 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
17291 ns |
16875 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
16291.5 ns |
15333 ns |
1.06 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
21904 ns |
21010 ns |
1.04 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI |
1073077 ns |
||
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
227500 ns |
204959 ns |
1.11 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
26131 ns |
27230 ns |
0.96 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
10916.5 ns |
11083 ns |
0.98 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
8958.5 ns |
7583 ns |
1.18 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
9167 ns |
9209 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
17208 ns |
17188 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
262286.5 ns |
264057 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI |
10056101 ns |
||
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
1654812 ns |
1736125.5 ns |
0.95 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
153661 ns |
152581.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8958 ns |
7417 ns |
1.21 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8729 ns |
8833 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10313 ns |
10041.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8125 ns |
8292 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
116383 ns |
117259.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
3328053 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
844500 ns |
887417 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
237893 ns |
236902.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9375 ns |
9708.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9875 ns |
9292 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10583 ns |
10791.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9500 ns |
9584 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
625403 ns |
631614 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22201158 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5192708 ns |
5189583 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
651707 ns |
668942 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10041.5 ns |
8812.5 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9708 ns |
9583 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11458 ns |
11042 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10062 ns |
9250 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
121406 ns |
122641 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
3262923 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
935083 ns |
876791.5 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
69301 ns |
74481 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13334 ns |
13708 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13188 ns |
14979 ns |
0.88 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
16458 ns |
14416 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12666.5 ns |
13625.5 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
595121 ns |
601521.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
20037393 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4712562 ns |
4885250 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
351104 ns |
353174 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
500 ns |
458 ns |
1.09 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
459 ns |
500 ns |
0.92 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
34870 ns |
35180 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1208458 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
440375 ns |
441166 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
208592 ns |
206562 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7354 ns |
7042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7916 ns |
10458 ns |
0.76 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9458 ns |
8042 ns |
1.18 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7146 ns |
7125 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
233423 ns |
233713.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
21500624 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
4832271 ns |
5300958.5 ns |
0.91 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
659156 ns |
658707 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
16500 ns |
12666 ns |
1.30 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
16708 ns |
13833 ns |
1.21 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
16208 ns |
15667 ns |
1.03 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
11854.5 ns |
10270.5 ns |
1.15 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
22539 ns |
22010 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI |
1105443 ns |
||
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
213166.5 ns |
186625 ns |
1.14 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
190492 ns |
191282 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
32292 ns |
32042 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
31708 ns |
32020.5 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
32333 ns |
32458 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
31833 ns |
31854.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
278174 ns |
278049 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
10682517 ns |
||
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
1778749.5 ns |
1885500 ns |
0.94 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
603796 ns |
606396.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
466750 ns |
438291 ns |
1.06 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
471896 ns |
484125 ns |
0.97 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
445208 ns |
446062.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
442875 ns |
477208 ns |
0.93 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194389 ns |
194398.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5707585.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1991000 ns |
1968250 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
350553 ns |
375174 ns |
0.93 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3826416 ns |
3825292 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3820437.5 ns |
3837396 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3833542 ns |
3828687.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3824416.5 ns |
3836875 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
543837 ns |
549907 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28449805 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10088312.5 ns |
12010500 ns |
0.84 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1218812 ns |
1226382.5 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
786744750 ns |
836787979.5 ns |
0.94 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
544322375 ns |
426008000 ns |
1.28 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
544701250 ns |
542930250 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1560888250 ns |
1533058916 ns |
1.02 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22539066.5 ns |
22531506 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14026519 ns |
14059203 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
3016294584 ns |
3617643875 ns |
0.83 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1790874375 ns |
1519606625 ns |
1.18 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1791257792 ns |
1791220042 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
6320454875 ns |
4771769708 ns |
1.32 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
366543615 ns |
370760684 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
88746342 ns |
89879564 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
80166.5 ns |
75354.5 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76083 ns |
77417 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
79000 ns |
80167 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76187.5 ns |
76625 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
209486 ns |
210924.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7636757.5 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
534750 ns |
1045583.5 ns |
0.51 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
109691 ns |
110131.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
208584 ns |
231500 ns |
0.90 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
262687.5 ns |
195167 ns |
1.35 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
225750 ns |
244583 ns |
0.92 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
273625 ns |
234875 ns |
1.16 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1063950 ns |
1060035 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43715632.5 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6250125 ns |
6603312.5 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
640881.5 ns |
643791.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199907104.5 ns |
199256958.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
139067125 ns |
103813958.5 ns |
1.34 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
138708750 ns |
139098125 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
388802584 ns |
388864875 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5823959 ns |
5820038 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3423293 ns |
3424485 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
618968833 ns |
615907583.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
440907584 ns |
354224562 ns |
1.24 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
439090750 ns |
440166291.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1177639416 ns |
1188432875 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26159681.5 ns |
26804213.5 ns |
0.98 |
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
21888417 ns |
21815881 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7167 ns |
7333 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6167 ns |
5416 ns |
1.14 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6083 ns |
6291 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10250 ns |
10458 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27882 ns |
28403 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1246522 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
645625 ns |
361437.5 ns |
1.79 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47780 ns |
48715.5 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213562.5 ns |
213333.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220687.5 ns |
221708 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
222520.5 ns |
220916 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205917 ns |
205750 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
224122 ns |
226122 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
34490145 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9008375 ns |
11493583.5 ns |
0.78 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
530410 ns |
541195.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9062.5 ns |
7291 ns |
1.24 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7833 ns |
8417 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
10291 ns |
10770.5 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
9000 ns |
8583 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
118000.5 ns |
119656 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
3370282 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
894584 ns |
855542 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
72091 ns |
72200 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7500 ns |
7667 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7917 ns |
9395.5 ns |
0.84 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10584 ns |
8375 ns |
1.26 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7312.5 ns |
7542 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
524296 ns |
526844.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
19429602 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
4708375 ns |
4384667 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
321993 ns |
322463 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
458 ns |
1.18 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
416 ns |
1.20 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
26603 ns |
27306 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
1230699 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
484062.5 ns |
483625 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
50630 ns |
48601 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9208 ns |
9917 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9375 ns |
10167 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
11291 ns |
9542 ns |
1.18 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9375 ns |
8667 ns |
1.08 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
254202 ns |
256488 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24288314.5 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5903834 ns |
5936416 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
394403 ns |
396784 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
106833 ns |
108542 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
100104.5 ns |
85333 ns |
1.17 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
100584 ns |
100208 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
146042 ns |
146625 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
25322 ns |
25074 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI |
1069883.5 ns |
||
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
268834 ns |
244333 ns |
1.10 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
188692 ns |
190632 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
478458 ns |
479625 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
478250 ns |
518583.5 ns |
0.92 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
478792 ns |
481000 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
478125 ns |
478125 ns |
1 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
234643 ns |
235150 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI |
12128060.5 ns |
||
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2169750 ns |
2164333 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
622201 ns |
622586 ns |
1.00 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5270.5 ns |
5500 ns |
0.96 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
6125 ns |
5750 ns |
1.07 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7417 ns |
6666.5 ns |
1.11 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
6458 ns |
4125 ns |
1.57 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17119 ns |
16723 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
78390 ns |
78130 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
11708 ns |
11812 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
10812.5 ns |
11916 ns |
0.91 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
11875 ns |
11000 ns |
1.08 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
16542 ns |
16500 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
214283.5 ns |
216336 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
390414 ns |
370958.5 ns |
1.05 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
38625 ns |
35917 ns |
1.08 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
52500 ns |
50500 ns |
1.04 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52667 ns |
52709 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13833 ns |
13541 ns |
1.02 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22075 ns |
20359 ns |
1.08 |
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
80161 ns |
79931 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
36249.5 ns |
36625 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
30937.5 ns |
29625 ns |
1.04 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
32625 ns |
31458 ns |
1.04 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
56958 ns |
57209 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
192677 ns |
195413 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
400584 ns |
409364 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
1729.5 ns |
1959 ns |
0.88 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
1916 ns |
1792 ns |
1.07 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
2208 ns |
2125 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
1875 ns |
1792 ns |
1.05 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
21247 ns |
21014.5 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI |
1157702 ns |
||
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
309833.5 ns |
324459 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
28990 ns |
33550 ns |
0.86 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
2167 ns |
2209 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2125 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
2395.5 ns |
2417 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2291 ns |
0.93 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
205351 ns |
207244.5 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI |
9086951 ns |
||
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
1522708.5 ns |
1670895.5 ns |
0.91 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
136302 ns |
137121 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5187.5 ns |
4583 ns |
1.13 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4625 ns |
4750 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7083 ns |
6333 ns |
1.12 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4292 ns |
4917 ns |
0.87 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
146996 ns |
147827 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
5676336 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
595250 ns |
771709 ns |
0.77 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
62401 ns |
71711 ns |
0.87 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8542 ns |
8270.5 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8667 ns |
8666 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10333 ns |
8792 ns |
1.18 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8083 ns |
8125 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
889542 ns |
888135.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
38346905.5 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5710000 ns |
6483625 ns |
0.88 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
389144 ns |
391164 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56750 ns |
56875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57791 ns |
56875 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57625 ns |
57750 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
58125 ns |
58292 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
37659 ns |
37890 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1173482.5 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
641708 ns |
379312.5 ns |
1.69 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
206772 ns |
205582 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
482062.5 ns |
448479 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
463250 ns |
465229 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
508917 ns |
464687.5 ns |
1.10 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
434333 ns |
433500 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
269678.5 ns |
270782 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27863393 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8045333 ns |
10306000 ns |
0.78 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
805107.5 ns |
801818 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3313249.5 ns |
3291000 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
2341229.5 ns |
1770084 ns |
1.32 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2340333 ns |
2335292 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6307458 ns |
6297083.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
206170.5 ns |
206316 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
210057 ns |
203322 ns |
1.03 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11431687.5 ns |
11333854.5 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
8348708.5 ns |
6594562.5 ns |
1.27 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8321437.5 ns |
8324937.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21106375 ns |
21089229 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
734135 ns |
735605 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1070635.5 ns |
1072271 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5667 ns |
5625 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5667 ns |
5667 ns |
1 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7292 ns |
7500 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4458 ns |
6750 ns |
0.66 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
140309.5 ns |
139700 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
5380103.5 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
768292 ns |
867541.5 ns |
0.89 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
56180 ns |
56260 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7625 ns |
7500 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
14625 ns |
0.52 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7292 ns |
7375 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7083 ns |
7000 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
765019 ns |
766028 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
35848307 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5274333 ns |
5998084 ns |
0.88 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
381219 ns |
380414 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
141125 ns |
117604 ns |
1.20 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
124167 ns |
125375 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
99021 ns |
102396 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
99959 ns |
98145.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
151096 ns |
152876 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5813160 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2024062.5 ns |
2030624.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
170012 ns |
185692 ns |
0.92 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2026542 ns |
2021875 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2030187.5 ns |
2037125 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2022917 ns |
2013542 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2001958 ns |
2033354 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
710929 ns |
716061.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31104834 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10821438 ns |
13591542 ns |
0.80 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1258188 ns |
1265732.5 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
34104 ns |
29833 ns |
1.14 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
36667 ns |
34167 ns |
1.07 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
35896 ns |
35542 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
500 ns |
625 ns |
0.80 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15836 ns |
15704 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
71301 ns |
71560.5 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2625 ns |
2583 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3041 ns |
4583 ns |
0.66 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3000 ns |
3000 ns |
1 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2250 ns |
2209 ns |
1.02 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
140932.5 ns |
143464 ns |
0.98 |
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
363533.5 ns |
351354 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7208 ns |
7208 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6084 ns |
5334 ns |
1.14 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
6166 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9834 ns |
10000 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36632 ns |
37164 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1147030 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
478270.5 ns |
334396 ns |
1.43 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50781 ns |
49180 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212875 ns |
212895.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
245250 ns |
222000 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
243291 ns |
221041.5 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
205791 ns |
205979 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
248183.5 ns |
249374 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
27101683.5 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7862917 ns |
9656333 ns |
0.81 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
521905.5 ns |
581561 ns |
0.90 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3916 ns |
3959 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
4000 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22266 ns |
21939 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI |
2115567 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
244666 ns |
227375 ns |
1.08 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
45861 ns |
45671 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14834 ns |
14916 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15083 ns |
14708 ns |
1.03 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14958 ns |
15000 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14875 ns |
14875 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
316276.5 ns |
314728.5 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI |
11563770.5 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
1010166 ns |
1635750 ns |
0.62 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
194972 ns |
192832 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
135292 ns |
109166 ns |
1.24 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
126542 ns |
132541 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
104833 ns |
109875 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
127083 ns |
102125 ns |
1.24 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
149262.5 ns |
138355.5 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5493706 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2741375 ns |
2016354 ns |
1.36 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
183311 ns |
188667 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1875500 ns |
1918396 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1929062.5 ns |
1939229 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1921833.5 ns |
1913584 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1881833 ns |
1937625 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
701377 ns |
700104 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
29803203.5 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10752125 ns |
13264020.5 ns |
0.81 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1228342 ns |
1233652.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18708 ns |
17667 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19333 ns |
18458 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20916.5 ns |
22270.5 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18625 ns |
18250 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
111660 ns |
110588.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3330550 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1335000 ns |
1374104.5 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
77091 ns |
81891 ns |
0.94 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
216104 ns |
216417 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
248250.5 ns |
249771 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
222000 ns |
216541.5 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
240979 ns |
217312.5 ns |
1.11 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
529691 ns |
527304 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
19308055.5 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6096541.5 ns |
8411584 ns |
0.72 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
489315 ns |
488925 ns |
1.00 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
25541.5 ns |
24063 ns |
1.06 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
30791.5 ns |
28500 ns |
1.08 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
30625 ns |
29459 ns |
1.04 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1667 ns |
1334 ns |
1.25 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16863 ns |
16479 ns |
1.02 |
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
82541 ns |
82590 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
4562.5 ns |
4708.5 ns |
0.97 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
4916.5 ns |
4708 ns |
1.04 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
5271 ns |
5208 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
4292 ns |
4875 ns |
0.88 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
210758 ns |
210198 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
382683.5 ns |
398304 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
304208 ns |
304792 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
308041 ns |
305542 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
308333 ns |
311083 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
304708 ns |
306375 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
234485 ns |
232191.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7433824.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1027333 ns |
1156396 ns |
0.89 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
275523 ns |
279563 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
541792 ns |
530625 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
532687.5 ns |
542459 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
544583.5 ns |
542000.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
548917 ns |
535875 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1108177 ns |
1096065 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
43330985.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6088292 ns |
6678000 ns |
0.91 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
862699 ns |
873778.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19375 ns |
20083 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
21875 ns |
20187.5 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22625 ns |
23187 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21291 ns |
20959 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
115815.5 ns |
115290.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3524302 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1434854.5 ns |
1265792 ns |
1.13 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
76091 ns |
80731 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212500 ns |
212042 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
241791 ns |
224625 ns |
1.08 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220125 ns |
214333 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213792 ns |
213708.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
761437 ns |
758025 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25996104 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7204145.5 ns |
10158583 ns |
0.71 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
543280.5 ns |
542975 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6542 ns |
6458 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7042 ns |
6917 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8666.5 ns |
8542 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6416 ns |
6417 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
144083 ns |
143078 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
5494023 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
783688 ns |
869500 ns |
0.90 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
71731 ns |
69771 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10542 ns |
10709 ns |
0.98 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10583.5 ns |
9771 ns |
1.08 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10291 ns |
10729.5 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9916 ns |
10291 ns |
0.96 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
846190.5 ns |
834187 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
37805204 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5358917 ns |
6274750 ns |
0.85 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
387734 ns |
396084 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5375 ns |
5333 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6187.5 ns |
4958 ns |
1.25 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6979 ns |
7125 ns |
0.98 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4834 ns |
5958 ns |
0.81 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
147518 ns |
146313.5 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
5618573 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
791750 ns |
875000 ns |
0.90 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
61531 ns |
67660 ns |
0.91 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7291 ns |
7667 ns |
0.95 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8104 ns |
7500 ns |
1.08 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7666 ns |
7625 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7708 ns |
7459 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
801831 ns |
797995 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
38472235 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5789958 ns |
6580999.5 ns |
0.88 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
397074 ns |
400804 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14458000 ns |
14350958 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
10125125 ns |
7722625 ns |
1.31 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10112458 ns |
10132750 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27710417 ns |
27757125 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
532341 ns |
532327 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
384094 ns |
403538.5 ns |
0.95 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46171729 ns |
45806208 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
33474333.5 ns |
26766750.5 ns |
1.25 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33459750 ns |
33520000 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85189084 ns |
85306916 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2669656 ns |
2661047 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3280222.5 ns |
3296413 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
66333 ns |
66000 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
67229 ns |
67333 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
70667 ns |
69854 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
67500 ns |
67375 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
120718.5 ns |
120529 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3600796.5 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1457000.5 ns |
1329083.5 ns |
1.10 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
231107.5 ns |
228112 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
443625.5 ns |
444083 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
475854 ns |
444083 ns |
1.07 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
447853.5 ns |
441292 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
442792 ns |
442521.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
742407 ns |
736542.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25435932 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7578187.5 ns |
10732062.5 ns |
0.71 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
798118 ns |
809398 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
542 ns |
1 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
667 ns |
0.88 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
542 ns |
0.92 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
33000 ns |
32886 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI |
1148407.5 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
458917 ns |
466834 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
49420 ns |
49230 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8584 ns |
9375 ns |
0.92 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10042 ns |
9250 ns |
1.09 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9750 ns |
9500 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8250 ns |
8125 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
292463 ns |
290314.5 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
23428542 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
4860020.5 ns |
5519708 ns |
0.88 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
382894 ns |
387394 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
9875 ns |
9875 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
9875 ns |
9833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
9875 ns |
9833 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
9833 ns |
9791 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23639 ns |
23928 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2138902.5 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
222041 ns |
204979.5 ns |
1.08 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
215522 ns |
214872 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45958 ns |
46000 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
46708 ns |
45667 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
46167 ns |
46666 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
45834 ns |
46250 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
296481.5 ns |
293307 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
11930292.5 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
926313 ns |
1595562.5 ns |
0.58 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
623576 ns |
621217 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
56208 ns |
56333 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
57209 ns |
56792 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
57125 ns |
57083 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
57833 ns |
57834 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
29373 ns |
29516 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1150343 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
611209 ns |
704333.5 ns |
0.87 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
203382 ns |
205082 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
448833 ns |
455021 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
476708 ns |
465375 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
503562.5 ns |
473000 ns |
1.06 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
439500 ns |
434208.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
251321 ns |
252003 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
31963324.5 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9414500.5 ns |
12166125 ns |
0.77 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
863589 ns |
893508.5 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
646541.5 ns |
624416 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
641000 ns |
662083 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
641562.5 ns |
619083 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
651083 ns |
633895.5 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
210265 ns |
212333 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8024959 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1399792 ns |
1471333 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
232562 ns |
236152 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2229208 ns |
2220834 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2220124.5 ns |
2250000 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2228208 ns |
2213792 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2239166 ns |
2240750 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1005286 ns |
990521.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50598584.5 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
8457041 ns |
9717333 ns |
0.87 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1379658.5 ns |
1376089 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21125 ns |
19000 ns |
1.11 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20104 ns |
19979 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
23416 ns |
22333.5 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21083 ns |
22250 ns |
0.95 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
115191.5 ns |
114382.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
3482804 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1454313 ns |
1244584 ns |
1.17 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
75055.5 ns |
81450 ns |
0.92 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
225500 ns |
222479 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
230791 ns |
224959 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
223750 ns |
221208 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219354 ns |
218917 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
740987 ns |
738666.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
25904808.5 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7416125 ns |
10456396 ns |
0.71 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
562065.5 ns |
562856 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
500 ns |
584 ns |
0.86 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
667 ns |
0.88 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
541 ns |
542 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23696 ns |
23746 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1202842 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
469708.5 ns |
488062.5 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
52120 ns |
49670 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
8375 ns |
9541.5 ns |
0.88 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9875 ns |
9792 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10125 ns |
9833 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
8813 ns |
9291.5 ns |
0.95 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
273167 ns |
272510 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24458892 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6139792 ns |
6224583.5 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
408644 ns |
407824 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8958 ns |
7708 ns |
1.16 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9937.5 ns |
8687.5 ns |
1.14 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
9750 ns |
11166.5 ns |
0.87 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8417 ns |
9666 ns |
0.87 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
122177 ns |
121220 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI |
3286464 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
893667 ns |
860208 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
67831 ns |
72661 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7708.5 ns |
7708 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7958 ns |
7250 ns |
1.10 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7792 ns |
8125 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7354.5 ns |
7334 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
520137 ns |
516336 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI |
17380037 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4378271 ns |
4339813 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
327554 ns |
328244 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1250 ns |
1458 ns |
0.86 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1646 ns |
1375 ns |
1.20 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1792 ns |
2041.5 ns |
0.88 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1541 ns |
1583 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
22253.5 ns |
21646 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI |
1134787.5 ns |
||
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
310250 ns |
305020.5 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
191402 ns |
191511.5 ns |
1.00 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
3458 ns |
3334 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
3416.5 ns |
3375 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
3583 ns |
3459 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
3250 ns |
3458 ns |
0.94 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
222894.5 ns |
224911 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
10304323.5 ns |
||
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
1806500 ns |
1768041 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
594426 ns |
595216 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
146687.5 ns |
145708.5 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
129292 ns |
106562.5 ns |
1.21 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
130125 ns |
129292 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
225021 ns |
225125 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
24810 ns |
24473.5 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI |
1164548 ns |
||
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
291229 ns |
252375 ns |
1.15 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
36760 ns |
38390 ns |
0.96 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
143312.5 ns |
143771 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
110917 ns |
88167 ns |
1.26 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
111645.5 ns |
110771 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
250854.5 ns |
250875 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
222162 ns |
220914.5 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI |
10422924 ns |
||
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
1979750 ns |
2045709 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
220922.5 ns |
237933 ns |
0.93 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7250 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
5333 ns |
1.12 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6041 ns |
5916 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10208 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33688 ns |
33448 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
1158395.5 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
349458 ns |
335833 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50791 ns |
50340 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220396 ns |
224250 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228250 ns |
228375 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228250 ns |
236083.5 ns |
0.97 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
217187.5 ns |
212562.5 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
269449.5 ns |
267943.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
26368125 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8172042 ns |
9170083 ns |
0.89 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
531435 ns |
609306 ns |
0.87 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
15125 ns |
14458 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
15062.5 ns |
14812.5 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
16125 ns |
16791.5 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
14709 ns |
15334 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
143874.5 ns |
141134 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI |
5485667 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
799375 ns |
873104 ns |
0.92 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
236992 ns |
238182 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
23958 ns |
24083.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
23583 ns |
23875 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
23791.5 ns |
24167 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
23895.5 ns |
23625 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
888177 ns |
878285 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI |
39660301 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5523166.5 ns |
6385188 ns |
0.86 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
698717 ns |
692226 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
9834 ns |
8916 ns |
1.10 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9542 ns |
9687.5 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11250 ns |
12125 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9041 ns |
10416 ns |
0.87 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
125866 ns |
124959.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI |
3385207 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
918291.5 ns |
918334 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
73661 ns |
75531 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13792 ns |
14000 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14500 ns |
13729 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14250 ns |
14708 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13541 ns |
13834 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
680403 ns |
676549 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI |
22152366 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5309250 ns |
5573041 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
376079 ns |
373189 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9792 ns |
8062 ns |
1.21 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
8896 ns |
9750 ns |
0.91 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11584 ns |
11916.5 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
8500 ns |
10187.5 ns |
0.83 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
125117 ns |
124116 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI |
3437739 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
894125 ns |
883646 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
73521 ns |
69690 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
12084 ns |
12625 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12854 ns |
12750 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13042 ns |
13542 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12250 ns |
12312 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
564204.5 ns |
561116 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI |
18654209 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
4654396 ns |
4630937 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
347644 ns |
345083.5 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
28104 ns |
27208.5 ns |
1.03 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
34458.5 ns |
32333.5 ns |
1.07 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
32209 ns |
31958 ns |
1.01 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
2250 ns |
2041 ns |
1.10 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16850 ns |
16556 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
73711 ns |
82091 ns |
0.90 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5104 ns |
5229 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5229.5 ns |
4687.5 ns |
1.12 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5250 ns |
5334 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
6479 ns |
6458 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
142969.5 ns |
142634 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
370174 ns |
367964 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
334 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
250 ns |
1.50 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
250 ns |
250 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
26455 ns |
26682 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
1200640 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
450083 ns |
482271 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48211 ns |
47990 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6500 ns |
6500 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6792 ns |
6562.5 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6645.5 ns |
6709 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6271 ns |
6188 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
191827 ns |
190767.5 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24867781 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5869000 ns |
5874834 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
389779 ns |
394363.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
2000 ns |
2042 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
2083 ns |
1917 ns |
1.09 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
2083 ns |
2125 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
1917 ns |
2000 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
27236 ns |
27167 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI |
1169588 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
475875 ns |
492292 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
209463 ns |
210002 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
16771 ns |
16833.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16459 ns |
16417 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
17125 ns |
17354.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16958 ns |
16458.5 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
280882.5 ns |
278278 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI |
24732202.5 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6172584 ns |
6125604 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
715917 ns |
714427 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
154104.5 ns |
146500 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
154208 ns |
171396 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
153687.5 ns |
155584 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
191750 ns |
154167 ns |
1.24 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
210824 ns |
204804 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
7880615 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1458145.5 ns |
1553583 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
193772 ns |
231362.5 ns |
0.84 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1316937.5 ns |
1324312.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1322666 ns |
1348021 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1323667 ns |
1319083.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1322334 ns |
1326542 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
938240 ns |
925557 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
45534804.5 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6626458 ns |
8602229.5 ns |
0.77 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1016615 ns |
1014380 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24916.5 ns |
23792 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25417 ns |
25354 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
30354 ns |
28250 ns |
1.07 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
23708 ns |
24604.5 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
241805 ns |
238411 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI |
7385481 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1045479 ns |
1139000 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
120432 ns |
120312 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
171875 ns |
117854 ns |
1.46 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
170938 ns |
124667 ns |
1.37 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
129000 ns |
174458.5 ns |
0.74 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
171020.5 ns |
118354 ns |
1.44 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1112146.5 ns |
1098934 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI |
46841407 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6287999.5 ns |
7919042 ns |
0.79 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
606330.5 ns |
614406 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
334 ns |
250 ns |
1.34 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
292 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23274 ns |
23522 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1232898 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
483708 ns |
491791.5 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
48460 ns |
50790 ns |
0.95 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6291 ns |
6583 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6833 ns |
6375 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6958 ns |
6833 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6125 ns |
6167 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
209275.5 ns |
207746.5 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
25541336 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5865958 ns |
5956667 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
396823.5 ns |
395954 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6209 ns |
5958 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5875 ns |
6041.5 ns |
0.97 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7583 ns |
7604.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5375 ns |
6500 ns |
0.83 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
148685.5 ns |
147981.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
5635688 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
450542 ns |
774875 ns |
0.58 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
236762 ns |
239202 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10291 ns |
10000 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10625 ns |
10083 ns |
1.05 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10292 ns |
10667 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9708 ns |
9791.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
926094 ns |
916090 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
41298893 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5833458 ns |
7392292 ns |
0.79 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
686037 ns |
688747.5 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
708 ns |
0.88 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
625 ns |
666 ns |
0.94 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
666 ns |
666 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
666 ns |
625 ns |
1.07 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22945 ns |
23031 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI |
2003264 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
324291.5 ns |
209625 ns |
1.55 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
216602 ns |
215712 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4584 ns |
4833 ns |
0.95 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4792 ns |
4584 ns |
1.05 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4833 ns |
4833 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4625 ns |
4625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
234125 ns |
230125.5 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI |
9765420 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
1729083 ns |
1700146 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
600706 ns |
599396 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8541.5 ns |
8396 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8417 ns |
8000 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9834 ns |
10125 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7770.5 ns |
9062.5 ns |
0.86 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
124166 ns |
123106.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI |
3943268 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
819500 ns |
907333 ns |
0.90 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
69551 ns |
76081 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8666.5 ns |
8792 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9125 ns |
8459 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8709 ns |
9041 ns |
0.96 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8416 ns |
8270.5 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
603479 ns |
600302.5 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI |
22543432 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
4953249.5 ns |
4960583.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
351464 ns |
353604 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
125896 ns |
122750 ns |
1.03 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
129542 ns |
95625 ns |
1.35 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
130125 ns |
130334 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
180833 ns |
183125 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46726 ns |
46375 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
94181 ns |
98981 ns |
0.95 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
319041 ns |
303292 ns |
1.05 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
344875 ns |
182750 ns |
1.89 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
340500 ns |
345917 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
570229.5 ns |
608729 ns |
0.94 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
195260.5 ns |
195364.5 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
501295.5 ns |
494734 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397583 ns |
396125 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288270.5 ns |
215375 ns |
1.34 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287750 ns |
287708 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
756042 ns |
756000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43912 ns |
43820 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI |
1469434 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
424771 ns |
358000 ns |
1.19 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
83611 ns |
83390 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1468708 ns |
1446958.5 ns |
1.02 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
1137167 ns |
863667 ns |
1.32 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1136562.5 ns |
1133375 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2444229 ns |
2443417 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
253458 ns |
252085 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI |
10856605 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1843937.5 ns |
1851958 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
354543 ns |
350863.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
651084 ns |
626459 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
660125 ns |
682479 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
652959 ns |
615000 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
626167 ns |
641167 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
207109 ns |
203045 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8156799 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1369084 ns |
1359542 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
255513 ns |
254223 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2443625 ns |
2435250 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2456792 ns |
2470979.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2441541 ns |
2445042 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2441833 ns |
2415792 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1028403 ns |
1014910 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
50468967.5 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10457750 ns |
11589916 ns |
0.90 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1469104 ns |
1478675 ns |
0.99 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
34083.5 ns |
29458.5 ns |
1.16 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
36312.5 ns |
33812.5 ns |
1.07 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
35500 ns |
34541 ns |
1.03 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
875 ns |
1042 ns |
0.84 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
15652 ns |
15442 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
72891 ns |
85531 ns |
0.85 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3209 ns |
3250 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3375 ns |
3042 ns |
1.11 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3375 ns |
3416 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3125 ns |
3166 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
141997 ns |
142240.5 ns |
1.00 |
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
343713 ns |
360413 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
406000 ns |
404291 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
408791 ns |
403708 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
408167 ns |
409042 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
419542 ns |
421875 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
43520 ns |
44262 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1357102.5 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1155208.5 ns |
1119041 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
241062 ns |
242882 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3864084 ns |
3855208 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
3988291.5 ns |
3997771 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
3990020.5 ns |
3998125 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3733958.5 ns |
3773938 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
249010 ns |
248524 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
36216473.5 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11511875 ns |
14976771 ns |
0.77 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1245217.5 ns |
1453704 ns |
0.86 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3916 ns |
3959 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
34609 ns |
34278.5 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI |
1165142.5 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
182792 ns |
161167 ns |
1.13 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
40940 ns |
40280 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15708 ns |
15875 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
16083 ns |
15583 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15959 ns |
16041 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15750 ns |
15791 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
258758 ns |
257529.5 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI |
8503589.5 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
882021 ns |
864083.5 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
165641 ns |
168256.5 ns |
0.98 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404125 ns |
403417 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
295625 ns |
221375 ns |
1.34 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
295583 ns |
295666 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
760417 ns |
760500 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113700 ns |
113952 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI |
1066318.5 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
463583.5 ns |
335792 ns |
1.38 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
88591 ns |
88615.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1492292 ns |
1471958 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1152625 ns |
887791.5 ns |
1.30 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1162562 ns |
1157167 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2463959 ns |
2467666 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
251541.5 ns |
255583.5 ns |
0.98 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10024873 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1881709 ns |
1946854 ns |
0.97 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
357224 ns |
360243.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
541 ns |
542 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
458 ns |
500 ns |
0.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
26556 ns |
26902 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI |
1136156 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
465500 ns |
486187.5 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
208083 ns |
208227.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7646 ns |
7667 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8084 ns |
7666 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7834 ns |
7916.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7417 ns |
7250 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
222497 ns |
219818 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI |
24199768 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5838500 ns |
6151042 ns |
0.95 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
689609 ns |
686716.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
832270.5 ns |
825562.5 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
618459 ns |
468833 ns |
1.32 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
621499.5 ns |
620188 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1558042 ns |
1547479 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
130601 ns |
131055 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
168282 ns |
231953 ns |
0.73 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2689437.5 ns |
2669042 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
2008583 ns |
1538125.5 ns |
1.31 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
2000250 ns |
2006270.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4937229.5 ns |
4938583 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
260328 ns |
242713 ns |
1.07 |
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
870736 ns |
860168 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
291 ns |
1.29 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
333 ns |
0.87 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32813 ns |
32634 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI |
1166182 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
464354.5 ns |
452000 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
51890 ns |
48761 ns |
1.06 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6083 ns |
6437.5 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6875 ns |
6541.5 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6812.5 ns |
6750 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6062.5 ns |
6000 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
229431 ns |
228896 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI |
22144460 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5526646 ns |
5302916 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
366015 ns |
369843 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2415167 ns |
2391250 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2398042 ns |
2400000 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2379812.5 ns |
2405958 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2391667 ns |
2372125 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
207204.5 ns |
204395 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
8005080.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1453750 ns |
1597249.5 ns |
0.91 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
356885 ns |
377704 ns |
0.94 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4645458 ns |
4646708.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4640083 ns |
4648958 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4665375 ns |
4659021 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4561854.5 ns |
4685792 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
930938 ns |
915367 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
48256012 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6726208 ns |
7426833 ns |
0.91 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1414028 ns |
1261857 ns |
1.12 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6854.5 ns |
7479 ns |
0.92 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
22875 ns |
7125 ns |
3.21 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7417 ns |
7959 ns |
0.93 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6792 ns |
7250 ns |
0.94 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
23968 ns |
23573 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI |
1176642 ns |
||
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
283437.5 ns |
243500 ns |
1.16 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
34960 ns |
39571 ns |
0.88 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
66187.5 ns |
70291.5 ns |
0.94 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
52229 ns |
45542 ns |
1.15 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
50687.5 ns |
63500 ns |
0.80 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
45209 ns |
33104 ns |
1.37 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
221676 ns |
217821 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI |
10860397 ns |
||
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2069917 ns |
2084458 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
239743 ns |
226612 ns |
1.06 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
22104 ns |
20396 ns |
1.08 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
26291.5 ns |
24479.5 ns |
1.07 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
25458 ns |
24854.5 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5958 ns |
5500 ns |
1.08 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18137 ns |
16892 ns |
1.07 |
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
91101 ns |
85151 ns |
1.07 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
12042 ns |
11958 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
10584 ns |
9000 ns |
1.18 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
10875 ns |
10958.5 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
17979 ns |
18167 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
231021 ns |
227664.5 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
374574 ns |
389024 ns |
0.96 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
406209 ns |
404791 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
297167 ns |
223500 ns |
1.33 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
296334 ns |
296709 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
762750 ns |
762750 ns |
1 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46909 ns |
46360 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI |
1358188 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
481770.5 ns |
340000 ns |
1.42 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
88761 ns |
88940 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1490812.5 ns |
1485750.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
1170000 ns |
895812 ns |
1.31 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1166250 ns |
1165791.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2470395.5 ns |
2472333 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
289489 ns |
290272 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI |
12873963.5 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2039750.5 ns |
2106583 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
378364 ns |
377424 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
433958 ns |
432770.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
436958 ns |
430583 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
436542 ns |
436958 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
447333 ns |
448209 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
55343 ns |
54092 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
1019801 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1133312.5 ns |
1074083.5 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
236118 ns |
235772 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
3905708 ns |
3888958 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4027020.5 ns |
4016791.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4021333.5 ns |
4025938 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
3767563 ns |
3793958.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
269874 ns |
263523 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
31176805 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10180479 ns |
11929333 ns |
0.85 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1242641 ns |
1247352 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
8750 ns |
8750 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
7667 ns |
6875 ns |
1.12 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
7708 ns |
7667 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
12375 ns |
12417 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24263 ns |
24084 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI |
2162149 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
226000 ns |
211583 ns |
1.07 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
215323 ns |
216562 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
45000 ns |
45125 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
45333 ns |
44750 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
45500 ns |
45375 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
44875 ns |
45187.5 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
351763 ns |
347338.5 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI |
12458965 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
1760771 ns |
1883625.5 ns |
0.93 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
670439 ns |
671931.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
86041.5 ns |
104146.5 ns |
0.83 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
123250 ns |
86437 ns |
1.43 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
92208 ns |
92875 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
122937.5 ns |
126625 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190149 ns |
189767 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI |
5780085 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1986083 ns |
1966250 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
210857.5 ns |
183982 ns |
1.15 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2008541 ns |
2011000 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2030687.5 ns |
2025000 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2014687.5 ns |
2009458 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2014250 ns |
2016917 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
544290 ns |
535873.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI |
28037049 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9702729 ns |
11961958.5 ns |
0.81 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
972217 ns |
982380 ns |
0.99 |
This comment was automatically generated by workflow using github-action-benchmark.
avik-pal
force-pushed
the
ap/in_stat_track
branch
from
September 5, 2024 02:00
a2993bd
to
91a1547
Compare
avik-pal
force-pushed
the
ap/in_stat_track
branch
from
September 5, 2024 02:24
91a1547
to
61af0b7
Compare
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
temporarily disabling other tests. Need to be enabled before merging