This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
-
Notifications
You must be signed in to change notification settings - Fork 0
fix: task switching in AMDGPU complex batched_matmul #178
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
avik-pal
force-pushed
the
ap/fix_downstream
branch
from
October 25, 2024 14:57
73e4211
to
583a6b4
Compare
This reverts commit a8c0f3b.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
Benchmark suite | Current: 0766885 | Previous: 98a2d7a | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6000 ns |
6417 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6541 ns |
6041 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7875 ns |
7167 ns |
1.10 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5333 ns |
5292 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
108617 ns |
103542 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
809916 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
436641 ns |
637131 ns |
0.69 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9875 ns |
10166.5 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10292 ns |
9958 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9979.5 ns |
10291.5 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9771 ns |
9979.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
535818 ns |
494284 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
6627750 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
664425 ns |
719725 ns |
0.92 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1458.5 ns |
1583 ns |
0.92 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1458 ns |
1542 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
2875 ns |
1666 ns |
1.73 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1416 ns |
1500 ns |
0.94 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
19736 ns |
20684 ns |
0.95 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal |
456250 ns |
||
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU |
29621 ns |
33302 ns |
0.89 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
3541.5 ns |
3812.5 ns |
0.93 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4291 ns |
4125 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4292 ns |
4250 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
3687.5 ns |
4334 ns |
0.85 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
132304.5 ns |
134278.5 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal |
2272937.5 ns |
||
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
146734 ns |
143062.5 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57750 ns |
58000 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38916 ns |
46417 ns |
0.84 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46667 ns |
46875 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
79291 ns |
83750 ns |
0.95 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
36853 ns |
37449 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1095208 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
80626.5 ns |
70883 ns |
1.14 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2038334 ns |
2037500 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2088854.5 ns |
2083416.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2090437 ns |
2090916.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1972417 ns |
1996979.5 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
218514 ns |
220080 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
6593333 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1269408 ns |
1213928 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
147292 ns |
173708 ns |
0.85 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
144833 ns |
146625 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
175521 ns |
165062.5 ns |
1.06 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
155375 ns |
172000 ns |
0.90 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
165892 ns |
167869.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1634083.5 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
170913.5 ns |
196051.5 ns |
0.87 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1107041.5 ns |
1113854.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1135563 ns |
1110541 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1114729 ns |
1118667 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1106583.5 ns |
1124479.5 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
619139 ns |
644177 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
7608750 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1017151.5 ns |
899376 ns |
1.13 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4375 ns |
5333 ns |
0.82 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5041 ns |
4875 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6645.5 ns |
6750 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4125 ns |
4416 ns |
0.93 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
79693 ns |
83066 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
1295145.5 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
61251 ns |
64020 ns |
0.96 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8875 ns |
8584 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8542 ns |
8750 ns |
0.98 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8833 ns |
8875 ns |
1.00 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8459 ns |
8584 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
545788 ns |
552192.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
7756917 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
384358 ns |
372446 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16666.5 ns |
17229.5 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17708 ns |
17250 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22041 ns |
21542 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17103.5 ns |
17208.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
62465 ns |
63166 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1325541.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
78722 ns |
79573.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
224000 ns |
220583 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
214083 ns |
218875 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
217291 ns |
223125 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213125 ns |
219625 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
324107 ns |
329089 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5606125 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
466754 ns |
423777 ns |
1.10 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
583 ns |
1.07 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
708 ns |
625 ns |
1.13 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
875 ns |
833 ns |
1.05 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
625 ns |
834 ns |
0.75 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
18908 ns |
19066 ns |
0.99 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal |
417770.5 ns |
||
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU |
30771 ns |
27311 ns |
1.13 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1416 ns |
1417 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1458 ns |
1417 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1583 ns |
1583 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1375 ns |
1375 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
115606.5 ns |
116071.5 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal |
2144521 ns |
||
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
125132 ns |
118732 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7417 ns |
7375 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5333 ns |
6000 ns |
0.89 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6042 ns |
6083 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10334 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23349 ns |
24482 ns |
0.95 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
859459 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
47121 ns |
52122 ns |
0.90 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
230667 ns |
229541.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
237792 ns |
268417 ns |
0.89 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
233312.5 ns |
241500 ns |
0.97 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
223000 ns |
251250 ns |
0.89 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
186325 ns |
189293 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9066875.5 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
613087.5 ns |
588480 ns |
1.04 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3917 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3958 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3917 ns |
4042 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
22894 ns |
23660.5 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal |
445458 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU |
48301 ns |
43502 ns |
1.11 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16792 ns |
16833 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16709 ns |
16834 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17042 ns |
16959 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
17042 ns |
16666 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
184640 ns |
188039 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal |
2172250 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
174143 ns |
166010.5 ns |
1.05 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
922250 ns |
929291 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
763083 ns |
838708 ns |
0.91 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
831458.5 ns |
841584 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
1257625 ns |
1269208 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113637 ns |
113941 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal |
481167 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
244135 ns |
396441 ns |
0.62 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2604333.5 ns |
2610729.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2062625 ns |
2330541.5 ns |
0.89 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2329458 ns |
2324458 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3564084 ns |
3478334 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
229247 ns |
232093 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal |
2180333 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
742369.5 ns |
630643.5 ns |
1.18 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5458 ns |
6000 ns |
0.91 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7167 ns |
7042 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8416.5 ns |
7333.5 ns |
1.15 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5583 ns |
6584 ns |
0.85 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
83621 ns |
82915 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
1175958.5 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
59646.5 ns |
62131.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11062.5 ns |
11875 ns |
0.93 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11791 ns |
11417 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11729.5 ns |
12417 ns |
0.94 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11645.5 ns |
9813 ns |
1.19 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
589604 ns |
585345.5 ns |
1.01 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
7601854 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
410418 ns |
388046 ns |
1.06 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
500 ns |
542 ns |
0.92 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23217 ns |
23179.5 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal |
433917 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU |
48601 ns |
41949 ns |
1.16 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2125 ns |
2083 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2208 ns |
2250 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2167 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2084 ns |
2083 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
230692.5 ns |
226220 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal |
2467084 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
181643 ns |
166171 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8667 ns |
8583 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9666 ns |
8542 ns |
1.13 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
10937.5 ns |
10709 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
9000 ns |
8833 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
102306 ns |
100758 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
1206104 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
72821 ns |
72575 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17646 ns |
17228.5 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
18583.5 ns |
18583 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18542 ns |
18500 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
17417 ns |
17750 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
559012 ns |
582511 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
5618604 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
381427 ns |
371318.5 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
542 ns |
459 ns |
1.18 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
541 ns |
625 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
583 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
459 ns |
500 ns |
0.92 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
34179 ns |
34079 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
653750 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
49111 ns |
44423 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9604 ns |
9479 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9291 ns |
9750 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10020.5 ns |
10333 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9978.5 ns |
9562.5 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
249028 ns |
262881 ns |
0.95 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5697458 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
365996.5 ns |
351422 ns |
1.04 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396833 ns |
396583 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
213375 ns |
288042 ns |
0.74 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288208 ns |
287666 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
755542 ns |
756167 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
111983 ns |
112987 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal |
513500 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU |
77051.5 ns |
77780.5 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1463417 ns |
1455709 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
854959 ns |
1130291 ns |
0.76 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1132083 ns |
1133250 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2481584 ns |
2358000 ns |
1.05 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
198784.5 ns |
202802 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal |
1708563 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
324536 ns |
268682 ns |
1.21 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7312.5 ns |
7354.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7833.5 ns |
8000 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8437 ns |
8687.5 ns |
0.97 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6917 ns |
7750 ns |
0.89 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
129346 ns |
137305 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
1162583 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
58251 ns |
64461 ns |
0.90 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13500 ns |
12812.5 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15604 ns |
15041.5 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14791.5 ns |
15353.5 ns |
0.96 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13979.5 ns |
12333.5 ns |
1.13 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
849836 ns |
906003 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
7891354 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
422317 ns |
413373 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24833 ns |
26000 ns |
0.96 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
26916.5 ns |
27562.5 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
28313 ns |
27042 ns |
1.05 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
23958.5 ns |
26021 ns |
0.92 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
185469.5 ns |
186382.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1644917 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
100376.5 ns |
146484 ns |
0.69 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
143417 ns |
146500 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
154042 ns |
157750 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
149042 ns |
129416 ns |
1.15 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
151459 ns |
155812.5 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1011255 ns |
1016426 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8142042 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
522889 ns |
551090 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
76416 ns |
84667 ns |
0.90 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
85000 ns |
80167 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
77958 ns |
78063 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
85500 ns |
80521 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
190193.5 ns |
190829 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1487542 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
125082.5 ns |
124858.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
295958.5 ns |
219479 ns |
1.35 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
290084 ns |
281750 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
309208 ns |
278146 ns |
1.11 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
274062.5 ns |
320791.5 ns |
0.85 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1039232 ns |
1021778 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9001333 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
692376 ns |
643542 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12417 ns |
13125 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
14083 ns |
13666.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
15333.5 ns |
14041.5 ns |
1.09 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12542 ns |
13459 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
136592 ns |
136741.5 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
1137437 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
234694 ns |
226473 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
24292 ns |
27083.5 ns |
0.90 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26875 ns |
26125 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
28020.5 ns |
27833.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
24416.5 ns |
26604.5 ns |
0.92 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
907722.5 ns |
919419 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
7852375 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
692131.5 ns |
633979.5 ns |
1.09 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
14167 ns |
14000 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
15041.5 ns |
14708.5 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
17166 ns |
17583.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
13833 ns |
14792 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
118944.5 ns |
119245 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
1213062.5 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
238604 ns |
233827 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
25458 ns |
26875 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
27208 ns |
25958.5 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26604.5 ns |
26583 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26417 ns |
26541 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
664219 ns |
676576 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
5824834 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
677391 ns |
589361.5 ns |
1.15 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182000 ns |
182375 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
183667 ns |
183208 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
186583 ns |
185583 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
181542 ns |
183459 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
101699.5 ns |
102955 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1332208 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
234523 ns |
232900.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
585708 ns |
583500 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
591417 ns |
595083 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
597812.5 ns |
597520.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
592625 ns |
624167 ns |
0.95 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
490131.5 ns |
493717.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
5953104 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
713921 ns |
657463 ns |
1.09 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6625 ns |
6750 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8083.5 ns |
7645.5 ns |
1.06 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
9166.5 ns |
8167 ns |
1.12 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6396 ns |
7542 ns |
0.85 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
137141.5 ns |
135360 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
1158916 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
59311 ns |
62767 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14375 ns |
15375 ns |
0.93 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14417 ns |
14917 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15937.5 ns |
16187.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13729 ns |
15292 ns |
0.90 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
875798.5 ns |
885601 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
7574750 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
403086 ns |
392428 ns |
1.03 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
6147166.5 ns |
6153416.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
3224312.5 ns |
6381624.5 ns |
0.51 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
6368937.5 ns |
6371521 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
11912208 ns |
11926500 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
347269 ns |
346494 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/GPU/Metal |
1592791 ns |
||
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU |
303595 ns |
392843 ns |
0.77 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
19092083 ns |
19117208.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
11115167 ns |
19977084 ns |
0.56 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
19976125 ns |
19957021 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
36699999.5 ns |
36558729 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1027305 ns |
1005649 ns |
1.02 |
batchedmm(512, Bsize=4)/zygote/GPU/Metal |
7852208 ns |
||
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU |
1169973.5 ns |
1105996 ns |
1.06 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1791 ns |
1750 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1875 ns |
1834 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1833 ns |
1833 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1792 ns |
1834 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23540 ns |
23503 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal |
455583.5 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
209943 ns |
197739 ns |
1.06 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4834 ns |
4834 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4917 ns |
4958 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4916 ns |
4917 ns |
1.00 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4833 ns |
4916 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
269515.5 ns |
276337.5 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal |
2631791 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
625449 ns |
502208 ns |
1.25 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7916.5 ns |
8062.5 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8646 ns |
8416 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9166 ns |
9459 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7312.5 ns |
8145.5 ns |
0.90 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
116497.5 ns |
115989 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
1187542 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
68391 ns |
71584 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11479 ns |
11562.5 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
13041 ns |
12438 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12520.5 ns |
12541 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10958.5 ns |
12875 ns |
0.85 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
599205 ns |
604320 ns |
0.99 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
5699500 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
356370.5 ns |
353160 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22757.5 ns |
22648 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal |
433916 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU |
47730 ns |
43592 ns |
1.09 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2875 ns |
2917 ns |
0.99 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
3042 ns |
2917 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3167 ns |
3041 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
3125 ns |
3000 ns |
1.04 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
194133 ns |
197848 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal |
2121916 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU |
163362 ns |
146363.5 ns |
1.12 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
14146 ns |
14604 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
16000 ns |
15458.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
15812.5 ns |
15896 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
14479 ns |
15000.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
117303 ns |
117481 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
1151167 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
236073 ns |
236802 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25624.5 ns |
26500 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
26313 ns |
25625 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
26104.5 ns |
26041.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25125 ns |
25958 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
549722 ns |
561217 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
5157541.5 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
652604.5 ns |
566814 ns |
1.15 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4167 ns |
4291 ns |
0.97 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4209 ns |
4209 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4208 ns |
4208 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4167 ns |
4375 ns |
0.95 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24489 ns |
24363 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal |
448083.5 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU |
47790.5 ns |
44754 ns |
1.07 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16167 ns |
16250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16042 ns |
16125 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16167 ns |
16292 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16083 ns |
16416 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
317717 ns |
321227 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal |
2428291.5 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU |
208453 ns |
190786 ns |
1.09 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5833 ns |
5916 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5792 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5833 ns |
5792 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5834 ns |
5750 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
34765 ns |
34700.5 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
648041 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
206573 ns |
200434 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
22792 ns |
22292 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20729.5 ns |
21292 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
22084 ns |
21792 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
21292 ns |
22208 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
280273 ns |
283315.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
6096104 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
688371 ns |
598489 ns |
1.15 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
60729 ns |
59729 ns |
1.02 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
60291 ns |
64229 ns |
0.94 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
67083 ns |
66833 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
50958 ns |
50958 ns |
1 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66493 ns |
66908 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/GPU/Metal |
14948959 ns |
||
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU |
100052 ns |
115781 ns |
0.86 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
203416.5 ns |
198937.5 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
138583 ns |
144625 ns |
0.96 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
159875 ns |
167291.5 ns |
0.96 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
223083 ns |
303249.5 ns |
0.74 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
209048 ns |
208882.5 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/GPU/Metal |
46390583 ns |
||
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU |
588303.5 ns |
529218 ns |
1.11 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
84459 ns |
84291 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
80541.5 ns |
83875 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
88708 ns |
88125 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
90875 ns |
81562.5 ns |
1.11 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
192723 ns |
193291 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2030916 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
194182.5 ns |
182771 ns |
1.06 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1931042 ns |
1875250 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1931625 ns |
1914792 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1917958 ns |
1928375 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1918958 ns |
1916625 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
506602.5 ns |
505449 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9124854.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1073165 ns |
857542 ns |
1.25 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21451 ns |
21535 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal |
498250 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU |
43020 ns |
36788 ns |
1.17 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1792 ns |
1833 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1834 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1834 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
246531.5 ns |
243998 ns |
1.01 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal |
2248666 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
184137.5 ns |
166221 ns |
1.11 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
9124.5 ns |
11229 ns |
0.81 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
10375 ns |
9791.5 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
11562.5 ns |
11125 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
8458 ns |
10479.5 ns |
0.81 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
114777.5 ns |
114440.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
1126333 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
235903 ns |
233386 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9520.5 ns |
10458 ns |
0.91 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
11104.5 ns |
10250 ns |
1.08 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10167 ns |
9917 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9479.5 ns |
10145.5 ns |
0.93 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
488308 ns |
491014 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5077500 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
636239 ns |
561274 ns |
1.13 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58209 ns |
58375 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38375 ns |
46917 ns |
0.82 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46417 ns |
46625 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81875 ns |
83708 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38284 ns |
38960 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1196146 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
78511 ns |
72876 ns |
1.08 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1889334 ns |
1897625 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1945875 ns |
1964750 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1975333 ns |
1985854 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1897937 ns |
1899833 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
210023.5 ns |
212091 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11022958 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1013399.5 ns |
994598 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
266958.5 ns |
266354 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
269250 ns |
269729 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
278292 ns |
271041.5 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
266729.5 ns |
268271 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
193472 ns |
193629.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1544167 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
282794 ns |
271156 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
668791.5 ns |
693917 ns |
0.96 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
589292 ns |
692541 ns |
0.85 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
676917 ns |
687708 ns |
0.98 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
671958 ns |
593833 ns |
1.13 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
988709 ns |
991006 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9169229 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
902732.5 ns |
863163 ns |
1.05 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2174145.5 ns |
2180687.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2220541 ns |
2214917 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2196708.5 ns |
2212041 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2206021 ns |
2208479 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
160810.5 ns |
154859 ns |
1.04 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1440791 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
406240 ns |
451844.5 ns |
0.90 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5497458 ns |
5453666 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5589291 ns |
5518208 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5498062 ns |
5522375 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5497542 ns |
5522209 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
929759.5 ns |
930442 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9921167 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1548081.5 ns |
1495900 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
995208 ns |
999875 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
838417 ns |
913333 ns |
0.92 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
904916 ns |
912895.5 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
1326042 ns |
1334562.5 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46239 ns |
46425 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal |
578625 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
245133 ns |
399125 ns |
0.61 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
2611792 ns |
2620166 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
2048166 ns |
2328541 ns |
0.88 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
2326917 ns |
2329395.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
3610166 ns |
3468667 ns |
1.04 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
256032 ns |
247327 ns |
1.04 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal |
2447708 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
771420 ns |
658089 ns |
1.17 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57875 ns |
58083 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38250 ns |
46625 ns |
0.82 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
45875 ns |
46542 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
77750 ns |
84000 ns |
0.93 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
27988 ns |
29007 ns |
0.96 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1149062.5 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
75421 ns |
73392 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2022250 ns |
2036000 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2104417 ns |
2096916 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2087000 ns |
2092208 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2001250 ns |
1992542 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
223076 ns |
225482 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11110333 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1038854 ns |
1028937.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58375 ns |
58417 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38666 ns |
47208 ns |
0.82 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47042 ns |
47375 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
78292 ns |
83541 ns |
0.94 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
48142 ns |
48550 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1133500 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
61675.5 ns |
71593.5 ns |
0.86 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1925250.5 ns |
1926354.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1984875 ns |
1987291 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1968958 ns |
1972375 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1876250 ns |
1890375 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
230053 ns |
231977 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9828354.5 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
918152 ns |
931260 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
333 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
34160 ns |
33752 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
644270.5 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
48721 ns |
44343 ns |
1.10 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6875 ns |
6542 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6958 ns |
7187.5 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7500 ns |
7625 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7292 ns |
6209 ns |
1.17 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
200629 ns |
203191.5 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
5584375 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
364295 ns |
350064 ns |
1.04 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
250 ns |
250 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
32001.5 ns |
32755 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal |
377063 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU |
38271 ns |
36558 ns |
1.05 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
3042 ns |
3375 ns |
0.90 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
3250 ns |
3333 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
3291 ns |
3000 ns |
1.10 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2792 ns |
3208 ns |
0.87 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
181967 ns |
185298.5 ns |
0.98 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal |
1820916 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU |
159762 ns |
144480 ns |
1.11 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1446021 ns |
1465479.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1409541 ns |
1410667 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1415625 ns |
1427770.5 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1408250 ns |
1410417 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
134710 ns |
136084 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2868875 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
322334 ns |
354201 ns |
0.91 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5013500 ns |
5012687.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5036542 ns |
5023959 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5026520.5 ns |
5034167 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5021667 ns |
5021667 ns |
1 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
671717 ns |
673868 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10332500 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1469159.5 ns |
1145811 ns |
1.28 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
49838709 ns |
49876625 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
25973958 ns |
35509791 ns |
0.73 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
35497958 ns |
35514916 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
97460875 ns |
97103375 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1597509 ns |
1608361 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/GPU/Metal |
10641729.5 ns |
||
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU |
1049398.5 ns |
1576726 ns |
0.67 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
154517833 ns |
154443875 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
89364146 ns |
112320833.5 ns |
0.80 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
112347166 ns |
112445042 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
299472874.5 ns |
296071750 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6480598 ns |
6483041.5 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/Metal |
77617584 ns |
||
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU |
5559482 ns |
6222525 ns |
0.89 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47541 ns |
48042 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47875 ns |
47667 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
48541.5 ns |
47916 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
48333 ns |
47583 ns |
1.02 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
19684.5 ns |
19626 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal |
496750.5 ns |
||
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU |
25931 ns |
28463 ns |
0.91 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50750 ns |
50583.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
49958 ns |
50167 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
51229.5 ns |
51000 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50125 ns |
50667 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
244616 ns |
245482 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal |
2284458 ns |
||
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU |
147992 ns |
140773 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
9083 ns |
8667 ns |
1.05 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
10020.5 ns |
8750 ns |
1.15 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
10375 ns |
11167 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8125 ns |
9666.5 ns |
0.84 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
117828 ns |
118847 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
1194250 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
234703 ns |
237489 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9875 ns |
10791 ns |
0.92 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11166.5 ns |
10458 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10395.5 ns |
10333 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10208 ns |
10709 ns |
0.95 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
579910 ns |
584310 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
5757375 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
652159 ns |
572469 ns |
1.14 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
8750 ns |
9125 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9833.5 ns |
9896 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10583 ns |
10667 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
8334 ns |
9292 ns |
0.90 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
115390.5 ns |
115727.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
1164979.5 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
69241 ns |
73908 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13625 ns |
13874.5 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
15291.5 ns |
13750 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
15208 ns |
14333 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
14937.5 ns |
14375.5 ns |
1.04 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
553583 ns |
559680.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
5153750 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
343225 ns |
337060 ns |
1.02 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
958 ns |
959 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1042 ns |
1042 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
958 ns |
1083 ns |
0.88 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
33679 ns |
33675 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal |
640334 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
205133 ns |
206546 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9583 ns |
8917 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8750 ns |
8437.5 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9166 ns |
8791 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7958.5 ns |
9250 ns |
0.86 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
222940.5 ns |
225862.5 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal |
5834209 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
658098 ns |
576667 ns |
1.14 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23292 ns |
23667 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23666 ns |
23292 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
24666.5 ns |
23813 ns |
1.04 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23166.5 ns |
23666 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
19737 ns |
20529 ns |
0.96 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal |
445812.5 ns |
||
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
184932 ns |
187811 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
53250.5 ns |
53583.5 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
54333 ns |
52145.5 ns |
1.04 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
53459 ns |
53584 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52604.5 ns |
53667 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
258308 ns |
260507 ns |
0.99 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal |
2423792 ns |
||
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
592777 ns |
549086 ns |
1.08 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1410542 ns |
1444541.5 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1438875 ns |
1445459 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1412000 ns |
1414666.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1400812.5 ns |
1401396 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
194713 ns |
195236 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2079417 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
345564 ns |
321861 ns |
1.07 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5016416 ns |
5007208 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5028229 ns |
5006958 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5008146 ns |
5015812.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5011500.5 ns |
5020500 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
508710 ns |
510108 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9265145.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1202145 ns |
1117899 ns |
1.08 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
828840208 ns |
828285625 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
413910521 ns |
541921375 ns |
0.76 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
539860417 ns |
542359625 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
1566139499.5 ns |
1558200021 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22553762 ns |
22535776.5 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/Metal |
108020292 ns |
||
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU |
14557060 ns |
12173703 ns |
1.20 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
3600174083 ns |
3903695416 ns |
0.92 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
1495447875 ns |
1771980416 ns |
0.84 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1779739000 ns |
1773568584 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
6017463208 ns |
5228367459 ns |
1.15 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
118952088 ns |
119027931 ns |
1.00 |
batchedmm(512, Bsize=512)/zygote/GPU/Metal |
2572718125 ns |
||
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU |
88160332.5 ns |
68450588 ns |
1.29 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
76500 ns |
75916.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
76874.5 ns |
87437.5 ns |
0.88 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
80396 ns |
84417 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
76250 ns |
81083 ns |
0.94 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
192054.5 ns |
192111.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1498709 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
106821 ns |
126607 ns |
0.84 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
272396 ns |
282646 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
294979 ns |
283042 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
275666.5 ns |
236875 ns |
1.16 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
294020.5 ns |
276458 ns |
1.06 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
990094.5 ns |
995625 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8688959 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
628613 ns |
612404 ns |
1.03 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
199338479 ns |
199947208.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
103671062 ns |
139420500 ns |
0.74 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
139137542 ns |
138954958 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
391597500 ns |
389188834 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5816800 ns |
5832800 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/Metal |
33632041.5 ns |
||
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU |
3564305 ns |
2958637.5 ns |
1.20 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
618111458.5 ns |
618298396 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
352015458.5 ns |
439277916 ns |
0.80 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
439011437.5 ns |
439303895.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
1195193792 ns |
1200068000 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
26696499.5 ns |
26614249.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/Metal |
111449958 ns |
||
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU |
21986711 ns |
16011697.5 ns |
1.37 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7458 ns |
7417 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5334 ns |
6125 ns |
0.87 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6209 ns |
6125 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9792 ns |
10125 ns |
0.97 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
26288 ns |
26885 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
828729 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48740 ns |
54341 ns |
0.90 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
248833 ns |
214083 ns |
1.16 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
230291.5 ns |
232833 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
225333.5 ns |
230000 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214208 ns |
207709 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
213874 ns |
215596 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9173729.5 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
521576 ns |
546726.5 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
7791 ns |
7417 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
8250 ns |
8875.5 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
9875 ns |
10750 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7750 ns |
10459 ns |
0.74 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
114289 ns |
111291 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
1112625 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
70720 ns |
72956 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7917 ns |
7792 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9292 ns |
7833.5 ns |
1.19 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8937.5 ns |
8125 ns |
1.10 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8375 ns |
8375 ns |
1 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
486362 ns |
492517.5 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5044854.5 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
315959 ns |
322723 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
459 ns |
417 ns |
1.10 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
500 ns |
459 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
500 ns |
583 ns |
0.86 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
25338 ns |
25272 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
726000 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
46771 ns |
45194 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10604.5 ns |
9646 ns |
1.10 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10771 ns |
9541 ns |
1.13 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10125 ns |
11104 ns |
0.91 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
10500 ns |
10333 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
243207 ns |
247083 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
6344458 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
387615 ns |
383457 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
351084 ns |
351000 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
360916.5 ns |
354459 ns |
1.02 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
352187 ns |
352250 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
353667 ns |
351625 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
22345 ns |
23168 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal |
310937.5 ns |
||
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU |
189077.5 ns |
198701 ns |
0.95 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
786583.5 ns |
826000 ns |
0.95 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
799959 ns |
820458 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
807250 ns |
822083.5 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
799146.5 ns |
827750 ns |
0.97 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
216599 ns |
214195.5 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal |
2720084 ns |
||
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU |
607873 ns |
578901 ns |
1.05 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
5667 ns |
5229.5 ns |
1.08 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
6333 ns |
5875 ns |
1.08 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
7250 ns |
6958.5 ns |
1.04 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
3917 ns |
4667 ns |
0.84 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
17357 ns |
17091 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/GPU/Metal |
1903500 ns |
||
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU |
71671 ns |
74219 ns |
0.97 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
12583.5 ns |
13458.5 ns |
0.93 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
11229.5 ns |
10625 ns |
1.06 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
12292 ns |
13041 ns |
0.94 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
17291 ns |
18542 ns |
0.93 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
203999.5 ns |
202239.5 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/Metal |
5059625 ns |
||
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU |
368794 ns |
330217 ns |
1.12 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
39542 ns |
39833.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
50166.5 ns |
51209 ns |
0.98 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
52542 ns |
52458.5 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
13917 ns |
13459 ns |
1.03 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
19944.5 ns |
19993 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/GPU/Metal |
4970958 ns |
||
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU |
86896 ns |
99666.5 ns |
0.87 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
36812.5 ns |
38229.5 ns |
0.96 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
29292 ns |
35125 ns |
0.83 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
32875 ns |
34187.5 ns |
0.96 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
78541 ns |
59417 ns |
1.32 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
180178 ns |
178995.5 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/GPU/Metal |
13303396 ns |
||
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU |
412350 ns |
362888 ns |
1.14 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3625 ns |
3500 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3709 ns |
3667 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3833 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3417 ns |
3709 ns |
0.92 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
19299 ns |
19015 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal |
489416 ns |
||
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU |
28800 ns |
29645 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4250 ns |
4291 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4417 ns |
4500 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4500 ns |
4458 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4167 ns |
4292 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
194770.5 ns |
194611 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal |
2153291.5 ns |
||
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU |
136382 ns |
126757 ns |
1.08 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4833 ns |
5916 ns |
0.82 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6375 ns |
5062.5 ns |
1.26 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6771 ns |
6375 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4417 ns |
4625 ns |
0.96 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
140113.5 ns |
138395 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
1172542 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
59621 ns |
65944 ns |
0.90 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8833 ns |
9625 ns |
0.92 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9000 ns |
8500 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8833 ns |
9333 ns |
0.95 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8791 ns |
10666 ns |
0.82 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
809012 ns |
807046.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
7637459 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
386675 ns |
378457 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204125 ns |
207583 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
211292 ns |
209042 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
211167 ns |
213208 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
200583 ns |
204125 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
36190 ns |
35332 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
844791.5 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
205402 ns |
203930.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
612042 ns |
603500 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
633416.5 ns |
623479.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
625250 ns |
658604.5 ns |
0.95 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
592250 ns |
586375 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
255705 ns |
254148 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8231270.5 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
797760 ns |
767213 ns |
1.04 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
3310375 ns |
3324167 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1778188 ns |
2328667 ns |
0.76 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
2329291.5 ns |
2334417 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
6304709 ns |
6324542 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
204430 ns |
206559 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/GPU/Metal |
6035916 ns |
||
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU |
217792.5 ns |
377105 ns |
0.58 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
11442083.5 ns |
11496208.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
6658375 ns |
8303562.5 ns |
0.80 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
8339708.5 ns |
8348416.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
21081083 ns |
21193020.5 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
735864.5 ns |
736080.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/Metal |
20279917 ns |
||
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU |
1067533 ns |
2044820.5 ns |
0.52 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5791 ns |
3917 ns |
1.48 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6292 ns |
5292 ns |
1.19 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7167 ns |
6292 ns |
1.14 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4500 ns |
7125 ns |
0.63 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
131372 ns |
129442 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
1175458 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
53861 ns |
57067 ns |
0.94 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8375 ns |
8500 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8333 ns |
7375 ns |
1.13 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9250 ns |
7833 ns |
1.18 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9000 ns |
8291.5 ns |
1.09 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
707753 ns |
711410 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
7292583 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
367029.5 ns |
364581 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
121562.5 ns |
117312.5 ns |
1.04 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
124917 ns |
101437.5 ns |
1.23 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
101250 ns |
102687.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
125062.5 ns |
98458.5 ns |
1.27 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
148668.5 ns |
149616 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2918000 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
192532 ns |
210473 ns |
0.91 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2031542 ns |
2008250 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1950834 ns |
2022459 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2007750 ns |
2039937.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2030959 ns |
2036625 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
668590 ns |
661994.5 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10443458 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1253100 ns |
963831 ns |
1.30 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
34167 ns |
33416 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
33666 ns |
35459 ns |
0.95 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
34375 ns |
34709 ns |
0.99 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
542 ns |
750 ns |
0.72 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15591 ns |
15265 ns |
1.02 |
batchedmm(2, Bsize=4)/forward/GPU/Metal |
550104 ns |
||
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU |
70251 ns |
78737 ns |
0.89 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
3000 ns |
3959 ns |
0.76 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3604.5 ns |
2917 ns |
1.24 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
4375 ns |
4708 ns |
0.93 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
2833 ns |
3666 ns |
0.77 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
136806 ns |
136137.5 ns |
1.00 |
batchedmm(2, Bsize=4)/zygote/GPU/Metal |
1196250 ns |
||
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU |
337614 ns |
321796.5 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7250 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5334 ns |
6042 ns |
0.88 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
3667 ns |
6083 ns |
0.60 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
10042 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
35320 ns |
34970 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
846271 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
48031 ns |
56516 ns |
0.85 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221208 ns |
221584 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
231083.5 ns |
220959 ns |
1.05 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
222042 ns |
234583 ns |
0.95 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
207104 ns |
207333 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
238515 ns |
237194 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7909791 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
509876 ns |
540189 ns |
0.94 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3833 ns |
0.98 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3709 ns |
3958 ns |
0.94 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22010 ns |
21681 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal |
480708 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU |
42240 ns |
39383 ns |
1.07 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14458 ns |
14458 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14250 ns |
14458 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14667 ns |
14541 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14459 ns |
14625 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
297003.5 ns |
297631.5 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal |
2355083.5 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
194062 ns |
190215 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
141896 ns |
129834 ns |
1.09 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
131583 ns |
118271 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
106125 ns |
106750 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
101396 ns |
101666.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132246.5 ns |
150106 ns |
0.88 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2848042 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
195502 ns |
241781 ns |
0.81 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1921791.5 ns |
1921708.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1941958 ns |
1924583 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1922084 ns |
1932000 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1925000 ns |
1922750 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
660860 ns |
653385 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10632334 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1214399.5 ns |
928325 ns |
1.31 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18312.5 ns |
18875 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19083 ns |
17292 ns |
1.10 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21334 ns |
20937 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17770.5 ns |
18459 ns |
0.96 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
104874 ns |
104073.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1354708 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
75806 ns |
91301 ns |
0.83 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
226750 ns |
239083.5 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
217833 ns |
224791 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229000.5 ns |
224958.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
257437.5 ns |
218500 ns |
1.18 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
497280 ns |
493640.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
6075250 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
479116 ns |
439080 ns |
1.09 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
25042 ns |
26166 ns |
0.96 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
26291 ns |
29167 ns |
0.90 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
28417 ns |
28958 ns |
0.98 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
1125 ns |
1416 ns |
0.79 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16284 ns |
15781 ns |
1.03 |
batchedmm(16, Bsize=4)/forward/GPU/Metal |
541312.5 ns |
||
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU |
87411 ns |
72756 ns |
1.20 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
5292 ns |
6208 ns |
0.85 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
5459 ns |
5041 ns |
1.08 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
6417 ns |
6875 ns |
0.93 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
5500 ns |
6417 ns |
0.86 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
201005.5 ns |
199155.5 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/GPU/Metal |
2020334 ns |
||
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU |
390754 ns |
324216 ns |
1.21 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
221145.5 ns |
221875 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
222875 ns |
223375 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
223666 ns |
225375 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
222729.5 ns |
223542 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
218348.5 ns |
216803 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1683750 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
269733 ns |
267771 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
509020.5 ns |
508542 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
565792 ns |
511042 ns |
1.11 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
512270.5 ns |
509500 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
500333.5 ns |
557354 ns |
0.90 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1028150 ns |
1017707.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8579625 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
850900 ns |
811461 ns |
1.05 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18875 ns |
19104 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20271 ns |
19584 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
21417 ns |
22063 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19625 ns |
19792 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
111806.5 ns |
111072 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1458875 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
77311 ns |
90009 ns |
0.86 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220979 ns |
221854 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
239834 ns |
220250 ns |
1.09 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
224916 ns |
218166.5 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
217249.5 ns |
220146 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
711348 ns |
700847.5 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7148708.5 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
538287 ns |
494855 ns |
1.09 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6375 ns |
6292 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
7208 ns |
7000 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8271 ns |
7375 ns |
1.12 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5375 ns |
6834 ns |
0.79 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
133581 ns |
130925 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
1164750 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
66341 ns |
63498 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11354 ns |
11041.5 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11375 ns |
9959 ns |
1.14 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12562.5 ns |
10895.5 ns |
1.15 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12563 ns |
10459 ns |
1.20 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
770196 ns |
770540.5 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
7229125 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
386869.5 ns |
375452 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5250 ns |
4104 ns |
1.28 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5375 ns |
7041 ns |
0.76 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7334 ns |
7166 ns |
1.02 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4792 ns |
6166 ns |
0.78 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
135271.5 ns |
131485.5 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
1193459 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
58991 ns |
62607 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7667 ns |
7416.5 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7959 ns |
7750 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7854.5 ns |
8125 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7520.5 ns |
8083 ns |
0.93 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
737431 ns |
737449 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
7609209 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
396605 ns |
380902 ns |
1.04 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
14453396 ns |
14481917 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
7701875 ns |
10107542 ns |
0.76 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
10103083 ns |
10094750 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
27738458 ns |
27859959 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
531399 ns |
533975 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/Metal |
22191895.5 ns |
||
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU |
392545 ns |
867906.5 ns |
0.45 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
46327270.5 ns |
46387667 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
26716104 ns |
33363354 ns |
0.80 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
33470417 ns |
33478875 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
85517417 ns |
85752792 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2856976 ns |
2651799 ns |
1.08 |
batchedmm(128, Bsize=512)/zygote/GPU/Metal |
88528708.5 ns |
||
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU |
3296365 ns |
5191497.5 ns |
0.63 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
185583 ns |
185208.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
187042 ns |
185916 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
187291.5 ns |
188604 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
184792 ns |
187271 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
103848 ns |
117719.5 ns |
0.88 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1537500 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
231333 ns |
236051 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
598791.5 ns |
634875 ns |
0.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
599958 ns |
627937.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
602250 ns |
601166 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
587896 ns |
587625 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
713701 ns |
694993 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7615520.5 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
788674 ns |
698169.5 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
667 ns |
541 ns |
1.23 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
666 ns |
584 ns |
1.14 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32643 ns |
31826 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal |
664895.5 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
47540 ns |
48104.5 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9041.5 ns |
9541 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12083 ns |
9687.5 ns |
1.25 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13250 ns |
10542 ns |
1.26 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10792 ns |
10938 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
278611.5 ns |
276120 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal |
6110667 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
372684 ns |
371078 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26250 ns |
26250 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26292 ns |
26333 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26334 ns |
26583 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26458 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23639 ns |
22942 ns |
1.03 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal |
423354.5 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
210507.5 ns |
206526 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67375 ns |
67125 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67375 ns |
67333 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
68333 ns |
68792 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66959 ns |
66875 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
277123 ns |
273858 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal |
2163167 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
607047 ns |
554115 ns |
1.10 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204083 ns |
207166 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
210917 ns |
211667 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209917 ns |
211167 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199709 ns |
202875 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
27902 ns |
27563 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
852708.5 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
205893 ns |
206546 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
610813 ns |
609937.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
632959 ns |
669750 ns |
0.95 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
635396 ns |
664812.5 ns |
0.96 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
588854.5 ns |
609042 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
239352 ns |
233231.5 ns |
1.03 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
9235709 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
839150.5 ns |
798562 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
649417 ns |
664875 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
658250 ns |
636687.5 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
651458 ns |
648791.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
650583 ns |
629792 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189912.5 ns |
185894.5 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1398604 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
251273 ns |
349393 ns |
0.72 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2235625 ns |
2244229 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2311187.5 ns |
2225354 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2238000 ns |
2256708 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2245375 ns |
2271792 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
922866 ns |
900927 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9537166.5 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1356111 ns |
1235829 ns |
1.10 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
20437.5 ns |
19333 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
20708 ns |
21166.5 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
22042 ns |
22375 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19396 ns |
19958 ns |
0.97 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
111717 ns |
106770.5 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1470978.5 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
75441 ns |
89387 ns |
0.84 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
233271 ns |
227250 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
232958 ns |
262312.5 ns |
0.89 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
233167 ns |
231250 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
221208.5 ns |
222770.5 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
709044 ns |
700957 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
7671770.5 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
555096.5 ns |
516550 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
667 ns |
500 ns |
1.33 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
584 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23540 ns |
22928 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal |
727375 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
47941 ns |
44243 ns |
1.08 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
12041 ns |
9583 ns |
1.26 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
12125 ns |
9958.5 ns |
1.22 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
17229.5 ns |
13229.5 ns |
1.30 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
11166 ns |
10875 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
260166 ns |
258192 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal |
6474000 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
397565 ns |
395479 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
8500 ns |
8062.5 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
9208 ns |
9208 ns |
1 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
9833 ns |
10459 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
7167 ns |
8333 ns |
0.86 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
116262.5 ns |
112863.5 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal |
1132416 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU |
67351 ns |
72315 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
7500 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10542 ns |
7750 ns |
1.36 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9000 ns |
14875 ns |
0.61 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
14959 ns |
8917 ns |
1.68 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
480097.5 ns |
472419 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal |
4769916.5 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
318874 ns |
321811 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2208.5 ns |
1979.5 ns |
1.12 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2458 ns |
2500 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2625 ns |
2542 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2083 ns |
2416 ns |
0.86 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
19599 ns |
19845 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal |
420458.5 ns |
||
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
189912 ns |
191508 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6750 ns |
6666 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
8291 ns |
6459 ns |
1.28 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
7334 ns |
7292 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6791 ns |
7292 ns |
0.93 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
212249 ns |
208409 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal |
2347167 ns |
||
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
580124 ns |
543621 ns |
1.07 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
749667 ns |
754167 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
749000 ns |
751000 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
747625 ns |
749375 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
748645.5 ns |
747104 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
22873 ns |
22303 ns |
1.03 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal |
324209 ns |
||
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU |
33080 ns |
47829 ns |
0.69 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
792750 ns |
792250 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
791625 ns |
811750 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
799541.5 ns |
789500 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
787417 ns |
794229.5 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
210255.5 ns |
206590.5 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal |
2648354.5 ns |
||
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU |
231762 ns |
233541 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7250 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5250 ns |
5917 ns |
0.89 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5958 ns |
6000 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10209 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33271 ns |
32976 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
852917 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
50031 ns |
57267 ns |
0.87 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
230708 ns |
228458.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
234646 ns |
269270.5 ns |
0.87 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
235812 ns |
235021 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
252042 ns |
213146 ns |
1.18 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
258276 ns |
254662 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8291125 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
523734 ns |
552652 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
12541.5 ns |
12417 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
13312.5 ns |
13250 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
14916 ns |
14458 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10771 ns |
13000 ns |
0.83 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
134784 ns |
131273.5 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal |
1166833 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU |
235912 ns |
231363 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24562 ns |
24854.5 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24916.5 ns |
24916 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24958 ns |
25542 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25479.5 ns |
24458 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
822351 ns |
813324 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal |
7673667 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
685105 ns |
634495 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
8958 ns |
8875 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
9749.5 ns |
9958 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
11125 ns |
11167 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
8854.5 ns |
9542 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
120470.5 ns |
116553 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal |
1250708 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU |
73741 ns |
74930 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
19854 ns |
13770.5 ns |
1.44 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15750 ns |
14917 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14791 ns |
15916 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
15229.5 ns |
16437.5 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
633748 ns |
621843 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal |
5614500 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
370283 ns |
356836 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9437.5 ns |
9145.5 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9937.5 ns |
9354 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11667 ns |
10750 ns |
1.09 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
8416.5 ns |
10125 ns |
0.83 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
119362 ns |
116468 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal |
1160958 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU |
70370.5 ns |
74383.5 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13021 ns |
12916 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
15896 ns |
12959 ns |
1.23 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13625 ns |
20541 ns |
0.66 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20083 ns |
14500 ns |
1.39 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
524705 ns |
515709 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal |
5078937.5 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU |
339993 ns |
328534 ns |
1.03 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
29959 ns |
31062 ns |
0.96 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
30833 ns |
33146 ns |
0.93 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
30770.5 ns |
30750 ns |
1.00 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
1896 ns |
1833 ns |
1.03 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16550 ns |
16169 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/GPU/Metal |
4756209 ns |
||
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU |
72631 ns |
77564 ns |
0.94 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
5959 ns |
5562.5 ns |
1.07 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
5584 ns |
5312.5 ns |
1.05 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
5541 ns |
7208 ns |
0.77 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
7229 ns |
7834 ns |
0.92 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
138201 ns |
134922 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/GPU/Metal |
13282667 ns |
||
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU |
370683 ns |
340125 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
375 ns |
0.78 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
25103.5 ns |
24307 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
700709 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
47470 ns |
45845 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
7541.5 ns |
6166.5 ns |
1.22 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
7583 ns |
6708 ns |
1.13 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6917 ns |
8167 ns |
0.85 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7062.5 ns |
7083 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
184083 ns |
179926.5 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
6386187.5 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
390413 ns |
372385.5 ns |
1.05 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6041 ns |
5834 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5833 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5833 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5917 ns |
5958 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
25719 ns |
25187 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal |
731208.5 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU |
210291 ns |
201636 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
23708 ns |
21041 ns |
1.13 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
23270.5 ns |
21709 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21750 ns |
23458 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
24250 ns |
26125 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
266239.5 ns |
262884 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal |
6639625 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU |
707865 ns |
615780.5 ns |
1.15 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
175833 ns |
192083.5 ns |
0.92 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
175125 ns |
158917 ns |
1.10 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
150792 ns |
154416.5 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
175959 ns |
146417 ns |
1.20 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189040 ns |
184640 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1564416.5 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
174111 ns |
215472.5 ns |
0.81 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1329062.5 ns |
1319792 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1311416.5 ns |
1328249.5 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1318813 ns |
1347250 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1346041 ns |
1337000 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
862925 ns |
844907 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9193604 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1117183.5 ns |
1041340 ns |
1.07 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
24375 ns |
24292 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25729 ns |
24916 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
27458 ns |
28000 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
23917 ns |
24833.5 ns |
0.96 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
226480 ns |
224694.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal |
1700209 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU |
102501 ns |
130334 ns |
0.79 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
186854 ns |
117583 ns |
1.59 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
167167 ns |
131375 ns |
1.27 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
177291.5 ns |
160499.5 ns |
1.10 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
124562.5 ns |
164750 ns |
0.76 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
993547 ns |
967206 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal |
8806833 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU |
608345 ns |
585053 ns |
1.04 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
333 ns |
250 ns |
1.33 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
334 ns |
1.12 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
291 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23084 ns |
22932 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal |
708709 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
49001 ns |
47870 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8250 ns |
6292 ns |
1.31 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9083 ns |
6833 ns |
1.33 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7000 ns |
9416 ns |
0.74 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7333.5 ns |
7500 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
200052 ns |
196587.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal |
6611083 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
394084 ns |
380031 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5895.5 ns |
5875 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6541 ns |
6292 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7792 ns |
7187.5 ns |
1.08 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4520.5 ns |
6562 ns |
0.69 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
138134 ns |
134586 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
1154209 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
236352 ns |
230170 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10000 ns |
9833 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10604.5 ns |
10000 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10167 ns |
11187.5 ns |
0.91 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9958 ns |
11083 ns |
0.90 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
852212.5 ns |
840176 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
8072333 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
678720 ns |
631290 ns |
1.08 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1584 ns |
1542 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1583 ns |
1625 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
22967.5 ns |
22272 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal |
458209 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU |
210012 ns |
204933 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5750 ns |
5750 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6166 ns |
6125 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5959 ns |
6417 ns |
0.93 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5750 ns |
5875 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
220496.5 ns |
216977 ns |
1.02 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal |
2224000 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU |
586500 ns |
491814.5 ns |
1.19 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
8708 ns |
8250 ns |
1.06 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
8167 ns |
8562.5 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
9917 ns |
9895.5 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
8417 ns |
9209 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
118370.5 ns |
115063 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal |
1213708 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU |
68660 ns |
73999 ns |
0.93 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8416 ns |
8167 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10000 ns |
9250 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8958 ns |
9833.5 ns |
0.91 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9250 ns |
10333 ns |
0.90 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
563481.5 ns |
548589 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal |
5616208 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
344258 ns |
340367 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
128958.5 ns |
127271 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
96083.5 ns |
128750 ns |
0.75 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
130042 ns |
131062 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
180791.5 ns |
181979.5 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
46592 ns |
46303.5 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/GPU/Metal |
369729.5 ns |
||
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU |
95170.5 ns |
102121 ns |
0.93 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
335729 ns |
338125 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
179021 ns |
339792 ns |
0.53 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
331750 ns |
346083 ns |
0.96 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
572000 ns |
595417 ns |
0.96 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
186585.5 ns |
181951 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/GPU/Metal |
1385875 ns |
||
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU |
501200 ns |
410627.5 ns |
1.22 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397375 ns |
397708 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
213645.5 ns |
288375 ns |
0.74 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
286292 ns |
287937.5 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
752167 ns |
756708 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44120 ns |
43092 ns |
1.02 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal |
432792 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU |
81571 ns |
85671 ns |
0.95 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
1457084 ns |
1456291.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
857542 ns |
1133125 ns |
0.76 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
1128083.5 ns |
1127937.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
2481187.5 ns |
2360208 ns |
1.05 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
249861 ns |
248595.5 ns |
1.01 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal |
1748791.5 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU |
350803 ns |
266317 ns |
1.32 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
656104.5 ns |
643479.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
672833 ns |
654166 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
649250 ns |
652750 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
667000 ns |
650625 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
188293.5 ns |
172424.5 ns |
1.09 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1390208 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
243237.5 ns |
315089 ns |
0.77 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2415666.5 ns |
2449417 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2426229 ns |
2455020.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2447042 ns |
2465625 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2475437.5 ns |
2469208.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
947745 ns |
922065 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10591021 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1455792 ns |
1363193.5 ns |
1.07 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
33208 ns |
32917 ns |
1.01 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
33167 ns |
35374.5 ns |
0.94 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
34334 ns |
34417 ns |
1.00 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
750 ns |
1000 ns |
0.75 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16016 ns |
15534 ns |
1.03 |
batchedmm(2, Bsize=32)/forward/GPU/Metal |
1296458.5 ns |
||
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU |
77541 ns |
78366 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
3145.5 ns |
2937.5 ns |
1.07 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
3833 ns |
3375 ns |
1.14 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
3375 ns |
5208 ns |
0.65 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
3875 ns |
4625 ns |
0.84 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
136913 ns |
133935.5 ns |
1.02 |
batchedmm(2, Bsize=32)/zygote/GPU/Metal |
5040146 ns |
||
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU |
355953 ns |
318886 ns |
1.12 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1458625 ns |
1464209 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1495542 ns |
1500333 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1499708 ns |
1501333 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1437750 ns |
1442563 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
42671 ns |
41738 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1411187 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
243652 ns |
318625 ns |
0.76 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5101625 ns |
5128625 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5320416.5 ns |
5291041 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5302834 ns |
5297084 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4990937 ns |
4998791.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
234235 ns |
230499.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
11285104 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1238450 ns |
1198280 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3791 ns |
3709 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3916 ns |
0.95 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
34870 ns |
33583 ns |
1.04 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal |
404916.5 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU |
39731 ns |
36778.5 ns |
1.08 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15292 ns |
15417 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15708 ns |
15500 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15542 ns |
15791 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15500 ns |
16000 ns |
0.97 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
253976 ns |
252278 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal |
1603291.5 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU |
172581 ns |
161662 ns |
1.07 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
404667 ns |
404625 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
216333 ns |
296000 ns |
0.73 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
295666 ns |
295916 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
755125 ns |
760625 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113698 ns |
113161.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal |
512417 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU |
90091 ns |
95859 ns |
0.94 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1483750 ns |
1479249.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
880292 ns |
1158584 ns |
0.76 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1158916.5 ns |
1160500 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2472770.5 ns |
2383354 ns |
1.04 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
241711 ns |
228888 ns |
1.06 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal |
1816083.5 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
351843 ns |
265922 ns |
1.32 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1041 ns |
958 ns |
1.09 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1084 ns |
1042 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
959 ns |
1083 ns |
0.89 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
25157 ns |
24404 ns |
1.03 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal |
712625 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU |
209502 ns |
207859 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8458 ns |
7917 ns |
1.07 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10041.5 ns |
8542 ns |
1.18 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8750 ns |
9917 ns |
0.88 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11250 ns |
12895.5 ns |
0.87 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
207012 ns |
202191 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal |
6760062.5 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU |
692491 ns |
620871 ns |
1.12 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
831541 ns |
835834 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
464666.5 ns |
615542 ns |
0.75 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
618667 ns |
617791.5 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
1547646 ns |
1549375 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
132130 ns |
130350.5 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/GPU/Metal |
1716791 ns |
||
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU |
166711 ns |
215532 ns |
0.77 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
2686834 ns |
2690375 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1530458 ns |
2000479.5 ns |
0.77 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1999000 ns |
2007416.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
4939562.5 ns |
4941104 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
233538 ns |
232712 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/Metal |
6467541.5 ns |
||
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU |
858788 ns |
872871.5 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32364 ns |
31625 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal |
643333 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU |
47840 ns |
47950 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6416 ns |
6084 ns |
1.05 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8562.5 ns |
6708 ns |
1.28 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6833.5 ns |
7666 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7354.5 ns |
8083 ns |
0.91 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
220283.5 ns |
221856.5 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal |
5857000 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU |
359924 ns |
352319 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1758708.5 ns |
1741791.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1753875.5 ns |
1752167 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1756209 ns |
1739042 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1761791 ns |
1719916 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
187281 ns |
183055.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1584292 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
354693 ns |
415606.5 ns |
0.85 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4354750 ns |
4361125 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4401000 ns |
4365916.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4390770.5 ns |
4399333 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4366166.5 ns |
4394333 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
845510 ns |
827645.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9176375 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1255141 ns |
1239667.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
9834 ns |
7083 ns |
1.39 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6687.5 ns |
7395.5 ns |
0.90 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7292 ns |
7041 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
8895.5 ns |
6854.5 ns |
1.30 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
22493 ns |
22223.5 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal |
285021.5 ns |
||
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU |
33691 ns |
47178 ns |
0.71 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
49625 ns |
45292 ns |
1.10 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
33584 ns |
51167 ns |
0.66 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
50854.5 ns |
49250 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
32833.5 ns |
49437 ns |
0.66 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
210314.5 ns |
204846 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal |
2601666.5 ns |
||
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
209212 ns |
235841 ns |
0.89 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
22416.5 ns |
22125 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
22875 ns |
25125 ns |
0.91 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
24416 ns |
24833 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
5395.5 ns |
5458.5 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18228 ns |
17859 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/GPU/Metal |
14808125 ns |
||
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU |
90271 ns |
82154 ns |
1.10 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
12041.5 ns |
11792 ns |
1.02 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
9917 ns |
10750 ns |
0.92 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
10792 ns |
12583 ns |
0.86 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
18833 ns |
19708.5 ns |
0.96 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
221873 ns |
216235 ns |
1.03 |
batchedmm(2, Bsize=512)/zygote/GPU/Metal |
46172959 ns |
||
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU |
380594 ns |
331099 ns |
1.15 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
405875 ns |
406250 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
221667 ns |
297333 ns |
0.75 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
297250 ns |
296833.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
758125 ns |
762833 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46873 ns |
46303.5 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal |
448750 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU |
89581 ns |
97252 ns |
0.92 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
1483542 ns |
1477458 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
886375 ns |
1164395.5 ns |
0.76 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
1163584 ns |
1164416 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
2469250 ns |
2386333 ns |
1.03 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
284393.5 ns |
268961 ns |
1.06 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal |
2357604.5 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU |
374963 ns |
282959 ns |
1.33 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1485583 ns |
1488416 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1520792 ns |
1526958 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1526916 ns |
1529250 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1462209 ns |
1466395.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
54219 ns |
52650 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
1149729.5 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
237227 ns |
326982 ns |
0.73 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5110708 ns |
5119459 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5306604 ns |
5285084 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5293958 ns |
5297709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4983479.5 ns |
4955208 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
256685 ns |
250192 ns |
1.03 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
10295792 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1227811 ns |
1186136 ns |
1.04 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28375 ns |
28292 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28375 ns |
28292 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28375 ns |
28333 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28250 ns |
28417 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24774 ns |
23514.5 ns |
1.05 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal |
458709 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU |
213012 ns |
207227 ns |
1.03 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66625 ns |
66542 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66375 ns |
66750 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67375 ns |
66500 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66292 ns |
66208 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
338401 ns |
333506.5 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal |
2758583.5 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU |
657866 ns |
576948.5 ns |
1.14 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
126583 ns |
124875 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
127167 ns |
81875 ns |
1.55 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
92292 ns |
89166 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
90041 ns |
86750 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193063 ns |
191648 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal |
2102167 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU |
200221.5 ns |
233116 ns |
0.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2011833.5 ns |
2025145.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2035250 ns |
2021978.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2020542 ns |
2030542 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2015583 ns |
1995125 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
517163.5 ns |
506195 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal |
9593458 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU |
1086120 ns |
881973 ns |
1.23 |
This comment was automatically generated by workflow using github-action-benchmark.
Sign up for free
to subscribe to this conversation on GitHub.
Already have an account?
Sign in.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
No description provided.