This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
test: run tests with more activations
- Loading branch information
9d522c5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@JuliaRegistrator register
9d522c5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Registration pull request created: JuliaRegistries/General/114537
Tip: Release Notes
Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.
To add them here just re-invoke and the PR will be updated.
Tagging
After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.
This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:
9d522c5
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5750
ns7270.5
ns0.79
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
6187.5
ns5542
ns1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7979
ns7958.5
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6958.5
ns7209
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
119461
ns117012
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
723417
ns686834
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
417664
ns433304
ns0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9834
ns10167
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9792
ns9875
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9916
ns10042
ns0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10166
ns10062.5
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
551816
ns550305
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
2364708
ns2412958
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
695047
ns10783943
ns0.06445202835363652
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
1458
ns2333
ns0.62
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1687.5
ns1584
ns1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
1917
ns1875
ns1.02
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
1250
ns1521
ns0.82
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
21782
ns21708
ns1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal
189208
ns184666
ns1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
30960
ns31240
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
3958.5
ns4270.5
ns0.93
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4167
ns4000
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4000
ns4500
ns0.89
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4334
ns4375
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
148046.5
ns146276
ns1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal
1745084
ns1500000
ns1.16
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
148342
ns151831
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56083
ns57416
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39917
ns46458
ns0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47000
ns46437.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82750
ns83625
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
37366
ns37234
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1348187.5
ns1140250
ns1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
80291
ns84481
ns0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2017708
ns2040583
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2083959
ns2059271
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2090792
ns2085458
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1999604
ns2013708.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
232635
ns230879
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7104833
ns4993834
ns1.42
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1540007
ns1195591
ns1.29
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
143708
ns152542
ns0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
173750.5
ns145541
ns1.19
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
165562.5
ns151416
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
165979
ns147395.5
ns1.13
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
166570
ns166882
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1701792
ns1468104
ns1.16
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
205502.5
ns188712
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1100292
ns1114875
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1114709
ns1110000
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1122042
ns1116500
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1119916
ns1122458
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
713685
ns702607
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7357125
ns5931562.5
ns1.24
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1039502
ns1045069
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4458
ns4500
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4291
ns4687.5
ns0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6208
ns6562.5
ns0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4416
ns4167
ns1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
94296
ns93036
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
782083.5
ns421646
ns1.85
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
69431
ns63695.5
ns1.09
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8542
ns8750
ns0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8834
ns8625
ns1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9083
ns9042
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8583
ns9145.5
ns0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
608245
ns610157.5
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5666604.5
ns5466959
ns1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
384864
ns388908.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17229
ns18479
ns0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17250
ns17667
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22250
ns20437.5
ns1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18312.5
ns18416.5
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
68096
ns66584
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1292667
ns462541
ns2.79
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
74070.5
ns73981
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
218583
ns218458
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
244459
ns211458
ns1.16
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
213333
ns213562.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
220875
ns214917
ns1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
359693
ns355208
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7278917
ns5651375
ns1.29
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
475315
ns476459
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
708
ns666
ns1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
584
ns645.5
ns0.90
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
916.5
ns1083
ns0.85
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
583
ns584
ns1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
20807.5
ns20608
ns1.01
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal
297208
ns283708
ns1.05
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
33001
ns33020
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1375
ns1417
ns0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1458
ns1375
ns1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1583
ns1583
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1417
ns1375
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
126203
ns125576.5
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal
1457625
ns1432958.5
ns1.02
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
138172
ns126626
ns1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7333
ns7334
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5375
ns6000
ns0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6083
ns6208
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10291
ns10542
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
24430
ns24024
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
351229
ns343292
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
47101
ns47670
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219208
ns260125
ns0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
261791
ns253083
ns1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228625
ns266916.5
ns0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
223750
ns224042
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
194664
ns193024
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
11964250
ns9238208
ns1.30
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
617187
ns617455
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4125
ns4125
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4167
ns4125
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4125
ns4125
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4084
ns4083
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23689
ns22982
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal
203375
ns210709
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
48541
ns49071
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16958
ns17000
ns1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16583
ns17042
ns0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
17250
ns16875
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16917
ns16417
ns1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
196884
ns194424
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal
1560667
ns1429834
ns1.09
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
174782
ns177822
ns0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
509333
ns510167
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
332250
ns405334
ns0.82
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
404250
ns404334
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
865708
ns865042
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
114284.5
ns113588.5
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal
392875
ns465958.5
ns0.84
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
248273
ns249082
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2318021
ns2331583
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1745083
ns2030250
ns0.86
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2021000
ns2010958
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3274791.5
ns3195125
ns1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
244508
ns242243
ns1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal
2001875
ns1910124.5
ns1.05
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
763478
ns763951.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
5833
ns6396
ns0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
7167
ns6875
ns1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7271
ns7584
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6124.5
ns6666
ns0.92
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
92855.5
ns92392.5
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
861271
ns721250
ns1.19
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
60401
ns60491
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11375
ns11979.5
ns0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11750
ns11417
ns1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12229
ns12083
ns1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
11125
ns11583.5
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
638820
ns638302
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
6435375
ns5394604
ns1.19
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
416514.5
ns408774
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
541
ns500
ns1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
541
ns542
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
541
ns542
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23671
ns23430
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal
318791
ns311771
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
53351
ns54340
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2167
ns2125
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2084
ns2208
ns0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2166
ns2167
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2125
ns2125
ns1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
222818.5
ns220610.5
ns1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal
1967167
ns1899667
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
180782
ns191382
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8708
ns8750
ns1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
8833
ns9541.5
ns0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9895.5
ns10167
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8709
ns8709
ns1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
100619
ns104875.5
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
898521
ns792083
ns1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
74410.5
ns77640
ns0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17375
ns17937.5
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17167
ns17396
ns0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
19375
ns18917
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18250
ns19146
ns0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
574738
ns592103
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
5654917
ns4981084
ns1.14
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
389229
ns390383
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
625
ns500
ns1.25
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
500
ns583
ns0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
667
ns625
ns1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
36237
ns35372
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
463667
ns398500
ns1.16
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
48401
ns46040
ns1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8437.5
ns8479.5
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9312
ns9625
ns0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9875
ns9833.5
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9708
ns9520.5
ns1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
254845
ns267957
ns0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
5087792
ns4295666
ns1.18
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
375784
ns376554
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
395833.5
ns397875
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215750
ns288416
ns0.75
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
288166
ns288417
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756000
ns756875
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
112957
ns112560
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal
299833
ns298416.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
76681
ns77565.5
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1455646
ns1449062.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
862000
ns1132208
ns0.76
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1130021
ns1118604
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2442563
ns2357521
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
210541
ns207975
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal
1636104.5
ns1580250.5
ns1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
325573.5
ns324872.5
ns1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7000
ns7270.5
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7084
ns7417
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8125
ns8354.5
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7041
ns7354
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
136948
ns143390
ns0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
760125
ns700375
ns1.09
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
68820
ns60051
ns1.15
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14625
ns12917
ns1.13
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
15042
ns14520.5
ns1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14958.5
ns16104
ns0.93
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15625
ns16021
ns0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
931253.5
ns945168.5
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6306249.5
ns5468354.5
ns1.15
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
436305
ns428828.5
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25542
ns25333
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
27334
ns25250
ns1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28354
ns27583
ns1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
31542
ns24416.5
ns1.29
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
200462.5
ns199074
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1129500
ns576708
ns1.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
112942
ns116211
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
149250
ns105875
ns1.41
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
131583.5
ns105209
ns1.25
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
106479
ns112708.5
ns0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
153208
ns147084
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1062590
ns1079966
ns0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
5978292
ns5470437.5
ns1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
590197
ns601665
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
76250
ns75042
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
74291.5
ns75709
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
77333
ns78666
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76792
ns74042
ns1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
209030.5
ns208057.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
638458
ns501667
ns1.27
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
130572
ns124861
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216500
ns223417
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
297395.5
ns274958.5
ns1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
212146
ns306250
ns0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
306208
ns303916.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1140320
ns1127846.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7480542
ns6260041.5
ns1.19
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
697363
ns702346
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
15833
ns16458.5
ns0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
17291.5
ns17125
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
17875
ns18895.5
ns0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
16687.5
ns16958
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
150183
ns146821.5
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
779979
ns620979.5
ns1.26
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
237943
ns240662
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
26458.5
ns27187.5
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
25708
ns28833
ns0.89
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27625
ns27937
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27750
ns28708
ns0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
987976
ns980967.5
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
7131041.5
ns5502250
ns1.30
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
701547
ns706416
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
10396
ns11166.5
ns0.93
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
11563
ns11333
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12833
ns13875
ns0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
10875.5
ns11000
ns0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
125970.5
ns124939.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
910812.5
ns818854.5
ns1.11
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
241512
ns238602
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
21083
ns22084
ns0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
21604.5
ns21667
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
23041.5
ns23104.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
21541.5
ns21958
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
709336
ns707018
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
5733333
ns5251771
ns1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
676248
ns693476
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
62667
ns67812.5
ns0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
63771
ns63166.5
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
65667
ns68375
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
67667
ns65416
ns1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
107292
ns106235
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1352583.5
ns469833
ns2.88
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
240373
ns241932
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
444083
ns458875
ns0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
448875
ns438291.5
ns1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
440458
ns449459
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
445833.5
ns450083
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
521267
ns517395
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8808750
ns6169375
ns1.43
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
728812.5
ns734037
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
6958.5
ns7375
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7291
ns8146
ns0.90
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
8771
ns9250
ns0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7104
ns7125
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
147758.5
ns144978
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
763583
ns639291
ns1.19
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
60941
ns59601
ns1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15125
ns16166
ns0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14417
ns14500
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15334
ns15291.5
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15958
ns14541
ns1.10
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
958359.5
ns953156
ns1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
6378396
ns5309583
ns1.20
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
409474
ns412483
ns0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6155291
ns6153625
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
3225687.5
ns6370584
ns0.51
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
6379541
ns6373521
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11906125
ns11918417
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
351844
ns347126
ns1.01
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
301554
ns299793
ns1.01
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19041833.5
ns19118354
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
11118520.5
ns19949833
ns0.56
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
19989395.5
ns19921916
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36469125
ns36514708.5
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1015731
ns1011727
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1151512
ns1159240
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
959
ns958
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
958
ns1000
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
959
ns1000
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
958
ns958
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23791
ns23026
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal
317417
ns309083
ns1.03
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
215032
ns216516.5
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3667
ns3625
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3667
ns3750
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3750
ns3750
ns1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3708
ns3667
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
283833
ns281456.5
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal
2116208
ns2006334
ns1.05
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
634877
ns641165.5
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7167
ns7875
ns0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
7833.5
ns8875
ns0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9291
ns10042
ns0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7500
ns8166
ns0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
122503
ns120732.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
866646
ns777646
ns1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
66931
ns72621
ns0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
11709
ns12500
ns0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
11834
ns12167
ns0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
13291
ns13041.5
ns1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
11875
ns11791.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
651319
ns645322
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5038083
ns4225228.5
ns1.19
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
365314
ns372211
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns291
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22923
ns22405
ns1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal
208979.5
ns207416
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
50651
ns51781
ns0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3000
ns3084
ns0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2959
ns3125
ns0.95
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3250
ns3125
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2959
ns2834
ns1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
206218
ns204197
ns1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal
1699541.5
ns1523666
ns1.12
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
158851.5
ns161603
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10375
ns11625
ns0.89
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
11854.5
ns11104.5
ns1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12417
ns13021
ns0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
12333
ns11083.5
ns1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
123182.5
ns121763
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
877125
ns786833
ns1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
241463
ns245014
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
22062
ns21708
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
21625
ns21625
ns1
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
21708
ns23625
ns0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
20084
ns22250
ns0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
605852.5
ns599417
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5025000
ns4065709
ns1.24
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
667502
ns671650
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4417
ns4541
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4584
ns4583
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4417
ns4666
ns0.95
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24334
ns24192
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal
208417
ns211333
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
54130
ns54791
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16375
ns16666
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16375
ns16666
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16667
ns16916
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16875
ns16292
ns1.04
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
333246
ns332323
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal
1768771
ns1587333
ns1.11
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
214042.5
ns215493.5
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
2084
ns1958
ns1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2000
ns2167
ns0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2166
ns2166
ns1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2041
ns2042
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
36196
ns36245
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
473000
ns439708
ns1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
205752
ns210833
ns0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
17667
ns16250
ns1.09
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
18937.5
ns17000
ns1.11
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
17625
ns17375
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
16896
ns16416.5
ns1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
297235
ns296059
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
5572167
ns4512166.5
ns1.23
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
694748
ns695090
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
55979.5
ns59708.5
ns0.94
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
60709
ns65708
ns0.92
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
65812.5
ns65729.5
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
51583
ns51209
ns1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66558
ns66461
ns1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
120591.5
ns98652
ns1.22
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
185895.5
ns196292
ns0.95
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
146354
ns152646
ns0.96
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
136208
ns132791.5
ns1.03
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
297104
ns265000
ns1.12
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
218976.5
ns216858
ns1.01
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
584106
ns588779
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
112833.5
ns85833
ns1.31
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
86417
ns124125
ns0.70
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
89416
ns85250
ns1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
81000
ns83917
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
191966
ns192676
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1945000
ns1754791.5
ns1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
209467.5
ns172083
ns1.22
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1912250
ns1889458
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1923916
ns1906375
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1917917
ns1639458.5
ns1.17
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1922250
ns1896208.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
536309
ns532536
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11093750
ns9060167
ns1.22
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
935284.5
ns1084751
ns0.86
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns291
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns292
ns0.86
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21820
ns21623
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal
327833.5
ns318709
ns1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
46181
ns45401
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1791
ns1834
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1792
ns1792
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
254627
ns252797
ns1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal
1640833
ns1460542
ns1.12
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
187212
ns183733
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8209
ns8625
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
9083
ns8708
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
9896
ns11438
ns0.87
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8417
ns8125
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
120586.5
ns118574
ns1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
873250
ns776791.5
ns1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
236722
ns241823.5
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10292
ns9833
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8958
ns9458
ns0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9917
ns11375
ns0.87
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8666
ns11041
ns0.78
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
532717.5
ns526956.5
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
4452292
ns3794729
ns1.17
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
646767
ns648949
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56750
ns58125
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39708
ns46375
ns0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47166
ns45750
ns1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83125
ns84250
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
40431
ns39640
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1093666
ns1077750
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
77971
ns79521
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1903833
ns1936334
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1979312
ns1979666
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1983896
ns1951604.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1849208
ns1881916
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
224788
ns222570.5
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
14363791.5
ns11388417
ns1.26
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1042991
ns1042195
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
415042
ns434833
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
418584
ns418000
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
420291
ns422395.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
420459
ns416583
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
212100.5
ns211826.5
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1065709
ns505604
ns2.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
286133
ns289239
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
742875
ns682520.5
ns1.09
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
758958
ns767333
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
691062.5
ns716271
ns0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
742624.5
ns750812.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1063422.5
ns1054928.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7312146
ns6283521
ns1.16
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
924920
ns921133
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
3442959
ns3362083.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
3441833
ns3444042
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
3417500
ns3375083
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
3453000
ns3433667
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
174858
ns175818.5
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1420583
ns1393208
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
452865
ns432507
ns1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
6180375
ns6161771
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
6232875
ns6172291.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
6229979
ns5672584
ns1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
6252666
ns6241000
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1007257
ns997215
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9641124.5
ns7277000
ns1.32
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1560736
ns1740609
ns0.90
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
471375
ns474833
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
253334
ns341334
ns0.74
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
341708
ns339937.5
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
902583
ns901833
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46913
ns46636
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal
338020.5
ns351584
ns0.96
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
250492
ns252343
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2320416
ns2323458
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1761167
ns2036541
ns0.86
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
2033167
ns2030999.5
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3279375
ns3199000
ns1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
260626
ns257623
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal
2319917
ns2193666
ns1.06
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
785678
ns793161
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56166
ns57229.5
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39417
ns45917
ns0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
46584
ns44687.5
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82917
ns84125
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28863
ns28263.5
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1130625
ns1073000
ns1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
79170.5
ns82736.5
ns0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2020083
ns1994187.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2062917
ns2084521.5
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2078437.5
ns2066104.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2004145.5
ns1987021.5
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
238429
ns236475
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
15264270.5
ns11587916.5
ns1.32
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1057241
ns1056434
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
56292
ns57500
ns0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39833
ns46375
ns0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
47416
ns45959
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
82875
ns83666
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
50090
ns49710
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1054834
ns1030562
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
74900
ns73311
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1924167
ns1928437.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1968250
ns1982583
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1980792
ns1921333.5
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1891208
ns1896500
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
243592
ns243238
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
12800042
ns9867125
ns1.30
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1070466
ns931613
ns1.15
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns291
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
333
ns375
ns0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
292
ns292
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
35236
ns34854
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
461750
ns268562.5
ns1.72
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
50011
ns48101
ns1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6709
ns6333
ns1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6520.5
ns7042
ns0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7209
ns1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6541
ns6709
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
216284
ns214556.5
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5088292
ns4310250.5
ns1.18
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
373774
ns378480.5
ns0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
250
ns292
ns0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
291
ns292
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
250
ns250
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32446
ns32581
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal
248500
ns231417
ns1.07
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
40510
ns39650
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2917
ns2750
ns1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
3250
ns3167
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
3083
ns3000
ns1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
3458
ns2875
ns1.20
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
191592.5
ns190168.5
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal
1031291.5
ns896854.5
ns1.15
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
153502
ns154712
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
423917
ns428625
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
473500
ns455000
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
427833
ns423875
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
424125
ns425374.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
138519
ns137437
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2048875
ns2017791
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
380684
ns354515
ns1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3799062.5
ns3815083.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3822458
ns3802292
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3802667
ns3442625
ns1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3823563
ns3811667
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
717031.5
ns711414
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
12950229
ns10864270.5
ns1.19
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1325953
ns1331908
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49840813
ns49850833.5
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
25988833
ns35504146
ns0.73
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
35525750
ns35546333
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
96904729.5
ns97031625
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1593190
ns1606173.5
ns0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1014101
ns1005743
ns1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
153775938
ns154464875
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
89008896
ns112292145.5
ns0.79
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
112384750
ns112275083
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
296752479
ns295087458
ns1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6476290
ns6454148
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5534451
ns5525883
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
15062.5
ns18271
ns0.82
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
15625
ns18333
ns0.85
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
16875
ns16416
ns1.03
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15333
ns16042
ns0.96
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
21010
ns21028
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal
204959
ns199083
ns1.03
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
27230
ns26291
ns1.04
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
11083
ns10812.5
ns1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
7583
ns8812.5
ns0.86
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
9209
ns9250
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17188
ns17271
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
264057
ns263179
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal
1736125.5
ns1476625
ns1.18
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
152581.5
ns155602
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7417
ns8479.5
ns0.87
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8833
ns9958
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10041.5
ns11041
ns0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8292
ns8292
ns1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
117259.5
ns126280
ns0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
887417
ns770417
ns1.15
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
236902.5
ns239503
ns0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
9708.5
ns10333
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9292
ns10417
ns0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10791.5
ns9833
ns1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9584
ns9813
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
631614
ns625000
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
5189583
ns4214500
ns1.23
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
668942
ns660618.5
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8812.5
ns10500.5
ns0.84
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9583
ns9792
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11042
ns12438
ns0.89
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9250
ns9396
ns0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
122641
ns120552
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
876791.5
ns821667
ns1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
74481
ns69276
ns1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
13708
ns15166.5
ns0.90
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
14979
ns15396
ns0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
14416
ns14124.5
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13625.5
ns14375
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
601521.5
ns596192
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
4885250
ns3931812.5
ns1.24
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
353174
ns355215
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
458
ns500
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
500
ns583
ns0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
584
ns583
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
35180
ns35017
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal
441166
ns259917
ns1.70
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
206562
ns208292
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7042
ns7666
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10458
ns8416
ns1.24
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8042
ns7917
ns1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7125
ns7792
ns0.91
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
233713.5
ns232859
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal
5300958.5
ns4590708
ns1.15
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
658707
ns670689
ns0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
12666
ns15500
ns0.82
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
13833
ns15834
ns0.87
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
15667
ns13709
ns1.14
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
10270.5
ns10375
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
22010
ns22187.5
ns0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal
186625
ns184292
ns1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
191282
ns194442.5
ns0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
32042
ns32042
ns1
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
32020.5
ns32250
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
32458
ns32250
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
31854.5
ns31917
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
278049
ns277348
ns1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal
1885500
ns1597167
ns1.18
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
606396.5
ns608217
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
438291
ns443583
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
484125
ns485750
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
446062.5
ns444958
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
477208
ns483792
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194398.5
ns194055
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1968250
ns1953500
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
375174
ns355719.5
ns1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3825292
ns3835771
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3837396
ns3818792
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3828687.5
ns3453229
ns1.11
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3836875
ns3847625
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
549907
ns547078
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
12010500
ns9055458
ns1.33
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1226382.5
ns1390493
ns0.88
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
836787979.5
ns783907000
ns1.07
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
426008000
ns542588375
ns0.79
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
542930250
ns542038833
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1533058916
ns1515263812.5
ns1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22531506
ns22757656.5
ns0.99
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14059203
ns14076767
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
3617643875
ns2559120625
ns1.41
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1519606625
ns1811234166
ns0.84
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1791220042
ns1823497333
ns0.98
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4771769708
ns4761215708
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
370760684
ns368318878
ns1.01
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
89879564
ns87507304
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
75354.5
ns77500
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
77417
ns85708
ns0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
80167
ns80125
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76625
ns77291.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
210924.5
ns210158.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1045583.5
ns508103.5
ns2.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
110131.5
ns110471
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
231500
ns235166
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
195167
ns290291.5
ns0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
244583
ns194125
ns1.26
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
234875
ns196875
ns1.19
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1060035
ns1050416
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6603312.5
ns5885021
ns1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
643791.5
ns645198
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199256958.5
ns199484937
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
103813958.5
ns139217416
ns0.75
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
139098125
ns139383750
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
388864875
ns388675625
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5820038
ns5836807.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3424485
ns3426102
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
615907583.5
ns618127896
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
354224562
ns439059167
ns0.81
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
440166291.5
ns438957292
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1188432875
ns1179308292
ns1.01
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26804213.5
ns26606894
ns1.01
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
21815881
ns21809492
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7333
ns7375
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5416
ns6292
ns0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6291
ns3542
ns1.78
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10458
ns10083
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28403
ns27930
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
361437.5
ns375166.5
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48715.5
ns48181
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213333.5
ns214541
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
221708
ns230604
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220916
ns220625
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
205750
ns206687.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
226122
ns224569.5
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
11493583.5
ns9326958
ns1.23
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
541195.5
ns537197
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7291
ns8021
ns0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8417
ns8792
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10770.5
ns11229.5
ns0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8583
ns7375
ns1.16
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
119656
ns116136
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
855542
ns797791.5
ns1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
72200
ns72561
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7667
ns9145.5
ns0.84
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9395.5
ns10167
ns0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8375
ns8104.5
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7542
ns8833.5
ns0.85
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
526844.5
ns524349
ns1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
4384667
ns3783833
ns1.16
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
322463
ns323234
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
459
ns375
ns1.22
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
458
ns667
ns0.69
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
500
ns541
ns0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
416
ns625
ns0.67
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
27306
ns26340
ns1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
483625
ns443354.5
ns1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
48601
ns49290
ns0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9917
ns9625
ns1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
10167
ns13291
ns0.76
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9542
ns9750
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
8667
ns9542
ns0.91
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
256488
ns255271
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5936416
ns4751833
ns1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
396784
ns397425
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
108542
ns106958.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
85333
ns99292
ns0.86
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
100208
ns99812.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
146625
ns146979
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
25074
ns25303
ns0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal
244333
ns240875
ns1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
190632
ns191062
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
479625
ns498250
ns0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
518583.5
ns524250
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
481000
ns479229.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
478125
ns489125
ns0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
235150
ns234991.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal
2164333
ns2102146
ns1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
622586
ns624127.5
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5500
ns5333
ns1.03
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
5750
ns5625
ns1.02
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
6666.5
ns7167
ns0.93
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
4125
ns6396
ns0.64
batchedmm(16, Bsize=32)/forward/GPU/CUDA
16723
ns16311.5
ns1.03
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
78130
ns79691
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
11812
ns12542
ns0.94
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
11916
ns11000
ns1.08
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
11000
ns11250
ns0.98
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
16500
ns17666.5
ns0.93
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
216336
ns214390
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
370958.5
ns390785
ns0.95
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
35917
ns38958
ns0.92
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
50500
ns52791.5
ns0.96
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
52709
ns52333
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13541
ns13667
ns0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA
20359
ns20008
ns1.02
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
79931
ns82631
ns0.97
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
36625
ns37041
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
29625
ns35917
ns0.82
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
31458
ns31417
ns1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
57209
ns57770.5
ns0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
195413
ns193115
ns1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
409364
ns424520
ns0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1959
ns1792
ns1.09
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
1792
ns1958
ns0.92
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2125
ns2125
ns1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1792
ns1812.5
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
21014.5
ns21083
ns1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal
324459
ns292542
ns1.11
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
33550
ns30610
ns1.10
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2209
ns2166.5
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2125
ns2333
ns0.91
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2417
ns2375
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2291
ns2375
ns0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
207244.5
ns204141
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal
1670895.5
ns1447208.5
ns1.15
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
137121
ns145632
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
4583
ns6125
ns0.75
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
4750
ns5396
ns0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6333
ns6062.5
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4917
ns5417
ns0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
147827
ns143737
ns1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
771709
ns684625
ns1.13
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
71711
ns64221
ns1.12
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8270.5
ns9334
ns0.89
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8666
ns9292
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8792
ns8792
ns1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8125
ns9229.5
ns0.88
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
888135.5
ns870456.5
ns1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
6483625
ns5245354.5
ns1.24
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
391164
ns395275
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56875
ns56834
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
56875
ns57542
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57750
ns57708
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
58292
ns58167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
37890
ns37529
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
379312.5
ns331041
ns1.15
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
205582
ns210192
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
448479
ns447958.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
465229
ns472042
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
464687.5
ns464624.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
433500
ns443958.5
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
270782
ns266188
ns1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
10306000
ns8232229.5
ns1.25
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
801818
ns809509
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3291000
ns3321791
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1770084
ns2340645.5
ns0.76
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
2335292
ns2338500
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6297083.5
ns6319896
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
206316
ns207561
ns0.99
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
203322
ns202168
ns1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11333854.5
ns11449521
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
6594562.5
ns8325854
ns0.79
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
8324937.5
ns8320229
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21089229
ns21173874.5
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
735605
ns743009
ns0.99
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1072271
ns1060547.5
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5625
ns6500
ns0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5667
ns5000
ns1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7500
ns7084
ns1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6750
ns6333
ns1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
139700
ns137112.5
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
867541.5
ns739604.5
ns1.17
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
56260
ns56461
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7500
ns7229.5
ns1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
14625
ns7666.5
ns1.91
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7375
ns7666
ns0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7000
ns11333
ns0.62
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
766028
ns753531.5
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
5998084
ns4958542
ns1.21
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
380414
ns379125
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
117604
ns95833
ns1.23
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
125375
ns122770.5
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
102396
ns99771
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
98145.5
ns97166
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
152876
ns151139
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2030624.5
ns2002250
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
185692
ns187022
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2021875
ns2023750
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2037125
ns2020250
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2013542
ns1746979
ns1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2033354
ns2042417
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
716061.5
ns705196
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
13591542
ns10844979.5
ns1.25
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1265732.5
ns1123943
ns1.13
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
29833
ns33395.5
ns0.89
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
34167
ns37708
ns0.91
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
35542
ns34375
ns1.03
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
625
ns708
ns0.88
batchedmm(2, Bsize=4)/forward/GPU/CUDA
15704
ns15220
ns1.03
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
71560.5
ns81571
ns0.88
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2583
ns2583
ns1
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
4583
ns2917
ns1.57
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3000
ns3041
ns0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2209
ns2750
ns0.80
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
143464
ns137129.5
ns1.05
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
351354
ns351384.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7208
ns7166
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5334
ns6125
ns0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
6166
ns6083
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10000
ns10083
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
37164
ns36037
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
334396
ns326166
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49180
ns48790
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212895.5
ns212208.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222000
ns232791.5
ns0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221041.5
ns220437.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
205979
ns207312.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
249374
ns243763
ns1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9656333
ns8135312.5
ns1.19
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
581561
ns524616
ns1.11
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3959
ns3917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
4000
ns3958
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3958
ns3958
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3917
ns4000
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
21939
ns21381
ns1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal
227375
ns224000
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
45671
ns47871
ns0.95
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14916
ns14917
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14708
ns15041
ns0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15000
ns14917
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14875
ns14709
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
314728.5
ns307867
ns1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal
1635750
ns958666
ns1.71
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
192832
ns197982
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
109166
ns100500
ns1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
132541
ns108625
ns1.22
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
109875
ns104083
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
102125
ns100833
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
138355.5
ns138303.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
2016354
ns1992791
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
188667
ns172222
ns1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1918396
ns1923250
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1939229
ns1915250
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1913584
ns1651666
ns1.16
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1937625
ns1913500
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
700104
ns688019
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
13264020.5
ns10627167
ns1.25
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1233652.5
ns1227559
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17667
ns18167
ns0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18458
ns18125
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22270.5
ns21000
ns1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18250
ns18229.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
110588.5
ns108058
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1374104.5
ns463917
ns2.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81891
ns80040
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216417
ns215479
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
249771
ns253396
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
216541.5
ns217958
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
217312.5
ns215854
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
527304
ns518117
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
8411584
ns6333958
ns1.33
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
488925
ns492755.5
ns0.99
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
24063
ns24541.5
ns0.98
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
28500
ns32959
ns0.86
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
29459
ns27875
ns1.06
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1334
ns1292
ns1.03
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16479
ns16059.5
ns1.03
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
82590
ns83436
ns0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
4708.5
ns4666.5
ns1.01
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
4708
ns4729.5
ns1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5208
ns5125
ns1.02
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4875
ns4458
ns1.09
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
210198
ns205889.5
ns1.02
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
398304
ns408069.5
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
304792
ns305750
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
305542
ns304333
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
311083
ns306917
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
306375
ns308083
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
232191.5
ns228499.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1156396
ns1000084
ns1.16
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
279563
ns276983
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
530625
ns530500
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
542459
ns547208
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
542000.5
ns532583
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
535875
ns562084
ns0.95
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1096065
ns1071032
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
6678000
ns5815562.5
ns1.15
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
873778.5
ns872009
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20083
ns19208
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20187.5
ns20812.5
ns0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23187
ns22125
ns1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20959
ns20542
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
115290.5
ns113101
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1265792
ns501709
ns2.52
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
80731
ns79781
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212042
ns212500
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
224625
ns241875
ns0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214333
ns215208
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
213708.5
ns212541
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
758025
ns741622.5
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
10158583
ns7459812.5
ns1.36
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
542975
ns548036
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6458
ns6625
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6917
ns6792
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
8542
ns8292
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6417
ns6917
ns0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
143078
ns139468
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
869500
ns738042
ns1.18
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
69771
ns69491
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10709
ns10208
ns1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9771
ns10000
ns0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10729.5
ns11083
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10291
ns10167
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
834187
ns826026
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
6274750
ns5037583
ns1.25
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
396084
ns389164
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5333
ns6709
ns0.79
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4958
ns4583.5
ns1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7125
ns7500
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5958
ns7000
ns0.85
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
146313.5
ns143052
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
875000
ns715979
ns1.22
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
67660
ns59900.5
ns1.13
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7667
ns7583
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7500
ns7895.5
ns0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7625
ns7709
ns0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7459
ns7416
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
797995
ns782023
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
6580999.5
ns5232416.5
ns1.26
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
400804
ns399255
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14350958
ns14504875
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
7722625
ns10144541
ns0.76
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
10132750
ns10123250
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27757125
ns27812708
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
532327
ns530146
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
403538.5
ns398444
ns1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
45806208
ns46256833
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
26766750.5
ns33497916.5
ns0.80
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
33520000
ns33428625
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85306916
ns85699625
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
2661047
ns2648857
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3296413
ns3285657
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
66000
ns66875
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
67333
ns65645.5
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
69854
ns68791.5
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
67375
ns66292
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
120529
ns118200
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1329083.5
ns509020.5
ns2.61
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
228112
ns239683
ns0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
444083
ns439916.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
444083
ns488291.5
ns0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
441292
ns442104.5
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
442521.5
ns441750
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
736542.5
ns727003
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
10732062.5
ns7746000
ns1.39
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
809398
ns807579
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns500
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
667
ns583
ns1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32886
ns32311
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal
466834
ns409084
ns1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
49230
ns49420
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9375
ns8709
ns1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9250
ns8375
ns1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9500
ns10208
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
8125
ns8667
ns0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
290314.5
ns284738.5
ns1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal
5519708
ns4462646
ns1.24
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
387394
ns391304
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9875
ns9875
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9833
ns9834
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9833
ns9834
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9791
ns9833
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23928
ns22837
ns1.05
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal
204979.5
ns212375
ns0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
214872
ns217863
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
46000
ns46042
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
45667
ns46292
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
46666
ns46292
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
46250
ns45917
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
293307
ns289926.5
ns1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal
1595562.5
ns929833.5
ns1.72
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
621217
ns620477
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
56333
ns56250
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
56792
ns57083
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
57083
ns57166
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
57834
ns58000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
29516
ns28779.5
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
704333.5
ns346145.5
ns2.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
205082
ns207252
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
455021
ns448416.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
465375
ns502000
ns0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
473000
ns465458
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
434208.5
ns434875
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
252003
ns246357
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
12166125
ns9664479.5
ns1.26
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
893508.5
ns860564
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
624416
ns597687.5
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
662083
ns645396
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
619083
ns549917
ns1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
633895.5
ns641125
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
212333
ns203653
ns1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1471333
ns1436750
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
236152
ns233752
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2220834
ns2234583
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2250000
ns2231583
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2213792
ns1888416
ns1.17
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2240750
ns2260250
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
990521.5
ns966670
ns1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
9717333
ns7505959
ns1.29
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1376089
ns1376955
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
19000
ns19292
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
19979
ns24625
ns0.81
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22333.5
ns22125
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
22250
ns19208
ns1.16
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
114382.5
ns112203
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal
1244584
ns1449458
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81450
ns85180
ns0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
222479
ns218917
ns1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
224959
ns232375
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221208
ns221292
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
218917
ns225562.5
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
738666.5
ns730535
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal
10456396
ns7839645.5
ns1.33
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
562856
ns559956
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
584
ns500
ns1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
500
ns583
ns0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
667
ns583
ns1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23746
ns22978
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal
488062.5
ns450833
ns1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
49670
ns50710
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9541.5
ns10312
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9792
ns10084
ns0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9833
ns10792
ns0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9291.5
ns10042
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
272510
ns267102
ns1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal
6224583.5
ns4916333
ns1.27
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
407824
ns427214
ns0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7708
ns10500
ns0.73
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8687.5
ns8708
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11166.5
ns10958
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9666
ns8063
ns1.20
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
121220
ns118534
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal
860208
ns768958
ns1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
72661
ns68831
ns1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7708
ns7333
ns1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7250
ns8417
ns0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8125
ns7875
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7334
ns7625
ns0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
516336
ns506197
ns1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal
4339813
ns3602937.5
ns1.20
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
328244
ns339243
ns0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1458
ns1375
ns1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1375
ns1687.5
ns0.81
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2041.5
ns1959
ns1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1583
ns1583
ns1
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
21646
ns21680
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal
305020.5
ns306333
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
191511.5
ns190212
ns1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3334
ns3250
ns1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3375
ns3395.5
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3459
ns3542
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3458
ns3458
ns1
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
224911
ns220280
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal
1768041
ns1546708
ns1.14
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
595216
ns597436
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
145708.5
ns147145.5
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
106562.5
ns130833
ns0.81
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
129292
ns128937.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
225125
ns226062.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
24473.5
ns24156
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal
252375
ns250812.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
38390
ns37640
ns1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
143771
ns156458.5
ns0.92
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
88167
ns136208
ns0.65
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
110771
ns110833
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
250875
ns264250
ns0.95
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
220914.5
ns217951.5
ns1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal
2045709
ns1080375
ns1.89
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
237933
ns226967
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7250
ns7292
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5333
ns6000
ns0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5916
ns3708
ns1.60
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10208
ns10500
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33448
ns32643.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
335833
ns549084
ns0.61
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
50340
ns51091
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
224250
ns219624.5
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
228375
ns236042
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
236083.5
ns228500
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
212562.5
ns217333.5
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
267943.5
ns261697
ns1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
9170083
ns8432208
ns1.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
609306
ns537506
ns1.13
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
14458
ns16459
ns0.88
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
14812.5
ns14937.5
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
16791.5
ns16667
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
15334
ns15834
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
141134
ns139783
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal
873104
ns745750
ns1.17
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
238182
ns242713
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24083.5
ns24083.5
ns1
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23875
ns23833
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24167
ns24188
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
23625
ns23625
ns1
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
878285
ns867529
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal
6385188
ns5264396
ns1.21
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
692226
ns706748
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8916
ns9208
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9687.5
ns10083
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
12125
ns11208
ns1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
10416
ns9937
ns1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
124959.5
ns122858
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal
918334
ns796708
ns1.15
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
75531
ns75011
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14000
ns14458
ns0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
13729
ns14375
ns0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14708
ns14833.5
ns0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
13834
ns14604
ns0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
676549
ns664733
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal
5573041
ns4970395.5
ns1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
373189
ns380234
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
8062
ns9166
ns0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9750
ns8770.5
ns1.11
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11916.5
ns11812.5
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10187.5
ns8875
ns1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
124116
ns120199.5
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal
883646
ns851833.5
ns1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
69690
ns71031
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12625
ns12958
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
12750
ns13250
ns0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13542
ns13791.5
ns0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
12312
ns13062.5
ns0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
561116
ns549928
ns1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal
4630937
ns3993041
ns1.16
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
345083.5
ns349744
ns0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
27208.5
ns29917
ns0.91
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
32333.5
ns35500
ns0.91
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
31958
ns30666.5
ns1.04
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
2041
ns2042
ns1.00
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16556
ns15829
ns1.05
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
82091
ns81070
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5229
ns5500
ns0.95
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
4687.5
ns5042
ns0.93
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5334
ns5437.5
ns0.98
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6458
ns6583.5
ns0.98
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
142634
ns138553.5
ns1.03
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
367964
ns375574
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
334
ns291
ns1.15
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
250
ns375
ns0.67
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
250
ns292
ns0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
26682
ns25491
ns1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
482271
ns274125
ns1.76
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
47990
ns48640
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6500
ns6334
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6562.5
ns6666
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6709
ns6875
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6188
ns6125
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
190767.5
ns186984.5
ns1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
5874834
ns4846709
ns1.21
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
394363.5
ns399775
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
2042
ns1917
ns1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
1917
ns2000
ns0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2125
ns2042
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
2000
ns2000
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
27167
ns26185.5
ns1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal
492292
ns456771
ns1.08
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
210002
ns209542
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16833.5
ns16521
ns1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16417
ns16458
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17354.5
ns17166.5
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
16458.5
ns16729.5
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
278278
ns274176
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal
6125604
ns4934125
ns1.24
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
714427
ns693313
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
146500
ns175041
ns0.84
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
171396
ns176250
ns0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
155584
ns151792
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
154167
ns153042
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
204804
ns199925
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1553583
ns1545458
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
231362.5
ns177492
ns1.30
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1324312.5
ns1316145.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1348021
ns1322833
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1319083.5
ns1306875
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1326542
ns1335833
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
925557
ns903541.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
8602229.5
ns6708709
ns1.28
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1014380
ns1125232
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
23792
ns24875
ns0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25354
ns25708.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28250
ns26667
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
24604.5
ns25187
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
238411
ns234415.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal
1139000
ns981687
ns1.16
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
120312
ns119891
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
117854
ns118146.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
124667
ns121937.5
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
174458.5
ns120062.5
ns1.45
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
118354
ns150541.5
ns0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1098934
ns1068343
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal
7919042
ns5874292
ns1.35
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
614406
ns611136
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
250
ns375
ns0.67
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
292
ns334
ns0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23522
ns22848
ns1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal
491791.5
ns453916.5
ns1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
50790
ns49700
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6583
ns6500
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6375
ns6542
ns0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6833
ns6708
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6167
ns6458
ns0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
207746.5
ns203322.5
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal
5956667
ns4933125
ns1.21
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
395954
ns405594.5
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5958
ns6084
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6041.5
ns5583
ns1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7604.5
ns7625
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
6500
ns6375
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
147981.5
ns145118
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
774875
ns662166.5
ns1.17
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
239202
ns241563
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10000
ns9917
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10083
ns10292
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10667
ns10250
ns1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
9791.5
ns10417
ns0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
916090
ns899874
ns1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
7392292
ns5521125
ns1.34
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
688747.5
ns696188
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
708
ns666
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
666
ns667
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
666
ns667
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
625
ns625
ns1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
23031
ns22288
ns1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal
209625
ns206792
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
215712
ns218712.5
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4833
ns4542
ns1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4584
ns4792
ns0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4833
ns4792
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4625
ns4584
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
230125.5
ns226923.5
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal
1700146
ns1564208
ns1.09
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
599396
ns607636
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8396
ns8417
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8000
ns8834
ns0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10125
ns10208
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
9062.5
ns8333.5
ns1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
123106.5
ns121107.5
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal
907333
ns787896
ns1.15
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
76081
ns69681
ns1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8792
ns8542
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8459
ns8584
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9041
ns8834
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8270.5
ns8687.5
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
600302.5
ns588077
ns1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal
4960583.5
ns4126709
ns1.20
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
353604
ns353148.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
122750
ns126625
ns0.97
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
95625
ns131250
ns0.73
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
130334
ns129875
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
183125
ns181374.5
ns1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46375
ns45747
ns1.01
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
98981
ns98706
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
303292
ns311584
ns0.97
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
182750
ns342188
ns0.53
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
345917
ns314062.5
ns1.10
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
608729
ns597708.5
ns1.02
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
195364.5
ns190310
ns1.03
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
494734
ns493310.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
396125
ns397917
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215375
ns288229.5
ns0.75
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
287708
ns288375
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756000
ns756667
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43820
ns42976
ns1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal
358000
ns361604
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
83390
ns85651
ns0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1446958.5
ns1452354.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
863667
ns1135375
ns0.76
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
1133375
ns1136166
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2443417
ns2360834
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
252085
ns244938.5
ns1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal
1851958
ns1837437.5
ns1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
350863.5
ns353764
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
626459
ns626917
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
682479
ns647958.5
ns1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
615000
ns644166.5
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
641167
ns647292
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
203045
ns202423.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1359542
ns1353875
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
254223
ns253853
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2435250
ns2449396
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2470979.5
ns2442604.5
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2445042
ns2442291
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2415792
ns2488000
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
1014910
ns984916.5
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11589916
ns10289041.5
ns1.13
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1478675
ns1500640
ns0.99
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
29458.5
ns32979
ns0.89
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
33812.5
ns36916
ns0.92
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
34541
ns33875
ns1.02
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
1042
ns958
ns1.09
batchedmm(2, Bsize=32)/forward/GPU/CUDA
15442
ns15311
ns1.01
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
85531
ns73911
ns1.16
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3250
ns3208
ns1.01
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3042
ns3209
ns0.95
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3416
ns3354.5
ns1.02
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3166
ns3125
ns1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
142240.5
ns136401
ns1.04
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
360413
ns360914
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
404291
ns405709
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
403708
ns408354.5
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
409042
ns407750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
421875
ns421833
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
44262
ns43333
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal
1119041
ns1083604.5
ns1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
242882
ns245632
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3855208
ns3884625
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
3997771
ns3994229
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
3998125
ns3999209
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3773938
ns3792521
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
248524
ns243526
ns1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal
14976771
ns11754959
ns1.27
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1453704
ns1250757.5
ns1.16
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3959
ns3916
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3917
ns1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3875
ns4000
ns0.97
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
34278.5
ns33572
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal
161167
ns162667
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
40280
ns43020
ns0.94
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15875
ns15667
ns1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15583
ns15959
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
16041
ns16042
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15791
ns15542
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
257529.5
ns253570
ns1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal
864083.5
ns834375
ns1.04
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
168256.5
ns180881
ns0.93
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
403417
ns404000
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
221375
ns295708
ns0.75
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
295666
ns295709
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760500
ns760917
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113952
ns112718
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal
335792
ns342354.5
ns0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
88615.5
ns90981
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1471958
ns1493312.5
ns0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
887791.5
ns1156854.5
ns0.77
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1157167
ns1160000
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2467666
ns2383125
ns1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
255583.5
ns238647
ns1.07
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal
1946854
ns1884417
ns1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
360243.5
ns359543.5
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
542
ns458
ns1.18
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
500
ns583
ns0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
584
ns583
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
500
ns583
ns0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
26902
ns25565
ns1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal
486187.5
ns419791
ns1.16
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
208227.5
ns212932
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7667
ns7417
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
7666
ns7667
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
7916.5
ns7791
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7250
ns7833
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
219818
ns215431.5
ns1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal
6151042
ns4957687
ns1.24
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
686716.5
ns709788
ns0.97
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
825562.5
ns828812.5
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
468833
ns617312
ns0.76
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
620188
ns619250
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1547479
ns1549375
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA
131055
ns133852.5
ns0.98
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
231953
ns169242
ns1.37
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2669042
ns2694583.5
ns0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1538125.5
ns2012042
ns0.76
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
2006270.5
ns2001792
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4938583
ns4939479.5
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
242713
ns239151.5
ns1.01
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
860168
ns886529
ns0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
375
ns250
ns1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
291
ns334
ns0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns333
ns1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns334
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
32634
ns31838
ns1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal
452000
ns259333
ns1.74
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
48761
ns49400
ns0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6437.5
ns6333
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6541.5
ns6500
ns1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6750
ns6625
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6000
ns6520.5
ns0.92
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
228896
ns223447
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal
5302916
ns4629187.5
ns1.15
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
369843
ns375294
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2391250
ns2395187.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2400000
ns2405291
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2405958
ns2380959
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2372125
ns2436208
ns0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
204395
ns200278
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1597249.5
ns1414792
ns1.13
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
377704
ns358993
ns1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4646708.5
ns4648625
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4648958
ns4665041.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4659021
ns4656875
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4685792
ns4669959
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
915367
ns896958
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
7426833
ns6803500
ns1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1261857
ns1421880
ns0.89
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
7479
ns6750
ns1.11
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7125
ns7083
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7959
ns7292
ns1.09
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
7250
ns6584
ns1.10
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
23573
ns23321
ns1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal
243500
ns239042
ns1.02
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
39571
ns38450
ns1.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
70291.5
ns45667
ns1.54
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
45542
ns35834
ns1.27
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
63500
ns33937.5
ns1.87
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
33104
ns66729
ns0.50
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
217821
ns216719
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal
2084458
ns1971646
ns1.06
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
226612
ns249873
ns0.91
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
20396
ns21583.5
ns0.94
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
24479.5
ns26958
ns0.91
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
24854.5
ns22875
ns1.09
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5500
ns5250
ns1.05
batchedmm(2, Bsize=512)/forward/GPU/CUDA
16892
ns16231
ns1.04
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
85151
ns86831
ns0.98
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
11958
ns11791.5
ns1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
9000
ns10333
ns0.87
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
10958.5
ns10708
ns1.02
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
18167
ns17979
ns1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
227664.5
ns225788.5
ns1.01
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
389024
ns379954
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
404791
ns405812.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
223500
ns297333.5
ns0.75
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
296709
ns297167
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762750
ns762666
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46360
ns46002
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal
340000
ns339937.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
88940
ns89221
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1485750.5
ns1490875.5
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
895812
ns1168979.5
ns0.77
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
1165791.5
ns1165791
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2472333
ns2389458
ns1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
290272
ns288633.5
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal
2106583
ns2056875
ns1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
377424
ns383589
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
432770.5
ns433750
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
430583
ns436958
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
436958
ns436167
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
448209
ns448500
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
54092
ns55020
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1074083.5
ns1006500
ns1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
235772
ns239793
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
3888958
ns3901875
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4016791.5
ns4017833
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4025938
ns4034833
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
3793958.5
ns3796208.5
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
263523
ns262986
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11929333
ns10384542
ns1.15
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1247352
ns1253342.5
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
8750
ns8708
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
6875
ns7667
ns0.90
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7667
ns7667
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
12417
ns12458
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
24084
ns23658
ns1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal
211583
ns213500
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
216562
ns218512
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
45125
ns45542
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
44750
ns45333
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
45375
ns45625
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
45187.5
ns45042
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
347338.5
ns345391.5
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal
1883625.5
ns1709416
ns1.10
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
671931.5
ns674767
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
104146.5
ns126458
ns0.82
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
86437
ns123125
ns0.70
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
92875
ns88875
ns1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
126625
ns83854.5
ns1.51
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
189767
ns190159
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal
1966250
ns1948625
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
183982
ns196902
ns0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2011000
ns2023167
ns0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2025000
ns1999791
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2009458
ns2015209
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2016917
ns2007042
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
535873.5
ns533000
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal
11961958.5
ns9148917
ns1.31
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
982380
ns1104121
ns0.89
This comment was automatically generated by workflow using github-action-benchmark.