This repository has been archived by the owner on Nov 4, 2024. It is now read-only.
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
chore: bump crate-ci/typos from 1.24.6 to 1.25.0
Bumps [crate-ci/typos](https://github.com/crate-ci/typos) from 1.24.6 to 1.25.0. - [Release notes](https://github.com/crate-ci/typos/releases) - [Changelog](https://github.com/crate-ci/typos/blob/master/CHANGELOG.md) - [Commits](crate-ci/typos@v1.24.6...v1.25.0) --- updated-dependencies: - dependency-name: crate-ci/typos dependency-type: direct:production update-type: version-update:semver-minor ... Signed-off-by: dependabot[bot] <[email protected]>
- Loading branch information
ba739d3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LuxLib Benchmarks
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5541
ns6104.5
ns0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
5208.5
ns6125
ns0.85
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6834
ns7166
ns0.95
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
4917
ns6042
ns0.81
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
102997
ns105660
ns0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
422395
ns401954
ns1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
10125
ns9979
ns1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
10167
ns10000
ns1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
9917
ns10125
ns0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
10020.5
ns10063
ns1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
530333
ns495391
ns1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
11174375
ns682487
ns16.37
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s)
2854
ns1812
ns1.58
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s)
1375
ns1708
ns0.81
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3750
ns1667
ns2.25
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s)
2792
ns2104
ns1.33
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA
19948
ns20067
ns0.99
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU
33501
ns31000
ns1.08
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
3834
ns4041
ns0.95
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
4250
ns3625
ns1.17
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
4208
ns4542
ns0.93
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
4416
ns4250.5
ns1.04
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA
131207.5
ns133056
ns0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU
146692
ns146031
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58167
ns58042
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39792
ns39959
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38209
ns39792
ns0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83208
ns83333
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
36515
ns36918.5
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
80481
ns76900
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2038875
ns2030417
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2083750
ns2081666.5
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2035541
ns2084437
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2003250
ns2002333
ns1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
217066
ns220443
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1203774
ns1433294
ns0.84
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
146333.5
ns146500
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
147458
ns164208.5
ns0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
174542
ns150937.5
ns1.16
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
150167
ns189709
ns0.79
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
167907.5
ns166381.5
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
171622
ns187972
ns0.91
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1119853.5
ns1113437
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1129187.5
ns1109375
ns1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1072541
ns1117083.5
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1117229.5
ns1112084
ns1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
620063
ns646028
ns0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1023002
ns1026270
ns1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5021.5
ns6250.5
ns0.80
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5083
ns4917
ns1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
6417
ns5562.5
ns1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
4584
ns4708
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
79500
ns82687
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
59431
ns59005.5
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8833
ns8958
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8458
ns8833
ns0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9083
ns9167
ns0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8958
ns8875
ns1.01
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
540188.5
ns554954
ns0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
390145
ns384224
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
17750
ns18208
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
17000
ns22250
ns0.76
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22125
ns20500
ns1.08
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18146
ns17833.5
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
61981.5
ns62129
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
78051
ns77001
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212750
ns234334
ns0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
257833
ns229500
ns1.12
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221375
ns224000
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
221750
ns219041.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
323096
ns329979.5
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
463260
ns465894
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s)
666
ns584
ns1.14
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s)
625
ns708
ns0.88
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s)
875
ns750
ns1.17
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s)
625
ns645.5
ns0.97
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA
18860
ns19107
ns0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU
30120
ns32171
ns0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
1458
ns1458
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
1375
ns1334
ns1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
1625
ns1542
ns1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
1375
ns1375
ns1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA
114822.5
ns114910.5
ns1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU
123847
ns124841
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7500
ns7417
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5333
ns5354.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5333
ns5458
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10459
ns10042
ns1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
23715.5
ns23654
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
46501
ns48941
ns0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
227792
ns256833
ns0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
241750
ns269917
ns0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
241584
ns269000
ns0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
227125
ns213417
ns1.06
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
188481.5
ns184585
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
591832
ns588346
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4084
ns4084
ns1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4125
ns4084
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s)
3958
ns4125
ns0.96
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4125
ns4083
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA
23784
ns23536
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU
45550
ns47570
ns0.96
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16750
ns16500
ns1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16792
ns16667
ns1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16791
ns17042
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16500
ns16500
ns1
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA
184666.5
ns185621
ns0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU
171442
ns171902
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
493292
ns493500
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
312833
ns313000
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
310584
ns312583
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
847917
ns847333
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA
113490
ns113322
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU
243193
ns242543
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2121291
ns2121250
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1584833
ns1582666
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1574875
ns1584000
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3034896
ns3043250.5
ns1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA
228348
ns230454
ns0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
739108
ns746137
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
7021
ns7000.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6792
ns6479.5
ns1.05
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
7958
ns6708
ns1.19
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6875
ns6458
ns1.06
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
82934
ns83715.5
ns0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
57300
ns59480
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
11520.5
ns12396
ns0.93
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
11708
ns11500
ns1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
12062.5
ns12104.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10896
ns11333.5
ns0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
598177.5
ns600141.5
ns1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
401725
ns410324
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s)
541
ns542
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s)
542
ns541
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s)
542
ns541
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s)
500
ns500
ns1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA
23280.5
ns23331
ns1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU
48351
ns51010
ns0.95
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
2083
ns2125
ns0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
2166
ns2084
ns1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
2209
ns2167
ns1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2084
ns2166
ns0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA
217524
ns233774
ns0.93
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU
178702
ns182892
ns0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
8542
ns8417
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
9229.5
ns9563
ns0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11042
ns10021
ns1.10
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
8042
ns8583
ns0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
92171
ns110268
ns0.84
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
76060.5
ns71861
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
19125
ns18042
ns1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
18895.5
ns18416.5
ns1.03
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
19375
ns19083.5
ns1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18458
ns18187.5
ns1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
534402.5
ns612118
ns0.87
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
379154
ns379663
ns1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
542
ns583
ns0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
500
ns500
ns1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
33745.5
ns34018
ns0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
45241
ns48210
ns0.94
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
9104
ns9000
ns1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9583
ns9250
ns1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9187.5
ns9541.5
ns0.96
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10042
ns9187.5
ns1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
242113
ns263691
ns0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
367124
ns363818.5
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s)
398958
ns399291
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215291
ns215375
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s)
213750
ns215291
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756041
ns756375
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA
111898
ns111229
ns1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU
77281
ns74750
ns1.03
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1396458
ns1397958
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
859875
ns860270.5
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
847958
ns859500
ns0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2356833.5
ns2356875
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA
199002
ns199160
ns1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU
322423
ns325203
ns0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7250
ns7458.5
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7625.5
ns7583
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9062.5
ns8250
ns1.10
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7229
ns7188
ns1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
126183.5
ns138757.5
ns0.91
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
57821
ns59831
ns0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
16959
ns12708.5
ns1.33
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14354.5
ns16250
ns0.88
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14792
ns16708
ns0.89
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15042
ns12250
ns1.23
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
851673
ns903568
ns0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
420849.5
ns426569.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
32959
ns25146
ns1.31
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
29083.5
ns29875
ns0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
30875
ns29563
ns1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
25770.5
ns28708
ns0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
184566
ns186563
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
110921
ns112512
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
160875
ns158917
ns1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
124458
ns155729
ns0.80
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
145396
ns147416.5
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
157729
ns143875
ns1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1005586
ns1016648
ns0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
576731
ns580615
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
75875
ns74583
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
75042
ns75291
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
80959
ns84145.5
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
74437.5
ns80750
ns0.92
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
190691
ns192007
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
124242
ns121601
ns1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
300833
ns303292
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
322542
ns318458
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
298292
ns310583.5
ns0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
219396
ns286500
ns0.77
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1023572
ns1028367
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
692382
ns694997
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
13000
ns13208
ns0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
13500
ns13209
ns1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
14833
ns14416.5
ns1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
13208
ns12583
ns1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
136120
ns137690
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
234302
ns235293
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
27083.5
ns25916.5
ns1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
26395.5
ns26042
ns1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
27146
ns27125
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
27770.5
ns27750
ns1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
907766
ns917440.5
ns0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
693402
ns677137
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
11500
ns11021.5
ns1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
10875
ns12104
ns0.90
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
13249.5
ns12667
ns1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
11666
ns11084
ns1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
119510.5
ns118805.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
240667.5
ns238257.5
ns1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
23021
ns22625
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
23312.5
ns23354.5
ns1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
23917
ns23500
ns1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
22708
ns23125
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
664160.5
ns678428
ns0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
675107
ns679757
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
66750
ns66333
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
63542
ns64583.5
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
68709
ns68500
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
65000
ns64792
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
101310
ns101302
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
234673
ns234893
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
466062.5
ns486625
ns0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
478625
ns486083
ns0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
472875
ns478646
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
518125
ns464625
ns1.12
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
484379
ns490708
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
712597
ns709767
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
7479
ns7562.5
ns0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
7687.5
ns7875
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
9958
ns8500
ns1.17
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
7667
ns7292
ns1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
134386
ns136584.5
ns0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
57600
ns57580
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
15750
ns14459
ns1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
16333
ns14417
ns1.13
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
15250
ns14625
ns1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
15291
ns16625
ns0.92
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
880162.5
ns882666
ns1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
398914
ns396884
ns1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s)
6151875
ns6159458
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s)
3226750
ns3225666
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s)
3223292
ns3225333
ns1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s)
11913583
ns11918958
ns1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA
350966
ns345241.5
ns1.02
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU
302008
ns301508
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s)
19126979
ns19144854.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s)
11161229.5
ns11111958.5
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s)
11077916
ns11126458
ns1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s)
36533646
ns36537562.5
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA
1006948.5
ns1009913
ns1.00
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU
1127082
ns1164436.5
ns0.97
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1042
ns1083
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
1042
ns1125
ns0.93
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
1042
ns1042
ns1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1000
ns1041
ns0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA
23502
ns23469
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU
209393
ns209702
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3958
ns4000
ns0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
4083
ns4000
ns1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
4041
ns4000
ns1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3917
ns4000
ns0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA
270232
ns270402
ns1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
623846
ns624936
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
7833
ns7896
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8042
ns7624.5
ns1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
9750
ns9041
ns1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
7625
ns8792
ns0.87
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
116542
ns116551
ns1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
69700
ns67301
ns1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
12375
ns12375
ns1
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
12458
ns12354.5
ns1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
12917
ns13458
ns0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
12292
ns11521
ns1.07
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
604932
ns608379
ns0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
357073.5
ns355544
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s)
292
ns375
ns0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s)
333
ns333
ns1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA
22511.5
ns22683
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU
46531
ns48621
ns0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s)
3167
ns2917
ns1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s)
3166
ns3000
ns1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s)
3333
ns3458
ns0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s)
2875
ns2917
ns0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA
194011
ns194883.5
ns1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU
158126.5
ns160881
ns0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
12125
ns11833
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
12333
ns11771
ns1.05
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
13708
ns12666
ns1.08
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
11937.5
ns11708
ns1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
115429.5
ns114987
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
237322
ns237082
ns1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
22000
ns22270.5
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
24459
ns23625
ns1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
23396
ns23145.5
ns1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
21792
ns22417
ns0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
554065.5
ns559620
ns0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
651546.5
ns657467.5
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s)
4375
ns4417
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s)
4416
ns4417
ns1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s)
4291
ns4375
ns0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s)
4375
ns4375
ns1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA
24232
ns23954
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU
48651
ns47821
ns1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s)
16208
ns16375
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s)
16500
ns16375
ns1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s)
16042
ns16500
ns0.97
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s)
16250
ns16250
ns1
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA
316149
ns319321
ns0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU
208227
ns205182
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
2083
ns2209
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
2083
ns2208
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
2083
ns2209
ns0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
2000
ns2084
ns0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
34761
ns34739
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
205252
ns207283
ns0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
17937.5
ns17729.5
ns1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
19271
ns19291.5
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
18584
ns19125
ns0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
18375
ns17500
ns1.05
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
283100
ns284503
ns1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
682562.5
ns683047
ns1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s)
59229.5
ns58771
ns1.01
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s)
60896
ns61500
ns0.99
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s)
60959
ns62167
ns0.98
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s)
53792
ns51041
ns1.05
batchedmm(16, Bsize=512)/forward/GPU/CUDA
66317
ns66683
ns0.99
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU
100931
ns96771
ns1.04
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s)
195625
ns189875
ns1.03
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s)
149417
ns148499.5
ns1.01
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s)
138292
ns141104
ns0.98
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s)
219291
ns271312
ns0.81
batchedmm(16, Bsize=512)/zygote/GPU/CUDA
208292.5
ns208001
ns1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU
554746
ns556366
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
85062
ns83188
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
127458
ns116270.5
ns1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
86104
ns87667
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
86812.5
ns88791
ns0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192707
ns190555.5
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
169152
ns168726.5
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1926791.5
ns1885521
ns1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1918312.5
ns1906833
ns1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1895083
ns1922167
ns0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1862750
ns1922208.5
ns0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
503729
ns505315
ns1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
915670
ns918625.5
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s)
291
ns292
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns333
ns0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA
21463.5
ns21748.5
ns0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU
41990
ns40920
ns1.03
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
1792
ns1834
ns0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
1833
ns1834
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
1833
ns1833
ns1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
1834
ns1792
ns1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA
244422
ns243459
ns1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU
183082
ns176522
ns1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
11375
ns11042
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
10292
ns9834
ns1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
12166
ns11166.5
ns1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
9084
ns9417
ns0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
113574.5
ns115799
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
237182
ns235862
ns1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
9583
ns9916
ns0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
12396
ns11000
ns1.13
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
10750
ns10437.5
ns1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
9458
ns9625
ns0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
489512
ns492386
ns0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
632057
ns634956.5
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57959
ns58666
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39208
ns39500
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38708
ns39333
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83375
ns83750
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
38522
ns38435
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
78311
ns79261
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1724708.5
ns1932333.5
ns0.89
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1941208
ns1949916
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1947834
ns1971250
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1891208.5
ns1900375
ns1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
210148.5
ns211772
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
998640
ns1010796
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
269083
ns276583
ns0.97
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
268833
ns268541
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
275875
ns270583.5
ns1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
269729.5
ns269542
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
193164
ns196349
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
282737.5
ns281833
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
587166.5
ns662208
ns0.89
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
614875
ns709250
ns0.87
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
651500
ns685042
ns0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
652062
ns690770.5
ns0.94
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
993619.5
ns994716
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
899480
ns902690
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
2202416
ns2181125
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
2216125
ns2197167
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
2192812.5
ns2214166
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
2220500
ns2217666
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
179761.5
ns156988.5
ns1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
415294
ns421825
ns0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
5520708
ns5477291.5
ns1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
5537000
ns5530250
ns1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
5449958.5
ns5519334
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
5515167
ns5543313
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
930917
ns938151
ns0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1711728
ns1722729
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
477542
ns478167
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
257375
ns257208
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
255375
ns257292
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
908666
ns908750
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA
46830
ns46532.5
ns1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU
245313
ns246353
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
2116979
ns2133375
ns0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
1589770.5
ns1588083
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
1579645.5
ns1587417
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
3037833.5
ns3041125
ns1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA
274670.5
ns256675
ns1.07
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
769148
ns775668
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
57875
ns58000
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39000
ns39625
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38458
ns39375
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83333
ns83500
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
28067
ns27930.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
75041
ns73260
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2047334
ns2017271
ns1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2049854.5
ns2083062.5
ns0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2059333
ns2080584
ns0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1987666.5
ns1994312.5
ns1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
227893
ns224353
ns1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1038901
ns1036751
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
58000
ns58292
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
39333
ns39917
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
38333
ns39750
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
83125
ns83458
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
48807.5
ns48290
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
67171
ns69781
ns0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1934875
ns1920208
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1962667
ns1966666.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1938167
ns1956354.5
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1827396
ns1892750
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
233324
ns231868
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
914834.5
ns917180
ns1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns333
ns0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
333
ns292
ns1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
34314.5
ns33423
ns1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
45171
ns47961
ns0.94
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6542
ns6750
ns0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7083
ns6625
ns1.07
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7000
ns6916
ns1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6958
ns6542
ns1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
202653
ns205663
ns0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
366114
ns364303.5
ns1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s)
250
ns333
ns0.75
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s)
292
ns292
ns1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s)
292
ns250
ns1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA
32763
ns31975
ns1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU
38131
ns40370
ns0.94
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s)
2792
ns3667
ns0.76
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s)
3000
ns3625
ns0.83
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s)
3459
ns3209
ns1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s)
2875
ns3250
ns0.88
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA
184852
ns182875
ns1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU
151962
ns146242
ns1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
494188
ns468625
ns1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
500333.5
ns492396
ns1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
470041.5
ns470250
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
489437
ns466354
ns1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
134801.5
ns134348
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
322243
ns349229
ns0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4053479
ns4091499.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4072375
ns4078417
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4033500
ns4081499.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4070625
ns4051646
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
680027
ns673570.5
ns1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1463545
ns1482381
ns0.99
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s)
49933854
ns49972812
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s)
26023000
ns26026291
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s)
25982541.5
ns25991500
ns1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s)
97045646
ns97072458
ns1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA
1626445
ns1599973.5
ns1.02
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU
1047410
ns1057326.5
ns0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s)
155000104.5
ns154932104.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s)
89050542
ns89308062.5
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s)
88666916.5
ns88895875
ns1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s)
295479666.5
ns295925812.5
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA
6477658
ns6475879
ns1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU
5560101.5
ns5578679
ns1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s)
20062.5
ns18917
ns1.06
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s)
15500
ns16000
ns0.97
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s)
13833.5
ns13708
ns1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s)
15708.5
ns16437.5
ns0.96
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA
20427
ns19926
ns1.03
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU
25781
ns27550
ns0.94
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s)
11063
ns10937
ns1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s)
7895.5
ns7770.5
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s)
7937.5
ns7708
ns1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s)
17375
ns17291
ns1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA
248558
ns243495.5
ns1.02
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU
143922
ns147112
ns0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8417
ns8750
ns0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
10229
ns9708.5
ns1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10375
ns10667
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8646
ns8646
ns1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
119635
ns119480.5
ns1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
239173
ns237342
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10041.5
ns10312.5
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10667
ns11041
ns0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10750
ns10667
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10145.5
ns10770.5
ns0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
591757
ns585828
ns1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
654107
ns655982
ns1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10375
ns10020.5
ns1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9770.5
ns9333
ns1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
11312.5
ns10396
ns1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
9500
ns9500
ns1
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
117527.5
ns115334.5
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
72401
ns70430.5
ns1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
14292
ns15292
ns0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
17708
ns17375
ns1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
14834
ns15542
ns0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
14750
ns16250
ns0.91
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
562161
ns558960.5
ns1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
345113
ns346234
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
500
ns625
ns0.80
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
625
ns584
ns1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
500
ns583
ns0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA
34287
ns33420.5
ns1.03
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU
207072
ns208233
ns0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
8625
ns8875
ns0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
9667
ns8917
ns1.08
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8667
ns9375
ns0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8687.5
ns8125
ns1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA
224465.5
ns223663.5
ns1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
658996
ns660067.5
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
17292
ns15833
ns1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
13771
ns14958
ns0.92
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
12458.5
ns13166.5
ns0.95
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
10770.5
ns12042
ns0.89
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA
20290
ns20351
ns1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU
186982
ns188642
ns0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
35625
ns35334
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
35625
ns35396
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
35834
ns35354.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
35666
ns35459
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA
261247.5
ns258908.5
ns1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
589266
ns593676
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
450208
ns453584
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
494583.5
ns448854.5
ns1.10
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
456791.5
ns458979
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
461833
ns463708
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
194699
ns194627
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
360324
ns361629
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4069833
ns4069291
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4063479
ns4057666
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4038041.5
ns4066166.5
ns0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4038167
ns4041000
ns1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
514235
ns509044
ns1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1354948.5
ns1369935
ns0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s)
788948625
ns786136291
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s)
416422208.5
ns416023146
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s)
415183312.5
ns416822792
ns1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s)
1509932250
ns1513689687.5
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA
22522291.5
ns22552578.5
ns1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU
14572928
ns14622705
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s)
2530024250
ns2527797917
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s)
1506878542
ns1507508250
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s)
1519381125
ns1513719042
ns1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s)
4752439166
ns4744640792
ns1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA
118941901
ns119636395
ns0.99
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU
87857404.5
ns87882829
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
77417
ns78083.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
77625
ns79375
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
79500
ns79292
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
76875
ns77417
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
194658.5
ns195081
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
106561
ns106236.5
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
284458
ns291584
ns0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
286188
ns232333.5
ns1.23
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
197750
ns275646
ns0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
192708
ns268875
ns0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1005733
ns999623
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
630306
ns637827
ns0.99
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s)
199829146
ns199983542
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s)
104009479.5
ns103920208
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s)
103995667
ns103978083
ns1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s)
389216083
ns389299042
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA
5833781
ns5843844.5
ns1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU
3615787
ns3606828
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s)
620952291.5
ns620238542
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s)
354227354.5
ns353393416.5
ns1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s)
354977104.5
ns352881646
ns1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s)
1182226250
ns1193561791
ns0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA
26559529
ns26518526
ns1.00
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU
21846736
ns22094133
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7167
ns7250
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5375
ns5375
ns1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5250
ns5375
ns0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10292
ns9875
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
27179
ns26733.5
ns1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
48210
ns46490
ns1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
212666.5
ns220979
ns0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222542
ns224417
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
221917
ns223500
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206167
ns207583
ns0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
217340.5
ns215495
ns1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
523165
ns519876
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
8708
ns10312.5
ns0.84
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8958
ns9479
ns0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
10667
ns9895.5
ns1.08
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8813
ns9937.5
ns0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
115467
ns113347
ns1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
73431
ns71090
ns1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7584
ns9604
ns0.79
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
11521
ns11437.5
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
8542
ns10042
ns0.85
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
8062.5
ns10145.5
ns0.79
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
494404
ns491382
ns1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
316873
ns314464
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns708
ns0.71
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
708
ns709
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
708
ns583
ns1.21
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns542
ns1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
25358
ns24930.5
ns1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
47920
ns48911
ns0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9250
ns12375
ns0.75
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
11396
ns14958
ns0.76
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
10875
ns9000
ns1.21
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
9750
ns9666
ns1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
246651
ns246496
ns1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
388584
ns386995
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s)
110834
ns110750
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s)
87791
ns90417
ns0.97
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s)
87792
ns88125
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s)
154959
ns155146
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA
23405
ns23300
ns1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU
189432
ns190702
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s)
539625
ns534625
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s)
562458
ns562249.5
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s)
535812.5
ns542812.5
ns0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s)
535000
ns535250
ns1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA
220513
ns217557.5
ns1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU
604586.5
ns610017
ns0.99
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s)
5354
ns5375
ns1.00
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s)
7042
ns6709
ns1.05
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s)
8229.5
ns7375
ns1.12
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s)
6541
ns6520.5
ns1.00
batchedmm(16, Bsize=32)/forward/GPU/CUDA
17715
ns17156
ns1.03
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU
71815.5
ns71171
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s)
11750
ns12833
ns0.92
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s)
11459
ns11375
ns1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s)
10792
ns10145.5
ns1.06
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s)
17125
ns16708.5
ns1.02
batchedmm(16, Bsize=32)/zygote/GPU/CUDA
206057.5
ns204040
ns1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU
379023.5
ns364443
ns1.04
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s)
39250
ns38834
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s)
51250
ns50542
ns1.01
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s)
50583
ns51417
ns0.98
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s)
13750
ns13854.5
ns0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA
21128.5
ns21940
ns0.96
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU
84216
ns84996
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s)
36208
ns36917
ns0.98
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s)
30584
ns31042
ns0.99
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s)
29250
ns28125
ns1.04
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s)
57375
ns77979.5
ns0.74
batchedmm(16, Bsize=128)/zygote/GPU/CUDA
184668
ns180753
ns1.02
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU
414734
ns397599
ns1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s)
1583.5
ns1854.5
ns0.85
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s)
2000
ns1958
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s)
2187
ns2209
ns0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s)
1833.5
ns1666.5
ns1.10
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA
19835
ns19375
ns1.02
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU
25650
ns27490
ns0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s)
2292
ns2208
ns1.04
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s)
2459
ns2167
ns1.13
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s)
2458
ns2416
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s)
2187.5
ns2125
ns1.03
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA
197459.5
ns194356
ns1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU
134722
ns136311
ns0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
5021
ns5166.5
ns0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
5167
ns5520.5
ns0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
5500
ns6396
ns0.86
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5959
ns5187.5
ns1.15
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
141255
ns140899.5
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
59291
ns57270
ns1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8396
ns9020.5
ns0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
9208
ns9437.5
ns0.98
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
9791
ns8583
ns1.14
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8375
ns8417
ns1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
823637
ns815402.5
ns1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
383144
ns388544
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
54917
ns55083
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
54291
ns54292
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
54250
ns54375
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
56541
ns56417
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
37246
ns36794
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
204842
ns206892
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
477000
ns478792
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
496604
ns535375
ns0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
494271
ns496937
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
467792
ns474395.5
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
259843
ns257604
ns1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
794468
ns810628
ns0.98
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s)
3306791
ns3331771
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s)
1761916
ns1763000
ns1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s)
1756167
ns1769417
ns0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s)
6310604.5
ns6317646
ns1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA
205873.5
ns204848.5
ns1.01
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU
214142
ns209783
ns1.02
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s)
11469395.5
ns11521375.5
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s)
6567229
ns6550500
ns1.00
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s)
6474021
ns6561792
ns0.99
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s)
21232020.5
ns21242604
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA
743103.5
ns741852
ns1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU
1064100
ns1060031
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
7125
ns6292
ns1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4791
ns5666
ns0.85
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
7042
ns7042
ns1
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
5333
ns5209
ns1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
130642.5
ns132073.5
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
55570
ns54021
ns1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7333
ns10375
ns0.71
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8500
ns9584
ns0.89
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7500
ns7417
ns1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7625
ns7667
ns0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
721790
ns718413.5
ns1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
371284
ns375894
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
124000
ns144542
ns0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
105458
ns124479.5
ns0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
100416.5
ns101625
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
93688
ns150583
ns0.62
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
149649.5
ns148583.5
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
203312
ns182281
ns1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2020750
ns2030666.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2021041
ns2034833.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1993771
ns2034166.5
ns0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2025000
ns2024125
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
676279
ns674148
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1107011
ns1114502
ns0.99
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s)
33958.5
ns32917
ns1.03
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s)
34334
ns35208
ns0.98
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s)
32584
ns33334
ns0.98
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s)
708
ns645.5
ns1.10
batchedmm(2, Bsize=4)/forward/GPU/CUDA
16105
ns15722
ns1.02
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU
78881
ns79041
ns1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s)
2479.5
ns3208
ns0.77
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s)
4000
ns3958
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s)
3125
ns3084
ns1.01
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s)
2292
ns2333
ns0.98
batchedmm(2, Bsize=4)/zygote/GPU/CUDA
139246
ns136962.5
ns1.02
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU
352743.5
ns340914
ns1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7209
ns7292
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5417
ns5417
ns1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5291
ns5333
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10083
ns10208
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
36300
ns35974
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49595.5
ns50280
ns0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
217854
ns215209
ns1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
222916.5
ns228896
ns0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220604.5
ns220729.5
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
206125
ns205917
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
241210
ns240303
ns1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
515535
ns519340
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3917
ns3917
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3958
ns3958
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3917
ns3958
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3958
ns3958
ns1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA
22201
ns21966
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU
41991
ns42521
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
14708
ns14709
ns1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
14708
ns14792
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
14750
ns14834
ns0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
14708
ns14708
ns1
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA
301554
ns299460
ns1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU
195902
ns188891.5
ns1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
116166.5
ns128584
ns0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
130416
ns128208
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
104479
ns106604
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
105250
ns119354
ns0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
135232
ns132553
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
169232
ns183902
ns0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1928583
ns1924833.5
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1925875
ns1932167
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1895041.5
ns1926479
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1745875
ns1925542
ns0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
664669
ns662628
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1220022.5
ns1065881
ns1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
18583
ns17958
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
18792
ns18625
ns1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
22250
ns20812
ns1.07
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
18250
ns19584
ns0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
107671
ns104706.5
ns1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
77341
ns81176
ns0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
216667
ns217417
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
216667
ns265209
ns0.82
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
217812.5
ns222291
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
227125
ns222917
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
497386
ns497576
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
470184
ns466715
ns1.01
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s)
26145.5
ns24687
ns1.06
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s)
28562
ns29083
ns0.98
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s)
26792
ns27250
ns0.98
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s)
1458
ns1417
ns1.03
batchedmm(16, Bsize=4)/forward/GPU/CUDA
16337
ns16449.5
ns0.99
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU
86810
ns80571
ns1.08
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s)
4875
ns4729.5
ns1.03
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s)
5104
ns5917
ns0.86
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s)
5333
ns5459
ns0.98
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s)
4833
ns4875
ns0.99
batchedmm(16, Bsize=4)/zygote/GPU/CUDA
203656
ns201398
ns1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU
391324
ns373024
ns1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
222125
ns223084
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
222583
ns223479.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
226333
ns225458.5
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
223333
ns222541
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
222346
ns220423
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
273793
ns274373
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
500833
ns497687.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
504334
ns497958
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
498167
ns501646
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
497542
ns507125
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
1053089
ns1033721
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
851353.5
ns858214
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
20667
ns20625
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
20313
ns22500
ns0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23083
ns21791
ns1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
20000
ns20042
ns1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
113758.5
ns112240
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
79011
ns77390
ns1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
213084
ns213084
ns1
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
213541
ns218104.5
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
214291
ns219292
ns0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
215500
ns217125
ns0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
724087
ns716111
ns1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
538870.5
ns532795
ns1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
6666
ns6708
ns0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
6666.5
ns7416
ns0.90
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
9125
ns8166
ns1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
6584
ns6791
ns0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
134050
ns133925.5
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
67330
ns65140
ns1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
10875
ns9709
ns1.12
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
10603.5
ns12458
ns0.85
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
10584
ns11125
ns0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
10750
ns10583
ns1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
782883
ns779907
ns1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
386274
ns379434
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
5000
ns7250
ns0.69
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
4625
ns5250
ns0.88
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
6541
ns6834
ns0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
6375
ns4917
ns1.30
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
136660
ns135559.5
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
58460
ns56400
ns1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7667
ns7542
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
7916.5
ns7792
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7750
ns7875
ns0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7750
ns7625
ns1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
747431
ns742169
ns1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
392653
ns389854
ns1.01
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s)
14573000
ns14503334
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s)
7702333.5
ns7723249.5
ns1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s)
7661229.5
ns7705416.5
ns0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s)
27919750
ns27810125
ns1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA
552572
ns535378
ns1.03
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU
402049
ns390439
ns1.03
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s)
46551750
ns46519500
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s)
26549208
ns26614709
ns1.00
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s)
26263166.5
ns26530062.5
ns0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s)
85671542
ns85657500
ns1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA
3391019
ns2847450.5
ns1.19
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU
3300103
ns3284834
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
67042
ns68958
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
67375
ns69084
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
70583
ns68500
ns1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
68291
ns68166
ns1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
103426.5
ns104098
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
229352.5
ns232172
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
468625
ns480417
ns0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
497666.5
ns475791
ns1.05
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
469292
ns474812.5
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
468500
ns481041.5
ns0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
709808.5
ns714971
ns0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
786728
ns793828
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
542
ns583
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
584
ns750
ns0.78
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
584
ns625
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
583
ns625
ns0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA
32664
ns32749
ns1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU
47181
ns49671
ns0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
8833
ns9875
ns0.89
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
9750
ns9875
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
9708
ns9375
ns1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
9792
ns9208
ns1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA
281049
ns282467
ns0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
373464
ns373314
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9666
ns9708
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
9708
ns9708
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
9625
ns9625
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
9666
ns9666
ns1
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA
23531
ns23485
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU
211602
ns211472
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
50250
ns50208
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
50250
ns50042
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
50125
ns50709
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
50167
ns50209
ns1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA
276186.5
ns277646
ns0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
603776
ns614117
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
54916
ns55291
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
54333
ns54458
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
54292
ns54334
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
56125
ns56458
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
28315
ns28038.5
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
204202
ns206412
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
515312.5
ns479020.5
ns1.08
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
495208
ns525042
ns0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
494875
ns499937
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
465271
ns462667
ns1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
238356
ns240355
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
843049
ns838988
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
657146
ns609500
ns1.08
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
678750
ns661417
ns1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
625021
ns659375
ns0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
649917
ns653812.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
189901
ns192690.5
ns0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
230582
ns262482
ns0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2239292
ns2226104
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2249895.5
ns2247458
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2176354.5
ns2238104
ns0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2265625
ns2244458.5
ns1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
926422
ns927304
ns1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1211101.5
ns1364114
ns0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
21083
ns20208
ns1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
22187.5
ns22354.5
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
23666
ns22167
ns1.07
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
19959
ns19375
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA
112183.5
ns109169
ns1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
81261
ns77150.5
ns1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
254333
ns222958
ns1.14
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
220666
ns220604.5
ns1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
220750
ns227521
ns0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
226708
ns225417
ns1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
705957
ns712641
ns0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
548680
ns558770.5
ns0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
500
ns542
ns0.92
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
583
ns583
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
583
ns584
ns1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA
23346
ns23081
ns1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU
47671
ns48321
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
9500
ns9208.5
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
9917
ns9250
ns1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
9959
ns10666
ns0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
10083
ns9791.5
ns1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA
260912
ns263338
ns0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
400874
ns399114
ns1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s)
10500
ns10500
ns1
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s)
8895.5
ns8770.5
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s)
11625
ns10499.5
ns1.11
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s)
8750
ns10083
ns0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA
116855
ns115864
ns1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU
67861
ns68530
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
7687.5
ns7917
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
8000
ns7750
ns1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
7875
ns8125
ns0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
7812.5
ns7875
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA
481589
ns487126
ns0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU
324483
ns322433
ns1.01
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
1666
ns1708
ns0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
2042
ns1667
ns1.22
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
2104.5
ns2125
ns0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
1459
ns1541
ns0.95
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA
19805
ns19744
ns1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU
190981
ns191542
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
3520.5
ns3584
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
3792
ns3708.5
ns1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
3854.5
ns3937.5
ns0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
3583
ns3625
ns0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA
211153.5
ns212174.5
ns1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
578046
ns580786
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s)
147645.5
ns147562.5
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s)
106542
ns106562
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s)
106708.5
ns107333
ns0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s)
225875
ns225583
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA
23334
ns23301
ns1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU
35995.5
ns34030
ns1.06
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s)
144708
ns160417
ns0.90
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s)
104000
ns87959
ns1.18
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s)
87625
ns100250
ns0.87
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s)
252562.5
ns252167
ns1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA
210178
ns211748
ns0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU
230212
ns214182
ns1.07
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
7125
ns7291
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
5375
ns5333
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
5292
ns5250
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
10250
ns10417
ns0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
33945.5
ns33560.5
ns1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
49690
ns50310
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
219375
ns253958.5
ns0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
260458
ns253021.5
ns1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
228500.5
ns235708
ns0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
222499.5
ns212792
ns1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
257172
ns260417
ns0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
523825
ns524496
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s)
13625
ns12375
ns1.10
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s)
13479
ns12583
ns1.07
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s)
15125
ns13896
ns1.09
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s)
13333
ns12792
ns1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA
132277
ns134512.5
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU
234872
ns235902
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
24084
ns23959
ns1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
23645.5
ns24479.5
ns0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
24708.5
ns25291
ns0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
24459
ns24583
ns0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA
830067.5
ns831522
ns1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU
681347
ns684542
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s)
9792
ns9708
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s)
10063
ns9917
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s)
11375
ns11625
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s)
9291.5
ns9209
ns1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA
120374.5
ns120339
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU
73601
ns72241
ns1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
14541
ns13750
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
14813
ns14187.5
ns1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
14812.5
ns15083.5
ns0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
14875
ns14084
ns1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA
637361.5
ns638601
ns1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU
368293
ns363914
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s)
10333
ns9208.5
ns1.12
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s)
9687.5
ns10000.5
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s)
12041.5
ns11166
ns1.08
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s)
10125.5
ns10167
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA
119012
ns118694
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU
73051
ns72320
ns1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s)
12792
ns13208.5
ns0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s)
13395.5
ns13020.5
ns1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s)
13375
ns13396
ns1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s)
13166
ns12292
ns1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA
525610
ns529419
ns0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU
342408
ns342414
ns1.00
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s)
31416.5
ns30416.5
ns1.03
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s)
32520.5
ns33666.5
ns0.97
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s)
28917
ns30542
ns0.95
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s)
2167
ns1917
ns1.13
batchedmm(2, Bsize=128)/forward/GPU/CUDA
16642
ns16576
ns1.00
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU
78711
ns77461
ns1.02
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s)
5583.5
ns5291.5
ns1.06
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s)
4958
ns4896
ns1.01
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s)
5250
ns5291.5
ns0.99
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s)
6584
ns6417
ns1.03
batchedmm(2, Bsize=128)/zygote/GPU/CUDA
137549
ns137601
ns1.00
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU
383954
ns379919
ns1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
292
ns375
ns0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
375
ns334
ns1.12
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
24843
ns24898
ns1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
48221
ns49280
ns0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6375
ns6750
ns0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6708.5
ns6500
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6916.5
ns6916.5
ns1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6875
ns6667
ns1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
183051
ns184245
ns0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
391009
ns386844
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s)
1958
ns2125
ns0.92
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s)
2042
ns2167
ns0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s)
2084
ns2084
ns1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s)
2041
ns2083
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA
25908
ns25661
ns1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU
207502
ns208752
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s)
17333.5
ns17250
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s)
17333
ns17292
ns1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s)
17625
ns18584
ns0.95
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s)
18000
ns18416.5
ns0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA
266084
ns269097.5
ns0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU
691847
ns693937
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
153459
ns150875
ns1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
175583.5
ns177416.5
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
150250
ns153625
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
150417
ns157791
ns0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
192072
ns191062
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
176432
ns174992
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
1193541
ns1338521
ns0.89
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
1327291.5
ns1328479
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1298166.5
ns1328250
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
1330166.5
ns1330083.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
864717
ns866603
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1114311
ns1114201.5
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s)
25604.5
ns26208.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s)
25333
ns29479.5
ns0.86
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s)
28625
ns27062.5
ns1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s)
25541
ns24833
ns1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA
232128
ns228889.5
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU
115071
ns116211
ns0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s)
118791.5
ns117584
ns1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s)
126708
ns140791
ns0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s)
118625
ns126021
ns0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s)
117979
ns119916.5
ns0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA
994805
ns992184
ns1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU
588415.5
ns594546
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
250
ns334
ns0.75
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
334
ns334
ns1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
416
ns375
ns1.11
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
334
ns375
ns0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA
23227
ns23038
ns1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU
46150
ns49341
ns0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
6417
ns6833
ns0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
6750
ns6604
ns1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
6958
ns7042
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
6750
ns6791
ns0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA
199656
ns200303.5
ns1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
393763.5
ns388994
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
6250
ns6375
ns0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
6500
ns5875
ns1.11
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
7291.5
ns7812.5
ns0.93
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
5291
ns6458
ns0.82
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
137884.5
ns139406.5
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
233922
ns235513
ns0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
10104.5
ns10083.5
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
10125
ns10167
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
10562.5
ns10417
ns1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
10250
ns9959
ns1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
853228
ns853447
ns1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
672507
ns676147
ns0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s)
708
ns750
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s)
708
ns750
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s)
750
ns667
ns1.12
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s)
708
ns750
ns0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA
22896
ns23007
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU
209942
ns209722
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s)
4834
ns4958
ns0.97
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s)
5042
ns5000
ns1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s)
5125
ns5125
ns1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s)
4834
ns4917
ns0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA
220625.5
ns221201.5
ns1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU
580650
ns585401
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s)
8750
ns8708
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s)
8708
ns8833.5
ns0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s)
10395.5
ns9812.5
ns1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s)
8167
ns8625
ns0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA
118921.5
ns118248.5
ns1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU
71421
ns71271
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
8292
ns8959
ns0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8791
ns9041.5
ns0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8958
ns9333.5
ns0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
8916
ns8687.5
ns1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA
567449
ns566922
ns1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU
346934
ns343484
ns1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s)
125791.5
ns126584
ns0.99
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s)
96000
ns96271
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s)
96187.5
ns96479.5
ns1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s)
181542
ns183375
ns0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA
46439
ns46672
ns1.00
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU
93231
ns99821
ns0.93
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s)
302834
ns330333
ns0.92
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s)
166542
ns166292
ns1.00
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s)
166917
ns170250
ns0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s)
567708
ns572041.5
ns0.99
batchedmm(128, Bsize=4)/zygote/GPU/CUDA
186141
ns187343
ns0.99
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU
466525
ns487975
ns0.96
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s)
398250
ns398958
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s)
215167
ns215334
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s)
214291
ns215041
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s)
756250
ns753500
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA
43722
ns43980
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU
80301
ns81451
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s)
1402813
ns1401520.5
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s)
862208
ns862917
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s)
854333
ns861417
ns0.99
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s)
2359583.5
ns2361042
ns1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA
247149
ns253211
ns0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU
350254
ns349378.5
ns1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
657333
ns651917
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
621958.5
ns658334
ns0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
628854
ns662479
ns0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
542146
ns579395.5
ns0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
185394
ns189789
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
258293
ns261218
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2469895.5
ns2487416
ns0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2491916.5
ns2468708
ns1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
2389875
ns2451333
ns0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2478250
ns2415666
ns1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
934339.5
ns951768.5
ns0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1448647.5
ns1454255
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s)
34271
ns33000
ns1.04
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s)
34250.5
ns36083.5
ns0.95
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s)
32312.5
ns32167
ns1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s)
916.5
ns1041.5
ns0.88
batchedmm(2, Bsize=32)/forward/GPU/CUDA
16189.5
ns16094
ns1.01
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU
71551
ns77491
ns0.92
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s)
3166.5
ns3187
ns0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s)
3437.5
ns3208
ns1.07
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s)
3541
ns3417
ns1.04
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s)
3125
ns3209
ns0.97
batchedmm(2, Bsize=32)/zygote/GPU/CUDA
134833
ns136515
ns0.99
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU
339494
ns349978
ns0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
437000
ns437166.5
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
432458
ns433083
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
432833
ns434750
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
449416
ns449916
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA
42351
ns42836
ns0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
238133
ns238823
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4152625
ns4154959
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4271667
ns4268667
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4252417
ns4254625
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4062020.5
ns4048000
ns1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
231247
ns236422
ns0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1229715
ns1232498
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s)
3875
ns3959
ns0.98
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s)
3917
ns3958
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s)
3875
ns3916
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s)
3916
ns3917
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA
34451.5
ns34298
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU
38680
ns40891
ns0.95
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s)
15458
ns15583
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s)
15708
ns15666
ns1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s)
15625
ns15708
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s)
15459
ns15459
ns1
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA
252640
ns255323
ns0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU
169682
ns170142
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s)
403417
ns403708
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s)
221209
ns221167
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s)
220042
ns220959
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s)
760791
ns756709
ns1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA
113133
ns113380
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU
87381
ns89671
ns0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1431749.5
ns1430083
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
886583
ns886645.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
881812.5
ns879208.5
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2383750
ns2383084
ns1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA
229435.5
ns238474
ns0.96
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU
350874
ns354939
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s)
459
ns625
ns0.73
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s)
583
ns584
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s)
625
ns625
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s)
584
ns584
ns1
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA
24713
ns24737
ns1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU
207622
ns210152
ns0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s)
7458.5
ns8042
ns0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s)
8041.5
ns7750
ns1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s)
8292
ns8020.5
ns1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s)
7792
ns8084
ns0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA
202392.5
ns206918.5
ns0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU
689378
ns691747
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s)
833145.5
ns829437
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s)
466667
ns466125
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s)
467771
ns467854
ns1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s)
1542833
ns1548750
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA
130433
ns130261
ns1.00
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU
166542
ns166677
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s)
2696000
ns2692000
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s)
1539437.5
ns1529979
ns1.01
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s)
1533500
ns1534291.5
ns1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s)
4930000
ns4940020.5
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA
233723
ns232798.5
ns1.00
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU
771469
ns770132.5
ns1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s)
292
ns375
ns0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s)
375
ns375
ns1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA
31721
ns32356
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU
48111
ns48991
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s)
6312.5
ns6583
ns0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s)
6812.5
ns6625
ns1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s)
6875
ns6708
ns1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s)
6500
ns6625
ns0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA
217171.5
ns227984
ns0.95
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU
362335
ns356278.5
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
1777250
ns1758084
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
1758812.5
ns1756792
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
1730917
ns1737458
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
1776250
ns1733750
ns1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
184219
ns188495
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
354280
ns357369
ns0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4352917
ns4372937
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4382542
ns4370667
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4351834
ns4369375
ns1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4391416
ns4362583.5
ns1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
837734
ns853700
ns0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1247440
ns1252878
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s)
6771
ns6792
ns1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s)
7937.5
ns7209
ns1.10
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s)
7333
ns7333
ns1
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s)
6687.5
ns7312.5
ns0.91
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA
22420
ns22968
ns0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU
36840.5
ns37681
ns0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
45312.5
ns48354
ns0.94
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
48146
ns69083
ns0.70
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
33917
ns33542
ns1.01
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
52729.5
ns44979
ns1.17
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA
206304
ns210612
ns0.98
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU
232673
ns235022
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s)
22146
ns21334
ns1.04
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s)
23896
ns24750
ns0.97
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s)
22417
ns22583.5
ns0.99
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s)
5334
ns5417
ns0.98
batchedmm(2, Bsize=512)/forward/GPU/CUDA
18024
ns18352
ns0.98
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU
83860.5
ns90001
ns0.93
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s)
12000
ns12187
ns0.98
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s)
9437.5
ns9250
ns1.02
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s)
9583
ns9625
ns1.00
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s)
18250
ns18375
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA
218264
ns219960
ns0.99
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU
367444
ns383514
ns0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s)
406417
ns407000
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s)
223333
ns223500
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s)
222292
ns223250
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s)
762750
ns762333
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA
46291
ns47174.5
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU
88691
ns90560
ns0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s)
1428625
ns1429042
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s)
892375
ns893625
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s)
886833
ns893041
ns0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s)
2386333
ns2387667
ns1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA
279641
ns278164
ns1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU
379995
ns378859
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
436833
ns435708
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
432708
ns431625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
429500
ns432333
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
449500
ns450291
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
52933
ns54012
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
235598
ns238112
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
4147167
ns4144125
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
4260354
ns4245667
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
4227333
ns4258583
ns0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
4030354.5
ns4033625
ns1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
252356.5
ns257888
ns0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1204784
ns1222232
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s)
9583
ns9459
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s)
7292
ns7250
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s)
7250
ns7250
ns1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s)
13500
ns13458
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA
23984
ns24527
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU
212683
ns211892
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s)
49416
ns49500
ns1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s)
49459
ns49708
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s)
49167
ns49417
ns0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s)
49625
ns49208.5
ns1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA
333606
ns339671
ns0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU
652008
ns654987
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s)
106875
ns125000
ns0.85
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s)
113729
ns89417
ns1.27
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s)
88666
ns86583
ns1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s)
89666.5
ns120666.5
ns0.74
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA
191172
ns191941.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU
200642
ns200372
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s)
2027750.5
ns2022250
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s)
2023896
ns2017666.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s)
1986666
ns2024042
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s)
2015667
ns2020812.5
ns1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA
507573.5
ns516999
ns0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU
1086742.5
ns1090611
ns1.00
This comment was automatically generated by workflow using github-action-benchmark.