Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

fix: task switching in AMDGPU complex batched_matmul #178

Merged
merged 4 commits into from
Oct 25, 2024

Conversation

avik-pal
Copy link
Member

No description provided.

Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 0766885 Previous: 98a2d7a Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6000 ns 6417 ns 0.94
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6541 ns 6041 ns 1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7875 ns 7167 ns 1.10
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5333 ns 5292 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 108617 ns 103542 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 809916 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 436641 ns 637131 ns 0.69
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9875 ns 10166.5 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10292 ns 9958 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9979.5 ns 10291.5 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9771 ns 9979.5 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 535818 ns 494284 ns 1.08
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 6627750 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 664425 ns 719725 ns 0.92
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1458.5 ns 1583 ns 0.92
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1458 ns 1542 ns 0.95
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 2875 ns 1666 ns 1.73
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1416 ns 1500 ns 0.94
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 19736 ns 20684 ns 0.95
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 456250 ns
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 29621 ns 33302 ns 0.89
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3541.5 ns 3812.5 ns 0.93
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4291 ns 4125 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4292 ns 4250 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3687.5 ns 4334 ns 0.85
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 132304.5 ns 134278.5 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 2272937.5 ns
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 146734 ns 143062.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57750 ns 58000 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38916 ns 46417 ns 0.84
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46667 ns 46875 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 79291 ns 83750 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36853 ns 37449 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1095208 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 80626.5 ns 70883 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2038334 ns 2037500 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2088854.5 ns 2083416.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2090437 ns 2090916.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1972417 ns 1996979.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 218514 ns 220080 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6593333 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1269408 ns 1213928 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 147292 ns 173708 ns 0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144833 ns 146625 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 175521 ns 165062.5 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 155375 ns 172000 ns 0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 165892 ns 167869.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1634083.5 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 170913.5 ns 196051.5 ns 0.87
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1107041.5 ns 1113854.5 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1135563 ns 1110541 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1114729 ns 1118667 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1106583.5 ns 1124479.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 619139 ns 644177 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7608750 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1017151.5 ns 899376 ns 1.13
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4375 ns 5333 ns 0.82
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5041 ns 4875 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6645.5 ns 6750 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4125 ns 4416 ns 0.93
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 79693 ns 83066 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 1295145.5 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 61251 ns 64020 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8875 ns 8584 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8542 ns 8750 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8833 ns 8875 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8459 ns 8584 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 545788 ns 552192.5 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 7756917 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 384358 ns 372446 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 16666.5 ns 17229.5 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17708 ns 17250 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22041 ns 21542 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17103.5 ns 17208.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 62465 ns 63166 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1325541.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78722 ns 79573.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224000 ns 220583 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 214083 ns 218875 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 217291 ns 223125 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213125 ns 219625 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 324107 ns 329089 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5606125 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 466754 ns 423777 ns 1.10
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 583 ns 1.07
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 708 ns 625 ns 1.13
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 875 ns 833 ns 1.05
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 834 ns 0.75
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 18908 ns 19066 ns 0.99
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 417770.5 ns
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 30771 ns 27311 ns 1.13
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1416 ns 1417 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458 ns 1417 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1583 ns 1583 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1375 ns 1375 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 115606.5 ns 116071.5 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 2144521 ns
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 125132 ns 118732 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7417 ns 7375 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5333 ns 6000 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 6083 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10334 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23349 ns 24482 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 859459 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47121 ns 52122 ns 0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 230667 ns 229541.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 237792 ns 268417 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 233312.5 ns 241500 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 223000 ns 251250 ns 0.89
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 186325 ns 189293 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9066875.5 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 613087.5 ns 588480 ns 1.04
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3958 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 4042 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 22894 ns 23660.5 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 445458 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48301 ns 43502 ns 1.11
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16792 ns 16833 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16709 ns 16834 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17042 ns 16959 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 17042 ns 16666 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 184640 ns 188039 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 2172250 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 174143 ns 166010.5 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 922250 ns 929291 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 763083 ns 838708 ns 0.91
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 831458.5 ns 841584 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 1257625 ns 1269208 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113637 ns 113941 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 481167 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 244135 ns 396441 ns 0.62
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2604333.5 ns 2610729.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2062625 ns 2330541.5 ns 0.89
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2329458 ns 2324458 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3564084 ns 3478334 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 229247 ns 232093 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 2180333 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 742369.5 ns 630643.5 ns 1.18
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5458 ns 6000 ns 0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7167 ns 7042 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8416.5 ns 7333.5 ns 1.15
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5583 ns 6584 ns 0.85
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 83621 ns 82915 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 1175958.5 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 59646.5 ns 62131.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11062.5 ns 11875 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11791 ns 11417 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11729.5 ns 12417 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11645.5 ns 9813 ns 1.19
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 589604 ns 585345.5 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 7601854 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 410418 ns 388046 ns 1.06
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 500 ns 542 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23217 ns 23179.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 433917 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 48601 ns 41949 ns 1.16
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2083 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2208 ns 2250 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2167 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2083 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 230692.5 ns 226220 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 2467084 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 181643 ns 166171 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8667 ns 8583 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9666 ns 8542 ns 1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10937.5 ns 10709 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9000 ns 8833 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 102306 ns 100758 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 1206104 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 72821 ns 72575 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17646 ns 17228.5 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18583.5 ns 18583 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18542 ns 18500 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17417 ns 17750 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 559012 ns 582511 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5618604 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 381427 ns 371318.5 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 459 ns 1.18
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 541 ns 625 ns 0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 34179 ns 34079 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 653750 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 49111 ns 44423 ns 1.11
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9604 ns 9479 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9291 ns 9750 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10020.5 ns 10333 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9978.5 ns 9562.5 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 249028 ns 262881 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5697458 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 365996.5 ns 351422 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396833 ns 396583 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 213375 ns 288042 ns 0.74
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288208 ns 287666 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 755542 ns 756167 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111983 ns 112987 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 513500 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 77051.5 ns 77780.5 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1463417 ns 1455709 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 854959 ns 1130291 ns 0.76
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1132083 ns 1133250 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2481584 ns 2358000 ns 1.05
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 198784.5 ns 202802 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1708563 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 324536 ns 268682 ns 1.21
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7312.5 ns 7354.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7833.5 ns 8000 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8437 ns 8687.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6917 ns 7750 ns 0.89
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 129346 ns 137305 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 1162583 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 58251 ns 64461 ns 0.90
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13500 ns 12812.5 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15604 ns 15041.5 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14791.5 ns 15353.5 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13979.5 ns 12333.5 ns 1.13
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 849836 ns 906003 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 7891354 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 422317 ns 413373 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24833 ns 26000 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 26916.5 ns 27562.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28313 ns 27042 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23958.5 ns 26021 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 185469.5 ns 186382.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1644917 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 100376.5 ns 146484 ns 0.69
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 143417 ns 146500 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 154042 ns 157750 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 149042 ns 129416 ns 1.15
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 151459 ns 155812.5 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1011255 ns 1016426 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8142042 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 522889 ns 551090 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76416 ns 84667 ns 0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 85000 ns 80167 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 77958 ns 78063 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 85500 ns 80521 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 190193.5 ns 190829 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1487542 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 125082.5 ns 124858.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 295958.5 ns 219479 ns 1.35
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 290084 ns 281750 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 309208 ns 278146 ns 1.11
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 274062.5 ns 320791.5 ns 0.85
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1039232 ns 1021778 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9001333 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 692376 ns 643542 ns 1.08
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12417 ns 13125 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 14083 ns 13666.5 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 15333.5 ns 14041.5 ns 1.09
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 12542 ns 13459 ns 0.93
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 136592 ns 136741.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 1137437 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 234694 ns 226473 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 24292 ns 27083.5 ns 0.90
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26875 ns 26125 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28020.5 ns 27833.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 24416.5 ns 26604.5 ns 0.92
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 907722.5 ns 919419 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 7852375 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 692131.5 ns 633979.5 ns 1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 14167 ns 14000 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 15041.5 ns 14708.5 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 17166 ns 17583.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 13833 ns 14792 ns 0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 118944.5 ns 119245 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 1213062.5 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 238604 ns 233827 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25458 ns 26875 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 27208 ns 25958.5 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26604.5 ns 26583 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26417 ns 26541 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 664219 ns 676576 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5824834 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 677391 ns 589361.5 ns 1.15
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 182000 ns 182375 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 183667 ns 183208 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 186583 ns 185583 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 181542 ns 183459 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 101699.5 ns 102955 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1332208 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 234523 ns 232900.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 585708 ns 583500 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 591417 ns 595083 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 597812.5 ns 597520.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 592625 ns 624167 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 490131.5 ns 493717.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5953104 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 713921 ns 657463 ns 1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6625 ns 6750 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8083.5 ns 7645.5 ns 1.06
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9166.5 ns 8167 ns 1.12
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6396 ns 7542 ns 0.85
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 137141.5 ns 135360 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 1158916 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 59311 ns 62767 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14375 ns 15375 ns 0.93
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14417 ns 14917 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15937.5 ns 16187.5 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13729 ns 15292 ns 0.90
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 875798.5 ns 885601 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 7574750 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 403086 ns 392428 ns 1.03
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6147166.5 ns 6153416.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 3224312.5 ns 6381624.5 ns 0.51
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6368937.5 ns 6371521 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11912208 ns 11926500 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 347269 ns 346494 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/Metal 1592791 ns
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 303595 ns 392843 ns 0.77
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19092083 ns 19117208.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 11115167 ns 19977084 ns 0.56
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19976125 ns 19957021 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36699999.5 ns 36558729 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1027305 ns 1005649 ns 1.02
batchedmm(512, Bsize=4)/zygote/GPU/Metal 7852208 ns
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1169973.5 ns 1105996 ns 1.06
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1791 ns 1750 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1834 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1833 ns 1833 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23540 ns 23503 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 455583.5 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 209943 ns 197739 ns 1.06
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4834 ns 4834 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4917 ns 4958 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4916 ns 4917 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4833 ns 4916 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 269515.5 ns 276337.5 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2631791 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 625449 ns 502208 ns 1.25
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7916.5 ns 8062.5 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8646 ns 8416 ns 1.03
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9166 ns 9459 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7312.5 ns 8145.5 ns 0.90
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 116497.5 ns 115989 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 1187542 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 68391 ns 71584 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11479 ns 11562.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 13041 ns 12438 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12520.5 ns 12541 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10958.5 ns 12875 ns 0.85
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 599205 ns 604320 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5699500 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 356370.5 ns 353160 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22757.5 ns 22648 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 433916 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 47730 ns 43592 ns 1.09
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2875 ns 2917 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3042 ns 2917 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3167 ns 3041 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3125 ns 3000 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 194133 ns 197848 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 2121916 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 163362 ns 146363.5 ns 1.12
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14146 ns 14604 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 16000 ns 15458.5 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15812.5 ns 15896 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14479 ns 15000.5 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 117303 ns 117481 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 1151167 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 236073 ns 236802 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25624.5 ns 26500 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 26313 ns 25625 ns 1.03
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 26104.5 ns 26041.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25125 ns 25958 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 549722 ns 561217 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5157541.5 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 652604.5 ns 566814 ns 1.15
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4167 ns 4291 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4209 ns 4209 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4208 ns 4208 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4167 ns 4375 ns 0.95
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24489 ns 24363 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 448083.5 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 47790.5 ns 44754 ns 1.07
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16167 ns 16250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16042 ns 16125 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16167 ns 16292 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16083 ns 16416 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 317717 ns 321227 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 2428291.5 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 208453 ns 190786 ns 1.09
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 5916 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5792 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5833 ns 5792 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5834 ns 5750 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 34765 ns 34700.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 648041 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 206573 ns 200434 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22792 ns 22292 ns 1.02
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 20729.5 ns 21292 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 22084 ns 21792 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21292 ns 22208 ns 0.96
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 280273 ns 283315.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 6096104 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 688371 ns 598489 ns 1.15
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 60729 ns 59729 ns 1.02
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 60291 ns 64229 ns 0.94
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 67083 ns 66833 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 50958 ns 50958 ns 1
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66493 ns 66908 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/Metal 14948959 ns
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 100052 ns 115781 ns 0.86
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 203416.5 ns 198937.5 ns 1.02
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 138583 ns 144625 ns 0.96
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 159875 ns 167291.5 ns 0.96
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 223083 ns 303249.5 ns 0.74
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 209048 ns 208882.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/Metal 46390583 ns
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 588303.5 ns 529218 ns 1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 84459 ns 84291 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 80541.5 ns 83875 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 88708 ns 88125 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 90875 ns 81562.5 ns 1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192723 ns 193291 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2030916 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 194182.5 ns 182771 ns 1.06
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1931042 ns 1875250 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1931625 ns 1914792 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1917958 ns 1928375 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1918958 ns 1916625 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 506602.5 ns 505449 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9124854.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1073165 ns 857542 ns 1.25
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21451 ns 21535 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 498250 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 43020 ns 36788 ns 1.17
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1833 ns 1875 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1834 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 246531.5 ns 243998 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 2248666 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 184137.5 ns 166221 ns 1.11
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9124.5 ns 11229 ns 0.81
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 10375 ns 9791.5 ns 1.06
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11562.5 ns 11125 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8458 ns 10479.5 ns 0.81
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 114777.5 ns 114440.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 1126333 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 235903 ns 233386 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9520.5 ns 10458 ns 0.91
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 11104.5 ns 10250 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10167 ns 9917 ns 1.03
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9479.5 ns 10145.5 ns 0.93
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 488308 ns 491014 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5077500 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 636239 ns 561274 ns 1.13
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58209 ns 58375 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38375 ns 46917 ns 0.82
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46417 ns 46625 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81875 ns 83708 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38284 ns 38960 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1196146 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 78511 ns 72876 ns 1.08
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1889334 ns 1897625 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1945875 ns 1964750 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1975333 ns 1985854 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1897937 ns 1899833 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 210023.5 ns 212091 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11022958 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1013399.5 ns 994598 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 266958.5 ns 266354 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 269250 ns 269729 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 278292 ns 271041.5 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 266729.5 ns 268271 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 193472 ns 193629.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1544167 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 282794 ns 271156 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 668791.5 ns 693917 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 589292 ns 692541 ns 0.85
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 676917 ns 687708 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 671958 ns 593833 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 988709 ns 991006 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9169229 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 902732.5 ns 863163 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2174145.5 ns 2180687.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2220541 ns 2214917 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2196708.5 ns 2212041 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2206021 ns 2208479 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 160810.5 ns 154859 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1440791 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 406240 ns 451844.5 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5497458 ns 5453666 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5589291 ns 5518208 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5498062 ns 5522375 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5497542 ns 5522209 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 929759.5 ns 930442 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9921167 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1548081.5 ns 1495900 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 995208 ns 999875 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 838417 ns 913333 ns 0.92
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 904916 ns 912895.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 1326042 ns 1334562.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46239 ns 46425 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 578625 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 245133 ns 399125 ns 0.61
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2611792 ns 2620166 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2048166 ns 2328541 ns 0.88
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2326917 ns 2329395.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3610166 ns 3468667 ns 1.04
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 256032 ns 247327 ns 1.04
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2447708 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 771420 ns 658089 ns 1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57875 ns 58083 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38250 ns 46625 ns 0.82
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45875 ns 46542 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 77750 ns 84000 ns 0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 27988 ns 29007 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1149062.5 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75421 ns 73392 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2022250 ns 2036000 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2104417 ns 2096916 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087000 ns 2092208 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2001250 ns 1992542 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 223076 ns 225482 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11110333 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1038854 ns 1028937.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58375 ns 58417 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 38666 ns 47208 ns 0.82
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47042 ns 47375 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 78292 ns 83541 ns 0.94
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48142 ns 48550 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1133500 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 61675.5 ns 71593.5 ns 0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1925250.5 ns 1926354.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1984875 ns 1987291 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1968958 ns 1972375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1876250 ns 1890375 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 230053 ns 231977 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9828354.5 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 918152 ns 931260 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 291 ns 333 ns 0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 34160 ns 33752 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 644270.5 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 48721 ns 44343 ns 1.10
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6875 ns 6542 ns 1.05
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6958 ns 7187.5 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7500 ns 7625 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7292 ns 6209 ns 1.17
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 200629 ns 203191.5 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5584375 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 364295 ns 350064 ns 1.04
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32001.5 ns 32755 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 377063 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 38271 ns 36558 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 3042 ns 3375 ns 0.90
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3250 ns 3333 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3291 ns 3000 ns 1.10
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2792 ns 3208 ns 0.87
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 181967 ns 185298.5 ns 0.98
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 1820916 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 159762 ns 144480 ns 1.11
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1446021 ns 1465479.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1409541 ns 1410667 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1415625 ns 1427770.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1408250 ns 1410417 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 134710 ns 136084 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2868875 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 322334 ns 354201 ns 0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5013500 ns 5012687.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5036542 ns 5023959 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5026520.5 ns 5034167 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5021667 ns 5021667 ns 1
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 671717 ns 673868 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10332500 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1469159.5 ns 1145811 ns 1.28
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49838709 ns 49876625 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 25973958 ns 35509791 ns 0.73
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35497958 ns 35514916 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 97460875 ns 97103375 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1597509 ns 1608361 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/Metal 10641729.5 ns
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1049398.5 ns 1576726 ns 0.67
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154517833 ns 154443875 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 89364146 ns 112320833.5 ns 0.80
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112347166 ns 112445042 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 299472874.5 ns 296071750 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6480598 ns 6483041.5 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/Metal 77617584 ns
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5559482 ns 6222525 ns 0.89
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47541 ns 48042 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47875 ns 47667 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 48541.5 ns 47916 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 48333 ns 47583 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 19684.5 ns 19626 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 496750.5 ns
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 25931 ns 28463 ns 0.91
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50750 ns 50583.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 49958 ns 50167 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 51229.5 ns 51000 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50125 ns 50667 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 244616 ns 245482 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 2284458 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 147992 ns 140773 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 9083 ns 8667 ns 1.05
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 10020.5 ns 8750 ns 1.15
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10375 ns 11167 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8125 ns 9666.5 ns 0.84
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 117828 ns 118847 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 1194250 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 234703 ns 237489 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9875 ns 10791 ns 0.92
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11166.5 ns 10458 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10395.5 ns 10333 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10208 ns 10709 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 579910 ns 584310 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5757375 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 652159 ns 572469 ns 1.14
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8750 ns 9125 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9833.5 ns 9896 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 10583 ns 10667 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8334 ns 9292 ns 0.90
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 115390.5 ns 115727.5 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 1164979.5 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 69241 ns 73908 ns 0.94
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13625 ns 13874.5 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 15291.5 ns 13750 ns 1.11
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 15208 ns 14333 ns 1.06
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14937.5 ns 14375.5 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 553583 ns 559680.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5153750 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 343225 ns 337060 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 958 ns 959 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1042 ns 1042 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 958 ns 1083 ns 0.88
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 33679 ns 33675 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 640334 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 205133 ns 206546 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9583 ns 8917 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8750 ns 8437.5 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9166 ns 8791 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7958.5 ns 9250 ns 0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 222940.5 ns 225862.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5834209 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 658098 ns 576667 ns 1.14
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23292 ns 23667 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23666 ns 23292 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 24666.5 ns 23813 ns 1.04
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23166.5 ns 23666 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 19737 ns 20529 ns 0.96
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 445812.5 ns
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 184932 ns 187811 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 53250.5 ns 53583.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 54333 ns 52145.5 ns 1.04
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 53459 ns 53584 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52604.5 ns 53667 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 258308 ns 260507 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 2423792 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 592777 ns 549086 ns 1.08
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1410542 ns 1444541.5 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1438875 ns 1445459 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1412000 ns 1414666.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1400812.5 ns 1401396 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194713 ns 195236 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2079417 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 345564 ns 321861 ns 1.07
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5016416 ns 5007208 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5028229 ns 5006958 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5008146 ns 5015812.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5011500.5 ns 5020500 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 508710 ns 510108 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9265145.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1202145 ns 1117899 ns 1.08
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 828840208 ns 828285625 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 413910521 ns 541921375 ns 0.76
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 539860417 ns 542359625 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1566139499.5 ns 1558200021 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22553762 ns 22535776.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/Metal 108020292 ns
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14557060 ns 12173703 ns 1.20
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3600174083 ns 3903695416 ns 0.92
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1495447875 ns 1771980416 ns 0.84
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1779739000 ns 1773568584 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 6017463208 ns 5228367459 ns 1.15
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118952088 ns 119027931 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/Metal 2572718125 ns
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88160332.5 ns 68450588 ns 1.29
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76500 ns 75916.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76874.5 ns 87437.5 ns 0.88
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 80396 ns 84417 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76250 ns 81083 ns 0.94
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 192054.5 ns 192111.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1498709 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 106821 ns 126607 ns 0.84
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 272396 ns 282646 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 294979 ns 283042 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 275666.5 ns 236875 ns 1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 294020.5 ns 276458 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 990094.5 ns 995625 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8688959 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 628613 ns 612404 ns 1.03
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199338479 ns 199947208.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 103671062 ns 139420500 ns 0.74
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139137542 ns 138954958 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 391597500 ns 389188834 ns 1.01
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5816800 ns 5832800 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/Metal 33632041.5 ns
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3564305 ns 2958637.5 ns 1.20
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 618111458.5 ns 618298396 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 352015458.5 ns 439277916 ns 0.80
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 439011437.5 ns 439303895.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1195193792 ns 1200068000 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26696499.5 ns 26614249.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/Metal 111449958 ns
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21986711 ns 16011697.5 ns 1.37
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7458 ns 7417 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5334 ns 6125 ns 0.87
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6209 ns 6125 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9792 ns 10125 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 26288 ns 26885 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 828729 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48740 ns 54341 ns 0.90
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 248833 ns 214083 ns 1.16
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230291.5 ns 232833 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 225333.5 ns 230000 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214208 ns 207709 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 213874 ns 215596 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9173729.5 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 521576 ns 546726.5 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7791 ns 7417 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8250 ns 8875.5 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9875 ns 10750 ns 0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7750 ns 10459 ns 0.74
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 114289 ns 111291 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 1112625 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 70720 ns 72956 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7917 ns 7792 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9292 ns 7833.5 ns 1.19
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8937.5 ns 8125 ns 1.10
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8375 ns 1
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 486362 ns 492517.5 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5044854.5 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 315959 ns 322723 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 459 ns 417 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 25338 ns 25272 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 726000 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 46771 ns 45194 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 10604.5 ns 9646 ns 1.10
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10771 ns 9541 ns 1.13
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10125 ns 11104 ns 0.91
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 10500 ns 10333 ns 1.02
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 243207 ns 247083 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 6344458 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 387615 ns 383457 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 351084 ns 351000 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 360916.5 ns 354459 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352187 ns 352250 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 353667 ns 351625 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 22345 ns 23168 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 310937.5 ns
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 189077.5 ns 198701 ns 0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 786583.5 ns 826000 ns 0.95
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 799959 ns 820458 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 807250 ns 822083.5 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 799146.5 ns 827750 ns 0.97
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 216599 ns 214195.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2720084 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 607873 ns 578901 ns 1.05
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5667 ns 5229.5 ns 1.08
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6333 ns 5875 ns 1.08
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7250 ns 6958.5 ns 1.04
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 3917 ns 4667 ns 0.84
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17357 ns 17091 ns 1.02
batchedmm(16, Bsize=32)/forward/GPU/Metal 1903500 ns
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 71671 ns 74219 ns 0.97
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 12583.5 ns 13458.5 ns 0.93
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 11229.5 ns 10625 ns 1.06
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 12292 ns 13041 ns 0.94
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 17291 ns 18542 ns 0.93
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 203999.5 ns 202239.5 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/Metal 5059625 ns
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 368794 ns 330217 ns 1.12
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39542 ns 39833.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 50166.5 ns 51209 ns 0.98
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52542 ns 52458.5 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13917 ns 13459 ns 1.03
batchedmm(16, Bsize=128)/forward/GPU/CUDA 19944.5 ns 19993 ns 1.00
batchedmm(16, Bsize=128)/forward/GPU/Metal 4970958 ns
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 86896 ns 99666.5 ns 0.87
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36812.5 ns 38229.5 ns 0.96
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 29292 ns 35125 ns 0.83
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 32875 ns 34187.5 ns 0.96
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 78541 ns 59417 ns 1.32
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 180178 ns 178995.5 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/Metal 13303396 ns
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 412350 ns 362888 ns 1.14
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3625 ns 3500 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3709 ns 3667 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3750 ns 3833 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3417 ns 3709 ns 0.92
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 19299 ns 19015 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 489416 ns
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 28800 ns 29645 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4250 ns 4291 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4417 ns 4500 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4500 ns 4458 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4167 ns 4292 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 194770.5 ns 194611 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 2153291.5 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 136382 ns 126757 ns 1.08
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4833 ns 5916 ns 0.82
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6375 ns 5062.5 ns 1.26
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6771 ns 6375 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4417 ns 4625 ns 0.96
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 140113.5 ns 138395 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 1172542 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 59621 ns 65944 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8833 ns 9625 ns 0.92
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9000 ns 8500 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8833 ns 9333 ns 0.95
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8791 ns 10666 ns 0.82
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 809012 ns 807046.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 7637459 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 386675 ns 378457 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204125 ns 207583 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 211292 ns 209042 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 211167 ns 213208 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 200583 ns 204125 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36190 ns 35332 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 844791.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 205402 ns 203930.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 612042 ns 603500 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 633416.5 ns 623479.5 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 625250 ns 658604.5 ns 0.95
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 592250 ns 586375 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 255705 ns 254148 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8231270.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 797760 ns 767213 ns 1.04
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3310375 ns 3324167 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1778188 ns 2328667 ns 0.76
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2329291.5 ns 2334417 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6304709 ns 6324542 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 204430 ns 206559 ns 0.99
batchedmm(128, Bsize=128)/forward/GPU/Metal 6035916 ns
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 217792.5 ns 377105 ns 0.58
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11442083.5 ns 11496208.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 6658375 ns 8303562.5 ns 0.80
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8339708.5 ns 8348416.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21081083 ns 21193020.5 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 735864.5 ns 736080.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/Metal 20279917 ns
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1067533 ns 2044820.5 ns 0.52
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5791 ns 3917 ns 1.48
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6292 ns 5292 ns 1.19
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7167 ns 6292 ns 1.14
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4500 ns 7125 ns 0.63
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 131372 ns 129442 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 1175458 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 53861 ns 57067 ns 0.94
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8375 ns 8500 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8333 ns 7375 ns 1.13
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9250 ns 7833 ns 1.18
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9000 ns 8291.5 ns 1.09
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 707753 ns 711410 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 7292583 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 367029.5 ns 364581 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 121562.5 ns 117312.5 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 124917 ns 101437.5 ns 1.23
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 101250 ns 102687.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 125062.5 ns 98458.5 ns 1.27
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 148668.5 ns 149616 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2918000 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 192532 ns 210473 ns 0.91
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2031542 ns 2008250 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1950834 ns 2022459 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2007750 ns 2039937.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2030959 ns 2036625 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 668590 ns 661994.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10443458 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1253100 ns 963831 ns 1.30
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 34167 ns 33416 ns 1.02
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 33666 ns 35459 ns 0.95
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 34375 ns 34709 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 542 ns 750 ns 0.72
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15591 ns 15265 ns 1.02
batchedmm(2, Bsize=4)/forward/GPU/Metal 550104 ns
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 70251 ns 78737 ns 0.89
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 3000 ns 3959 ns 0.76
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3604.5 ns 2917 ns 1.24
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 4375 ns 4708 ns 0.93
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2833 ns 3666 ns 0.77
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 136806 ns 136137.5 ns 1.00
batchedmm(2, Bsize=4)/zygote/GPU/Metal 1196250 ns
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 337614 ns 321796.5 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7250 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5334 ns 6042 ns 0.88
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 3667 ns 6083 ns 0.60
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10042 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 35320 ns 34970 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 846271 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48031 ns 56516 ns 0.85
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221208 ns 221584 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 231083.5 ns 220959 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222042 ns 234583 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 207104 ns 207333 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 238515 ns 237194 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7909791 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 509876 ns 540189 ns 0.94
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3833 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3709 ns 3958 ns 0.94
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22010 ns 21681 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 480708 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 42240 ns 39383 ns 1.07
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14458 ns 14458 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14250 ns 14458 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14667 ns 14541 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14459 ns 14625 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 297003.5 ns 297631.5 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 2355083.5 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 194062 ns 190215 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 141896 ns 129834 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 131583 ns 118271 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 106125 ns 106750 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 101396 ns 101666.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132246.5 ns 150106 ns 0.88
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2848042 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 195502 ns 241781 ns 0.81
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1921791.5 ns 1921708.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1941958 ns 1924583 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1922084 ns 1932000 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1925000 ns 1922750 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 660860 ns 653385 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10632334 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1214399.5 ns 928325 ns 1.31
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18312.5 ns 18875 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19083 ns 17292 ns 1.10
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21334 ns 20937 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17770.5 ns 18459 ns 0.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 104874 ns 104073.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1354708 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75806 ns 91301 ns 0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 226750 ns 239083.5 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 217833 ns 224791 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229000.5 ns 224958.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 257437.5 ns 218500 ns 1.18
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 497280 ns 493640.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6075250 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 479116 ns 439080 ns 1.09
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 25042 ns 26166 ns 0.96
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 26291 ns 29167 ns 0.90
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 28417 ns 28958 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1125 ns 1416 ns 0.79
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16284 ns 15781 ns 1.03
batchedmm(16, Bsize=4)/forward/GPU/Metal 541312.5 ns
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 87411 ns 72756 ns 1.20
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 5292 ns 6208 ns 0.85
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 5459 ns 5041 ns 1.08
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 6417 ns 6875 ns 0.93
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 5500 ns 6417 ns 0.86
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 201005.5 ns 199155.5 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/Metal 2020334 ns
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 390754 ns 324216 ns 1.21
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 221145.5 ns 221875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 222875 ns 223375 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 223666 ns 225375 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 222729.5 ns 223542 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 218348.5 ns 216803 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1683750 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 269733 ns 267771 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 509020.5 ns 508542 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 565792 ns 511042 ns 1.11
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 512270.5 ns 509500 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 500333.5 ns 557354 ns 0.90
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1028150 ns 1017707.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8579625 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 850900 ns 811461 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18875 ns 19104 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20271 ns 19584 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21417 ns 22063 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19625 ns 19792 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111806.5 ns 111072 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1458875 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77311 ns 90009 ns 0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220979 ns 221854 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 239834 ns 220250 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 224916 ns 218166.5 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217249.5 ns 220146 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 711348 ns 700847.5 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7148708.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 538287 ns 494855 ns 1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6375 ns 6292 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7208 ns 7000 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8271 ns 7375 ns 1.12
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5375 ns 6834 ns 0.79
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 133581 ns 130925 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 1164750 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 66341 ns 63498 ns 1.04
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11354 ns 11041.5 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11375 ns 9959 ns 1.14
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12562.5 ns 10895.5 ns 1.15
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12563 ns 10459 ns 1.20
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 770196 ns 770540.5 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 7229125 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 386869.5 ns 375452 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5250 ns 4104 ns 1.28
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5375 ns 7041 ns 0.76
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7334 ns 7166 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4792 ns 6166 ns 0.78
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 135271.5 ns 131485.5 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 1193459 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 58991 ns 62607 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7667 ns 7416.5 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7959 ns 7750 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7854.5 ns 8125 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7520.5 ns 8083 ns 0.93
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 737431 ns 737449 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 7609209 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 396605 ns 380902 ns 1.04
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14453396 ns 14481917 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 7701875 ns 10107542 ns 0.76
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10103083 ns 10094750 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27738458 ns 27859959 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 531399 ns 533975 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/Metal 22191895.5 ns
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 392545 ns 867906.5 ns 0.45
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46327270.5 ns 46387667 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 26716104 ns 33363354 ns 0.80
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33470417 ns 33478875 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85517417 ns 85752792 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2856976 ns 2651799 ns 1.08
batchedmm(128, Bsize=512)/zygote/GPU/Metal 88528708.5 ns
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3296365 ns 5191497.5 ns 0.63
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 185583 ns 185208.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 187042 ns 185916 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 187291.5 ns 188604 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 184792 ns 187271 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 103848 ns 117719.5 ns 0.88
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1537500 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 231333 ns 236051 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 598791.5 ns 634875 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 599958 ns 627937.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 602250 ns 601166 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 587896 ns 587625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 713701 ns 694993 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7615520.5 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 788674 ns 698169.5 ns 1.13
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 667 ns 541 ns 1.23
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 666 ns 584 ns 1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32643 ns 31826 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 664895.5 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47540 ns 48104.5 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9041.5 ns 9541 ns 0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12083 ns 9687.5 ns 1.25
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13250 ns 10542 ns 1.26
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10792 ns 10938 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 278611.5 ns 276120 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 6110667 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 372684 ns 371078 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26250 ns 26250 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26292 ns 26333 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26334 ns 26583 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26458 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23639 ns 22942 ns 1.03
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 423354.5 ns
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 210507.5 ns 206526 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67375 ns 67125 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67375 ns 67333 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 68333 ns 68792 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66959 ns 66875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 277123 ns 273858 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 2163167 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 607047 ns 554115 ns 1.10
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204083 ns 207166 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210917 ns 211667 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209917 ns 211167 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199709 ns 202875 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27902 ns 27563 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 852708.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 205893 ns 206546 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 610813 ns 609937.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 632959 ns 669750 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 635396 ns 664812.5 ns 0.96
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 588854.5 ns 609042 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 239352 ns 233231.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9235709 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 839150.5 ns 798562 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 649417 ns 664875 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 658250 ns 636687.5 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 651458 ns 648791.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 650583 ns 629792 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189912.5 ns 185894.5 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1398604 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 251273 ns 349393 ns 0.72
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2235625 ns 2244229 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2311187.5 ns 2225354 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2238000 ns 2256708 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2245375 ns 2271792 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 922866 ns 900927 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9537166.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1356111 ns 1235829 ns 1.10
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20437.5 ns 19333 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20708 ns 21166.5 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22042 ns 22375 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19396 ns 19958 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111717 ns 106770.5 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1470978.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75441 ns 89387 ns 0.84
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 233271 ns 227250 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 232958 ns 262312.5 ns 0.89
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 233167 ns 231250 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 221208.5 ns 222770.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 709044 ns 700957 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7671770.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 555096.5 ns 516550 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 667 ns 500 ns 1.33
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23540 ns 22928 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 727375 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 47941 ns 44243 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 12041 ns 9583 ns 1.26
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 12125 ns 9958.5 ns 1.22
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17229.5 ns 13229.5 ns 1.30
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 11166 ns 10875 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 260166 ns 258192 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6474000 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 397565 ns 395479 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8500 ns 8062.5 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9208 ns 9208 ns 1
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9833 ns 10459 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7167 ns 8333 ns 0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 116262.5 ns 112863.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 1132416 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 67351 ns 72315 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 7500 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10542 ns 7750 ns 1.36
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9000 ns 14875 ns 0.61
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 14959 ns 8917 ns 1.68
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 480097.5 ns 472419 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4769916.5 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 318874 ns 321811 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2208.5 ns 1979.5 ns 1.12
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2458 ns 2500 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2625 ns 2542 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2083 ns 2416 ns 0.86
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 19599 ns 19845 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 420458.5 ns
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 189912 ns 191508 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6750 ns 6666 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 8291 ns 6459 ns 1.28
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 7334 ns 7292 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6791 ns 7292 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 212249 ns 208409 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 2347167 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 580124 ns 543621 ns 1.07
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 749667 ns 754167 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 749000 ns 751000 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 747625 ns 749375 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 748645.5 ns 747104 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 22873 ns 22303 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 324209 ns
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 33080 ns 47829 ns 0.69
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 792750 ns 792250 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 791625 ns 811750 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 799541.5 ns 789500 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 787417 ns 794229.5 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 210255.5 ns 206590.5 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2648354.5 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 231762 ns 233541 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7250 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5250 ns 5917 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 6000 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10209 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33271 ns 32976 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 852917 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50031 ns 57267 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 230708 ns 228458.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 234646 ns 269270.5 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 235812 ns 235021 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 252042 ns 213146 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 258276 ns 254662 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8291125 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 523734 ns 552652 ns 0.95
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12541.5 ns 12417 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 13312.5 ns 13250 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 14916 ns 14458 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10771 ns 13000 ns 0.83
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 134784 ns 131273.5 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 1166833 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 235912 ns 231363 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24562 ns 24854.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24916.5 ns 24916 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24958 ns 25542 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25479.5 ns 24458 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 822351 ns 813324 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 7673667 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 685105 ns 634495 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8958 ns 8875 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9749.5 ns 9958 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11125 ns 11167 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8854.5 ns 9542 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 120470.5 ns 116553 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 1250708 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 73741 ns 74930 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 19854 ns 13770.5 ns 1.44
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15750 ns 14917 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14791 ns 15916 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15229.5 ns 16437.5 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 633748 ns 621843 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5614500 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 370283 ns 356836 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9437.5 ns 9145.5 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9937.5 ns 9354 ns 1.06
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11667 ns 10750 ns 1.09
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8416.5 ns 10125 ns 0.83
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 119362 ns 116468 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 1160958 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 70370.5 ns 74383.5 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13021 ns 12916 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 15896 ns 12959 ns 1.23
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13625 ns 20541 ns 0.66
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20083 ns 14500 ns 1.39
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 524705 ns 515709 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5078937.5 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 339993 ns 328534 ns 1.03
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 29959 ns 31062 ns 0.96
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 30833 ns 33146 ns 0.93
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 30770.5 ns 30750 ns 1.00
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 1896 ns 1833 ns 1.03
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16550 ns 16169 ns 1.02
batchedmm(2, Bsize=128)/forward/GPU/Metal 4756209 ns
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 72631 ns 77564 ns 0.94
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5959 ns 5562.5 ns 1.07
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5584 ns 5312.5 ns 1.05
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5541 ns 7208 ns 0.77
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 7229 ns 7834 ns 0.92
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 138201 ns 134922 ns 1.02
batchedmm(2, Bsize=128)/zygote/GPU/Metal 13282667 ns
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 370683 ns 340125 ns 1.09
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 25103.5 ns 24307 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 700709 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 47470 ns 45845 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7541.5 ns 6166.5 ns 1.22
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7583 ns 6708 ns 1.13
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6917 ns 8167 ns 0.85
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7062.5 ns 7083 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 184083 ns 179926.5 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 6386187.5 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 390413 ns 372385.5 ns 1.05
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6041 ns 5834 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5833 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5833 ns 5875 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5917 ns 5958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 25719 ns 25187 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 731208.5 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 210291 ns 201636 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 23708 ns 21041 ns 1.13
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 23270.5 ns 21709 ns 1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21750 ns 23458 ns 0.93
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 24250 ns 26125 ns 0.93
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 266239.5 ns 262884 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6639625 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 707865 ns 615780.5 ns 1.15
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 175833 ns 192083.5 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 175125 ns 158917 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 150792 ns 154416.5 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 175959 ns 146417 ns 1.20
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189040 ns 184640 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1564416.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 174111 ns 215472.5 ns 0.81
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1329062.5 ns 1319792 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1311416.5 ns 1328249.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1318813 ns 1347250 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1346041 ns 1337000 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 862925 ns 844907 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9193604 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1117183.5 ns 1041340 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24375 ns 24292 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25729 ns 24916 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 27458 ns 28000 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23917 ns 24833.5 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 226480 ns 224694.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1700209 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 102501 ns 130334 ns 0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 186854 ns 117583 ns 1.59
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 167167 ns 131375 ns 1.27
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 177291.5 ns 160499.5 ns 1.10
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 124562.5 ns 164750 ns 0.76
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 993547 ns 967206 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8806833 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 608345 ns 585053 ns 1.04
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 333 ns 250 ns 1.33
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 334 ns 1.12
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23084 ns 22932 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 708709 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 49001 ns 47870 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8250 ns 6292 ns 1.31
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9083 ns 6833 ns 1.33
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7000 ns 9416 ns 0.74
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7333.5 ns 7500 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 200052 ns 196587.5 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6611083 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 394084 ns 380031 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5895.5 ns 5875 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6541 ns 6292 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7792 ns 7187.5 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4520.5 ns 6562 ns 0.69
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 138134 ns 134586 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 1154209 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 236352 ns 230170 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10000 ns 9833 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10604.5 ns 10000 ns 1.06
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10167 ns 11187.5 ns 0.91
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9958 ns 11083 ns 0.90
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 852212.5 ns 840176 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 8072333 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 678720 ns 631290 ns 1.08
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1584 ns 1542 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1583 ns 1625 ns 0.97
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22967.5 ns 22272 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 458209 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 210012 ns 204933 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5750 ns 5750 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6166 ns 6125 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5959 ns 6417 ns 0.93
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5750 ns 5875 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 220496.5 ns 216977 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 2224000 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 586500 ns 491814.5 ns 1.19
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8708 ns 8250 ns 1.06
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8167 ns 8562.5 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9917 ns 9895.5 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8417 ns 9209 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 118370.5 ns 115063 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 1213708 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 68660 ns 73999 ns 0.93
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8416 ns 8167 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10000 ns 9250 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8958 ns 9833.5 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9250 ns 10333 ns 0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 563481.5 ns 548589 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5616208 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 344258 ns 340367 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 128958.5 ns 127271 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 96083.5 ns 128750 ns 0.75
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 130042 ns 131062 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 180791.5 ns 181979.5 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46592 ns 46303.5 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/Metal 369729.5 ns
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 95170.5 ns 102121 ns 0.93
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 335729 ns 338125 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 179021 ns 339792 ns 0.53
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 331750 ns 346083 ns 0.96
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 572000 ns 595417 ns 0.96
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 186585.5 ns 181951 ns 1.03
batchedmm(128, Bsize=4)/zygote/GPU/Metal 1385875 ns
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 501200 ns 410627.5 ns 1.22
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397375 ns 397708 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 213645.5 ns 288375 ns 0.74
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 286292 ns 287937.5 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 752167 ns 756708 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44120 ns 43092 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 432792 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 81571 ns 85671 ns 0.95
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1457084 ns 1456291.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 857542 ns 1133125 ns 0.76
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1128083.5 ns 1127937.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2481187.5 ns 2360208 ns 1.05
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 249861 ns 248595.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1748791.5 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 350803 ns 266317 ns 1.32
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 656104.5 ns 643479.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 672833 ns 654166 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 649250 ns 652750 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 667000 ns 650625 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 188293.5 ns 172424.5 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1390208 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 243237.5 ns 315089 ns 0.77
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2415666.5 ns 2449417 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2426229 ns 2455020.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2447042 ns 2465625 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2475437.5 ns 2469208.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 947745 ns 922065 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10591021 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1455792 ns 1363193.5 ns 1.07
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 33208 ns 32917 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 33167 ns 35374.5 ns 0.94
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34334 ns 34417 ns 1.00
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 750 ns 1000 ns 0.75
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16016 ns 15534 ns 1.03
batchedmm(2, Bsize=32)/forward/GPU/Metal 1296458.5 ns
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 77541 ns 78366 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3145.5 ns 2937.5 ns 1.07
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3833 ns 3375 ns 1.14
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3375 ns 5208 ns 0.65
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3875 ns 4625 ns 0.84
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 136913 ns 133935.5 ns 1.02
batchedmm(2, Bsize=32)/zygote/GPU/Metal 5040146 ns
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 355953 ns 318886 ns 1.12
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1458625 ns 1464209 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1495542 ns 1500333 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1499708 ns 1501333 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1437750 ns 1442563 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 42671 ns 41738 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1411187 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 243652 ns 318625 ns 0.76
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5101625 ns 5128625 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5320416.5 ns 5291041 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5302834 ns 5297084 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4990937 ns 4998791.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 234235 ns 230499.5 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11285104 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1238450 ns 1198280 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3791 ns 3709 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3750 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3916 ns 0.95
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34870 ns 33583 ns 1.04
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 404916.5 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 39731 ns 36778.5 ns 1.08
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15292 ns 15417 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15708 ns 15500 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15542 ns 15791 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15500 ns 16000 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 253976 ns 252278 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 1603291.5 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 172581 ns 161662 ns 1.07
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404667 ns 404625 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 216333 ns 296000 ns 0.73
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295666 ns 295916 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 755125 ns 760625 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113698 ns 113161.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 512417 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 90091 ns 95859 ns 0.94
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1483750 ns 1479249.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 880292 ns 1158584 ns 0.76
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1158916.5 ns 1160500 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2472770.5 ns 2383354 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 241711 ns 228888 ns 1.06
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1816083.5 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 351843 ns 265922 ns 1.32
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1041 ns 958 ns 1.09
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1084 ns 1042 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 959 ns 1083 ns 0.89
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 25157 ns 24404 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 712625 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 209502 ns 207859 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 7917 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10041.5 ns 8542 ns 1.18
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8750 ns 9917 ns 0.88
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11250 ns 12895.5 ns 0.87
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 207012 ns 202191 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6760062.5 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 692491 ns 620871 ns 1.12
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 831541 ns 835834 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 464666.5 ns 615542 ns 0.75
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 618667 ns 617791.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1547646 ns 1549375 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 132130 ns 130350.5 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/Metal 1716791 ns
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 166711 ns 215532 ns 0.77
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2686834 ns 2690375 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1530458 ns 2000479.5 ns 0.77
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1999000 ns 2007416.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4939562.5 ns 4941104 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 233538 ns 232712 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/Metal 6467541.5 ns
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 858788 ns 872871.5 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32364 ns 31625 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 643333 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 47840 ns 47950 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6416 ns 6084 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8562.5 ns 6708 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6833.5 ns 7666 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7354.5 ns 8083 ns 0.91
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 220283.5 ns 221856.5 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5857000 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 359924 ns 352319 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1758708.5 ns 1741791.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1753875.5 ns 1752167 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1756209 ns 1739042 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1761791 ns 1719916 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 187281 ns 183055.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1584292 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 354693 ns 415606.5 ns 0.85
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4354750 ns 4361125 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4401000 ns 4365916.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4390770.5 ns 4399333 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4366166.5 ns 4394333 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 845510 ns 827645.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9176375 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1255141 ns 1239667.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 9834 ns 7083 ns 1.39
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6687.5 ns 7395.5 ns 0.90
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7292 ns 7041 ns 1.04
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 8895.5 ns 6854.5 ns 1.30
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 22493 ns 22223.5 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 285021.5 ns
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 33691 ns 47178 ns 0.71
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 49625 ns 45292 ns 1.10
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 33584 ns 51167 ns 0.66
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 50854.5 ns 49250 ns 1.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 32833.5 ns 49437 ns 0.66
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 210314.5 ns 204846 ns 1.03
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2601666.5 ns
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 209212 ns 235841 ns 0.89
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 22416.5 ns 22125 ns 1.01
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 22875 ns 25125 ns 0.91
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 24416 ns 24833 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5395.5 ns 5458.5 ns 0.99
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18228 ns 17859 ns 1.02
batchedmm(2, Bsize=512)/forward/GPU/Metal 14808125 ns
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 90271 ns 82154 ns 1.10
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12041.5 ns 11792 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 9917 ns 10750 ns 0.92
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10792 ns 12583 ns 0.86
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18833 ns 19708.5 ns 0.96
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 221873 ns 216235 ns 1.03
batchedmm(2, Bsize=512)/zygote/GPU/Metal 46172959 ns
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 380594 ns 331099 ns 1.15
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 405875 ns 406250 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 221667 ns 297333 ns 0.75
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 297250 ns 296833.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 758125 ns 762833 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46873 ns 46303.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 448750 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 89581 ns 97252 ns 0.92
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1483542 ns 1477458 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 886375 ns 1164395.5 ns 0.76
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1163584 ns 1164416 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2469250 ns 2386333 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 284393.5 ns 268961 ns 1.06
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2357604.5 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 374963 ns 282959 ns 1.33
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1485583 ns 1488416 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1520792 ns 1526958 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1526916 ns 1529250 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1462209 ns 1466395.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54219 ns 52650 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1149729.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 237227 ns 326982 ns 0.73
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5110708 ns 5119459 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5306604 ns 5285084 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5293958 ns 5297709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4983479.5 ns 4955208 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 256685 ns 250192 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10295792 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1227811 ns 1186136 ns 1.04
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28375 ns 28292 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28375 ns 28292 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28375 ns 28333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28250 ns 28417 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24774 ns 23514.5 ns 1.05
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 458709 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 213012 ns 207227 ns 1.03
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66625 ns 66542 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 66375 ns 66750 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67375 ns 66500 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66292 ns 66208 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 338401 ns 333506.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 2758583.5 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 657866 ns 576948.5 ns 1.14
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 126583 ns 124875 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 127167 ns 81875 ns 1.55
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 92292 ns 89166 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 90041 ns 86750 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193063 ns 191648 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2102167 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 200221.5 ns 233116 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2011833.5 ns 2025145.5 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2035250 ns 2021978.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2020542 ns 2030542 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2015583 ns 1995125 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 517163.5 ns 506195 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9593458 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1086120 ns 881973 ns 1.23

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal merged commit 877ef96 into main Oct 25, 2024
51 of 63 checks passed
@avik-pal avik-pal deleted the ap/fix_downstream branch October 25, 2024 18:59
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant