Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

feat: instancenorm with running statistics #152

Merged
merged 2 commits into from
Sep 5, 2024
Merged

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Sep 4, 2024

temporarily disabling other tests. Need to be enabled before merging

@avik-pal avik-pal force-pushed the ap/in_stat_track branch 4 times, most recently from 386c753 to a2993bd Compare September 4, 2024 23:11
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: a2993bd Previous: 9d522c5 Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5917 ns 5750 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6375 ns 6187.5 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6875 ns 7979 ns 0.86
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6354.5 ns 6958.5 ns 0.91
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 119595 ns 119461 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2638983 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 841208 ns 723417 ns 1.16
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 416294 ns 417664 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9771 ns 9834 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10084 ns 9792 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9958 ns 9916 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9709 ns 10166 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 556032 ns 551816 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 17113852 ns
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2407250 ns 2364708 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 679437 ns 695047 ns 0.98
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1500 ns 1458 ns 1.03
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 2959 ns 1687.5 ns 1.75
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1834 ns 1917 ns 0.96
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3000 ns 1250 ns 2.40
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 22035 ns 21782 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1290376 ns
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 206917 ns 189208 ns 1.09
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 31001 ns 30960 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4208.5 ns 3958.5 ns 1.06
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4270.5 ns 4167 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4229.5 ns 4000 ns 1.06
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4146 ns 4334 ns 0.96
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 147211 ns 148046.5 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 9003436.5 ns
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1634875 ns 1745084 ns 0.94
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 152001 ns 148342 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57750 ns 56083 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46458 ns 39917 ns 1.16
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46833 ns 47000 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82000 ns 82750 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 36979 ns 37366 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 545608 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1037042 ns 1348187.5 ns 0.77
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 82860 ns 80291 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2035334 ns 2017708 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2076166 ns 2083959 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2083083 ns 2090792 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2002146 ns 1999604 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 230796 ns 232635 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 7641898 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7003625 ns 7104833 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1090251 ns 1540007 ns 0.71
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 150000 ns 143708 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 149791 ns 173750.5 ns 0.86
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 176250 ns 165562.5 ns 1.06
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 173250 ns 165979 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166820.5 ns 166570 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7577001 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1533291.5 ns 1701792 ns 0.90
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 173682 ns 205502.5 ns 0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1087229 ns 1100292 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1124500 ns 1114709 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1111062.5 ns 1122042 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1108417 ns 1119916 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 716617 ns 713685 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 35528650.5 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6150958 ns 7357125 ns 0.84
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1025161 ns 1039502 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4750 ns 4458 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5104 ns 4291 ns 1.19
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6041 ns 6208 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4250 ns 4416 ns 0.96
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 94368.5 ns 94296 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5654517 ns
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 444916 ns 782083.5 ns 0.57
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 62585.5 ns 69431 ns 0.90
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8833 ns 8542 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8792 ns 8834 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 9083 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8583 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 611231.5 ns 608245 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 38660115 ns
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 6129729.5 ns 5666604.5 ns 1.08
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 388614 ns 384864 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17729 ns 17229 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17625 ns 17250 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 21417 ns 22250 ns 0.96
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 19291.5 ns 18312.5 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 67023 ns 68096 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 2950314 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1293375 ns 1292667 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77260.5 ns 74070.5 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223709 ns 218583 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212459 ns 244459 ns 0.87
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 227500 ns 213333 ns 1.07
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219917 ns 220875 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 360786 ns 359693 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 14019876 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5619749.5 ns 7278917 ns 0.77
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 476405 ns 475315 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 708 ns 0.77
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 667 ns 584 ns 1.14
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 792 ns 916.5 ns 0.86
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 666 ns 583 ns 1.14
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20833 ns 20807.5 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1137434 ns
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 300334 ns 297208 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 34260 ns 33001 ns 1.04
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1417 ns 1375 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458 ns 1458 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1666 ns 1583 ns 1.05
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1417 ns 1417 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 126906.5 ns 126203 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8506726 ns
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1628625 ns 1457625 ns 1.12
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 127611 ns 138172 ns 0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7333 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 5375 ns 1.14
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 6083 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10084 ns 10291 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23821 ns 24430 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1260059 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 671167 ns 351229 ns 1.91
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49060 ns 47101 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225000 ns 219208 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 227833 ns 261791 ns 0.87
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 240062.5 ns 228625 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 222479 ns 223750 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 192842.5 ns 194664 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31473361 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8703333 ns 11964250 ns 0.73
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 620596 ns 617187 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4083 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4084 ns 4167 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4083 ns 4084 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23794 ns 23689 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 1974229 ns
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 222875 ns 203375 ns 1.10
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48831 ns 48541 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 17041 ns 16958 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16875 ns 16583 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17167 ns 17250 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 17167 ns 16917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 197595 ns 196884 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 9863825 ns
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 969833 ns 1560667 ns 0.62
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 177642 ns 174782 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 511000 ns 509333 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 405812.5 ns 332250 ns 1.22
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 406125 ns 404250 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 866000 ns 865708 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113024 ns 114284.5 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 393046 ns
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 464375 ns 392875 ns 1.18
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 247752 ns 248273 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2306958 ns 2318021 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2028375 ns 1745083 ns 1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2028666 ns 2021000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3274208.5 ns 3274791.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 243951 ns 244508 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 9852070 ns
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1941875 ns 2001875 ns 0.97
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 760828 ns 763478 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6542 ns 5833 ns 1.12
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7124.5 ns 7167 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8416.5 ns 7271 ns 1.16
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6542 ns 6124.5 ns 1.07
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 92767.5 ns 92855.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5331169 ns
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 798833 ns 861271 ns 0.93
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 60351 ns 60401 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11479 ns 11375 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12062.5 ns 11750 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12042 ns 12229 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11666.5 ns 11125 ns 1.05
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 654548 ns 638820 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38795434 ns
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5639791.5 ns 6435375 ns 0.88
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 415434 ns 416514.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 541 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 541 ns 541 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 541 ns 541 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23597 ns 23671 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2266003 ns
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 325292 ns 318791 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 51190 ns 53351 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2083 ns 2167 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2209 ns 2084 ns 1.06
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2167 ns 2166 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 223010 ns 222818.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 11010123 ns
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 2006416 ns 1967167 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 177622 ns 180782 ns 0.98
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9166 ns 8708 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9791 ns 8833 ns 1.11
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11687.5 ns 9895.5 ns 1.18
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8708 ns 8709 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 106417 ns 100619 ns 1.06
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3372568.5 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 844499.5 ns 898521 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 74990 ns 74410.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17750 ns 17375 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 18645.5 ns 17167 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18479 ns 19375 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 17250 ns 18250 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 595296.5 ns 574738 ns 1.04
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 17174592 ns
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5296333 ns 5654917 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 386994 ns 389229 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 583 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 667 ns 0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 35962 ns 36237 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1201763 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 463458 ns 463667 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 47860 ns 48401 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8416 ns 8437.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10042 ns 9312 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9750 ns 9875 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9062.5 ns 9708 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 256831 ns 254845 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18555060 ns
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5261083 ns 5087792 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 373643.5 ns 375784 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396667 ns 395833.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287875 ns 215750 ns 1.33
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287834 ns 288166 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 755792 ns 756000 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 111567.5 ns 112957 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 322878.5 ns
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 450958.5 ns 299833 ns 1.50
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 76691 ns 76681 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1443229 ns 1455646 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1135146 ns 862000 ns 1.32
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1132416.5 ns 1130021 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2438083 ns 2442563 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 208450 ns 210541 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 9796183 ns
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1573708 ns 1636104.5 ns 0.96
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 327283 ns 325573.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7437.5 ns 7000 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8062.5 ns 7084 ns 1.14
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8479 ns 8125 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7145.5 ns 7041 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 145243 ns 136948 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5512391 ns
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 446312.5 ns 760125 ns 0.59
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 62560 ns 68820 ns 0.91
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15041.5 ns 14625 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14771 ns 15042 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15291 ns 14958.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13750 ns 15625 ns 0.88
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 965397 ns 931253.5 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 44376533 ns
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6134875 ns 6306249.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 438064 ns 436305 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25958.5 ns 25542 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 27125 ns 27334 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 29750 ns 28354 ns 1.05
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23979.5 ns 31542 ns 0.76
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 202823 ns 200462.5 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 8058826 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 996250 ns 1129500 ns 0.88
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 116721 ns 112942 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 110875 ns 149250 ns 0.74
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 151458 ns 131583.5 ns 1.15
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 150625 ns 106479 ns 1.41
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 104500 ns 153208 ns 0.68
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1088269 ns 1062590 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42023192 ns
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5932833.5 ns 5978292 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 595155 ns 590197 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 77542 ns 76250 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75812.5 ns 74291.5 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79875 ns 77333 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 73208 ns 76792 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 208775 ns 209030.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7708981 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 536542 ns 638458 ns 0.84
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 126861 ns 130572 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 289583 ns 216500 ns 1.34
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 323249.5 ns 297395.5 ns 1.09
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 314416.5 ns 212146 ns 1.48
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 275875 ns 306208 ns 0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1125851.5 ns 1140320 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 39323813 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6511542 ns 7480542 ns 0.87
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 698637 ns 697363 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16750 ns 15833 ns 1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17750 ns 17291.5 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 18187.5 ns 17875 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16667 ns 16687.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 147960 ns 150183 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5605473.5 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 442666.5 ns 779979 ns 0.57
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 240302 ns 237943 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26916.5 ns 26458.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 28250 ns 25708 ns 1.10
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27541 ns 27625 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26250 ns 27750 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 991977 ns 987976 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 40237198 ns
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6023542 ns 7131041.5 ns 0.84
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 701302.5 ns 701547 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 10792 ns 10396 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 10979 ns 11563 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12958 ns 12833 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10375 ns 10875.5 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 124814.5 ns 125970.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3549474 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 954792 ns 910812.5 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 238007.5 ns 241512 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 22208 ns 21083 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 22750 ns 21604.5 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 22958.5 ns 23041.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21334 ns 21541.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 705181 ns 709336 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 21586618 ns
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5537979.5 ns 5733333 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 669177 ns 676248 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 63166.5 ns 62667 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 64084 ns 63771 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 65541 ns 65667 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 64375 ns 67667 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107012.5 ns 107292 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3389638 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1279125 ns 1352583.5 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 237317.5 ns 240373 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 449416.5 ns 444083 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 480208.5 ns 448875 ns 1.07
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 447521 ns 440458 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 446958 ns 445833.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 516298 ns 521267 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 21075444 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6049625 ns 8808750 ns 0.69
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 718537 ns 728812.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7000 ns 6958.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7479 ns 7291 ns 1.03
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8812.5 ns 8771 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7021 ns 7104 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 146225 ns 147758.5 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5722281 ns
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 446291 ns 763583 ns 0.58
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 61390 ns 60941 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13750 ns 15125 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14562 ns 14417 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16750 ns 15334 ns 1.09
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14583 ns 15958 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 960744.5 ns 958359.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 38852284 ns
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5796291 ns 6378396 ns 0.91
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 405558.5 ns 409474 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6142167 ns 6155291 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6377458 ns 3225687.5 ns 1.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6374646 ns 6379541 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11908917 ns 11906125 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 303195 ns 351844 ns 0.86
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 286322 ns 301554 ns 0.95
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19075354.5 ns 19041833.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19951541.5 ns 11118520.5 ns 1.79
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19978750 ns 19989395.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36445104 ns 36469125 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1022515 ns 1015731 ns 1.01
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1160387 ns 1151512 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 917 ns 959 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 959 ns 958 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 959 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 958 ns 958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23341 ns 23791 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 1978246 ns
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 326084 ns 317417 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 214712 ns 215032 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3667 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3791 ns 3667 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3750 ns 3708 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 281453 ns 283833 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11721582 ns
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2113042 ns 2116208 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 643326.5 ns 634877 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8417 ns 7167 ns 1.17
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 9062 ns 7833.5 ns 1.16
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9333.5 ns 9291 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7812.5 ns 7500 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 122008 ns 122503 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3404786 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 790625 ns 866646 ns 0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 67280 ns 66931 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12125 ns 11709 ns 1.04
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 12729.5 ns 11834 ns 1.08
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 12500 ns 13291 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11562.5 ns 11875 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 642521 ns 651319 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22835654 ns
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5317417 ns 5038083 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 362758.5 ns 365314 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 334 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22606 ns 22923 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2068918 ns
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 331916 ns 208979.5 ns 1.59
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 51351 ns 50651 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3000 ns 3000 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2958 ns 2959 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3292 ns 3250 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 3208 ns 2959 ns 1.08
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 204531 ns 206218 ns 0.99
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9340318 ns
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1670229 ns 1699541.5 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 160441.5 ns 158851.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 12000 ns 10375 ns 1.16
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11666 ns 11854.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 13250 ns 12417 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10396 ns 12333 ns 0.84
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 122200.5 ns 123182.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3246604 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 915479 ns 877125 ns 1.04
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 239182 ns 241463 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 21104 ns 22062 ns 0.96
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21792 ns 21625 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21500 ns 21708 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20563 ns 20084 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 600065 ns 605852.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20165731 ns
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4787416 ns 5025000 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 665137 ns 667502 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4417 ns 4417 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4416 ns 4584 ns 0.96
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4417 ns 4417 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24894 ns 24334 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2248390 ns
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 223916 ns 208417 ns 1.07
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 54030 ns 54130 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16458 ns 16375 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16791 ns 16375 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16625 ns 16667 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16583 ns 16875 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 330393 ns 333246 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 13174147.5 ns
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1096687.5 ns 1768771 ns 0.62
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 216292 ns 214042.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2084 ns 2084 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2083 ns 2000 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2167 ns 2166 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2042 ns 2041 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 36372 ns 36196 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1201021 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 446833 ns 473000 ns 0.94
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 206512 ns 205752 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 16812 ns 17667 ns 0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 19625 ns 18937.5 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 18083.5 ns 17625 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16417 ns 16896 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 296169 ns 297235 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21318598 ns
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5102291 ns 5572167 ns 0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 695637 ns 694748 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59542 ns 55979.5 ns 1.06
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 65334 ns 60709 ns 1.08
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 65875 ns 65812.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51291 ns 51583 ns 0.99
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66549 ns 66558 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 98811 ns 120591.5 ns 0.82
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 195875 ns 185895.5 ns 1.05
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 129792 ns 146354 ns 0.89
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 156812.5 ns 136208 ns 1.15
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 308959 ns 297104 ns 1.04
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 218139 ns 218976.5 ns 1.00
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 591506 ns 584106 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 84458.5 ns 112833.5 ns 0.75
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 83666 ns 86417 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 83708 ns 89416 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82771 ns 81000 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 192192 ns 191966 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5581758 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1985375 ns 1945000 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 198612 ns 209467.5 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1917333 ns 1912250 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1921042 ns 1923916 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1913083 ns 1917917 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1896792 ns 1922250 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 537492 ns 536309 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 25658976.5 ns
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8852334 ns 11093750 ns 0.80
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1081525.5 ns 935284.5 ns 1.16
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 291 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 291 ns 250 ns 1.16
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21792 ns 21820 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2054894 ns
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 370250 ns 327833.5 ns 1.13
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 45091 ns 46181 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1791 ns 1.05
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1792 ns 1833 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 252179 ns 254627 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9589577 ns
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1112041.5 ns 1640833 ns 0.68
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 182142 ns 187212 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 10708 ns 8209 ns 1.30
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 10521 ns 9083 ns 1.16
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10708.5 ns 9896 ns 1.08
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8208 ns 8417 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 119900.5 ns 120586.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3415652 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 880333 ns 873250 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 237573 ns 236722 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9208 ns 10292 ns 0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9895.5 ns 8958 ns 1.10
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9750 ns 9917 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8792 ns 8666 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 528737 ns 532717.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18354264 ns
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4675166 ns 4452292 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 631771.5 ns 646767 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57667 ns 56750 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46500 ns 39708 ns 1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46334 ns 47166 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83541 ns 83125 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39618 ns 40431 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1322744 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1121333 ns 1093666 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 73931 ns 77971 ns 0.95
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1938853.5 ns 1903833 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1991958.5 ns 1979312 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1945875 ns 1983896 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1887000 ns 1849208 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 221942 ns 224788 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33002765 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11113083 ns 14363791.5 ns 0.77
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1027840 ns 1042991 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 417250 ns 415042 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 417187.5 ns 418584 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 421250 ns 420291 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 418333 ns 420459 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 211347 ns 212100.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7575667.5 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 539625 ns 1065709 ns 0.51
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 287053 ns 286133 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 738291.5 ns 742875 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 683771 ns 758958 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 757916 ns 691062.5 ns 1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 681208.5 ns 742624.5 ns 0.92
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1065352.5 ns 1063422.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43581622 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6560209 ns 7312146 ns 0.90
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 924089 ns 924920 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3444959 ns 3442959 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3412209 ns 3441833 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3411667 ns 3417500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3424875 ns 3453000 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 174242.5 ns 174858 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8338635 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1410542 ns 1420583 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 433624 ns 452865 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6186541.5 ns 6180375 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6251146 ns 6232875 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6196875 ns 6229979 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6211542 ns 6252666 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1007247 ns 1007257 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50404661 ns
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7388542 ns 9641124.5 ns 0.77
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1563280.5 ns 1560736 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 470792 ns 471375 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 340500 ns 253334 ns 1.34
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 340708.5 ns 341708 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 902792 ns 902583 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46331.5 ns 46913 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 883706 ns
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 498958.5 ns 338020.5 ns 1.48
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 249362 ns 250492 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2334396 ns 2320416 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2027917 ns 1761167 ns 1.15
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2035000 ns 2033167 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3282208.5 ns 3279375 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 258830.5 ns 260626 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 13167072 ns
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2215396 ns 2319917 ns 0.95
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 789218 ns 785678 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57958 ns 56166 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 45958 ns 39417 ns 1.17
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46291 ns 46584 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82916.5 ns 82917 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28557 ns 28863 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1311996 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1137604.5 ns 1130625 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 76966 ns 79170.5 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020625 ns 2020083 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2097750 ns 2062917 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2087000 ns 2078437.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1971500 ns 2004145.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 236092 ns 238429 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36800248.5 ns
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11422291 ns 15264270.5 ns 0.75
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1060671 ns 1057241 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57375 ns 56292 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46584 ns 39833 ns 1.17
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46500 ns 47416 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82542 ns 82875 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 49183 ns 50090 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 815504 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1092000 ns 1054834 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77081 ns 74900 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1936771 ns 1924167 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1983542 ns 1968250 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1972750 ns 1980792 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1881250 ns 1891208 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 240760 ns 243592 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 17572213 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9756750 ns 12800042 ns 0.76
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 929739 ns 1070466 ns 0.87
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 416 ns 375 ns 1.11
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 35201 ns 35236 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1199599 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 461437.5 ns 461750 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 48150 ns 50011 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6709 ns 0.95
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 6520.5 ns 1.15
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6959 ns 7625 ns 0.91
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6667 ns 6541 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 213488.5 ns 216284 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20286833.5 ns
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4892959 ns 5088292 ns 0.96
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 373504 ns 373774 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 250 ns 1.16
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32390.5 ns 32446 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1231275 ns
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 255041 ns 248500 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 41000 ns 40510 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2875 ns 2917 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3125 ns 3250 ns 0.96
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3167 ns 3083 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 3042 ns 3458 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 189695.5 ns 191592.5 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 7603161 ns
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 962208 ns 1031291.5 ns 0.93
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 155736.5 ns 153502 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 429770.5 ns 423917 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 423416 ns 473500 ns 0.89
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 423313 ns 427833 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 427000 ns 424125 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 138026.5 ns 138519 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5855028 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2057062.5 ns 2048875 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 351493 ns 380684 ns 0.92
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3807895.5 ns 3799062.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3817542 ns 3822458 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3807458.5 ns 3802667 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3760479.5 ns 3823563 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 709783 ns 717031.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31085025 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10433209 ns 12950229 ns 0.81
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1493135 ns 1325953 ns 1.13
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49863500 ns 49840813 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35514500 ns 25988833 ns 1.37
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35511042 ns 35525750 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 96900416.5 ns 96904729.5 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1600320.5 ns 1593190 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1005650 ns 1014101 ns 0.99
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154538292 ns 153775938 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112336291.5 ns 89008896 ns 1.26
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112413750 ns 112384750 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 294933354 ns 296752479 ns 0.99
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6479054.5 ns 6476290 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5530406 ns 5534451 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 19125 ns 15062.5 ns 1.27
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 19375 ns 15625 ns 1.24
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 17291 ns 16875 ns 1.02
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 16291.5 ns 15333 ns 1.06
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 21904 ns 21010 ns 1.04
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1073077 ns
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 227500 ns 204959 ns 1.11
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 26131 ns 27230 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 10916.5 ns 11083 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 8958.5 ns 7583 ns 1.18
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9167 ns 9209 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17208 ns 17188 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 262286.5 ns 264057 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 10056101 ns
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1654812 ns 1736125.5 ns 0.95
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 153661 ns 152581.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8958 ns 7417 ns 1.21
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8729 ns 8833 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10313 ns 10041.5 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8125 ns 8292 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 116383 ns 117259.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3328053 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 844500 ns 887417 ns 0.95
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 237893 ns 236902.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9375 ns 9708.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9875 ns 9292 ns 1.06
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10583 ns 10791.5 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9500 ns 9584 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 625403 ns 631614 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22201158 ns
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5192708 ns 5189583 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 651707 ns 668942 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10041.5 ns 8812.5 ns 1.14
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9708 ns 9583 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11458 ns 11042 ns 1.04
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10062 ns 9250 ns 1.09
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 121406 ns 122641 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3262923 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 935083 ns 876791.5 ns 1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 69301 ns 74481 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13334 ns 13708 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13188 ns 14979 ns 0.88
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 16458 ns 14416 ns 1.14
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12666.5 ns 13625.5 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 595121 ns 601521.5 ns 0.99
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20037393 ns
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4712562 ns 4885250 ns 0.96
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 351104 ns 353174 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 500 ns 458 ns 1.09
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 459 ns 500 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 34870 ns 35180 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1208458 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 440375 ns 441166 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 208592 ns 206562 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7354 ns 7042 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7916 ns 10458 ns 0.76
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9458 ns 8042 ns 1.18
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7146 ns 7125 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 233423 ns 233713.5 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21500624 ns
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4832271 ns 5300958.5 ns 0.91
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 659156 ns 658707 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 16500 ns 12666 ns 1.30
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 16708 ns 13833 ns 1.21
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 16208 ns 15667 ns 1.03
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 11854.5 ns 10270.5 ns 1.15
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 22539 ns 22010 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1105443 ns
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 213166.5 ns 186625 ns 1.14
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 190492 ns 191282 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32292 ns 32042 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 31708 ns 32020.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32333 ns 32458 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 31833 ns 31854.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 278174 ns 278049 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 10682517 ns
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1778749.5 ns 1885500 ns 0.94
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 603796 ns 606396.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 466750 ns 438291 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 471896 ns 484125 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 445208 ns 446062.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 442875 ns 477208 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194389 ns 194398.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5707585.5 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1991000 ns 1968250 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 350553 ns 375174 ns 0.93
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3826416 ns 3825292 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3820437.5 ns 3837396 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3833542 ns 3828687.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3824416.5 ns 3836875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 543837 ns 549907 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28449805 ns
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10088312.5 ns 12010500 ns 0.84
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1218812 ns 1226382.5 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 786744750 ns 836787979.5 ns 0.94
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 544322375 ns 426008000 ns 1.28
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 544701250 ns 542930250 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1560888250 ns 1533058916 ns 1.02
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22539066.5 ns 22531506 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14026519 ns 14059203 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3016294584 ns 3617643875 ns 0.83
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1790874375 ns 1519606625 ns 1.18
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1791257792 ns 1791220042 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 6320454875 ns 4771769708 ns 1.32
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 366543615 ns 370760684 ns 0.99
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88746342 ns 89879564 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 80166.5 ns 75354.5 ns 1.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 76083 ns 77417 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79000 ns 80167 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76187.5 ns 76625 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 209486 ns 210924.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7636757.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 534750 ns 1045583.5 ns 0.51
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 109691 ns 110131.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 208584 ns 231500 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 262687.5 ns 195167 ns 1.35
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 225750 ns 244583 ns 0.92
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 273625 ns 234875 ns 1.16
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1063950 ns 1060035 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43715632.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6250125 ns 6603312.5 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 640881.5 ns 643791.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199907104.5 ns 199256958.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 139067125 ns 103813958.5 ns 1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 138708750 ns 139098125 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 388802584 ns 388864875 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5823959 ns 5820038 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3423293 ns 3424485 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 618968833 ns 615907583.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 440907584 ns 354224562 ns 1.24
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 439090750 ns 440166291.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1177639416 ns 1188432875 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26159681.5 ns 26804213.5 ns 0.98
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21888417 ns 21815881 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7167 ns 7333 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6167 ns 5416 ns 1.14
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6083 ns 6291 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10250 ns 10458 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 27882 ns 28403 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1246522 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 645625 ns 361437.5 ns 1.79
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47780 ns 48715.5 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213562.5 ns 213333.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220687.5 ns 221708 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222520.5 ns 220916 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205917 ns 205750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 224122 ns 226122 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 34490145 ns
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9008375 ns 11493583.5 ns 0.78
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 530410 ns 541195.5 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9062.5 ns 7291 ns 1.24
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 7833 ns 8417 ns 0.93
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10291 ns 10770.5 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9000 ns 8583 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 118000.5 ns 119656 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3370282 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 894584 ns 855542 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 72091 ns 72200 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7667 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7917 ns 9395.5 ns 0.84
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 10584 ns 8375 ns 1.26
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7312.5 ns 7542 ns 0.97
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 524296 ns 526844.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19429602 ns
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4708375 ns 4384667 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 321993 ns 322463 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 542 ns 458 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 500 ns 416 ns 1.20
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 26603 ns 27306 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1230699 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 484062.5 ns 483625 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 50630 ns 48601 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9208 ns 9917 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9375 ns 10167 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 11291 ns 9542 ns 1.18
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9375 ns 8667 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 254202 ns 256488 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 24288314.5 ns
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5903834 ns 5936416 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 394403 ns 396784 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 106833 ns 108542 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 100104.5 ns 85333 ns 1.17
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 100584 ns 100208 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146042 ns 146625 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 25322 ns 25074 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1069883.5 ns
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 268834 ns 244333 ns 1.10
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 188692 ns 190632 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 478458 ns 479625 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 478250 ns 518583.5 ns 0.92
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 478792 ns 481000 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 478125 ns 478125 ns 1
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 234643 ns 235150 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 12128060.5 ns
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2169750 ns 2164333 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 622201 ns 622586 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5270.5 ns 5500 ns 0.96
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6125 ns 5750 ns 1.07
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 7417 ns 6666.5 ns 1.11
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 6458 ns 4125 ns 1.57
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17119 ns 16723 ns 1.02
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 78390 ns 78130 ns 1.00
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11708 ns 11812 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10812.5 ns 11916 ns 0.91
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11875 ns 11000 ns 1.08
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16542 ns 16500 ns 1.00
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 214283.5 ns 216336 ns 0.99
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 390414 ns 370958.5 ns 1.05
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 38625 ns 35917 ns 1.08
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 52500 ns 50500 ns 1.04
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52667 ns 52709 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13833 ns 13541 ns 1.02
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22075 ns 20359 ns 1.08
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 80161 ns 79931 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36249.5 ns 36625 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 30937.5 ns 29625 ns 1.04
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 32625 ns 31458 ns 1.04
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 56958 ns 57209 ns 1.00
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 192677 ns 195413 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 400584 ns 409364 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1729.5 ns 1959 ns 0.88
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1916 ns 1792 ns 1.07
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2208 ns 2125 ns 1.04
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1792 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 21247 ns 21014.5 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1157702 ns
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 309833.5 ns 324459 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 28990 ns 33550 ns 0.86
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2167 ns 2209 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2125 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2395.5 ns 2417 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2291 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 205351 ns 207244.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 9086951 ns
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1522708.5 ns 1670895.5 ns 0.91
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 136302 ns 137121 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5187.5 ns 4583 ns 1.13
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4625 ns 4750 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7083 ns 6333 ns 1.12
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4292 ns 4917 ns 0.87
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 146996 ns 147827 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5676336 ns
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 595250 ns 771709 ns 0.77
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 62401 ns 71711 ns 0.87
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8542 ns 8270.5 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8667 ns 8666 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10333 ns 8792 ns 1.18
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8083 ns 8125 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 889542 ns 888135.5 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 38346905.5 ns
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5710000 ns 6483625 ns 0.88
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 389144 ns 391164 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56750 ns 56875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57791 ns 56875 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57625 ns 57750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58125 ns 58292 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37659 ns 37890 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1173482.5 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 641708 ns 379312.5 ns 1.69
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 206772 ns 205582 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 482062.5 ns 448479 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 463250 ns 465229 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 508917 ns 464687.5 ns 1.10
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 434333 ns 433500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 269678.5 ns 270782 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27863393 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8045333 ns 10306000 ns 0.78
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 805107.5 ns 801818 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3313249.5 ns 3291000 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2341229.5 ns 1770084 ns 1.32
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2340333 ns 2335292 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6307458 ns 6297083.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 206170.5 ns 206316 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 210057 ns 203322 ns 1.03
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11431687.5 ns 11333854.5 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8348708.5 ns 6594562.5 ns 1.27
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8321437.5 ns 8324937.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21106375 ns 21089229 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 734135 ns 735605 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1070635.5 ns 1072271 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5667 ns 5625 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5667 ns 5667 ns 1
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7292 ns 7500 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4458 ns 6750 ns 0.66
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 140309.5 ns 139700 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5380103.5 ns
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 768292 ns 867541.5 ns 0.89
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 56180 ns 56260 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 7500 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7625 ns 14625 ns 0.52
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7292 ns 7375 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7083 ns 7000 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 765019 ns 766028 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 35848307 ns
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5274333 ns 5998084 ns 0.88
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 381219 ns 380414 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 141125 ns 117604 ns 1.20
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 124167 ns 125375 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 99021 ns 102396 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 99959 ns 98145.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 151096 ns 152876 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5813160 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2024062.5 ns 2030624.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 170012 ns 185692 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2026542 ns 2021875 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2030187.5 ns 2037125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2022917 ns 2013542 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2001958 ns 2033354 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 710929 ns 716061.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31104834 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10821438 ns 13591542 ns 0.80
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1258188 ns 1265732.5 ns 0.99
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 34104 ns 29833 ns 1.14
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 36667 ns 34167 ns 1.07
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 35896 ns 35542 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 500 ns 625 ns 0.80
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15836 ns 15704 ns 1.01
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 71301 ns 71560.5 ns 1.00
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2625 ns 2583 ns 1.02
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3041 ns 4583 ns 0.66
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3000 ns 3000 ns 1
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2250 ns 2209 ns 1.02
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 140932.5 ns 143464 ns 0.98
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 363533.5 ns 351354 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7208 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6084 ns 5334 ns 1.14
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6042 ns 6166 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9834 ns 10000 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 36632 ns 37164 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1147030 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 478270.5 ns 334396 ns 1.43
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50781 ns 49180 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212875 ns 212895.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 245250 ns 222000 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 243291 ns 221041.5 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205791 ns 205979 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 248183.5 ns 249374 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27101683.5 ns
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7862917 ns 9656333 ns 0.81
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 521905.5 ns 581561 ns 0.90
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3916 ns 3959 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 4000 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3917 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22266 ns 21939 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2115567 ns
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 244666 ns 227375 ns 1.08
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 45861 ns 45671 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14834 ns 14916 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15083 ns 14708 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14958 ns 15000 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14875 ns 14875 ns 1
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 316276.5 ns 314728.5 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 11563770.5 ns
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1010166 ns 1635750 ns 0.62
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 194972 ns 192832 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 135292 ns 109166 ns 1.24
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 126542 ns 132541 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 104833 ns 109875 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 127083 ns 102125 ns 1.24
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 149262.5 ns 138355.5 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5493706 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2741375 ns 2016354 ns 1.36
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 183311 ns 188667 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1875500 ns 1918396 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1929062.5 ns 1939229 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1921833.5 ns 1913584 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1881833 ns 1937625 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 701377 ns 700104 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 29803203.5 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10752125 ns 13264020.5 ns 0.81
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1228342 ns 1233652.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18708 ns 17667 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19333 ns 18458 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20916.5 ns 22270.5 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18625 ns 18250 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 111660 ns 110588.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3330550 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1335000 ns 1374104.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 77091 ns 81891 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216104 ns 216417 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 248250.5 ns 249771 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 222000 ns 216541.5 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 240979 ns 217312.5 ns 1.11
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 529691 ns 527304 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 19308055.5 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6096541.5 ns 8411584 ns 0.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 489315 ns 488925 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 25541.5 ns 24063 ns 1.06
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 30791.5 ns 28500 ns 1.08
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 30625 ns 29459 ns 1.04
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1667 ns 1334 ns 1.25
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16863 ns 16479 ns 1.02
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 82541 ns 82590 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4562.5 ns 4708.5 ns 0.97
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 4916.5 ns 4708 ns 1.04
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5271 ns 5208 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4292 ns 4875 ns 0.88
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 210758 ns 210198 ns 1.00
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 382683.5 ns 398304 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 304208 ns 304792 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 308041 ns 305542 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 308333 ns 311083 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 304708 ns 306375 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 234485 ns 232191.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7433824.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1027333 ns 1156396 ns 0.89
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 275523 ns 279563 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 541792 ns 530625 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 532687.5 ns 542459 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 544583.5 ns 542000.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 548917 ns 535875 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1108177 ns 1096065 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43330985.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6088292 ns 6678000 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 862699 ns 873778.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19375 ns 20083 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 21875 ns 20187.5 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22625 ns 23187 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21291 ns 20959 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 115815.5 ns 115290.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3524302 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1434854.5 ns 1265792 ns 1.13
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 76091 ns 80731 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212500 ns 212042 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 241791 ns 224625 ns 1.08
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220125 ns 214333 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213792 ns 213708.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 761437 ns 758025 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25996104 ns
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7204145.5 ns 10158583 ns 0.71
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 543280.5 ns 542975 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6542 ns 6458 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7042 ns 6917 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8666.5 ns 8542 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6416 ns 6417 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 144083 ns 143078 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5494023 ns
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 783688 ns 869500 ns 0.90
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 71731 ns 69771 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10542 ns 10709 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10583.5 ns 9771 ns 1.08
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10291 ns 10729.5 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9916 ns 10291 ns 0.96
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 846190.5 ns 834187 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 37805204 ns
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5358917 ns 6274750 ns 0.85
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 387734 ns 396084 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5375 ns 5333 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6187.5 ns 4958 ns 1.25
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6979 ns 7125 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4834 ns 5958 ns 0.81
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 147518 ns 146313.5 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5618573 ns
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 791750 ns 875000 ns 0.90
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 61531 ns 67660 ns 0.91
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7291 ns 7667 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8104 ns 7500 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7666 ns 7625 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7708 ns 7459 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 801831 ns 797995 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 38472235 ns
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5789958 ns 6580999.5 ns 0.88
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 397074 ns 400804 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14458000 ns 14350958 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10125125 ns 7722625 ns 1.31
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10112458 ns 10132750 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27710417 ns 27757125 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 532341 ns 532327 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 384094 ns 403538.5 ns 0.95
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46171729 ns 45806208 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33474333.5 ns 26766750.5 ns 1.25
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33459750 ns 33520000 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85189084 ns 85306916 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2669656 ns 2661047 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3280222.5 ns 3296413 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66333 ns 66000 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 67229 ns 67333 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 70667 ns 69854 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 67500 ns 67375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 120718.5 ns 120529 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3600796.5 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1457000.5 ns 1329083.5 ns 1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 231107.5 ns 228112 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 443625.5 ns 444083 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 475854 ns 444083 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 447853.5 ns 441292 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 442792 ns 442521.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 742407 ns 736542.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25435932 ns
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7578187.5 ns 10732062.5 ns 0.71
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 798118 ns 809398 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 542 ns 1
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 584 ns 667 ns 0.88
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 33000 ns 32886 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1148407.5 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 458917 ns 466834 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 49420 ns 49230 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8584 ns 9375 ns 0.92
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10042 ns 9250 ns 1.09
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9750 ns 9500 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8250 ns 8125 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 292463 ns 290314.5 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 23428542 ns
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4860020.5 ns 5519708 ns 0.88
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 382894 ns 387394 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9875 ns 9875 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9875 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9875 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9833 ns 9791 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23639 ns 23928 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2138902.5 ns
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 222041 ns 204979.5 ns 1.08
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 215522 ns 214872 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45958 ns 46000 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 46708 ns 45667 ns 1.02
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46167 ns 46666 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45834 ns 46250 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 296481.5 ns 293307 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11930292.5 ns
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 926313 ns 1595562.5 ns 0.58
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 623576 ns 621217 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56208 ns 56333 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57209 ns 56792 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57125 ns 57083 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57833 ns 57834 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 29373 ns 29516 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1150343 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 611209 ns 704333.5 ns 0.87
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 203382 ns 205082 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 448833 ns 455021 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 476708 ns 465375 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 503562.5 ns 473000 ns 1.06
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 439500 ns 434208.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 251321 ns 252003 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31963324.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9414500.5 ns 12166125 ns 0.77
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 863589 ns 893508.5 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 646541.5 ns 624416 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 641000 ns 662083 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 641562.5 ns 619083 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 651083 ns 633895.5 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 210265 ns 212333 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8024959 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1399792 ns 1471333 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 232562 ns 236152 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2229208 ns 2220834 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2220124.5 ns 2250000 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2228208 ns 2213792 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2239166 ns 2240750 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1005286 ns 990521.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50598584.5 ns
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8457041 ns 9717333 ns 0.87
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1379658.5 ns 1376089 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 21125 ns 19000 ns 1.11
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20104 ns 19979 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23416 ns 22333.5 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21083 ns 22250 ns 0.95
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 115191.5 ns 114382.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3482804 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1454313 ns 1244584 ns 1.17
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 75055.5 ns 81450 ns 0.92
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 225500 ns 222479 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 230791 ns 224959 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223750 ns 221208 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 219354 ns 218917 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 740987 ns 738666.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 25904808.5 ns
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7416125 ns 10456396 ns 0.71
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 562065.5 ns 562856 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 500 ns 584 ns 0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 667 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 541 ns 542 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23696 ns 23746 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1202842 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 469708.5 ns 488062.5 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 52120 ns 49670 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8375 ns 9541.5 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9875 ns 9792 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10125 ns 9833 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8813 ns 9291.5 ns 0.95
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 273167 ns 272510 ns 1.00
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24458892 ns
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6139792 ns 6224583.5 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 408644 ns 407824 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8958 ns 7708 ns 1.16
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9937.5 ns 8687.5 ns 1.14
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9750 ns 11166.5 ns 0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8417 ns 9666 ns 0.87
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 122177 ns 121220 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3286464 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 893667 ns 860208 ns 1.04
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 67831 ns 72661 ns 0.93
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7708.5 ns 7708 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7958 ns 7250 ns 1.10
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7792 ns 8125 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7354.5 ns 7334 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 520137 ns 516336 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17380037 ns
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4378271 ns 4339813 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 327554 ns 328244 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1250 ns 1458 ns 0.86
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1646 ns 1375 ns 1.20
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1792 ns 2041.5 ns 0.88
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1541 ns 1583 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 22253.5 ns 21646 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1134787.5 ns
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 310250 ns 305020.5 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 191402 ns 191511.5 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3458 ns 3334 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3416.5 ns 3375 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3583 ns 3459 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3250 ns 3458 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 222894.5 ns 224911 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 10304323.5 ns
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1806500 ns 1768041 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 594426 ns 595216 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 146687.5 ns 145708.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 129292 ns 106562.5 ns 1.21
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 130125 ns 129292 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225021 ns 225125 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 24810 ns 24473.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1164548 ns
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 291229 ns 252375 ns 1.15
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 36760 ns 38390 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 143312.5 ns 143771 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 110917 ns 88167 ns 1.26
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 111645.5 ns 110771 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 250854.5 ns 250875 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 222162 ns 220914.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10422924 ns
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 1979750 ns 2045709 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 220922.5 ns 237933 ns 0.93
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7250 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5958 ns 5333 ns 1.12
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6041 ns 5916 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10208 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33688 ns 33448 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1158395.5 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 349458 ns 335833 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50791 ns 50340 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220396 ns 224250 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228250 ns 228375 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228250 ns 236083.5 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217187.5 ns 212562.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 269449.5 ns 267943.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26368125 ns
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8172042 ns 9170083 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 531435 ns 609306 ns 0.87
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 15125 ns 14458 ns 1.05
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15062.5 ns 14812.5 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 16125 ns 16791.5 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 14709 ns 15334 ns 0.96
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 143874.5 ns 141134 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5485667 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 799375 ns 873104 ns 0.92
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 236992 ns 238182 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23958 ns 24083.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23583 ns 23875 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23791.5 ns 24167 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23895.5 ns 23625 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 888177 ns 878285 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 39660301 ns
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5523166.5 ns 6385188 ns 0.86
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 698717 ns 692226 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9834 ns 8916 ns 1.10
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9542 ns 9687.5 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 11250 ns 12125 ns 0.93
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9041 ns 10416 ns 0.87
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 125866 ns 124959.5 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3385207 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 918291.5 ns 918334 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 73661 ns 75531 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13792 ns 14000 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14500 ns 13729 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14250 ns 14708 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13541 ns 13834 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 680403 ns 676549 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 22152366 ns
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5309250 ns 5573041 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 376079 ns 373189 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 9792 ns 8062 ns 1.21
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 8896 ns 9750 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11584 ns 11916.5 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 8500 ns 10187.5 ns 0.83
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 125117 ns 124116 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3437739 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 894125 ns 883646 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 73521 ns 69690 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12084 ns 12625 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12854 ns 12750 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13042 ns 13542 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12250 ns 12312 ns 0.99
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 564204.5 ns 561116 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18654209 ns
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4654396 ns 4630937 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 347644 ns 345083.5 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 28104 ns 27208.5 ns 1.03
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 34458.5 ns 32333.5 ns 1.07
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 32209 ns 31958 ns 1.01
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 2250 ns 2041 ns 1.10
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16850 ns 16556 ns 1.02
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 73711 ns 82091 ns 0.90
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5104 ns 5229 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 5229.5 ns 4687.5 ns 1.12
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5250 ns 5334 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6479 ns 6458 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 142969.5 ns 142634 ns 1.00
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 370174 ns 367964 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 334 ns 0.87
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 250 ns 250 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 26455 ns 26682 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1200640 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 450083 ns 482271 ns 0.93
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48211 ns 47990 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6500 ns 6500 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6792 ns 6562.5 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6645.5 ns 6709 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6271 ns 6188 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 191827 ns 190767.5 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 24867781 ns
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5869000 ns 5874834 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 389779 ns 394363.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 2042 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 2083 ns 1917 ns 1.09
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2083 ns 2125 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 1917 ns 2000 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 27236 ns 27167 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1169588 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 475875 ns 492292 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 209463 ns 210002 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16771 ns 16833.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16459 ns 16417 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17125 ns 17354.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16958 ns 16458.5 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 280882.5 ns 278278 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24732202.5 ns
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6172584 ns 6125604 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 715917 ns 714427 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 154104.5 ns 146500 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 154208 ns 171396 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 153687.5 ns 155584 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 191750 ns 154167 ns 1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 210824 ns 204804 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7880615 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1458145.5 ns 1553583 ns 0.94
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 193772 ns 231362.5 ns 0.84
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1316937.5 ns 1324312.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1322666 ns 1348021 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1323667 ns 1319083.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1322334 ns 1326542 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 938240 ns 925557 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 45534804.5 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6626458 ns 8602229.5 ns 0.77
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1016615 ns 1014380 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24916.5 ns 23792 ns 1.05
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25417 ns 25354 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 30354 ns 28250 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 23708 ns 24604.5 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 241805 ns 238411 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7385481 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1045479 ns 1139000 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 120432 ns 120312 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 171875 ns 117854 ns 1.46
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 170938 ns 124667 ns 1.37
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 129000 ns 174458.5 ns 0.74
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 171020.5 ns 118354 ns 1.44
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1112146.5 ns 1098934 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 46841407 ns
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6287999.5 ns 7919042 ns 0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 606330.5 ns 614406 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 334 ns 250 ns 1.34
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 291 ns 292 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23274 ns 23522 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1232898 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 483708 ns 491791.5 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 48460 ns 50790 ns 0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6291 ns 6583 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6833 ns 6375 ns 1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6958 ns 6833 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6125 ns 6167 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 209275.5 ns 207746.5 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25541336 ns
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5865958 ns 5956667 ns 0.98
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 396823.5 ns 395954 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6209 ns 5958 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5875 ns 6041.5 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7583 ns 7604.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5375 ns 6500 ns 0.83
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 148685.5 ns 147981.5 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5635688 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 450542 ns 774875 ns 0.58
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 236762 ns 239202 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10291 ns 10000 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10625 ns 10083 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10292 ns 10667 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9708 ns 9791.5 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 926094 ns 916090 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 41298893 ns
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5833458 ns 7392292 ns 0.79
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 686037 ns 688747.5 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 708 ns 0.88
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 666 ns 0.94
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 666 ns 666 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 666 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22945 ns 23031 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2003264 ns
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 324291.5 ns 209625 ns 1.55
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 216602 ns 215712 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4584 ns 4833 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4792 ns 4584 ns 1.05
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4833 ns 4833 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4625 ns 4625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 234125 ns 230125.5 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9765420 ns
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1729083 ns 1700146 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 600706 ns 599396 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8541.5 ns 8396 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8417 ns 8000 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9834 ns 10125 ns 0.97
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7770.5 ns 9062.5 ns 0.86
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 124166 ns 123106.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3943268 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 819500 ns 907333 ns 0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 69551 ns 76081 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8666.5 ns 8792 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9125 ns 8459 ns 1.08
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8709 ns 9041 ns 0.96
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8416 ns 8270.5 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 603479 ns 600302.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 22543432 ns
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4953249.5 ns 4960583.5 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 351464 ns 353604 ns 0.99
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 125896 ns 122750 ns 1.03
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129542 ns 95625 ns 1.35
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 130125 ns 130334 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 180833 ns 183125 ns 0.99
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46726 ns 46375 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 94181 ns 98981 ns 0.95
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 319041 ns 303292 ns 1.05
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 344875 ns 182750 ns 1.89
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 340500 ns 345917 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 570229.5 ns 608729 ns 0.94
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 195260.5 ns 195364.5 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 501295.5 ns 494734 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397583 ns 396125 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288270.5 ns 215375 ns 1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287750 ns 287708 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756042 ns 756000 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43912 ns 43820 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1469434 ns
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 424771 ns 358000 ns 1.19
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 83611 ns 83390 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1468708 ns 1446958.5 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1137167 ns 863667 ns 1.32
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1136562.5 ns 1133375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2444229 ns 2443417 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 253458 ns 252085 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 10856605 ns
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1843937.5 ns 1851958 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 354543 ns 350863.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 651084 ns 626459 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 660125 ns 682479 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 652959 ns 615000 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 626167 ns 641167 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 207109 ns 203045 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8156799 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1369084 ns 1359542 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 255513 ns 254223 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2443625 ns 2435250 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2456792 ns 2470979.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2441541 ns 2445042 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2441833 ns 2415792 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1028403 ns 1014910 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 50468967.5 ns
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10457750 ns 11589916 ns 0.90
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1469104 ns 1478675 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 34083.5 ns 29458.5 ns 1.16
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 36312.5 ns 33812.5 ns 1.07
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 35500 ns 34541 ns 1.03
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 875 ns 1042 ns 0.84
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15652 ns 15442 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 72891 ns 85531 ns 0.85
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3209 ns 3250 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3375 ns 3042 ns 1.11
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3375 ns 3416 ns 0.99
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3125 ns 3166 ns 0.99
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 141997 ns 142240.5 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 343713 ns 360413 ns 0.95
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 406000 ns 404291 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 408791 ns 403708 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 408167 ns 409042 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 419542 ns 421875 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 43520 ns 44262 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1357102.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1155208.5 ns 1119041 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 241062 ns 242882 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3864084 ns 3855208 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3988291.5 ns 3997771 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3990020.5 ns 3998125 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3733958.5 ns 3773938 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 249010 ns 248524 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 36216473.5 ns
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11511875 ns 14976771 ns 0.77
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1245217.5 ns 1453704 ns 0.86
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3916 ns 3959 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 3875 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34609 ns 34278.5 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1165142.5 ns
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 182792 ns 161167 ns 1.13
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 40940 ns 40280 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15708 ns 15875 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 16083 ns 15583 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15959 ns 16041 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15750 ns 15791 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 258758 ns 257529.5 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 8503589.5 ns
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 882021 ns 864083.5 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 165641 ns 168256.5 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404125 ns 403417 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295625 ns 221375 ns 1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295583 ns 295666 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760417 ns 760500 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113700 ns 113952 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1066318.5 ns
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 463583.5 ns 335792 ns 1.38
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 88591 ns 88615.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1492292 ns 1471958 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1152625 ns 887791.5 ns 1.30
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1162562 ns 1157167 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2463959 ns 2467666 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 251541.5 ns 255583.5 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 10024873 ns
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1881709 ns 1946854 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 357224 ns 360243.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 541 ns 542 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 583 ns 584 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 458 ns 500 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 26556 ns 26902 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1136156 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 465500 ns 486187.5 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 208083 ns 208227.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7646 ns 7667 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8084 ns 7666 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7834 ns 7916.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7417 ns 7250 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 222497 ns 219818 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 24199768 ns
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5838500 ns 6151042 ns 0.95
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 689609 ns 686716.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 832270.5 ns 825562.5 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 618459 ns 468833 ns 1.32
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 621499.5 ns 620188 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1558042 ns 1547479 ns 1.01
batchedmm(128, Bsize=32)/forward/GPU/CUDA 130601 ns 131055 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 168282 ns 231953 ns 0.73
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2689437.5 ns 2669042 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 2008583 ns 1538125.5 ns 1.31
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2000250 ns 2006270.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4937229.5 ns 4938583 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 260328 ns 242713 ns 1.07
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 870736 ns 860168 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 334 ns 375 ns 0.89
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 291 ns 333 ns 0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32813 ns 32634 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1166182 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 464354.5 ns 452000 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 51890 ns 48761 ns 1.06
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6083 ns 6437.5 ns 0.94
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6875 ns 6541.5 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6812.5 ns 6750 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6062.5 ns 6000 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 229431 ns 228896 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 22144460 ns
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5526646 ns 5302916 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 366015 ns 369843 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2415167 ns 2391250 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2398042 ns 2400000 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2379812.5 ns 2405958 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2391667 ns 2372125 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 207204.5 ns 204395 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8005080.5 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1453750 ns 1597249.5 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 356885 ns 377704 ns 0.94
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4645458 ns 4646708.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4640083 ns 4648958 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4665375 ns 4659021 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4561854.5 ns 4685792 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 930938 ns 915367 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48256012 ns
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6726208 ns 7426833 ns 0.91
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1414028 ns 1261857 ns 1.12
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6854.5 ns 7479 ns 0.92
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 22875 ns 7125 ns 3.21
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7417 ns 7959 ns 0.93
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6792 ns 7250 ns 0.94
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23968 ns 23573 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1176642 ns
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 283437.5 ns 243500 ns 1.16
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 34960 ns 39571 ns 0.88
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 66187.5 ns 70291.5 ns 0.94
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 52229 ns 45542 ns 1.15
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 50687.5 ns 63500 ns 0.80
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 45209 ns 33104 ns 1.37
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 221676 ns 217821 ns 1.02
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10860397 ns
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2069917 ns 2084458 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 239743 ns 226612 ns 1.06
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 22104 ns 20396 ns 1.08
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 26291.5 ns 24479.5 ns 1.07
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 25458 ns 24854.5 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5958 ns 5500 ns 1.08
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18137 ns 16892 ns 1.07
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 91101 ns 85151 ns 1.07
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12042 ns 11958 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10584 ns 9000 ns 1.18
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10875 ns 10958.5 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 17979 ns 18167 ns 0.99
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 231021 ns 227664.5 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 374574 ns 389024 ns 0.96
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 406209 ns 404791 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 297167 ns 223500 ns 1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296334 ns 296709 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762750 ns 762750 ns 1
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46909 ns 46360 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1358188 ns
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 481770.5 ns 340000 ns 1.42
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 88761 ns 88940 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1490812.5 ns 1485750.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1170000 ns 895812 ns 1.31
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1166250 ns 1165791.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2470395.5 ns 2472333 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 289489 ns 290272 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 12873963.5 ns
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2039750.5 ns 2106583 ns 0.97
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 378364 ns 377424 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 433958 ns 432770.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 436958 ns 430583 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 436542 ns 436958 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 447333 ns 448209 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 55343 ns 54092 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1019801 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1133312.5 ns 1074083.5 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 236118 ns 235772 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3905708 ns 3888958 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4027020.5 ns 4016791.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4021333.5 ns 4025938 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3767563 ns 3793958.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 269874 ns 263523 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31176805 ns
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10180479 ns 11929333 ns 0.85
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1242641 ns 1247352 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8750 ns 8750 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7667 ns 6875 ns 1.12
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7708 ns 7667 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12375 ns 12417 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24263 ns 24084 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2162149 ns
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 226000 ns 211583 ns 1.07
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 215323 ns 216562 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45000 ns 45125 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45333 ns 44750 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45500 ns 45375 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 44875 ns 45187.5 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 351763 ns 347338.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 12458965 ns
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1760771 ns 1883625.5 ns 0.93
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 670439 ns 671931.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 86041.5 ns 104146.5 ns 0.83
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 123250 ns 86437 ns 1.43
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 92208 ns 92875 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 122937.5 ns 126625 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190149 ns 189767 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5780085 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1986083 ns 1966250 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 210857.5 ns 183982 ns 1.15
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2008541 ns 2011000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2030687.5 ns 2025000 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2014687.5 ns 2009458 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2014250 ns 2016917 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 544290 ns 535873.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28037049 ns
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9702729 ns 11961958.5 ns 0.81
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 972217 ns 982380 ns 0.99

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal merged commit 1afc1c7 into main Sep 5, 2024
62 of 69 checks passed
@avik-pal avik-pal deleted the ap/in_stat_track branch September 5, 2024 03:23
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant