Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

feat: oneDNN wrapper based on oneDNN_jll #156

Draft
wants to merge 2 commits into
base: main
Choose a base branch
from
Draft

feat: oneDNN wrapper based on oneDNN_jll #156

wants to merge 2 commits into from

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Sep 9, 2024

@avik-pal avik-pal force-pushed the ap/onednn branch 2 times, most recently from 475ac00 to 7812347 Compare September 9, 2024 21:04
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 844beaf Previous: 7ba127a Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6167 ns 4667 ns 1.32
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6229.5 ns 6666.5 ns 0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6042 ns 7500 ns 0.81
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7416 ns 5750 ns 1.29
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 119269 ns 117321 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 2830199 ns 2723919 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 835125 ns 3008750 ns 0.28
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 455544 ns 404195 ns 1.13
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9791 ns 9896 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9895.5 ns 9833 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9875 ns 9979 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10000 ns 9958.5 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 595639 ns 533872 ns 1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 18414070 ns 18512917 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2633292 ns 2324292 ns 1.13
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 656066 ns 674968 ns 0.97
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1458 ns 1437.5 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1500 ns 2875 ns 0.52
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1500 ns 2083 ns 0.72
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 2208 ns 1437.5 ns 1.54
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 22524 ns 21479 ns 1.05
bias_activation(32, act=relu)(32 x 128)/forward/GPU/oneAPI 1309757 ns 1282166 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 207541 ns 190209 ns 1.09
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 29130 ns 29540 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3958 ns 4250 ns 0.93
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4417 ns 4167 ns 1.06
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4083 ns 4145.5 ns 0.98
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 3834 ns 4375 ns 0.88
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 146914 ns 144438.5 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/oneAPI 8793357 ns 9108147.5 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1531042 ns 1604875 ns 0.95
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 147022 ns 145092 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57708 ns 55875 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46916 ns 39209 ns 1.20
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45167 ns 46625 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82145.5 ns 84167 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 38357 ns 36824 ns 1.04
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 642389 ns 542002 ns 1.19
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1024291 ns 1333104 ns 0.77
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 83071 ns 81391 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2041646 ns 2024917 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2084416 ns 2079125 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2084937 ns 2081625 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1993500 ns 1993125 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 232297.5 ns 226688 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 8026795 ns 7623752 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 4457750 ns 7427958 ns 0.60
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1566555 ns 1252074 ns 1.25
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 149083.5 ns 174750 ns 0.85
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 151250 ns 164541.5 ns 0.92
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 146062.5 ns 148812.5 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 145166 ns 144375 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166368.5 ns 165480 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 7633809.5 ns 7680925 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1607645.5 ns 1457521 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 199292 ns 204852 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1115833 ns 1117250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1109042 ns 1109375.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1116042 ns 1113334 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1112584 ns 1112187.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 685909.5 ns 694582 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33611727.5 ns 33705507.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 5937958.5 ns 6238375 ns 0.95
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1023739.5 ns 1026961 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5062 ns 4417 ns 1.15
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4917 ns 5041 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4708.5 ns 5208 ns 0.90
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4083 ns 4583 ns 0.89
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 92394 ns 93299.5 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 5364370 ns 5368327 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 449521 ns 634041.5 ns 0.71
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 67281 ns 69460 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8625 ns 8375 ns 1.03
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8687.5 ns 8542 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8542 ns 8833 ns 0.97
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8667 ns 8833 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 598599 ns 604485 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 36192580 ns 36365543 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5961458.5 ns 5669937.5 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 392053.5 ns 388374 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17250.5 ns 17000 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17708.5 ns 17709 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17937.5 ns 18021 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 17729 ns 16895.5 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 68975.5 ns 66654.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3001183 ns 2923981.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1319417 ns 477833 ns 2.76
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79131 ns 78451 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216000 ns 216834 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 223000 ns 219896 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212792 ns 225583.5 ns 0.94
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220083 ns 217625 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 357846.5 ns 356473 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 14192257.5 ns 14201022 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5791666 ns 5644395.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 468694 ns 465005 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 667 ns 0.94
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 750 ns 750 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 833 ns 812.5 ns 1.03
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 625 ns 1
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 21039 ns 20462 ns 1.03
bias_activation(2, act=relu)(2 x 128)/forward/GPU/oneAPI 1158458 ns 1162134.5 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 287041.5 ns 302625 ns 0.95
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 30980 ns 32870 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1375 ns 1417 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458.5 ns 1458 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1458 ns 1417 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1333 ns 1416 ns 0.94
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 125621 ns 125127 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/oneAPI 8779470 ns 8831211 ns 0.99
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1564250 ns 1526500 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 135731 ns 136521 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7208 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 5416 ns 1.12
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 6125 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 9959 ns 10666 ns 0.93
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24706 ns 23625 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1229054.5 ns 1207481 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 432792 ns 356458 ns 1.21
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47191 ns 48881 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 228937.5 ns 226166 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 245125 ns 265333 ns 0.92
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228416.5 ns 234854 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215562 ns 219500 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 199703 ns 192027 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 29760693 ns 31211143.5 ns 0.95
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9054125 ns 9046313 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 643376 ns 649247 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4083 ns 4125 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4084 ns 4083 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4125 ns 4084 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4166 ns 4083 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23957 ns 23477 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/oneAPI 2055936 ns 2001417 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 224166.5 ns 214833 ns 1.04
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 46275.5 ns 47261 ns 0.98
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16917 ns 17083 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16916 ns 17000 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17083 ns 16833 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16792 ns 17334 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 196624 ns 195303 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/oneAPI 10472777 ns 14536946 ns 0.72
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 966209 ns 918208 ns 1.05
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 174212 ns 174652 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 510000 ns 508750 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 405500 ns 330583 ns 1.23
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 404500 ns 404666 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865000 ns 864791 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113431 ns 113620 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/oneAPI 401239 ns 401393 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 429875 ns 490979 ns 0.88
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 241452 ns 242133 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2320583 ns 2313834 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2030250 ns 1747479 ns 1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2017459 ns 2035208 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3273458 ns 3272708.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 244185 ns 241207 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11598243 ns 10021457.5 ns 1.16
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 1916000 ns 2011770.5 ns 0.95
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 738512 ns 743443 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6604 ns 4708.5 ns 1.40
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7042 ns 7625 ns 0.92
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6083.5 ns 7708 ns 0.79
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 7250 ns 5479.5 ns 1.32
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 92991 ns 92351.5 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 5513071 ns 5442998 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 770854.5 ns 783479 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 65480 ns 65411 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12083.5 ns 10333.5 ns 1.17
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11750 ns 11875 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12042 ns 11750 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11375 ns 12062.5 ns 0.94
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 631759 ns 634956 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 38968123.5 ns 40400531.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5644020.5 ns 5457291.5 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 408519 ns 409979.5 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 500 ns 541 ns 0.92
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23457 ns 23181 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/oneAPI 2129985 ns 2216579 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 235625 ns 332584 ns 0.71
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 46801 ns 47221 ns 0.99
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2083 ns 2166 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2167 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2166 ns 2084 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2084 ns 2084 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 223097 ns 215755 ns 1.03
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/oneAPI 11589618 ns 11357397.5 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 2049375 ns 1978417 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 174291.5 ns 172626.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8458 ns 8937.5 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9125 ns 9729.5 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9646 ns 9459 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8875 ns 8958 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 103644.5 ns 96639 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 3192803 ns 3207607 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 805042 ns 876000 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 71721 ns 71941 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17834 ns 18521 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17896 ns 19104.5 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18479 ns 17625 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18291 ns 18812.5 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 583162.5 ns 554001 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 17664128 ns 16517942.5 ns 1.07
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5192125 ns 5180916.5 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 380613.5 ns 378539 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 458 ns 1.18
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 459 ns 625 ns 0.73
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 709 ns 666 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 541 ns 500 ns 1.08
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 36527.5 ns 35213 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 1184419 ns 1186873 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 295334 ns 466396 ns 0.63
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 45821 ns 46270 ns 0.99
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10812.5 ns 9312.5 ns 1.16
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8708 ns 9916.5 ns 0.88
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10000 ns 9167 ns 1.09
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8417 ns 9458.5 ns 0.89
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 258735 ns 267136 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 18465714.5 ns 18948901 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5142083.5 ns 4572250 ns 1.12
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 369734 ns 367694 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397625 ns 395333 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287687.5 ns 214416 ns 1.34
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287542 ns 288292 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 755625 ns 756291 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112363 ns 111882 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/oneAPI 334860 ns 329474.5 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 364667 ns 300208.5 ns 1.21
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 76050 ns 77331 ns 0.98
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1459479 ns 1453791.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1133958.5 ns 852583 ns 1.33
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1132729 ns 1132645.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2439375 ns 2440625 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 210126 ns 207032 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/oneAPI 8834558 ns 10204120 ns 0.87
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1568229 ns 1668041.5 ns 0.94
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 320624 ns 324428.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7187 ns 7041.5 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7375 ns 7750 ns 0.95
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7625 ns 9396 ns 0.81
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6812.5 ns 7791.5 ns 0.87
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 141360.5 ns 144806.5 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 5654882.5 ns 5813106.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 450292 ns 437250 ns 1.03
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 70081 ns 66071 ns 1.06
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16312.5 ns 13083 ns 1.25
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15042 ns 14479 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15375 ns 15709 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14458.5 ns 15354.5 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 968205 ns 956377 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 42860911.5 ns 42729213 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5831875 ns 5700250 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 430104 ns 428955 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 26375 ns 24000 ns 1.10
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 26958 ns 24875 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28083 ns 29292 ns 0.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 28750 ns 27667 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 204088 ns 199144 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7654854 ns 7744284 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 634979 ns 999584 ns 0.64
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 114891.5 ns 116931 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 156083 ns 103583 ns 1.51
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104375 ns 152687 ns 0.68
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 150750 ns 153583 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 118312.5 ns 151000 ns 0.78
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1088120 ns 1075746 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43090708 ns 43042130 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5963375 ns 5733792 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 593256 ns 590946.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 80167 ns 75000 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 85167 ns 77084 ns 1.10
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 82750 ns 86333.5 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 75167 ns 74875 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 211316.5 ns 205585 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7489316 ns 8027595.5 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 539791.5 ns 519187.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 127081 ns 127562 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 276562.5 ns 293542 ns 0.94
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 208458 ns 308750 ns 0.68
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 305791.5 ns 315187.5 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 268333.5 ns 304208 ns 0.88
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1151232 ns 1108118 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 41375439 ns 40422383 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6693729.5 ns 6276458 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 691937 ns 695017 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 16875 ns 15875 ns 1.06
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17417 ns 17521 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 17604 ns 18500 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 17084 ns 16958 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 153754.5 ns 146489 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 5655250 ns 5586208 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 450709 ns 723083.5 ns 0.62
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 232832 ns 232683 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 27958 ns 26667 ns 1.05
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26958 ns 26687.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 28271 ns 28208.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 28104 ns 27708.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 979026 ns 982068.5 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 41077814 ns 40344043 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5908521 ns 5743229 ns 1.03
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 690436 ns 686807.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11250 ns 11083 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11292 ns 12042 ns 0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12041 ns 12334 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10625 ns 10791 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 126931.5 ns 124134 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 3475371 ns 3473152 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 808854 ns 880000 ns 0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 238583 ns 234213 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 23000 ns 21958 ns 1.05
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 22500 ns 22729.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 22667 ns 21895.5 ns 1.04
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 22000 ns 22000 ns 1
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 698360 ns 701831.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 23080681 ns 21157140 ns 1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5570042 ns 5204750 ns 1.07
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 673757 ns 674667 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 62875.5 ns 63437.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 63000 ns 65521 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 64042 ns 66750 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 63500 ns 63042 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 110364 ns 106345.5 ns 1.04
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3393751.5 ns 3373870 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1301084 ns 480667 ns 2.71
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 233322 ns 233433 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 445041.5 ns 437896 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 437667 ns 456000 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 479167 ns 450542 ns 1.06
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 450375 ns 444000 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 524383 ns 515188 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20565993 ns 21597008 ns 0.95
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6117292 ns 6095791.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 704751.5 ns 717017.5 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7396 ns 6792 ns 1.09
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8187.5 ns 8000 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7417 ns 8583.5 ns 0.86
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7229 ns 6917 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 147886.5 ns 146052.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 5538159 ns 5510181.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 449250 ns 726500 ns 0.62
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 64911 ns 65301 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16084 ns 14292 ns 1.13
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15354 ns 15292 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16354.5 ns 14084 ns 1.16
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14166.5 ns 16209 ns 0.87
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 954692 ns 947670 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 38171523 ns 39845105 ns 0.96
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5788167 ns 5499875 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 402704 ns 399764 ns 1.01
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6155208 ns 6131500 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 6371792 ns 3224875 ns 1.98
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6370958 ns 6379229.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11912687 ns 11911084 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 302156 ns 349856 ns 0.86
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 303693 ns 303248 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19141583 ns 19059708.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 19944479.5 ns 11090437.5 ns 1.80
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19924250 ns 20005646 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36530333.5 ns 36446770.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1017640 ns 1081781.5 ns 0.94
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1172952 ns 1153782 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1000 ns 958 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 917 ns 1000 ns 0.92
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1000 ns 958 ns 1.04
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 917 ns 917 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23463 ns 23071 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/oneAPI 2180811.5 ns 2085318 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 235959 ns 332541.5 ns 0.71
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 208162 ns 207622 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3666 ns 3667 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3709 ns 3750 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3708 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3708 ns 3667 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 284285.5 ns 281551.5 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11253813 ns 12095727 ns 0.93
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2167729 ns 2129583 ns 1.02
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 625786 ns 626307 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7541.5 ns 8042 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 9125 ns 8145.5 ns 1.12
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8729 ns 9042 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8708 ns 7937.5 ns 1.10
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 123165 ns 121104 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 3667231 ns 3679976 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 744666.5 ns 802541.5 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 65270 ns 65471 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 12062 ns 13125 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11667 ns 12875 ns 0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 11959 ns 11417 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 12562.5 ns 12708 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 651060 ns 638151 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22798911 ns 22685670 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5299417 ns 4390333 ns 1.21
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 358783 ns 355644 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 250 ns 333 ns 0.75
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 333 ns 291 ns 1.14
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22961 ns 22337 ns 1.03
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/oneAPI 2071449 ns 2195388.5 ns 0.94
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 228708 ns 207833 ns 1.10
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 46370 ns 47401 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 3042 ns 0.96
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3000 ns 3375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3042 ns 2916 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2875 ns 3333 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 207109 ns 204047 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/oneAPI 9152954 ns 14763707.5 ns 0.62
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1607270.5 ns 1611395.5 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 157106.5 ns 157641.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11917 ns 10250 ns 1.16
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 12208 ns 12167 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11584 ns 12187.5 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 11666.5 ns 10604 ns 1.10
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 124551.5 ns 121713.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 3309022 ns 3281210 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 918375 ns 904791.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 233552 ns 233512.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 20979.5 ns 21104.5 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 22334 ns 22583 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21542 ns 21083 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 21604 ns 21708 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 599594 ns 595173 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20120761.5 ns 20531194.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 4710917 ns 4095583 ns 1.15
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 648136.5 ns 638246.5 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4334 ns 4417 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4375 ns 4375 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4500 ns 4375 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4417 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24357 ns 24193.5 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/oneAPI 2211901.5 ns 2211530 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 228354 ns 215041 ns 1.06
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 47410 ns 47690 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16583 ns 16292 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16708 ns 16291 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16416 ns 16667 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16625 ns 16416 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 332885.5 ns 330020.5 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/oneAPI 12676739 ns 12280627 ns 1.03
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1102416.5 ns 1639709 ns 0.67
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 208862 ns 206457.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 1917 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2000 ns 2167 ns 0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2208 ns 2084 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2083 ns 2084 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 37203 ns 35891 ns 1.04
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 1177427 ns 1213015 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 427479.5 ns 474917 ns 0.90
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 203212 ns 204052 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 19583 ns 19687.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 19125 ns 17187.5 ns 1.11
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21125 ns 17750 ns 1.19
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16500 ns 16667 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 296769 ns 293976.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 20540351 ns 21212198 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5401041 ns 4767354.5 ns 1.13
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 684347 ns 686777 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 59416.5 ns 55771 ns 1.07
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 65209 ns 62792 ns 1.04
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 62542 ns 65604.5 ns 0.95
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51291 ns 51333 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66719 ns 66418 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 116606.5 ns 114241 ns 1.02
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 184521 ns 202896 ns 0.91
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 159771 ns 135104 ns 1.18
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 155166.5 ns 130083 ns 1.19
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 297562.5 ns 245666 ns 1.21
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 220006.5 ns 215296 ns 1.02
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 606336 ns 607861 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 83042 ns 79709 ns 1.04
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 83875 ns 107104 ns 0.78
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 84958 ns 85167 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81500 ns 124166.5 ns 0.66
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 190562.5 ns 192861 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5798141 ns 5531381 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1997417 ns 1816084 ns 1.10
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204577 ns 203512 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1920709 ns 1869895.5 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1913792 ns 1901084 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1920000 ns 1917666.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1920125 ns 1889333 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 539927 ns 531825 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 27537419.5 ns 32650285 ns 0.84
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8864020.5 ns 8859584 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 920879 ns 925670 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 291 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 291 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 22072.5 ns 21389 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/oneAPI 2090569 ns 2065883 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 338958 ns 336229.5 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 40850 ns 42770.5 ns 0.96
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1792 ns 1834 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 257371 ns 253832 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/oneAPI 9709585 ns 10417238 ns 0.93
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1020646 ns 1009479 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 179951 ns 184376.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9042 ns 8000 ns 1.13
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9667 ns 10042 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9125 ns 10375 ns 0.88
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9750 ns 8167 ns 1.19
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 121818.5 ns 119090.5 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 3357352 ns 3309191 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 891542 ns 876708 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 232702 ns 232622 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9708 ns 9083 ns 1.07
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10208 ns 10625 ns 0.96
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9541 ns 9542 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8292 ns 10125 ns 0.82
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 534389 ns 527209 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19775589 ns 22247571 ns 0.89
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4303291 ns 3949187.5 ns 1.09
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 622296 ns 624237 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57875 ns 56166 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46458 ns 38916 ns 1.19
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45375 ns 46125 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83000 ns 83958 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 41762 ns 40233 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1338270 ns 1343252 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1165604 ns 1123667 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 76810.5 ns 76266 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1930875 ns 1923750 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1971625 ns 1952750.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1976958.5 ns 1982854 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1877438 ns 1850708.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 226570.5 ns 221906.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33427846 ns 33376877 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11278417 ns 11408021 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1183436 ns 1191052 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 419562.5 ns 416333 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 419375.5 ns 421645.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 419041 ns 421208.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 416209 ns 417667 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 215261 ns 208798 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7681114 ns 7659621 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 541416.5 ns 518208 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 281483 ns 282883 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 751479.5 ns 747916.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 738396 ns 671583 ns 1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 740208 ns 673562.5 ns 1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 781437.5 ns 748021 ns 1.04
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1064114.5 ns 1048327.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 44853248 ns 45569778.5 ns 0.98
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6467854 ns 6335208.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 909209 ns 914290 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3438520.5 ns 3428937.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3375083 ns 3384709 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3406958 ns 3435000 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3397249.5 ns 3417875 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189879.5 ns 175238.5 ns 1.08
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8059161.5 ns 8069034 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1406375 ns 1424083 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 424534.5 ns 426124 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6192958.5 ns 6191270.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6200125 ns 6170041 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6188979 ns 6167416.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6196396.5 ns 6190792 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1020322 ns 994959 ns 1.03
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 65545133.5 ns 50094330 ns 1.31
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7436125 ns 7413750 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1702616.5 ns 1549811 ns 1.10
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 472500 ns 470666 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 341709 ns 252458 ns 1.35
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 341208 ns 342417 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 897875 ns 901125 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 47313.5 ns 46139 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/oneAPI 393377 ns 884569 ns 0.44
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 461042 ns 368208 ns 1.25
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 243782 ns 243602 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2322666 ns 2334750 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 2037395.5 ns 1752562 ns 1.16
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2032542 ns 2041187.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3278750 ns 3280124.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 275549 ns 255952 ns 1.08
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/oneAPI 13477426 ns 12850913 ns 1.05
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2214708.5 ns 2244770.5 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 765227 ns 770018 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57291 ns 55708 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46084 ns 39041 ns 1.18
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 44791 ns 46020.5 ns 0.97
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83042 ns 84125 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 29341 ns 28321 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1428289 ns 1407008 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1147291 ns 1106875 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 75001 ns 76505.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2052187 ns 2029708 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2068187.5 ns 2082292 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2084042 ns 2090958 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2003021 ns 1949604 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 239105 ns 232547 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 39375724 ns 35887652 ns 1.10
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11371687 ns 11649979 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1036740 ns 1052311 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57750 ns 55833 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46792 ns 39083.5 ns 1.20
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 45334 ns 46375 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82917 ns 84042 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51653.5 ns 49287 ns 1.05
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 797393 ns 790006.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1122209 ns 1049084 ns 1.07
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 70150.5 ns 69820 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1927354 ns 1919458 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1981874.5 ns 1955416.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1968271 ns 1946334 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1887083 ns 1890750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 245788 ns 239685 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 17099740 ns 17609091 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9616750 ns 9788042 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1031735 ns 918859 ns 1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 333 ns 292 ns 1.14
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 417 ns 0.70
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 35895 ns 34717 ns 1.03
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 1231936 ns 1181143 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 295541 ns 263500 ns 1.12
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 45920 ns 46211 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6125 ns 6333 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6417 ns 7500 ns 0.86
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6666.5 ns 6583 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6145.5 ns 7000 ns 0.88
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 217032.5 ns 208392.5 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 20544840 ns 20162243 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4868375 ns 4479667 ns 1.09
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 372728.5 ns 365124 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 250 ns 291 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32714 ns 32562 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/oneAPI 1251803 ns 1251080 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 258291.5 ns 258000 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 36890 ns 37000 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2666 ns 2750 ns 0.97
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3041 ns 3625 ns 0.84
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2834 ns 2709 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2666 ns 2917 ns 0.91
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 192919.5 ns 189309.5 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/oneAPI 7137848 ns 7798739 ns 0.92
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 975833.5 ns 905666.5 ns 1.08
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 150721 ns 151136.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 420792 ns 467667 ns 0.90
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 421958 ns 444750 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 424625 ns 425999.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 455854 ns 421833.5 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 140632 ns 137895 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5978562 ns 5774821 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2094041 ns 2386500 ns 0.88
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 378363 ns 367024 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3795708 ns 3802521 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3744542 ns 3765917 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3805146 ns 3811417 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3800458 ns 3799541.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 718592 ns 709425 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 33036728 ns 33554230 ns 0.98
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10823333 ns 10457896 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1465698.5 ns 1471404 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49868000.5 ns 49735229.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 35522791.5 ns 25984959 ns 1.37
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35434437.5 ns 35560875 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 96915395.5 ns 96902041.5 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1603373 ns 1616773 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1046780 ns 1045271 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 154569979 ns 153907333 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 112355625.5 ns 89247291.5 ns 1.26
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 111830209 ns 112379750 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 294823520.5 ns 294166500 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6476750.5 ns 6515848 ns 0.99
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5572630 ns 5562255.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 18791.5 ns 14521 ns 1.29
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 17584 ns 14958 ns 1.18
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 15458 ns 16833 ns 0.92
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15000 ns 14854.5 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 21075 ns 20539.5 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/oneAPI 1101717 ns 1114507 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 220479 ns 206959 ns 1.07
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 25790 ns 26060 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 10937.5 ns 10625 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 9167 ns 7771 ns 1.18
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9250 ns 9208 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17042 ns 17437.5 ns 0.98
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 264109 ns 260548 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/oneAPI 10241493.5 ns 9528073.5 ns 1.07
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1536583 ns 1587125 ns 0.97
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 147571 ns 149326.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8500 ns 7958 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7958 ns 9292 ns 0.86
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9833 ns 9500 ns 1.04
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 9979 ns 7958.5 ns 1.25
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 126526 ns 116273.5 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 3507407 ns 3476228 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 778521 ns 810375 ns 0.96
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 233617 ns 233683 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9458.5 ns 9208.5 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9417 ns 10645.5 ns 0.88
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9875 ns 10208 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9229.5 ns 10375 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 628037.5 ns 619508.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 22754941 ns 22906068.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 4815958 ns 4432792 ns 1.09
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 653036 ns 654786 ns 1.00
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10188 ns 8291.5 ns 1.23
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9750 ns 10459 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9812.5 ns 10042 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10396 ns 9250 ns 1.12
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 123440.5 ns 120531 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 3376982 ns 3436472 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 912146 ns 901792 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 72011 ns 71071 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13875 ns 13250 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13146 ns 16042 ns 0.82
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13583 ns 17208 ns 0.79
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 14041 ns 15167 ns 0.93
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 597349 ns 592138 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 19218920 ns 18951458.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4784916 ns 4027062.5 ns 1.19
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 344493 ns 345753 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 458 ns 459 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 500 ns 541 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 36382 ns 34521 ns 1.05
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/oneAPI 1239135 ns 1191899 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 420458 ns 371562.5 ns 1.13
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 204071 ns 206352 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7166 ns 7062.5 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7125 ns 8333.5 ns 0.85
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7645.5 ns 8583 ns 0.89
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7292 ns 8000 ns 0.91
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 236575 ns 233771 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/oneAPI 21174965 ns 23357164 ns 0.91
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5710375 ns 4885833 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 655981 ns 662116 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 15625 ns 12292 ns 1.27
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 15958 ns 13229 ns 1.21
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 13750 ns 15125 ns 0.91
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10166.5 ns 10167 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 22442 ns 22042 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/oneAPI 1141056 ns 1119591.5 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 212166.5 ns 189125 ns 1.12
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 184201 ns 189132 ns 0.97
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32250 ns 31875 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32083.5 ns 32333.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32208 ns 32291.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 32250 ns 32000 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 277672.5 ns 276327 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/oneAPI 10900465 ns 12201192 ns 0.89
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1658291.5 ns 1697542 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 589205 ns 595015.5 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 440750 ns 480875 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 440624.5 ns 441083 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 448208 ns 450250 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 444042 ns 490979 ns 0.90
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193833.5 ns 194024 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5864901 ns 5766516 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2017041.5 ns 2629708 ns 0.77
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 367053 ns 368063.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3831813 ns 3822958 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3823292 ns 3807354 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3830041 ns 3827834 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3829437.5 ns 3826167 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 552859 ns 544349 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 28143374 ns 29050298 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9281334 ns 9196542 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1354052 ns 1359983 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 785276417 ns 838219667 ns 0.94
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 540295917 ns 415052604.5 ns 1.30
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 554667500 ns 543102500 ns 1.02
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1560118395.5 ns 1525021500 ns 1.02
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22754234.5 ns 22764607.5 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14753214 ns 14772276 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 2530250667 ns 3570164958 ns 0.71
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1789641792 ns 1502049709 ns 1.19
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 2778601041 ns 2269221042 ns 1.22
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 5294222958 ns 4773617583 ns 1.11
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 346639040 ns 369302709 ns 0.94
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 88499181.5 ns 87924411 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 77083.5 ns 79646 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 78333 ns 78895.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 79208.5 ns 78667 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 78979.5 ns 77583 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 213193 ns 207237 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7637698 ns 7871351 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 545083 ns 520375 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 106751 ns 107601 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 197666.5 ns 250834 ns 0.79
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 191708 ns 294583.5 ns 0.65
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 244167 ns 285708.5 ns 0.85
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 266625 ns 222333.5 ns 1.20
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1057411 ns 1049109.5 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 42829342 ns 43337417.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6225479 ns 6122958 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 633805 ns 640576 ns 0.99
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199626249.5 ns 199656458.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 138818666 ns 103769666.5 ns 1.34
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 138760500 ns 139342042 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 388835292 ns 388182208 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5838846 ns 5838796 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3565003 ns 3577840.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 619631416.5 ns 616451521 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 439117667 ns 351188291.5 ns 1.25
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 438492541.5 ns 439680896 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1178157416 ns 1178137125 ns 1.00
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26506796.5 ns 26651952 ns 0.99
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 22062982 ns 22092888 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7333 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6167 ns 5292 ns 1.17
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6125 ns 6084 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10417 ns 10167 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 29047 ns 27714.5 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1220374 ns 1202781 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 427542 ns 351458 ns 1.22
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 46640 ns 48481 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212542 ns 218291.5 ns 0.97
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220375 ns 222250 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221417 ns 221209 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213354 ns 213708.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 227983 ns 222292 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 32340750 ns 31765824 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9106020.5 ns 9125125 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 526590 ns 529665 ns 0.99
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9020.5 ns 7271 ns 1.24
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8541 ns 9541.5 ns 0.90
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9000 ns 9791 ns 0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9396 ns 8187.5 ns 1.15
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 121446.5 ns 117715.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 3283620.5 ns 3188633 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 904979.5 ns 885458 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 69901 ns 69700 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7833.5 ns 7479 ns 1.05
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 10479.5 ns 0.72
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7854.5 ns 10875 ns 0.72
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7895.5 ns 8875 ns 0.89
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 528756.5 ns 519786.5 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 19746247 ns 18597573.5 ns 1.06
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4591708.5 ns 3961208 ns 1.16
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 319453 ns 316073 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 583 ns 416 ns 1.40
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 417 ns 750 ns 0.56
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 500 ns 459 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 27227 ns 26338 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 1166419.5 ns 1200694 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 456541.5 ns 488604.5 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 46370 ns 46820 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9416.5 ns 9291 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9020.5 ns 10416 ns 0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9667 ns 9208.5 ns 1.05
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9562.5 ns 11583 ns 0.83
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 255444 ns 253612 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 23081728 ns 25803867.5 ns 0.89
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5235833 ns 5171833.5 ns 1.01
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 391054 ns 388624 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 107375 ns 104834 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 98667 ns 84834 ns 1.16
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 99833 ns 99500 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146812 ns 146333 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 25167.5 ns 24613 ns 1.02
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/oneAPI 1173177.5 ns 1194962 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 263000 ns 246062.5 ns 1.07
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 189882 ns 192062 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 477917 ns 526854 ns 0.91
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 478541 ns 478875 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 515396 ns 500416.5 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 528917 ns 478958.5 ns 1.10
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 235264 ns 232619 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/oneAPI 11541234.5 ns 11733131 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2156229 ns 1709625 ns 1.26
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 606146 ns 610896 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5375 ns 5125 ns 1.05
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 6000 ns 7167 ns 0.84
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 5292 ns 6791 ns 0.78
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4749.5 ns 4042 ns 1.18
batchedmm(16, Bsize=32)/forward/GPU/CUDA 17066 ns 16580 ns 1.03
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 79131 ns 79701 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 12041 ns 11708 ns 1.03
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 10375 ns 11584 ns 0.90
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11375 ns 10792 ns 1.05
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16646 ns 17687.5 ns 0.94
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 217111.5 ns 214143.5 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 366413 ns 366964 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 39167 ns 35792 ns 1.09
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 51875 ns 50791 ns 1.02
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 49541 ns 51833.5 ns 0.96
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13625 ns 13542 ns 1.01
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20639 ns 21568 ns 0.96
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 86411 ns 87241 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36958 ns 38979.5 ns 0.95
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 30916 ns 30708 ns 1.01
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 31749.5 ns 30416 ns 1.04
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57312.5 ns 58458 ns 0.98
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 196718 ns 192010 ns 1.02
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 415084 ns 395119 ns 1.05
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1979.5 ns 1729.5 ns 1.14
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1792 ns 1875 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2167 ns 2146 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1812.5 ns 1709 ns 1.06
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 21203.5 ns 20594 ns 1.03
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/oneAPI 1106310 ns 1163029.5 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 307459 ns 326833 ns 0.94
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 33890 ns 33120 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2125 ns 2125 ns 1
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2291 ns 2333 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2458 ns 2250 ns 1.09
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2250 ns 2042 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 207088 ns 204587 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/oneAPI 8807143 ns 9292587 ns 0.95
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1522270.5 ns 1518500 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 143331 ns 136826.5 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5270.5 ns 4417 ns 1.19
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5125 ns 5250 ns 0.98
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5750 ns 6375.5 ns 0.90
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5166.5 ns 4041.5 ns 1.28
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 146943.5 ns 145077 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 5705076 ns 5424296 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 520708 ns 725208 ns 0.72
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 68485.5 ns 69471 ns 0.99
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8333 ns 8041 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8250 ns 8958 ns 0.92
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8416 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8083 ns 9208 ns 0.88
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 888729.5 ns 875812.5 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 39615920.5 ns 40742928.5 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5705937.5 ns 5580917 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 401149 ns 389804 ns 1.03
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56875 ns 56792 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57583 ns 56875 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57792 ns 57584 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58208 ns 58375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 38777 ns 37054 ns 1.05
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 2031645.5 ns 1234596.5 ns 1.65
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 355042 ns 336000 ns 1.06
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 207012 ns 203242 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 448625 ns 485813 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 464333 ns 499958.5 ns 0.93
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 499625 ns 468208 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 434291 ns 438854.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 273231 ns 268055 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 29758946.5 ns 27322975 ns 1.09
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8044833 ns 8122166.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 825417 ns 832729 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3330916.5 ns 3291250 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 2338208 ns 1764708 ns 1.32
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2311375 ns 2339021 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6316875 ns 6260292 ns 1.01
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207709 ns 204625 ns 1.02
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 212792 ns 209992 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11447500 ns 11332208 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 8336208 ns 6550833 ns 1.27
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8225083 ns 8325250 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21090292 ns 20937125 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 739705 ns 734916 ns 1.01
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1050190 ns 1048155.5 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6375 ns 4291 ns 1.49
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4750 ns 5875 ns 0.81
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 5833 ns 6583 ns 0.89
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 7083 ns 4896 ns 1.45
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 140910 ns 137991.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 5627243 ns 5581467 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 787812.5 ns 785625 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 56271 ns 56390 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7208 ns 7042 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7375 ns 10562.5 ns 0.70
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7334 ns 7104.5 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7417 ns 7833 ns 0.95
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 762144 ns 754679 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 35087672 ns 34960226 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5326625 ns 5245042 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 371874 ns 371414 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 122750 ns 127625 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 122020.5 ns 95624.5 ns 1.28
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 98459 ns 100000 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 140792 ns 95708 ns 1.47
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 153890 ns 152137 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5926081 ns 5871279.5 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2123000 ns 2635166.5 ns 0.81
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 204552 ns 203242 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1988542 ns 2017959 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2003812.5 ns 2027771 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2024875 ns 2021167 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2026833 ns 1987167 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 716956 ns 703925.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32083849 ns 31965494 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10716792 ns 11055292 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1107075 ns 1255893 ns 0.88
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 32625 ns 29375 ns 1.11
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 35833 ns 34500 ns 1.04
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 33812.5 ns 35250 ns 0.96
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 520.5 ns 583 ns 0.89
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15839 ns 15622 ns 1.01
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 78371 ns 80130 ns 0.98
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2583 ns 2542 ns 1.02
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 2792 ns 3125 ns 0.89
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 2958 ns 2834 ns 1.04
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2208 ns 3000 ns 0.74
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 140578 ns 141408 ns 0.99
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 339833 ns 343344 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7292 ns 7125 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5937.5 ns 5375 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6041 ns 6000 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10209 ns 10209 ns 1
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 38269.5 ns 36671 ns 1.04
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1192024.5 ns 1208337 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 367166 ns 331459 ns 1.11
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47825.5 ns 48221 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215792 ns 217479 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 220500 ns 229625 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 247125 ns 225000 ns 1.10
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 206042 ns 212875 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 249845.5 ns 244929 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26197878.5 ns 26091309.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7952645.5 ns 7984187.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 574095 ns 574266 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3958 ns 3959 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 4000 ns 3917 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3959 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22218 ns 21419 ns 1.04
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/oneAPI 2104977 ns 2118188.5 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 247146 ns 234583 ns 1.05
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 41980 ns 42620 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14958 ns 14791 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14917 ns 14750 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14917 ns 14875 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15000 ns 14833 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 314700 ns 311492 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/oneAPI 10856377 ns 10906139 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 997166 ns 982000 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 196832 ns 192231.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 102708 ns 140834 ns 0.73
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 141708 ns 127417 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 104749.5 ns 105167 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 99291 ns 141000 ns 0.70
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 140349 ns 152595 ns 0.92
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5866105 ns 6050834 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2092834 ns 2057334 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 205677 ns 213297 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1930458 ns 1917833 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1878145.5 ns 1898875 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1920854 ns 1922083 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1922917 ns 1898854 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 704122 ns 692137 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 32247238 ns 31139112 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10354916.5 ns 10436541 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1062040 ns 1217872 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 18416 ns 18250 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17666.5 ns 18625 ns 0.95
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19500 ns 20750 ns 0.94
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20792 ns 17749.5 ns 1.17
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 113947.5 ns 110137 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3467976 ns 3282416 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1373875 ns 480541.5 ns 2.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 80120.5 ns 79421 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222604.5 ns 252041.5 ns 0.88
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221583 ns 217541.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 223084 ns 219687.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217250 ns 222729.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 526883 ns 519298 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 20789992 ns 20051825.5 ns 1.04
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6207729 ns 6194812.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 476799.5 ns 478425 ns 1.00
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 23417 ns 23291.5 ns 1.01
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 31750 ns 28583 ns 1.11
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 27479.5 ns 28792 ns 0.95
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1584 ns 1229.5 ns 1.29
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16598 ns 16210 ns 1.02
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 81101 ns 82241 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4917 ns 4292 ns 1.15
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 4833 ns 4729 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5292 ns 5042 ns 1.05
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 5167 ns 5771 ns 0.90
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 210207 ns 207444.5 ns 1.01
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 389554 ns 378084 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 306875 ns 305417 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 306291 ns 306250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 306645.5 ns 308084 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 305542 ns 305750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 242939.5 ns 228609 ns 1.06
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7807299 ns 7545946 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 929959 ns 604584 ns 1.54
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 272783 ns 273963 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 542084 ns 532917 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 573291 ns 538167 ns 1.07
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 541250 ns 539125 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 591834 ns 572709 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1118094 ns 1074383 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43107357.5 ns 44755027.5 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6266417 ns 6115208.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 856818 ns 858603.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20083 ns 19291 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19354.5 ns 20708 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20958 ns 22375.5 ns 0.94
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20041 ns 19875 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 119252.5 ns 114907 ns 1.04
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3858542 ns 3614583 ns 1.07
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1468541.5 ns 593916 ns 2.47
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 79501 ns 79421 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213708 ns 215708 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 212250 ns 220584 ns 0.96
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219792 ns 213625 ns 1.03
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215917 ns 215875 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 773895 ns 762395 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 23689258.5 ns 25444001 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7400708 ns 7232562.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 536315 ns 542290.5 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6708 ns 6125 ns 1.10
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6541.5 ns 7083 ns 0.92
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6916.5 ns 7917 ns 0.87
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6417 ns 6208 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 143905 ns 140165.5 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 5650984 ns 5168559 ns 1.09
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 802042 ns 799291 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 65691 ns 65270 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10875 ns 9542 ns 1.14
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10437.5 ns 10333.5 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10854.5 ns 10375 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10917 ns 11145.5 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 834374 ns 826456 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 37471811 ns 37337383 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5454042 ns 5311708 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 388739 ns 387474 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5167 ns 4875 ns 1.06
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5562.5 ns 6917 ns 0.80
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6271 ns 7250 ns 0.86
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6542 ns 4812.5 ns 1.36
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 148014 ns 144262 ns 1.03
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 5536748 ns 5426091.5 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 810000 ns 808375 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 69510 ns 66621 ns 1.04
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7334 ns 7458 ns 0.98
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7708 ns 8083 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7667 ns 7541.5 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7625 ns 7833 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 793246.5 ns 783702 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 38329662 ns 37497088 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5698792 ns 5566229 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 391653 ns 395004 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14496146 ns 14350584 ns 1.01
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 10152125 ns 7693688 ns 1.32
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10003874.5 ns 10127042 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27734000.5 ns 27615959 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 530595 ns 548306 ns 0.97
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 398084 ns 393134 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 46337667 ns 45943208 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 33437833.5 ns 26437417 ns 1.26
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33081375 ns 33454833 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85226375 ns 84782667 ns 1.01
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2816535 ns 2657066 ns 1.06
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3305850.5 ns 3290613 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 65667 ns 66375 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 66083 ns 68584 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 68875 ns 69333.5 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 66042 ns 65979 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 110394 ns 121920.5 ns 0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3591406.5 ns 3593431.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1454916 ns 508166 ns 2.86
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 238347.5 ns 229397.5 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 441166 ns 446833 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 441125 ns 452437.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 448167 ns 446375 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 453333 ns 445834 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 746940 ns 728139 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 26902056 ns 26912797 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7664166 ns 7552104 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 783197 ns 790108 ns 0.99
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 666 ns 0.75
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 666 ns 500 ns 1.33
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 583 ns 667 ns 0.87
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 33544 ns 32311 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/oneAPI 1171151 ns 1198752.5 ns 0.98
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 290625 ns 473500 ns 0.61
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 47120.5 ns 47340 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9334 ns 8666 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 8729.5 ns 9208 ns 0.95
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9708.5 ns 8458 ns 1.15
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9500 ns 17104 ns 0.56
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 291398 ns 286358 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/oneAPI 21969530 ns 20778583 ns 1.06
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5359292 ns 4681395.5 ns 1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 378433 ns 375004 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9833 ns 9875 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9792 ns 9875 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9875 ns 9792 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9875 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23300.5 ns 23012 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/oneAPI 2128073 ns 2014844 ns 1.06
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 226208 ns 215645.5 ns 1.05
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 205132 ns 205762 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 46417 ns 45958 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 46042 ns 46042 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46084 ns 46041 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 46083 ns 46250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 292954 ns 290878 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/oneAPI 11189770.5 ns 9152947 ns 1.22
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 969625 ns 942542 ns 1.03
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 599185 ns 607695 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56375 ns 56250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 57083 ns 56458 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57166 ns 57083 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57917 ns 57709 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 30029 ns 28552 ns 1.05
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1238260 ns 1253508.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 532417 ns 663666.5 ns 0.80
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 202612 ns 203541.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 449646 ns 448583 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 471541.5 ns 465562 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 471750 ns 465458.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 483645.5 ns 454041.5 ns 1.07
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 252167 ns 245887 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 31302385.5 ns 33424426 ns 0.94
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9573625 ns 9545520.5 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 886119 ns 887779 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 635979.5 ns 645812.5 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 642375 ns 575959 ns 1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 637500 ns 640542 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 647729 ns 646271 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 212654.5 ns 208584 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8043371 ns 8406939 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1391104 ns 1406395.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 304122 ns 315503 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2229500 ns 2214979 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2220500 ns 2211999.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2222167 ns 2220812.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2244750 ns 2227958 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 994128.5 ns 978439 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48896347 ns 47363900 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7263875 ns 10481646 ns 0.69
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1355913 ns 1213952 ns 1.12
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20083 ns 18625 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 21709 ns 20729 ns 1.05
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 20792 ns 21583 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20375 ns 18875 ns 1.08
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 117053.5 ns 113850.5 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 3544768.5 ns 3565557.5 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1358000 ns 497958 ns 2.73
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 78741 ns 79731 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221459 ns 227375 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 263854 ns 259417 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 227875 ns 225541 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 224917 ns 227084 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 737761.5 ns 729838 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27674702 ns 26163617 ns 1.06
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7887646 ns 7560500 ns 1.04
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 545460 ns 554315 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 584 ns 0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 584 ns 541 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23883 ns 23274 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/oneAPI 1185582 ns 1191789 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 441250 ns 484250 ns 0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 47751 ns 48040 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9500 ns 9083 ns 1.05
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9500 ns 10437.5 ns 0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10271 ns 9541 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9292 ns 9500 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 270173 ns 268183 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24317049 ns 24685731.5 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6185229 ns 5000875 ns 1.24
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 400384 ns 398234 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 9167 ns 7250 ns 1.26
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8709 ns 9187.5 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9083 ns 9645.5 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9500 ns 8041 ns 1.18
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 122465 ns 118921.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/oneAPI 3361391 ns 3382327 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 904145.5 ns 886791.5 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 69921 ns 71801 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 7604 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7459 ns 8125 ns 0.92
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7937.5 ns 7500 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7749.5 ns 7562.5 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 511679 ns 507494 ns 1.01
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/oneAPI 17778250 ns 17189656.5 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4293062.5 ns 3782375 ns 1.14
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 321303 ns 320313 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1500 ns 1500 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1584 ns 1708.5 ns 0.93
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2041 ns 1791 ns 1.14
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1458 ns 1375 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 21786 ns 21598 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/oneAPI 1144522 ns 1189888 ns 0.96
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 304958 ns 313375 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 188582 ns 190932 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3458 ns 3541 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3375 ns 3583 ns 0.94
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3542 ns 3458 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3416 ns 3292 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 224607.5 ns 218452 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/oneAPI 11476067.5 ns 9603283 ns 1.20
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1662542 ns 1797375 ns 0.92
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 578026 ns 583116 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 148020.5 ns 148104.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 127750 ns 106833 ns 1.20
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 128333 ns 128562.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 226084 ns 225000 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 24758 ns 23975 ns 1.03
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/oneAPI 1058093 ns 1165725 ns 0.91
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 275646 ns 254292 ns 1.08
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 39911 ns 41470 ns 0.96
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 143937 ns 157645.5 ns 0.91
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 111000 ns 87625 ns 1.27
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 125875 ns 112000 ns 1.12
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 250750 ns 250708.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 222474 ns 218220.5 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/oneAPI 10717171.5 ns 10460438 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2035208.5 ns 1096666 ns 1.86
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 265987 ns 269773 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7167 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6083 ns 5333 ns 1.14
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6000 ns 6000 ns 1
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10125 ns 10458 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34367 ns 32755 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 1202902 ns 1178842 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 589750 ns 330458 ns 1.78
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50671 ns 50720 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 220833 ns 253104 ns 0.87
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 233708.5 ns 229041.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 263729.5 ns 234187.5 ns 1.13
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 228334 ns 227938 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 271007 ns 263186.5 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 27775567 ns 27448206 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8334917 ns 8237750 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 597416 ns 594190.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 15375 ns 13792 ns 1.11
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 15708 ns 15166 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 15437.5 ns 16499.5 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 15000 ns 14667 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 142352 ns 139540 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/oneAPI 5444705.5 ns 5436668.5 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 812500 ns 786729 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 230862 ns 232963 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 23375 ns 23000 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 24187.5 ns 23937.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 23917 ns 23875 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23625 ns 23979.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 876776.5 ns 870094.5 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/oneAPI 39659188.5 ns 40010466.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5781625 ns 5595708 ns 1.03
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 678837 ns 679366 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 9958 ns 8750 ns 1.14
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9208 ns 10312.5 ns 0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 10041.5 ns 11271 ns 0.89
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 9791 ns 9584 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 126368.5 ns 123388.5 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/oneAPI 3435538 ns 3563169 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 522645.5 ns 858292 ns 0.61
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 73501 ns 74460 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14167 ns 13375 ns 1.06
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13312.5 ns 14458.5 ns 0.92
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14667 ns 13958 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14687.5 ns 13625 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 675513 ns 667308 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/oneAPI 20841620 ns 21257602 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5175667 ns 4997708 ns 1.04
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 367658 ns 365743 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10250 ns 8583 ns 1.19
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9417 ns 10333 ns 0.91
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 9916 ns 10312.5 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9167 ns 9166 ns 1.00
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 125489.5 ns 121770.5 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/oneAPI 3383910 ns 3365145.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 931334 ns 906625 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 73070 ns 75170 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12708 ns 12292 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12395.5 ns 13437.5 ns 0.92
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13395.5 ns 12916 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13000 ns 12458 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 562655.5 ns 553718.5 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/oneAPI 19374906 ns 18868109 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4456125 ns 3865125.5 ns 1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 348694 ns 341293 ns 1.02
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 29271 ns 26354.5 ns 1.11
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 34646 ns 30645.5 ns 1.13
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 30334 ns 31541 ns 0.96
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 2000 ns 1833 ns 1.09
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16548 ns 16183 ns 1.02
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 86501 ns 81001 ns 1.07
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5333 ns 5209 ns 1.02
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 4916 ns 5021 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5291.5 ns 5417 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6375 ns 6604 ns 0.97
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 142100 ns 140577.5 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 386404 ns 370423.5 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 26630 ns 25697 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 1201023 ns 1197018 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 440458 ns 465667 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 48650.5 ns 47180 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6125 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6166 ns 6729 ns 0.92
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6958 ns 6333 ns 1.10
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6584 ns 6312.5 ns 1.04
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 191484 ns 187721.5 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 22194887 ns 23736279.5 ns 0.94
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5924646 ns 4952833.5 ns 1.20
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 394354 ns 386429 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2000 ns 1959 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 1917 ns 2042 ns 0.94
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2084 ns 2000 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2042 ns 1959 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 27533 ns 26463 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/oneAPI 1279119 ns 1170027.5 ns 1.09
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 316792 ns 479625 ns 0.66
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 205082 ns 206252 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16917 ns 16250 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16625 ns 16666 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16750 ns 16208.5 ns 1.03
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16167 ns 16417 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 277550 ns 276067 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/oneAPI 24740731.5 ns 24921263 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5558667 ns 5326083 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 701702 ns 700836 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 152271 ns 173875 ns 0.88
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 152312.5 ns 148750 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 156229.5 ns 155708 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 147978.5 ns 147458 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 217398 ns 203847 ns 1.07
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8109553 ns 8347024.5 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1510125 ns 1561917 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 222002 ns 232482 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1325333 ns 1328917 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1318416 ns 1311771 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1327063 ns 1320791 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1336000 ns 1322500 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 922007.5 ns 909940.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 48499912 ns 44667022 ns 1.09
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6407833.5 ns 7124333 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1114440 ns 995559.5 ns 1.12
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 24792 ns 22958 ns 1.08
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 27167 ns 26833 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25479 ns 27625 ns 0.92
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 25375 ns 24667 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 247798.5 ns 234608.5 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/oneAPI 7567395.5 ns 7924652 ns 0.95
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 917625 ns 576541 ns 1.59
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 113721 ns 116011 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 119208.5 ns 118166.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 152104 ns 122375 ns 1.24
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 128875 ns 158041.5 ns 0.82
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 180624.5 ns 123833.5 ns 1.46
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1031705 ns 1073695 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/oneAPI 43993019 ns 44153968 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6305625 ns 6127166 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 613041 ns 612925 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 250 ns 375 ns 0.67
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 291 ns 1.29
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23487 ns 23160 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/oneAPI 1213583.5 ns 1212472 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 448229 ns 478542 ns 0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 47080 ns 47471 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6750 ns 6291 ns 1.07
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6833.5 ns 0.93
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6833 ns 6458 ns 1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6792 ns 6584 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 207982 ns 204382.5 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/oneAPI 26359574 ns 24496787 ns 1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5739500 ns 5334937.5 ns 1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 396654 ns 388703 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6500 ns 5208 ns 1.25
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 5834 ns 7021 ns 0.83
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6833 ns 7458 ns 0.92
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5958 ns 5667 ns 1.05
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 148693 ns 145933.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 5749490 ns 5745568 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 456771 ns 753959 ns 0.61
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 231332 ns 234802 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9937.5 ns 9583 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10083 ns 10375 ns 0.97
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10250 ns 10125 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9875 ns 10042 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 915053 ns 903827 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 40523619 ns 42297357 ns 0.96
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5971833 ns 5826479 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 670296 ns 668457 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 667 ns 667 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 625 ns 709 ns 0.88
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 708 ns 625 ns 1.13
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 22978 ns 22371 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/oneAPI 2038038 ns 2015786 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 224875 ns 208416 ns 1.08
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 205752 ns 207552 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4625 ns 4584 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4584 ns 4833 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4667 ns 4666 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4667 ns 4584 ns 1.02
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 231780.5 ns 228749 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/oneAPI 9922230 ns 10461831 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1617896 ns 1654416.5 ns 0.98
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 577656 ns 580735 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8750 ns 7750 ns 1.13
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8208 ns 9166.5 ns 0.90
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 8666.5 ns 8834 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 9042 ns 8291 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 126004.5 ns 121959 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/oneAPI 3566791 ns 3411255 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 780521 ns 827916 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 73741 ns 74011 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8833 ns 8625 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8520.5 ns 9041.5 ns 0.94
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8916.5 ns 8583.5 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8833 ns 8375 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 600610 ns 591884.5 ns 1.01
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/oneAPI 21487620 ns 20708574.5 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4955625 ns 4264875 ns 1.16
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 344374 ns 342784 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 127354 ns 122750 ns 1.04
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 129958 ns 96459 ns 1.35
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 126833.5 ns 130187.5 ns 0.97
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 183417 ns 180875 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46282.5 ns 45830 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 100990 ns 101721 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 329833 ns 328000 ns 1.01
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 313667 ns 166666 ns 1.88
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 341042 ns 347541.5 ns 0.98
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 610771 ns 608646 ns 1.00
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 195542 ns 192063 ns 1.02
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 504960 ns 505519.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397500 ns 395916 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 287958 ns 214250 ns 1.34
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287459 ns 288167 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756291 ns 756500 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 44714 ns 43676.5 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/oneAPI 1410517 ns 1411321 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 410687.5 ns 429792 ns 0.96
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 80101 ns 82131 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1460042 ns 1458834 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 1135937.5 ns 857583 ns 1.32
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1133124.5 ns 1134333 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2442875 ns 2441958.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 263533.5 ns 249859 ns 1.05
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/oneAPI 11351221 ns 10370982 ns 1.09
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1766000 ns 1909646 ns 0.92
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 352028 ns 352903 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 647458 ns 616500 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 630625 ns 598250 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 641459 ns 648916.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 661042 ns 642667 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 208592 ns 200586.5 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8373757 ns 7794534 ns 1.07
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1383542 ns 1363291 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 310873 ns 313733 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2458709 ns 2445375 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2452000 ns 2426917 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2439917 ns 2441500 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2458188 ns 2440750 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1014801 ns 994961 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 51359759 ns 50766350 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7516625 ns 9661291 ns 0.78
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1314993 ns 1307388 ns 1.01
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 32375.5 ns 28521 ns 1.14
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 35687.5 ns 34625 ns 1.03
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 32625 ns 33916.5 ns 0.96
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 792 ns 875 ns 0.91
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15915 ns 15425.5 ns 1.03
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 79021 ns 79381 ns 1.00
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3125 ns 3062.5 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3167 ns 3416 ns 0.93
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3500 ns 3208 ns 1.09
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3208 ns 3209 ns 1.00
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 141537 ns 139741 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 337773 ns 338953 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 406750 ns 404500 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 408208 ns 402125 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 407000 ns 408334 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 420041 ns 422458 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 44689 ns 43145 ns 1.04
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 1425430 ns 1417291 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1163333 ns 1128750.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 239212 ns 239562 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3871979.5 ns 3863292 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3986250 ns 3971625 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3984708 ns 3996791 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3777084 ns 3757979.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 249522 ns 242826 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 37376599 ns 38623864 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11951583 ns 11673750 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1430284 ns 1433229 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3917 ns 3959 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3916 ns 3917 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 4000 ns 3916 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34785 ns 33968 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/oneAPI 1242586 ns 1232483 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 232208 ns 167334 ns 1.39
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 38191 ns 38620 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15792 ns 15666 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15750 ns 15750 ns 1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 16042 ns 15625 ns 1.03
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15833 ns 15625 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 258998 ns 255128 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/oneAPI 9462058 ns 8717525 ns 1.09
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 879875 ns 843520.5 ns 1.04
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 171032 ns 169816.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404875 ns 402625 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 295750 ns 220209 ns 1.34
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295208 ns 295959 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760417 ns 760791.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113901 ns 113239 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/oneAPI 1031250.5 ns 1047524 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 400833 ns 348895.5 ns 1.15
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 87691 ns 89300.5 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1487250 ns 1474958.5 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1160000 ns 881146 ns 1.32
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1154146 ns 1159083.5 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2466542 ns 2461917 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 257263 ns 241292 ns 1.07
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/oneAPI 12586074 ns 9318727.5 ns 1.35
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1873583 ns 1946459 ns 0.96
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 350633 ns 354883 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 500 ns 542 ns 0.92
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 583 ns 500 ns 1.17
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 26361 ns 25844 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/oneAPI 1271336 ns 1200537.5 ns 1.06
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 440542 ns 496709 ns 0.89
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 205822 ns 209382 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7584 ns 7375 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7292 ns 8104.5 ns 0.90
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7833 ns 7500 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7917 ns 7375 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 214765.5 ns 217033.5 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/oneAPI 25503667 ns 25754399 ns 0.99
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5647042 ns 5254333.5 ns 1.07
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 690301.5 ns 685977 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 836041 ns 825125.5 ns 1.01
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 614834 ns 468584 ns 1.31
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 614083 ns 621500 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1539125 ns 1536542 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 129881 ns 130845.5 ns 0.99
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 180786.5 ns 229862 ns 0.79
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2696417 ns 2661979 ns 1.01
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1999750 ns 1535250.5 ns 1.30
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1981917 ns 2000792 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4944584 ns 4906416 ns 1.01
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 240669 ns 242304 ns 0.99
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 764397 ns 841449 ns 0.91
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 292 ns 1.28
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 334 ns 291 ns 1.15
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31975 ns 32216 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/oneAPI 1271692 ns 1218492 ns 1.04
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 280292 ns 464375 ns 0.60
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 47100 ns 47630 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6417 ns 6125 ns 1.05
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6083.5 ns 6708 ns 0.91
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6500 ns 6500 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6542 ns 6375 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 223021.5 ns 224154.5 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/oneAPI 20811368 ns 21407773 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5080916 ns 4615291 ns 1.10
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 362464 ns 357793.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2380833 ns 2392708 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2381375 ns 2371959 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2382833 ns 2404416 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2427417 ns 2370084 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 202015 ns 200035.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 8009212 ns 7868335 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1480334 ns 1597041.5 ns 0.93
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 376468.5 ns 373933 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4661770.5 ns 4648292 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4647875 ns 4644250 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4654042 ns 4636708 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4641666.5 ns 4642750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 896525 ns 891890 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 47660997 ns 46027858 ns 1.04
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 6349937 ns 6938541.5 ns 0.92
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1389723 ns 1391633 ns 1.00
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7042 ns 7187.5 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7417 ns 7542 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7334 ns 7125 ns 1.03
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7000 ns 6875 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23533 ns 23289 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/oneAPI 1183451.5 ns 1167669 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 267979.5 ns 243458.5 ns 1.10
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 39501 ns 39800 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 49271 ns 46396.5 ns 1.06
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 70500 ns 32917 ns 2.14
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 33667 ns 45875.5 ns 0.73
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 46563 ns 67312 ns 0.69
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 216537 ns 214725 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/oneAPI 10983818 ns 10485830 ns 1.05
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2080125 ns 1121562 ns 1.85
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 266342 ns 269102.5 ns 0.99
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 20458 ns 19604.5 ns 1.04
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 24875 ns 24021 ns 1.04
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 23334 ns 23750 ns 0.98
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5417 ns 5084 ns 1.07
batchedmm(2, Bsize=512)/forward/GPU/CUDA 17721 ns 17227 ns 1.03
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 83171 ns 83741 ns 0.99
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 12229.5 ns 11916 ns 1.03
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 10167 ns 9354.5 ns 1.09
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10709 ns 10417 ns 1.03
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18084 ns 17958 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 227773 ns 225890 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 370583 ns 371753 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 405959 ns 404000 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 296833 ns 222584 ns 1.33
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296250 ns 296875 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762833 ns 762667 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46165 ns 46288 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/oneAPI 1416604 ns 1401617.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 481917 ns 358375 ns 1.34
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 88501 ns 89491 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1477375 ns 1480896 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 1167458.5 ns 888250 ns 1.31
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1163208 ns 1164959 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2469709 ns 2465417 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 290042 ns 288016 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/oneAPI 11583417 ns 12678894 ns 0.91
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2071041 ns 2117375 ns 0.98
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 375873 ns 381744 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 433750 ns 432125 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 436959 ns 430333 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 434875 ns 436917 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 447417 ns 448604.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54489 ns 54122.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 999697 ns 1002212 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1138729 ns 1059021 ns 1.08
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 232887.5 ns 234952 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3882375.5 ns 3895042 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4013333 ns 4004458 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4016667 ns 4030291.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3803708 ns 3789979 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 264090.5 ns 260055 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 31304151.5 ns 30675954 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 10435291.5 ns 10349458.5 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1354298 ns 1223712 ns 1.11
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8750 ns 8750 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 7625 ns 6917 ns 1.10
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7667 ns 7583 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12458 ns 12416 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 23665 ns 23553.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/oneAPI 2164697 ns 2134096 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 229042 ns 214667 ns 1.07
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 208572 ns 211142 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45458 ns 44958 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45750 ns 45083 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45250 ns 45000 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45250 ns 44958 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 339376.5 ns 344550 ns 0.98
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/oneAPI 12744869 ns 14001329.5 ns 0.91
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1713646 ns 1862458 ns 0.92
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 655776.5 ns 659011.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 88250.5 ns 122729 ns 0.72
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 85167 ns 83521 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 125250 ns 87354.5 ns 1.43
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 105979 ns 105375 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189286.5 ns 190055 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/oneAPI 5807215 ns 5969481 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 2011458 ns 1972791.5 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 224843 ns 214447 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2027291.5 ns 2012458.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2020042 ns 1980000 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2018750.5 ns 2023917 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2024770.5 ns 2011645.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 534082.5 ns 529776 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/oneAPI 30062921 ns 29142428 ns 1.03
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9367958 ns 9305500.5 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 946039 ns 1088680 ns 0.87

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal avik-pal marked this pull request as draft September 9, 2024 23:56
@avik-pal avik-pal force-pushed the ap/onednn branch 2 times, most recently from d883497 to b5511e7 Compare September 15, 2024 23:50
@@ -0,0 +1,19 @@
@wrap_type MemoryPtr dnnl_memory_t dnnl_memory_destroy

function MemoryPtrNoFinalizer(A::AbstractArray, desc = memory_descriptor(A))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
function MemoryPtrNoFinalizer(A::AbstractArray, desc = memory_descriptor(A))
function MemoryPtrNoFinalizer(A::AbstractArray, desc=memory_descriptor(A))


@wrap_type Engine dnnl_engine_t dnnl_engine_destroy

function EngineNoFinalizer(kind = Lib.dnnl_cpu, index = 0)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[JuliaFormatter] reported by reviewdog 🐶

Suggested change
function EngineNoFinalizer(kind = Lib.dnnl_cpu, index = 0)
function EngineNoFinalizer(kind=Lib.dnnl_cpu, index=0)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant