Skip to content
This repository has been archived by the owner on Nov 4, 2024. It is now read-only.

Commit

Permalink
test: run tests with more activations
Browse files Browse the repository at this point in the history
  • Loading branch information
avik-pal committed Sep 4, 2024
1 parent a9c6bd7 commit 9d522c5
Showing 1 changed file with 4 additions and 2 deletions.
6 changes: 4 additions & 2 deletions test/common_ops/dense_tests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -172,14 +172,16 @@ end

rng = StableRNG(1234)

ALL_ACTS = [identity, tanh, tanh_fast, sigmoid, sigmoid_fast,
relu, gelu, x -> x^3, x -> gelu(x)]

@testset "$mode" for (mode, aType, ongpu) in MODES
mode ("cpu", "cuda") || continue

y = zeros(Float32, 2, 2) |> aType
weight = randn(rng, Float32, 2, 2) |> aType
x = randn(rng, Float32, 2, 2) |> aType
@testset for (act, hasbias) in Iterators.product(
[relu, gelu, x -> x^3], (true, false))
@testset for (act, hasbias) in Iterators.product(ALL_ACTS, (true, false))
b = hasbias ? aType(randn(rng, Float32, 2)) : nothing

dy = randn(rng, Float32, 2, 2) |> aType
Expand Down

3 comments on commit 9d522c5

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/114537

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v1.1.0 -m "<description of version>" 9d522c5e98f473da077dcb8d002fe77b5f3696b9
git push origin v1.1.0

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LuxLib Benchmarks

Benchmark suite Current: 9d522c5 Previous: 121a2fe Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5750 ns 7270.5 ns 0.79
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6187.5 ns 5542 ns 1.12
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7979 ns 7958.5 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6958.5 ns 7209 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 119461 ns 117012 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 723417 ns 686834 ns 1.05
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 417664 ns 433304 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9834 ns 10167 ns 0.97
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9792 ns 9875 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9916 ns 10042 ns 0.99
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 10166 ns 10062.5 ns 1.01
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 551816 ns 550305 ns 1.00
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 2364708 ns 2412958 ns 0.98
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 695047 ns 10783943 ns 0.06445202835363652
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1458 ns 2333 ns 0.62
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1687.5 ns 1584 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1917 ns 1875 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1250 ns 1521 ns 0.82
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 21782 ns 21708 ns 1.00
bias_activation(32, act=relu)(32 x 128)/forward/GPU/Metal 189208 ns 184666 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/GPU/AMDGPU 30960 ns 31240 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 3958.5 ns 4270.5 ns 0.93
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4167 ns 4000 ns 1.04
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4000 ns 4500 ns 0.89
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4334 ns 4375 ns 0.99
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 148046.5 ns 146276 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/Metal 1745084 ns 1500000 ns 1.16
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/AMDGPU 148342 ns 151831 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56083 ns 57416 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39917 ns 46458 ns 0.86
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47000 ns 46437.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82750 ns 83625 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37366 ns 37234 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1348187.5 ns 1140250 ns 1.18
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 80291 ns 84481 ns 0.95
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2017708 ns 2040583 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2083959 ns 2059271 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2090792 ns 2085458 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1999604 ns 2013708.5 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 232635 ns 230879 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7104833 ns 4993834 ns 1.42
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1540007 ns 1195591 ns 1.29
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 143708 ns 152542 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 173750.5 ns 145541 ns 1.19
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 165562.5 ns 151416 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 165979 ns 147395.5 ns 1.13
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166570 ns 166882 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1701792 ns 1468104 ns 1.16
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 205502.5 ns 188712 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1100292 ns 1114875 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1114709 ns 1110000 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1122042 ns 1116500 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1119916 ns 1122458 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 713685 ns 702607 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7357125 ns 5931562.5 ns 1.24
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1039502 ns 1045069 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4458 ns 4500 ns 0.99
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4291 ns 4687.5 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6208 ns 6562.5 ns 0.95
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4416 ns 4167 ns 1.06
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 94296 ns 93036 ns 1.01
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 782083.5 ns 421646 ns 1.85
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 69431 ns 63695.5 ns 1.09
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8542 ns 8750 ns 0.98
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8834 ns 8625 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9083 ns 9042 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8583 ns 9145.5 ns 0.94
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 608245 ns 610157.5 ns 1.00
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5666604.5 ns 5466959 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 384864 ns 388908.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17229 ns 18479 ns 0.93
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17250 ns 17667 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22250 ns 20437.5 ns 1.09
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18312.5 ns 18416.5 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 68096 ns 66584 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1292667 ns 462541 ns 2.79
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 74070.5 ns 73981 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 218583 ns 218458 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 244459 ns 211458 ns 1.16
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 213333 ns 213562.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220875 ns 214917 ns 1.03
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 359693 ns 355208 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7278917 ns 5651375 ns 1.29
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 475315 ns 476459 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 708 ns 666 ns 1.06
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 584 ns 645.5 ns 0.90
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 916.5 ns 1083 ns 0.85
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 583 ns 584 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 20807.5 ns 20608 ns 1.01
bias_activation(2, act=relu)(2 x 128)/forward/GPU/Metal 297208 ns 283708 ns 1.05
bias_activation(2, act=relu)(2 x 128)/forward/GPU/AMDGPU 33001 ns 33020 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1375 ns 1417 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1458 ns 1375 ns 1.06
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1583 ns 1583 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1417 ns 1375 ns 1.03
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 126203 ns 125576.5 ns 1.00
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/Metal 1457625 ns 1432958.5 ns 1.02
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/AMDGPU 138172 ns 126626 ns 1.09
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7334 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5375 ns 6000 ns 0.90
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6083 ns 6208 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10291 ns 10542 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 24430 ns 24024 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 351229 ns 343292 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 47101 ns 47670 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219208 ns 260125 ns 0.84
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 261791 ns 253083 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 228625 ns 266916.5 ns 0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 223750 ns 224042 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 194664 ns 193024 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 11964250 ns 9238208 ns 1.30
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 617187 ns 617455 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4167 ns 4125 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4125 ns 4125 ns 1
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4084 ns 4083 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23689 ns 22982 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/Metal 203375 ns 210709 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/AMDGPU 48541 ns 49071 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16958 ns 17000 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16583 ns 17042 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 17250 ns 16875 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16917 ns 16417 ns 1.03
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 196884 ns 194424 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/Metal 1560667 ns 1429834 ns 1.09
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/AMDGPU 174782 ns 177822 ns 0.98
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 509333 ns 510167 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 332250 ns 405334 ns 0.82
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 404250 ns 404334 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 865708 ns 865042 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 114284.5 ns 113588.5 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/Metal 392875 ns 465958.5 ns 0.84
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/AMDGPU 248273 ns 249082 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2318021 ns 2331583 ns 0.99
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1745083 ns 2030250 ns 0.86
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2021000 ns 2010958 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3274791.5 ns 3195125 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 244508 ns 242243 ns 1.01
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/Metal 2001875 ns 1910124.5 ns 1.05
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 763478 ns 763951.5 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5833 ns 6396 ns 0.91
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 7167 ns 6875 ns 1.04
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7271 ns 7584 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6124.5 ns 6666 ns 0.92
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 92855.5 ns 92392.5 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 861271 ns 721250 ns 1.19
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 60401 ns 60491 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11375 ns 11979.5 ns 0.95
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11750 ns 11417 ns 1.03
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 12229 ns 12083 ns 1.01
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11125 ns 11583.5 ns 0.96
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 638820 ns 638302 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 6435375 ns 5394604 ns 1.19
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 416514.5 ns 408774 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 541 ns 500 ns 1.08
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 541 ns 542 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 500 ns 500 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23671 ns 23430 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/Metal 318791 ns 311771 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/AMDGPU 53351 ns 54340 ns 0.98
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2167 ns 2125 ns 1.02
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2084 ns 2208 ns 0.94
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2166 ns 2167 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2125 ns 2125 ns 1
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 222818.5 ns 220610.5 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/Metal 1967167 ns 1899667 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/AMDGPU 180782 ns 191382 ns 0.94
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8708 ns 8750 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 8833 ns 9541.5 ns 0.93
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 9895.5 ns 10167 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 8709 ns 8709 ns 1
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 100619 ns 104875.5 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 898521 ns 792083 ns 1.13
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 74410.5 ns 77640 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17375 ns 17937.5 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 17167 ns 17396 ns 0.99
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 19375 ns 18917 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18250 ns 19146 ns 0.95
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 574738 ns 592103 ns 0.97
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 5654917 ns 4981084 ns 1.14
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 389229 ns 390383 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 625 ns 500 ns 1.25
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 667 ns 625 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 36237 ns 35372 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 463667 ns 398500 ns 1.16
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 48401 ns 46040 ns 1.05
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 8437.5 ns 8479.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9312 ns 9625 ns 0.97
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9875 ns 9833.5 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9708 ns 9520.5 ns 1.02
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 254845 ns 267957 ns 0.95
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 5087792 ns 4295666 ns 1.18
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 375784 ns 376554 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 395833.5 ns 397875 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215750 ns 288416 ns 0.75
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288166 ns 288417 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756000 ns 756875 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112957 ns 112560 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/Metal 299833 ns 298416.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/AMDGPU 76681 ns 77565.5 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1455646 ns 1449062.5 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 862000 ns 1132208 ns 0.76
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1130021 ns 1118604 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2442563 ns 2357521 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 210541 ns 207975 ns 1.01
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/Metal 1636104.5 ns 1580250.5 ns 1.04
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/AMDGPU 325573.5 ns 324872.5 ns 1.00
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 7000 ns 7270.5 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7084 ns 7417 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8125 ns 8354.5 ns 0.97
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7041 ns 7354 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 136948 ns 143390 ns 0.96
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 760125 ns 700375 ns 1.09
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 68820 ns 60051 ns 1.15
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14625 ns 12917 ns 1.13
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15042 ns 14520.5 ns 1.04
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14958.5 ns 16104 ns 0.93
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15625 ns 16021 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 931253.5 ns 945168.5 ns 0.99
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6306249.5 ns 5468354.5 ns 1.15
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 436305 ns 428828.5 ns 1.02
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 25542 ns 25333 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 27334 ns 25250 ns 1.08
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28354 ns 27583 ns 1.03
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 31542 ns 24416.5 ns 1.29
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 200462.5 ns 199074 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1129500 ns 576708 ns 1.96
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 112942 ns 116211 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 149250 ns 105875 ns 1.41
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 131583.5 ns 105209 ns 1.25
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 106479 ns 112708.5 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 153208 ns 147084 ns 1.04
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1062590 ns 1079966 ns 0.98
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 5978292 ns 5470437.5 ns 1.09
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 590197 ns 601665 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 76250 ns 75042 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74291.5 ns 75709 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 77333 ns 78666 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76792 ns 74042 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 209030.5 ns 208057.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 638458 ns 501667 ns 1.27
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 130572 ns 124861 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216500 ns 223417 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 297395.5 ns 274958.5 ns 1.08
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 212146 ns 306250 ns 0.69
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 306208 ns 303916.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1140320 ns 1127846.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7480542 ns 6260041.5 ns 1.19
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 697363 ns 702346 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 15833 ns 16458.5 ns 0.96
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 17291.5 ns 17125 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 17875 ns 18895.5 ns 0.95
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 16687.5 ns 16958 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 150183 ns 146821.5 ns 1.02
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 779979 ns 620979.5 ns 1.26
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 237943 ns 240662 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26458.5 ns 27187.5 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25708 ns 28833 ns 0.89
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27625 ns 27937 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 27750 ns 28708 ns 0.97
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 987976 ns 980967.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 7131041.5 ns 5502250 ns 1.30
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 701547 ns 706416 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 10396 ns 11166.5 ns 0.93
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11563 ns 11333 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12833 ns 13875 ns 0.92
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10875.5 ns 11000 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 125970.5 ns 124939.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 910812.5 ns 818854.5 ns 1.11
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 241512 ns 238602 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21083 ns 22084 ns 0.95
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 21604.5 ns 21667 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 23041.5 ns 23104.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21541.5 ns 21958 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 709336 ns 707018 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 5733333 ns 5251771 ns 1.09
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 676248 ns 693476 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 62667 ns 67812.5 ns 0.92
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 63771 ns 63166.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 65667 ns 68375 ns 0.96
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 67667 ns 65416 ns 1.03
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 107292 ns 106235 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1352583.5 ns 469833 ns 2.88
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 240373 ns 241932 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 444083 ns 458875 ns 0.97
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 448875 ns 438291.5 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 440458 ns 449459 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 445833.5 ns 450083 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 521267 ns 517395 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8808750 ns 6169375 ns 1.43
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 728812.5 ns 734037 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6958.5 ns 7375 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7291 ns 8146 ns 0.90
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 8771 ns 9250 ns 0.95
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 7104 ns 7125 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 147758.5 ns 144978 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 763583 ns 639291 ns 1.19
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 60941 ns 59601 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 15125 ns 16166 ns 0.94
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14417 ns 14500 ns 0.99
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15334 ns 15291.5 ns 1.00
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 15958 ns 14541 ns 1.10
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 958359.5 ns 953156 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 6378396 ns 5309583 ns 1.20
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 409474 ns 412483 ns 0.99
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 6155291 ns 6153625 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 3225687.5 ns 6370584 ns 0.51
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 6379541 ns 6373521 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 11906125 ns 11918417 ns 1.00
batchedmm(512, Bsize=4)/forward/GPU/CUDA 351844 ns 347126 ns 1.01
batchedmm(512, Bsize=4)/forward/GPU/AMDGPU 301554 ns 299793 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 19041833.5 ns 19118354 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 11118520.5 ns 19949833 ns 0.56
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 19989395.5 ns 19921916 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 36469125 ns 36514708.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1015731 ns 1011727 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/AMDGPU 1151512 ns 1159240 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 959 ns 958 ns 1.00
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 958 ns 1000 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 959 ns 1000 ns 0.96
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 958 ns 958 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23791 ns 23026 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/Metal 317417 ns 309083 ns 1.03
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/AMDGPU 215032 ns 216516.5 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3667 ns 3625 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3667 ns 3750 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3708 ns 3667 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 283833 ns 281456.5 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/Metal 2116208 ns 2006334 ns 1.05
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 634877 ns 641165.5 ns 0.99
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7167 ns 7875 ns 0.91
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 7833.5 ns 8875 ns 0.88
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 9291 ns 10042 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 7500 ns 8166 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 122503 ns 120732.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 866646 ns 777646 ns 1.11
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 66931 ns 72621 ns 0.92
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11709 ns 12500 ns 0.94
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11834 ns 12167 ns 0.97
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 13291 ns 13041.5 ns 1.02
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11875 ns 11791.5 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 651319 ns 645322 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5038083 ns 4225228.5 ns 1.19
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 365314 ns 372211 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22923 ns 22405 ns 1.02
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/Metal 208979.5 ns 207416 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/AMDGPU 50651 ns 51781 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 3000 ns 3084 ns 0.97
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2959 ns 3125 ns 0.95
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 3250 ns 3125 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2959 ns 2834 ns 1.04
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 206218 ns 204197 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/Metal 1699541.5 ns 1523666 ns 1.12
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/AMDGPU 158851.5 ns 161603 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10375 ns 11625 ns 0.89
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11854.5 ns 11104.5 ns 1.07
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12417 ns 13021 ns 0.95
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 12333 ns 11083.5 ns 1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 123182.5 ns 121763 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 877125 ns 786833 ns 1.11
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 241463 ns 245014 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 22062 ns 21708 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21625 ns 21625 ns 1
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21708 ns 23625 ns 0.92
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20084 ns 22250 ns 0.90
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 605852.5 ns 599417 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5025000 ns 4065709 ns 1.24
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 667502 ns 671650 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4417 ns 4541 ns 0.97
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4584 ns 4583 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4417 ns 4666 ns 0.95
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4375 ns 4417 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 24334 ns 24192 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/Metal 208417 ns 211333 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/AMDGPU 54130 ns 54791 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16375 ns 16666 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16375 ns 16666 ns 0.98
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16667 ns 16916 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16875 ns 16292 ns 1.04
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 333246 ns 332323 ns 1.00
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/Metal 1768771 ns 1587333 ns 1.11
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/AMDGPU 214042.5 ns 215493.5 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 2084 ns 1958 ns 1.06
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 2000 ns 2167 ns 0.92
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 2166 ns 2166 ns 1
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 2041 ns 2042 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 36196 ns 36245 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 473000 ns 439708 ns 1.08
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 205752 ns 210833 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 17667 ns 16250 ns 1.09
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 18937.5 ns 17000 ns 1.11
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 17625 ns 17375 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 16896 ns 16416.5 ns 1.03
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 297235 ns 296059 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 5572167 ns 4512166.5 ns 1.23
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 694748 ns 695090 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 55979.5 ns 59708.5 ns 0.94
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 60709 ns 65708 ns 0.92
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 65812.5 ns 65729.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 51583 ns 51209 ns 1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66558 ns 66461 ns 1.00
batchedmm(16, Bsize=512)/forward/GPU/AMDGPU 120591.5 ns 98652 ns 1.22
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 185895.5 ns 196292 ns 0.95
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 146354 ns 152646 ns 0.96
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 136208 ns 132791.5 ns 1.03
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 297104 ns 265000 ns 1.12
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 218976.5 ns 216858 ns 1.01
batchedmm(16, Bsize=512)/zygote/GPU/AMDGPU 584106 ns 588779 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 112833.5 ns 85833 ns 1.31
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 86417 ns 124125 ns 0.70
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 89416 ns 85250 ns 1.05
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81000 ns 83917 ns 0.97
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 191966 ns 192676 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1945000 ns 1754791.5 ns 1.11
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 209467.5 ns 172083 ns 1.22
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1912250 ns 1889458 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1923916 ns 1906375 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1917917 ns 1639458.5 ns 1.17
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1922250 ns 1896208.5 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 536309 ns 532536 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11093750 ns 9060167 ns 1.22
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 935284.5 ns 1084751 ns 0.86
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 291 ns 291 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 291 ns 292 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 292 ns 333 ns 0.88
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21820 ns 21623 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/Metal 327833.5 ns 318709 ns 1.03
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/AMDGPU 46181 ns 45401 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1791 ns 1834 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1833 ns 1834 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1792 ns 1792 ns 1
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 254627 ns 252797 ns 1.01
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/Metal 1640833 ns 1460542 ns 1.12
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/AMDGPU 187212 ns 183733 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 8209 ns 8625 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 9083 ns 8708 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 9896 ns 11438 ns 0.87
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8417 ns 8125 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 120586.5 ns 118574 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 873250 ns 776791.5 ns 1.12
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 236722 ns 241823.5 ns 0.98
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10292 ns 9833 ns 1.05
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8958 ns 9458 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9917 ns 11375 ns 0.87
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 8666 ns 11041 ns 0.78
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 532717.5 ns 526956.5 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 4452292 ns 3794729 ns 1.17
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 646767 ns 648949 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56750 ns 58125 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39708 ns 46375 ns 0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47166 ns 45750 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 83125 ns 84250 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 40431 ns 39640 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1093666 ns 1077750 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 77971 ns 79521 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1903833 ns 1936334 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1979312 ns 1979666 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1983896 ns 1951604.5 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1849208 ns 1881916 ns 0.98
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 224788 ns 222570.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 14363791.5 ns 11388417 ns 1.26
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1042991 ns 1042195 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 415042 ns 434833 ns 0.95
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 418584 ns 418000 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 420291 ns 422395.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 420459 ns 416583 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 212100.5 ns 211826.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1065709 ns 505604 ns 2.11
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 286133 ns 289239 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 742875 ns 682520.5 ns 1.09
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 758958 ns 767333 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 691062.5 ns 716271 ns 0.96
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 742624.5 ns 750812.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1063422.5 ns 1054928.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7312146 ns 6283521 ns 1.16
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 924920 ns 921133 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 3442959 ns 3362083.5 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 3441833 ns 3444042 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 3417500 ns 3375083 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 3453000 ns 3433667 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 174858 ns 175818.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1420583 ns 1393208 ns 1.02
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 452865 ns 432507 ns 1.05
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 6180375 ns 6161771 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 6232875 ns 6172291.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 6229979 ns 5672584 ns 1.10
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 6252666 ns 6241000 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1007257 ns 997215 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9641124.5 ns 7277000 ns 1.32
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1560736 ns 1740609 ns 0.90
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 471375 ns 474833 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 253334 ns 341334 ns 0.74
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 341708 ns 339937.5 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 902583 ns 901833 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46913 ns 46636 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/Metal 338020.5 ns 351584 ns 0.96
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/AMDGPU 250492 ns 252343 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 2320416 ns 2323458 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1761167 ns 2036541 ns 0.86
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 2033167 ns 2030999.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 3279375 ns 3199000 ns 1.03
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 260626 ns 257623 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/Metal 2319917 ns 2193666 ns 1.06
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 785678 ns 793161 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56166 ns 57229.5 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39417 ns 45917 ns 0.86
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 46584 ns 44687.5 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82917 ns 84125 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28863 ns 28263.5 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1130625 ns 1073000 ns 1.05
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 79170.5 ns 82736.5 ns 0.96
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2020083 ns 1994187.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2062917 ns 2084521.5 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2078437.5 ns 2066104.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2004145.5 ns 1987021.5 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 238429 ns 236475 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 15264270.5 ns 11587916.5 ns 1.32
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1057241 ns 1056434 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 56292 ns 57500 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 39833 ns 46375 ns 0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47416 ns 45959 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82875 ns 83666 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 50090 ns 49710 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1054834 ns 1030562 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 74900 ns 73311 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1924167 ns 1928437.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1968250 ns 1982583 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1980792 ns 1921333.5 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1891208 ns 1896500 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 243592 ns 243238 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 12800042 ns 9867125 ns 1.30
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1070466 ns 931613 ns 1.15
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 292 ns 291 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 292 ns 292 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 35236 ns 34854 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 461750 ns 268562.5 ns 1.72
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 50011 ns 48101 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6709 ns 6333 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6520.5 ns 7042 ns 0.93
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7209 ns 1.06
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6541 ns 6709 ns 0.97
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 216284 ns 214556.5 ns 1.01
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5088292 ns 4310250.5 ns 1.18
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 373774 ns 378480.5 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 250 ns 1.17
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 250 ns 292 ns 0.86
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 291 ns 292 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 250 ns 250 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 32446 ns 32581 ns 1.00
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/Metal 248500 ns 231417 ns 1.07
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/AMDGPU 40510 ns 39650 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2917 ns 2750 ns 1.06
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 3250 ns 3167 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 3083 ns 3000 ns 1.03
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 3458 ns 2875 ns 1.20
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 191592.5 ns 190168.5 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/Metal 1031291.5 ns 896854.5 ns 1.15
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/AMDGPU 153502 ns 154712 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 423917 ns 428625 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 473500 ns 455000 ns 1.04
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 427833 ns 423875 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 424125 ns 425374.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 138519 ns 137437 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2048875 ns 2017791 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 380684 ns 354515 ns 1.07
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3799062.5 ns 3815083.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3822458 ns 3802292 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3802667 ns 3442625 ns 1.10
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3823563 ns 3811667 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 717031.5 ns 711414 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 12950229 ns 10864270.5 ns 1.19
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1325953 ns 1331908 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 49840813 ns 49850833.5 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 25988833 ns 35504146 ns 0.73
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 35525750 ns 35546333 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 96904729.5 ns 97031625 ns 1.00
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1593190 ns 1606173.5 ns 0.99
batchedmm(512, Bsize=32)/forward/GPU/AMDGPU 1014101 ns 1005743 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 153775938 ns 154464875 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 89008896 ns 112292145.5 ns 0.79
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 112384750 ns 112275083 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 296752479 ns 295087458 ns 1.01
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6476290 ns 6454148 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/AMDGPU 5534451 ns 5525883 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 15062.5 ns 18271 ns 0.82
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 15625 ns 18333 ns 0.85
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 16875 ns 16416 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 15333 ns 16042 ns 0.96
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 21010 ns 21028 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/Metal 204959 ns 199083 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/AMDGPU 27230 ns 26291 ns 1.04
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 11083 ns 10812.5 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 7583 ns 8812.5 ns 0.86
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 9209 ns 9250 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 17188 ns 17271 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 264057 ns 263179 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/Metal 1736125.5 ns 1476625 ns 1.18
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/AMDGPU 152581.5 ns 155602 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 7417 ns 8479.5 ns 0.87
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8833 ns 9958 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10041.5 ns 11041 ns 0.91
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 8292 ns 8292 ns 1
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 117259.5 ns 126280 ns 0.93
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 887417 ns 770417 ns 1.15
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 236902.5 ns 239503 ns 0.99
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9708.5 ns 10333 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9292 ns 10417 ns 0.89
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10791.5 ns 9833 ns 1.10
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9584 ns 9813 ns 0.98
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 631614 ns 625000 ns 1.01
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 5189583 ns 4214500 ns 1.23
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 668942 ns 660618.5 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8812.5 ns 10500.5 ns 0.84
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9583 ns 9792 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11042 ns 12438 ns 0.89
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 9250 ns 9396 ns 0.98
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 122641 ns 120552 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 876791.5 ns 821667 ns 1.07
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 74481 ns 69276 ns 1.08
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13708 ns 15166.5 ns 0.90
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 14979 ns 15396 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 14416 ns 14124.5 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13625.5 ns 14375 ns 0.95
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 601521.5 ns 596192 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 4885250 ns 3931812.5 ns 1.24
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 353174 ns 355215 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 458 ns 500 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 500 ns 500 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 35180 ns 35017 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/Metal 441166 ns 259917 ns 1.70
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/AMDGPU 206562 ns 208292 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7042 ns 7666 ns 0.92
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 10458 ns 8416 ns 1.24
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8042 ns 7917 ns 1.02
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7125 ns 7792 ns 0.91
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 233713.5 ns 232859 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/Metal 5300958.5 ns 4590708 ns 1.15
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 658707 ns 670689 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 12666 ns 15500 ns 0.82
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 13833 ns 15834 ns 0.87
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 15667 ns 13709 ns 1.14
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 10270.5 ns 10375 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 22010 ns 22187.5 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/Metal 186625 ns 184292 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/AMDGPU 191282 ns 194442.5 ns 0.98
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 32042 ns 32042 ns 1
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 32020.5 ns 32250 ns 0.99
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 32458 ns 32250 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 31854.5 ns 31917 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 278049 ns 277348 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/Metal 1885500 ns 1597167 ns 1.18
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 606396.5 ns 608217 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 438291 ns 443583 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 484125 ns 485750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 446062.5 ns 444958 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 477208 ns 483792 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 194398.5 ns 194055 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1968250 ns 1953500 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 375174 ns 355719.5 ns 1.05
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3825292 ns 3835771 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3837396 ns 3818792 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3828687.5 ns 3453229 ns 1.11
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3836875 ns 3847625 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 549907 ns 547078 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 12010500 ns 9055458 ns 1.33
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1226382.5 ns 1390493 ns 0.88
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 836787979.5 ns 783907000 ns 1.07
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 426008000 ns 542588375 ns 0.79
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 542930250 ns 542038833 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 1533058916 ns 1515263812.5 ns 1.01
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22531506 ns 22757656.5 ns 0.99
batchedmm(512, Bsize=512)/forward/GPU/AMDGPU 14059203 ns 14076767 ns 1.00
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 3617643875 ns 2559120625 ns 1.41
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 1519606625 ns 1811234166 ns 0.84
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1791220042 ns 1823497333 ns 0.98
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 4771769708 ns 4761215708 ns 1.00
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 370760684 ns 368318878 ns 1.01
batchedmm(512, Bsize=512)/zygote/GPU/AMDGPU 89879564 ns 87507304 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75354.5 ns 77500 ns 0.97
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 77417 ns 85708 ns 0.90
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 80167 ns 80125 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 76625 ns 77291.5 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 210924.5 ns 210158.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1045583.5 ns 508103.5 ns 2.06
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 110131.5 ns 110471 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 231500 ns 235166 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 195167 ns 290291.5 ns 0.67
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 244583 ns 194125 ns 1.26
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 234875 ns 196875 ns 1.19
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1060035 ns 1050416 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6603312.5 ns 5885021 ns 1.12
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 643791.5 ns 645198 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 199256958.5 ns 199484937 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 103813958.5 ns 139217416 ns 0.75
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 139098125 ns 139383750 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 388864875 ns 388675625 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5820038 ns 5836807.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/AMDGPU 3424485 ns 3426102 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 615907583.5 ns 618127896 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 354224562 ns 439059167 ns 0.81
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 440166291.5 ns 438957292 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 1188432875 ns 1179308292 ns 1.01
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 26804213.5 ns 26606894 ns 1.01
batchedmm(512, Bsize=128)/zygote/GPU/AMDGPU 21815881 ns 21809492 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7333 ns 7375 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5416 ns 6292 ns 0.86
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6291 ns 3542 ns 1.78
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10458 ns 10083 ns 1.04
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 28403 ns 27930 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 361437.5 ns 375166.5 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 48715.5 ns 48181 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 213333.5 ns 214541 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221708 ns 230604 ns 0.96
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220916 ns 220625 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205750 ns 206687.5 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 226122 ns 224569.5 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 11493583.5 ns 9326958 ns 1.23
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 541195.5 ns 537197 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7291 ns 8021 ns 0.91
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8417 ns 8792 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 10770.5 ns 11229.5 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 8583 ns 7375 ns 1.16
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 119656 ns 116136 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 855542 ns 797791.5 ns 1.07
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 72200 ns 72561 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7667 ns 9145.5 ns 0.84
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9395.5 ns 10167 ns 0.92
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8104.5 ns 1.03
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7542 ns 8833.5 ns 0.85
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 526844.5 ns 524349 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 4384667 ns 3783833 ns 1.16
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 322463 ns 323234 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 459 ns 375 ns 1.22
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 458 ns 667 ns 0.69
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 500 ns 541 ns 0.92
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 416 ns 625 ns 0.67
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 27306 ns 26340 ns 1.04
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 483625 ns 443354.5 ns 1.09
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 48601 ns 49290 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9917 ns 9625 ns 1.03
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10167 ns 13291 ns 0.76
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9542 ns 9750 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 8667 ns 9542 ns 0.91
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 256488 ns 255271 ns 1.00
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5936416 ns 4751833 ns 1.25
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 396784 ns 397425 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 108542 ns 106958.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 85333 ns 99292 ns 0.86
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 100208 ns 99812.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 146625 ns 146979 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 25074 ns 25303 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/Metal 244333 ns 240875 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/AMDGPU 190632 ns 191062 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 479625 ns 498250 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 518583.5 ns 524250 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 481000 ns 479229.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 478125 ns 489125 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 235150 ns 234991.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/Metal 2164333 ns 2102146 ns 1.03
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/AMDGPU 622586 ns 624127.5 ns 1.00
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 5500 ns 5333 ns 1.03
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 5750 ns 5625 ns 1.02
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 6666.5 ns 7167 ns 0.93
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 4125 ns 6396 ns 0.64
batchedmm(16, Bsize=32)/forward/GPU/CUDA 16723 ns 16311.5 ns 1.03
batchedmm(16, Bsize=32)/forward/GPU/AMDGPU 78130 ns 79691 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 11812 ns 12542 ns 0.94
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 11916 ns 11000 ns 1.08
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 11000 ns 11250 ns 0.98
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 16500 ns 17666.5 ns 0.93
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 216336 ns 214390 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/AMDGPU 370958.5 ns 390785 ns 0.95
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 35917 ns 38958 ns 0.92
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 50500 ns 52791.5 ns 0.96
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 52709 ns 52333 ns 1.01
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 13541 ns 13667 ns 0.99
batchedmm(16, Bsize=128)/forward/GPU/CUDA 20359 ns 20008 ns 1.02
batchedmm(16, Bsize=128)/forward/GPU/AMDGPU 79931 ns 82631 ns 0.97
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 36625 ns 37041 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 29625 ns 35917 ns 0.82
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 31458 ns 31417 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 57209 ns 57770.5 ns 0.99
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 195413 ns 193115 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/AMDGPU 409364 ns 424520 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 1959 ns 1792 ns 1.09
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 1792 ns 1958 ns 0.92
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 2125 ns 2125 ns 1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 1792 ns 1812.5 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 21014.5 ns 21083 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/Metal 324459 ns 292542 ns 1.11
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/AMDGPU 33550 ns 30610 ns 1.10
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 2209 ns 2166.5 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 2125 ns 2333 ns 0.91
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 2417 ns 2375 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 2291 ns 2375 ns 0.96
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 207244.5 ns 204141 ns 1.02
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/Metal 1670895.5 ns 1447208.5 ns 1.15
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/AMDGPU 137121 ns 145632 ns 0.94
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4583 ns 6125 ns 0.75
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4750 ns 5396 ns 0.88
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6333 ns 6062.5 ns 1.04
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4917 ns 5417 ns 0.91
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 147827 ns 143737 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 771709 ns 684625 ns 1.13
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 71711 ns 64221 ns 1.12
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8270.5 ns 9334 ns 0.89
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8666 ns 9292 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8792 ns 8792 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8125 ns 9229.5 ns 0.88
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 888135.5 ns 870456.5 ns 1.02
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 6483625 ns 5245354.5 ns 1.24
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 391164 ns 395275 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56875 ns 56834 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56875 ns 57542 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57750 ns 57708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 58292 ns 58167 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37890 ns 37529 ns 1.01
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 379312.5 ns 331041 ns 1.15
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 205582 ns 210192 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 448479 ns 447958.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 465229 ns 472042 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 464687.5 ns 464624.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 433500 ns 443958.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 270782 ns 266188 ns 1.02
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 10306000 ns 8232229.5 ns 1.25
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 801818 ns 809509 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 3291000 ns 3321791 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1770084 ns 2340645.5 ns 0.76
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 2335292 ns 2338500 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 6297083.5 ns 6319896 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 206316 ns 207561 ns 0.99
batchedmm(128, Bsize=128)/forward/GPU/AMDGPU 203322 ns 202168 ns 1.01
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 11333854.5 ns 11449521 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 6594562.5 ns 8325854 ns 0.79
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 8324937.5 ns 8320229 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 21089229 ns 21173874.5 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 735605 ns 743009 ns 0.99
batchedmm(128, Bsize=128)/zygote/GPU/AMDGPU 1072271 ns 1060547.5 ns 1.01
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5625 ns 6500 ns 0.87
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5667 ns 5000 ns 1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7500 ns 7084 ns 1.06
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 6750 ns 6333 ns 1.07
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 139700 ns 137112.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 867541.5 ns 739604.5 ns 1.17
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 56260 ns 56461 ns 1.00
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7500 ns 7229.5 ns 1.04
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 14625 ns 7666.5 ns 1.91
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7375 ns 7666 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7000 ns 11333 ns 0.62
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 766028 ns 753531.5 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 5998084 ns 4958542 ns 1.21
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 380414 ns 379125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 117604 ns 95833 ns 1.23
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 125375 ns 122770.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 102396 ns 99771 ns 1.03
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 98145.5 ns 97166 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 152876 ns 151139 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2030624.5 ns 2002250 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 185692 ns 187022 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2021875 ns 2023750 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2037125 ns 2020250 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2013542 ns 1746979 ns 1.15
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2033354 ns 2042417 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 716061.5 ns 705196 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 13591542 ns 10844979.5 ns 1.25
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1265732.5 ns 1123943 ns 1.13
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 29833 ns 33395.5 ns 0.89
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 34167 ns 37708 ns 0.91
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 35542 ns 34375 ns 1.03
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 625 ns 708 ns 0.88
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15704 ns 15220 ns 1.03
batchedmm(2, Bsize=4)/forward/GPU/AMDGPU 71560.5 ns 81571 ns 0.88
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2583 ns 2583 ns 1
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 4583 ns 2917 ns 1.57
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3000 ns 3041 ns 0.99
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 2209 ns 2750 ns 0.80
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 143464 ns 137129.5 ns 1.05
batchedmm(2, Bsize=4)/zygote/GPU/AMDGPU 351354 ns 351384.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7166 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5334 ns 6125 ns 0.87
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6166 ns 6083 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 10083 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 37164 ns 36037 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 334396 ns 326166 ns 1.03
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 49180 ns 48790 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212895.5 ns 212208.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 222000 ns 232791.5 ns 0.95
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221041.5 ns 220437.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 205979 ns 207312.5 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 249374 ns 243763 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9656333 ns 8135312.5 ns 1.19
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 581561 ns 524616 ns 1.11
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3959 ns 3917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 4000 ns 3958 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3958 ns 1
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3917 ns 4000 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 21939 ns 21381 ns 1.03
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/Metal 227375 ns 224000 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/AMDGPU 45671 ns 47871 ns 0.95
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14916 ns 14917 ns 1.00
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14708 ns 15041 ns 0.98
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15000 ns 14917 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14875 ns 14709 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 314728.5 ns 307867 ns 1.02
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/Metal 1635750 ns 958666 ns 1.71
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/AMDGPU 192832 ns 197982 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 109166 ns 100500 ns 1.09
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 132541 ns 108625 ns 1.22
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 109875 ns 104083 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 102125 ns 100833 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 138355.5 ns 138303.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 2016354 ns 1992791 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 188667 ns 172222 ns 1.10
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1918396 ns 1923250 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1939229 ns 1915250 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1913584 ns 1651666 ns 1.16
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1937625 ns 1913500 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 700104 ns 688019 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 13264020.5 ns 10627167 ns 1.25
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1233652.5 ns 1227559 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17667 ns 18167 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 18458 ns 18125 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22270.5 ns 21000 ns 1.06
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18250 ns 18229.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 110588.5 ns 108058 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1374104.5 ns 463917 ns 2.96
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 81891 ns 80040 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 216417 ns 215479 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 249771 ns 253396 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 216541.5 ns 217958 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 217312.5 ns 215854 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 527304 ns 518117 ns 1.02
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 8411584 ns 6333958 ns 1.33
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 488925 ns 492755.5 ns 0.99
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 24063 ns 24541.5 ns 0.98
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 28500 ns 32959 ns 0.86
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 29459 ns 27875 ns 1.06
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 1334 ns 1292 ns 1.03
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16479 ns 16059.5 ns 1.03
batchedmm(16, Bsize=4)/forward/GPU/AMDGPU 82590 ns 83436 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 4708.5 ns 4666.5 ns 1.01
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 4708 ns 4729.5 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 5208 ns 5125 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 4875 ns 4458 ns 1.09
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 210198 ns 205889.5 ns 1.02
batchedmm(16, Bsize=4)/zygote/GPU/AMDGPU 398304 ns 408069.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 304792 ns 305750 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 305542 ns 304333 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 311083 ns 306917 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 306375 ns 308083 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 232191.5 ns 228499.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1156396 ns 1000084 ns 1.16
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 279563 ns 276983 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 530625 ns 530500 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 542459 ns 547208 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 542000.5 ns 532583 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 535875 ns 562084 ns 0.95
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1096065 ns 1071032 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 6678000 ns 5815562.5 ns 1.15
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 873778.5 ns 872009 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 20083 ns 19208 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 20187.5 ns 20812.5 ns 0.97
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 23187 ns 22125 ns 1.05
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20959 ns 20542 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 115290.5 ns 113101 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1265792 ns 501709 ns 2.52
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 80731 ns 79781 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212042 ns 212500 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 224625 ns 241875 ns 0.93
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214333 ns 215208 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 213708.5 ns 212541 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 758025 ns 741622.5 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 10158583 ns 7459812.5 ns 1.36
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 542975 ns 548036 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6458 ns 6625 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6917 ns 6792 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 8542 ns 8292 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 6417 ns 6917 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 143078 ns 139468 ns 1.03
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 869500 ns 738042 ns 1.18
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 69771 ns 69491 ns 1.00
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10709 ns 10208 ns 1.05
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9771 ns 10000 ns 0.98
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10729.5 ns 11083 ns 0.97
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10291 ns 10167 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 834187 ns 826026 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 6274750 ns 5037583 ns 1.25
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 396084 ns 389164 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5333 ns 6709 ns 0.79
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4958 ns 4583.5 ns 1.08
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7125 ns 7500 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5958 ns 7000 ns 0.85
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 146313.5 ns 143052 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 875000 ns 715979 ns 1.22
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 67660 ns 59900.5 ns 1.13
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7667 ns 7583 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7895.5 ns 0.95
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7625 ns 7709 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7459 ns 7416 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 797995 ns 782023 ns 1.02
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 6580999.5 ns 5232416.5 ns 1.26
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 400804 ns 399255 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 14350958 ns 14504875 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 7722625 ns 10144541 ns 0.76
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 10132750 ns 10123250 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 27757125 ns 27812708 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 532327 ns 530146 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/AMDGPU 403538.5 ns 398444 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 45806208 ns 46256833 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 26766750.5 ns 33497916.5 ns 0.80
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 33520000 ns 33428625 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 85306916 ns 85699625 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2661047 ns 2648857 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/AMDGPU 3296413 ns 3285657 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 66000 ns 66875 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 67333 ns 65645.5 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 69854 ns 68791.5 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 67375 ns 66292 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 120529 ns 118200 ns 1.02
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1329083.5 ns 509020.5 ns 2.61
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 228112 ns 239683 ns 0.95
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 444083 ns 439916.5 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 444083 ns 488291.5 ns 0.91
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 441292 ns 442104.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 442521.5 ns 441750 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 736542.5 ns 727003 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 10732062.5 ns 7746000 ns 1.39
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 809398 ns 807579 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 542 ns 500 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 667 ns 583 ns 1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 32886 ns 32311 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/Metal 466834 ns 409084 ns 1.14
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/AMDGPU 49230 ns 49420 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9375 ns 8709 ns 1.08
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9250 ns 8375 ns 1.10
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9500 ns 10208 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8125 ns 8667 ns 0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 290314.5 ns 284738.5 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/Metal 5519708 ns 4462646 ns 1.24
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 387394 ns 391304 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 9875 ns 9875 ns 1
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 9833 ns 9834 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 9833 ns 9834 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 9791 ns 9833 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23928 ns 22837 ns 1.05
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/Metal 204979.5 ns 212375 ns 0.97
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/AMDGPU 214872 ns 217863 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 46000 ns 46042 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 45667 ns 46292 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 46666 ns 46292 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 46250 ns 45917 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 293307 ns 289926.5 ns 1.01
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/Metal 1595562.5 ns 929833.5 ns 1.72
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 621217 ns 620477 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 56333 ns 56250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 56792 ns 57083 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 57083 ns 57166 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 57834 ns 58000 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 29516 ns 28779.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 704333.5 ns 346145.5 ns 2.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 205082 ns 207252 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 455021 ns 448416.5 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 465375 ns 502000 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 473000 ns 465458 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 434208.5 ns 434875 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 252003 ns 246357 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 12166125 ns 9664479.5 ns 1.26
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 893508.5 ns 860564 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 624416 ns 597687.5 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 662083 ns 645396 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 619083 ns 549917 ns 1.13
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 633895.5 ns 641125 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 212333 ns 203653 ns 1.04
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1471333 ns 1436750 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 236152 ns 233752 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2220834 ns 2234583 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2250000 ns 2231583 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2213792 ns 1888416 ns 1.17
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2240750 ns 2260250 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 990521.5 ns 966670 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 9717333 ns 7505959 ns 1.29
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1376089 ns 1376955 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19000 ns 19292 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 19979 ns 24625 ns 0.81
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 22333.5 ns 22125 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22250 ns 19208 ns 1.16
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 114382.5 ns 112203 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/Metal 1244584 ns 1449458 ns 0.86
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 81450 ns 85180 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 222479 ns 218917 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 224959 ns 232375 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 221208 ns 221292 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 218917 ns 225562.5 ns 0.97
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 738666.5 ns 730535 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/Metal 10456396 ns 7839645.5 ns 1.33
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 562856 ns 559956 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 667 ns 583 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 542 ns 583 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23746 ns 22978 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/Metal 488062.5 ns 450833 ns 1.08
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/AMDGPU 49670 ns 50710 ns 0.98
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9541.5 ns 10312 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9792 ns 10084 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9833 ns 10792 ns 0.91
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9291.5 ns 10042 ns 0.93
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 272510 ns 267102 ns 1.02
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/Metal 6224583.5 ns 4916333 ns 1.27
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 407824 ns 427214 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 7708 ns 10500 ns 0.73
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 8687.5 ns 8708 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 11166.5 ns 10958 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 9666 ns 8063 ns 1.20
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 121220 ns 118534 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/Metal 860208 ns 768958 ns 1.12
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/AMDGPU 72661 ns 68831 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7708 ns 7333 ns 1.05
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7250 ns 8417 ns 0.86
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8125 ns 7875 ns 1.03
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7334 ns 7625 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 516336 ns 506197 ns 1.02
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/Metal 4339813 ns 3602937.5 ns 1.20
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/AMDGPU 328244 ns 339243 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1458 ns 1375 ns 1.06
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1375 ns 1687.5 ns 0.81
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2041.5 ns 1959 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1583 ns 1583 ns 1
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 21646 ns 21680 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/Metal 305020.5 ns 306333 ns 1.00
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/AMDGPU 191511.5 ns 190212 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 3334 ns 3250 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 3375 ns 3395.5 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 3459 ns 3542 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 3458 ns 3458 ns 1
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 224911 ns 220280 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/Metal 1768041 ns 1546708 ns 1.14
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 595216 ns 597436 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 145708.5 ns 147145.5 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 106562.5 ns 130833 ns 0.81
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 129292 ns 128937.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 225125 ns 226062.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 24473.5 ns 24156 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/Metal 252375 ns 250812.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/AMDGPU 38390 ns 37640 ns 1.02
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 143771 ns 156458.5 ns 0.92
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 88167 ns 136208 ns 0.65
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 110771 ns 110833 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 250875 ns 264250 ns 0.95
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 220914.5 ns 217951.5 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/Metal 2045709 ns 1080375 ns 1.89
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/AMDGPU 237933 ns 226967 ns 1.05
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7250 ns 7292 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5333 ns 6000 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5916 ns 3708 ns 1.60
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10208 ns 10500 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33448 ns 32643.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 335833 ns 549084 ns 0.61
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 50340 ns 51091 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 224250 ns 219624.5 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228375 ns 236042 ns 0.97
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 236083.5 ns 228500 ns 1.03
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 212562.5 ns 217333.5 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 267943.5 ns 261697 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 9170083 ns 8432208 ns 1.09
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 609306 ns 537506 ns 1.13
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 14458 ns 16459 ns 0.88
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 14812.5 ns 14937.5 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 16791.5 ns 16667 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 15334 ns 15834 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 141134 ns 139783 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/Metal 873104 ns 745750 ns 1.17
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/AMDGPU 238182 ns 242713 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24083.5 ns 24083.5 ns 1
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 23875 ns 23833 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24167 ns 24188 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 23625 ns 23625 ns 1
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 878285 ns 867529 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/Metal 6385188 ns 5264396 ns 1.21
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/AMDGPU 692226 ns 706748 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 8916 ns 9208 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 9687.5 ns 10083 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12125 ns 11208 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 10416 ns 9937 ns 1.05
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 124959.5 ns 122858 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/Metal 918334 ns 796708 ns 1.15
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/AMDGPU 75531 ns 75011 ns 1.01
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14000 ns 14458 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 13729 ns 14375 ns 0.96
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14708 ns 14833.5 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 13834 ns 14604 ns 0.95
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 676549 ns 664733 ns 1.02
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/Metal 5573041 ns 4970395.5 ns 1.12
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/AMDGPU 373189 ns 380234 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 8062 ns 9166 ns 0.88
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 9750 ns 8770.5 ns 1.11
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11916.5 ns 11812.5 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10187.5 ns 8875 ns 1.15
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 124116 ns 120199.5 ns 1.03
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/Metal 883646 ns 851833.5 ns 1.04
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/AMDGPU 69690 ns 71031 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 12625 ns 12958 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 12750 ns 13250 ns 0.96
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13542 ns 13791.5 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12312 ns 13062.5 ns 0.94
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 561116 ns 549928 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/Metal 4630937 ns 3993041 ns 1.16
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/AMDGPU 345083.5 ns 349744 ns 0.99
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 27208.5 ns 29917 ns 0.91
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 32333.5 ns 35500 ns 0.91
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 31958 ns 30666.5 ns 1.04
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 2041 ns 2042 ns 1.00
batchedmm(2, Bsize=128)/forward/GPU/CUDA 16556 ns 15829 ns 1.05
batchedmm(2, Bsize=128)/forward/GPU/AMDGPU 82091 ns 81070 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 5229 ns 5500 ns 0.95
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 4687.5 ns 5042 ns 0.93
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 5334 ns 5437.5 ns 0.98
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 6458 ns 6583.5 ns 0.98
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 142634 ns 138553.5 ns 1.03
batchedmm(2, Bsize=128)/zygote/GPU/AMDGPU 367964 ns 375574 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 334 ns 291 ns 1.15
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 250 ns 375 ns 0.67
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 250 ns 292 ns 0.86
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 26682 ns 25491 ns 1.05
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 482271 ns 274125 ns 1.76
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 47990 ns 48640 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6500 ns 6334 ns 1.03
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6562.5 ns 6666 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6709 ns 6875 ns 0.98
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6188 ns 6125 ns 1.01
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 190767.5 ns 186984.5 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 5874834 ns 4846709 ns 1.21
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 394363.5 ns 399775 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 2042 ns 1917 ns 1.07
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 1917 ns 2000 ns 0.96
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 2125 ns 2042 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 2000 ns 2000 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 27167 ns 26185.5 ns 1.04
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/Metal 492292 ns 456771 ns 1.08
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/AMDGPU 210002 ns 209542 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 16833.5 ns 16521 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 16417 ns 16458 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 17354.5 ns 17166.5 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 16458.5 ns 16729.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 278278 ns 274176 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/Metal 6125604 ns 4934125 ns 1.24
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/AMDGPU 714427 ns 693313 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 146500 ns 175041 ns 0.84
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 171396 ns 176250 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 155584 ns 151792 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 154167 ns 153042 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 204804 ns 199925 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1553583 ns 1545458 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 231362.5 ns 177492 ns 1.30
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1324312.5 ns 1316145.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1348021 ns 1322833 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1319083.5 ns 1306875 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1326542 ns 1335833 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 925557 ns 903541.5 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 8602229.5 ns 6708709 ns 1.28
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1014380 ns 1125232 ns 0.90
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23792 ns 24875 ns 0.96
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 25354 ns 25708.5 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 28250 ns 26667 ns 1.06
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 24604.5 ns 25187 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 238411 ns 234415.5 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/Metal 1139000 ns 981687 ns 1.16
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/AMDGPU 120312 ns 119891 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 117854 ns 118146.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 124667 ns 121937.5 ns 1.02
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 174458.5 ns 120062.5 ns 1.45
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 118354 ns 150541.5 ns 0.79
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1098934 ns 1068343 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/Metal 7919042 ns 5874292 ns 1.35
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/AMDGPU 614406 ns 611136 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 250 ns 375 ns 0.67
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 292 ns 334 ns 0.87
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23522 ns 22848 ns 1.03
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/Metal 491791.5 ns 453916.5 ns 1.08
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/AMDGPU 50790 ns 49700 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6583 ns 6500 ns 1.01
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6375 ns 6542 ns 0.97
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6833 ns 6708 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6167 ns 6458 ns 0.95
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 207746.5 ns 203322.5 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/Metal 5956667 ns 4933125 ns 1.21
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 395954 ns 405594.5 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5958 ns 6084 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6041.5 ns 5583 ns 1.08
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7604.5 ns 7625 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6500 ns 6375 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 147981.5 ns 145118 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 774875 ns 662166.5 ns 1.17
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 239202 ns 241563 ns 0.99
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10000 ns 9917 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10083 ns 10292 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10667 ns 10250 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9791.5 ns 10417 ns 0.94
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 916090 ns 899874 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 7392292 ns 5521125 ns 1.34
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 688747.5 ns 696188 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 708 ns 666 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 666 ns 667 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 666 ns 667 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 625 ns 625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23031 ns 22288 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/Metal 209625 ns 206792 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/AMDGPU 215712 ns 218712.5 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4833 ns 4542 ns 1.06
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4584 ns 4792 ns 0.96
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 4833 ns 4792 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4625 ns 4584 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 230125.5 ns 226923.5 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/Metal 1700146 ns 1564208 ns 1.09
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/AMDGPU 599396 ns 607636 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 8396 ns 8417 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 8000 ns 8834 ns 0.91
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 10125 ns 10208 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 9062.5 ns 8333.5 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 123106.5 ns 121107.5 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/Metal 907333 ns 787896 ns 1.15
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/AMDGPU 76081 ns 69681 ns 1.09
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8792 ns 8542 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8459 ns 8584 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9041 ns 8834 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8270.5 ns 8687.5 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 600302.5 ns 588077 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/Metal 4960583.5 ns 4126709 ns 1.20
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/AMDGPU 353604 ns 353148.5 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 122750 ns 126625 ns 0.97
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 95625 ns 131250 ns 0.73
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 130334 ns 129875 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 183125 ns 181374.5 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/CUDA 46375 ns 45747 ns 1.01
batchedmm(128, Bsize=4)/forward/GPU/AMDGPU 98981 ns 98706 ns 1.00
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 303292 ns 311584 ns 0.97
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 182750 ns 342188 ns 0.53
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 345917 ns 314062.5 ns 1.10
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 608729 ns 597708.5 ns 1.02
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 195364.5 ns 190310 ns 1.03
batchedmm(128, Bsize=4)/zygote/GPU/AMDGPU 494734 ns 493310.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 396125 ns 397917 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 215375 ns 288229.5 ns 0.75
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 287708 ns 288375 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 756000 ns 756667 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43820 ns 42976 ns 1.02
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/Metal 358000 ns 361604 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/AMDGPU 83390 ns 85651 ns 0.97
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 1446958.5 ns 1452354.5 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 863667 ns 1135375 ns 0.76
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 1133375 ns 1136166 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 2443417 ns 2360834 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 252085 ns 244938.5 ns 1.03
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/Metal 1851958 ns 1837437.5 ns 1.01
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/AMDGPU 350863.5 ns 353764 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 626459 ns 626917 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 682479 ns 647958.5 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 615000 ns 644166.5 ns 0.95
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 641167 ns 647292 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 203045 ns 202423.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1359542 ns 1353875 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 254223 ns 253853 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2435250 ns 2449396 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2470979.5 ns 2442604.5 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2445042 ns 2442291 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2415792 ns 2488000 ns 0.97
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1014910 ns 984916.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11589916 ns 10289041.5 ns 1.13
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1478675 ns 1500640 ns 0.99
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 29458.5 ns 32979 ns 0.89
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 33812.5 ns 36916 ns 0.92
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 34541 ns 33875 ns 1.02
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 1042 ns 958 ns 1.09
batchedmm(2, Bsize=32)/forward/GPU/CUDA 15442 ns 15311 ns 1.01
batchedmm(2, Bsize=32)/forward/GPU/AMDGPU 85531 ns 73911 ns 1.16
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 3250 ns 3208 ns 1.01
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 3042 ns 3209 ns 0.95
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 3416 ns 3354.5 ns 1.02
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 3166 ns 3125 ns 1.01
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 142240.5 ns 136401 ns 1.04
batchedmm(2, Bsize=32)/zygote/GPU/AMDGPU 360413 ns 360914 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 404291 ns 405709 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 403708 ns 408354.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 409042 ns 407750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 421875 ns 421833 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 44262 ns 43333 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/Metal 1119041 ns 1083604.5 ns 1.03
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 242882 ns 245632 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3855208 ns 3884625 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 3997771 ns 3994229 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 3998125 ns 3999209 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3773938 ns 3792521 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 248524 ns 243526 ns 1.02
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/Metal 14976771 ns 11754959 ns 1.27
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1453704 ns 1250757.5 ns 1.16
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3959 ns 3916 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3917 ns 3958 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3917 ns 3917 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3875 ns 4000 ns 0.97
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 34278.5 ns 33572 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/Metal 161167 ns 162667 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/AMDGPU 40280 ns 43020 ns 0.94
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15875 ns 15667 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15583 ns 15959 ns 0.98
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 16041 ns 16042 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15791 ns 15542 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 257529.5 ns 253570 ns 1.02
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/Metal 864083.5 ns 834375 ns 1.04
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/AMDGPU 168256.5 ns 180881 ns 0.93
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 403417 ns 404000 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 221375 ns 295708 ns 0.75
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 295666 ns 295709 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 760500 ns 760917 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 113952 ns 112718 ns 1.01
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/Metal 335792 ns 342354.5 ns 0.98
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/AMDGPU 88615.5 ns 90981 ns 0.97
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1471958 ns 1493312.5 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 887791.5 ns 1156854.5 ns 0.77
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1157167 ns 1160000 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2467666 ns 2383125 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 255583.5 ns 238647 ns 1.07
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/Metal 1946854 ns 1884417 ns 1.03
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/AMDGPU 360243.5 ns 359543.5 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 542 ns 458 ns 1.18
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 500 ns 583 ns 0.86
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 26902 ns 25565 ns 1.05
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/Metal 486187.5 ns 419791 ns 1.16
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/AMDGPU 208227.5 ns 212932 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 7667 ns 7417 ns 1.03
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 7666 ns 7667 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 7916.5 ns 7791 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7833 ns 0.93
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 219818 ns 215431.5 ns 1.02
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/Metal 6151042 ns 4957687 ns 1.24
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/AMDGPU 686716.5 ns 709788 ns 0.97
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 825562.5 ns 828812.5 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 468833 ns 617312 ns 0.76
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 620188 ns 619250 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 1547479 ns 1549375 ns 1.00
batchedmm(128, Bsize=32)/forward/GPU/CUDA 131055 ns 133852.5 ns 0.98
batchedmm(128, Bsize=32)/forward/GPU/AMDGPU 231953 ns 169242 ns 1.37
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 2669042 ns 2694583.5 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1538125.5 ns 2012042 ns 0.76
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 2006270.5 ns 2001792 ns 1.00
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 4938583 ns 4939479.5 ns 1.00
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 242713 ns 239151.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/GPU/AMDGPU 860168 ns 886529 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 250 ns 1.50
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 291 ns 334 ns 0.87
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 333 ns 334 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 32634 ns 31838 ns 1.03
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/Metal 452000 ns 259333 ns 1.74
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/AMDGPU 48761 ns 49400 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6437.5 ns 6333 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6541.5 ns 6500 ns 1.01
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6625 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6000 ns 6520.5 ns 0.92
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 228896 ns 223447 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/Metal 5302916 ns 4629187.5 ns 1.15
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/AMDGPU 369843 ns 375294 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2391250 ns 2395187.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2400000 ns 2405291 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2405958 ns 2380959 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2372125 ns 2436208 ns 0.97
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 204395 ns 200278 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1597249.5 ns 1414792 ns 1.13
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 377704 ns 358993 ns 1.05
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4646708.5 ns 4648625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4648958 ns 4665041.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4659021 ns 4656875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4685792 ns 4669959 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 915367 ns 896958 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 7426833 ns 6803500 ns 1.09
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1261857 ns 1421880 ns 0.89
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 7479 ns 6750 ns 1.11
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 7125 ns 7083 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 7959 ns 7292 ns 1.09
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 7250 ns 6584 ns 1.10
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 23573 ns 23321 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/GPU/Metal 243500 ns 239042 ns 1.02
bias_activation(512, act=relu)(512 x 128)/forward/GPU/AMDGPU 39571 ns 38450 ns 1.03
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 70291.5 ns 45667 ns 1.54
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 45542 ns 35834 ns 1.27
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 63500 ns 33937.5 ns 1.87
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 33104 ns 66729 ns 0.50
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 217821 ns 216719 ns 1.01
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/Metal 2084458 ns 1971646 ns 1.06
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/AMDGPU 226612 ns 249873 ns 0.91
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 20396 ns 21583.5 ns 0.94
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 24479.5 ns 26958 ns 0.91
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 24854.5 ns 22875 ns 1.09
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 5500 ns 5250 ns 1.05
batchedmm(2, Bsize=512)/forward/GPU/CUDA 16892 ns 16231 ns 1.04
batchedmm(2, Bsize=512)/forward/GPU/AMDGPU 85151 ns 86831 ns 0.98
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 11958 ns 11791.5 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 9000 ns 10333 ns 0.87
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 10958.5 ns 10708 ns 1.02
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 18167 ns 17979 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 227664.5 ns 225788.5 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/AMDGPU 389024 ns 379954 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 404791 ns 405812.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 223500 ns 297333.5 ns 0.75
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 296709 ns 297167 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 762750 ns 762666 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46360 ns 46002 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/Metal 340000 ns 339937.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/AMDGPU 88940 ns 89221 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 1485750.5 ns 1490875.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 895812 ns 1168979.5 ns 0.77
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 1165791.5 ns 1165791 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 2472333 ns 2389458 ns 1.03
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 290272 ns 288633.5 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/Metal 2106583 ns 2056875 ns 1.02
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/AMDGPU 377424 ns 383589 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 432770.5 ns 433750 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 430583 ns 436958 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 436958 ns 436167 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 448209 ns 448500 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 54092 ns 55020 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1074083.5 ns 1006500 ns 1.07
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 235772 ns 239793 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 3888958 ns 3901875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4016791.5 ns 4017833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4025938 ns 4034833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 3793958.5 ns 3796208.5 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 263523 ns 262986 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11929333 ns 10384542 ns 1.15
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 1247352 ns 1253342.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 8750 ns 8708 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 6875 ns 7667 ns 0.90
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 7667 ns 7667 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 12417 ns 12458 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24084 ns 23658 ns 1.02
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/Metal 211583 ns 213500 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/AMDGPU 216562 ns 218512 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 45125 ns 45542 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 44750 ns 45333 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 45375 ns 45625 ns 0.99
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 45187.5 ns 45042 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 347338.5 ns 345391.5 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/Metal 1883625.5 ns 1709416 ns 1.10
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/AMDGPU 671931.5 ns 674767 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 104146.5 ns 126458 ns 0.82
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 86437 ns 123125 ns 0.70
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 92875 ns 88875 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 126625 ns 83854.5 ns 1.51
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 189767 ns 190159 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/Metal 1966250 ns 1948625 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/AMDGPU 183982 ns 196902 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2011000 ns 2023167 ns 0.99
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2025000 ns 1999791 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2009458 ns 2015209 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2016917 ns 2007042 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 535873.5 ns 533000 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/Metal 11961958.5 ns 9148917 ns 1.31
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/AMDGPU 982380 ns 1104121 ns 0.89

This comment was automatically generated by workflow using github-action-benchmark.

Please sign in to comment.