Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: migrate most examples to Reactant #1180

Open
wants to merge 15 commits into
base: main
Choose a base branch
from
Open

docs: migrate most examples to Reactant #1180

wants to merge 15 commits into from

Conversation

avik-pal
Copy link
Member

@avik-pal avik-pal commented Jan 5, 2025

CUDA CI is giving some BT with downloading artifacts

  • Basics
  • PolynomialFitting
  • SimpleRNN
  • OptimizationIntegration -- use CPU
  • NeuralODE -- tricky here. maybe use CPU
  • HyperNet
  • PINN2DPDE
  • ConditionalVAE throughput calc
  • SimpleChains -- Run CPU using Reactant
  • Main Documentation Examples
  • downgrade testing

other upstream changes:

Copy link
Contributor

github-actions bot commented Jan 5, 2025

Benchmark Results (ASV)

main 981a191... main/981a191910301b...
basics/overhead 0.127 ± 0.0014 μs 0.122 ± 0.0017 μs 1.04
time_to_load 0.897 ± 0.005 s 0.899 ± 0.0061 s 0.998

Benchmark Plots

A plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR.
Go to "Actions"->"Benchmark a pull request"->[the most recent run]->"Artifacts" (at the bottom).

@avik-pal avik-pal marked this pull request as ready for review January 5, 2025 03:50
Copy link
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 706a05a Previous: 46a012d Ratio
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 4000 ns 3791 ns 1.06
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 4334 ns 4500 ns 0.96
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4959 ns 4875 ns 1.02
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3916.5 ns 3666 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 64908 ns 59711.5 ns 1.09
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 10604.5 ns 10167 ns 1.04
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 11209 ns 10458 ns 1.07
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 11125 ns 10750 ns 1.03
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9916 ns 10625 ns 0.93
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 425347 ns 419469 ns 1.01
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) 1083 ns 1062.5 ns 1.02
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) 1250 ns 1167 ns 1.07
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) 1292 ns 1500 ns 0.86
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) 1042 ns 1125 ns 0.93
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA 18063 ns 18540 ns 0.97
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 4083 ns 4083 ns 1
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 4083 ns 4042 ns 1.01
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 4209 ns 4208 ns 1.00
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 4020.5 ns 3958 ns 1.02
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA 110564 ns 109802.5 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 57333 ns 57542 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 46500 ns 46416 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47209 ns 47125 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82333 ns 80875 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37199 ns 37744 ns 0.99
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2033020.5 ns 2035395.5 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2085895.5 ns 2078396 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2101958 ns 2078708 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1997062.5 ns 1998584 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195555 ns 195463 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144041 ns 144250 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144584 ns 144166.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 145166.5 ns 145125 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 143750 ns 153104.5 ns 0.94
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 166186.5 ns 165592.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1117854 ns 1120291.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1116167 ns 1113167 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1136000 ns 832708.5 ns 1.36
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1119083 ns 1117084 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 529755 ns 520015.5 ns 1.02
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3667 ns 3375 ns 1.09
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3708 ns 3542 ns 1.05
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4333 ns 4166 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3417 ns 3125 ns 1.09
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 71440.5 ns 66073.5 ns 1.08
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 9917 ns 9042 ns 1.10
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 9375 ns 8750 ns 1.07
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 9416 ns 10208 ns 0.92
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 9208 ns 8833 ns 1.04
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 475729 ns 469701 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15250 ns 17041 ns 0.89
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15959 ns 15834 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 16333 ns 16604.5 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 15209 ns 16791 ns 0.91
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 55464 ns 54530 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 215833 ns 213750 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 215042 ns 214875 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 214083 ns 215667 ns 0.99
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 215541 ns 226125 ns 0.95
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 277083 ns 269469 ns 1.03
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) 625 ns 542 ns 1.15
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) 875 ns 708 ns 1.24
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) 584 ns 709 ns 0.82
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 541 ns 1.00
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA 17559 ns 17336 ns 1.01
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 1541 ns 1375 ns 1.12
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 1500 ns 1375 ns 1.09
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 1500 ns 1500 ns 1
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 1417 ns 1458 ns 0.97
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA 102642 ns 100554 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7208 ns 7000 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 5875 ns 5750 ns 1.02
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5917 ns 6042 ns 0.98
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10000 ns 9750 ns 1.03
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 23623 ns 23286 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221166 ns 222021 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 229083 ns 228542 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 229083 ns 229292 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 214416.5 ns 213937.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 169773 ns 166141.5 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) 3916 ns 3875 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) 3916 ns 3917 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) 3958 ns 3959 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) 3958 ns 3917 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA 23615 ns 23204 ns 1.02
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16708 ns 16917 ns 0.99
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16875 ns 16792 ns 1.00
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16750 ns 17250 ns 0.97
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16875 ns 16750 ns 1.01
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA 164235.5 ns 164061.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 578041 ns 568792 ns 1.02
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 578458 ns 578645.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 580333 ns 578083 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 574667 ns 575625 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA 113670 ns 113438.5 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1381458.5 ns 1422625 ns 0.97
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1417875 ns 1420000 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1422125 ns 1422375 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 1420792 ns 1426708 ns 1.00
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA 213871 ns 213572 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1082646 ns 1077687.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 975542 ns 960917 ns 1.02
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 1348500 ns 1353229.5 ns 1.00
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1302083 ns 1315312 ns 0.99
lenet(28, 28, 1, 64)/forward/GPU/CUDA 277486.5 ns 274529.5 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 5993833 ns 5961958 ns 1.01
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 4614270.5 ns 4633250 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 4932083 ns 4975188 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5723020.5 ns 5557125 ns 1.03
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1095707 ns 1081948 ns 1.01
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) 542 ns 583 ns 0.93
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) 542 ns 542 ns 1
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA 23887 ns 23910 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2209 ns 2208 ns 1.00
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 2167 ns 2250 ns 0.96
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2250 ns 2167 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2208 ns 2125 ns 1.04
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA 172803.5 ns 176064.5 ns 0.98
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4208 ns 4125 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 3917 ns 4375 ns 0.90
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5166 ns 5167 ns 1.00
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4125 ns 4250 ns 0.97
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 66602.5 ns 65504 ns 1.02
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 11708 ns 11875 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 11625 ns 11000 ns 1.06
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 11791 ns 11917 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 11417 ns 11500 ns 0.99
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 453473 ns 448080.5 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6209 ns 7000 ns 0.89
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7083.5 ns 6958 ns 1.02
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7958 ns 8250 ns 0.96
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6209 ns 6125 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 53188.5 ns 52534 ns 1.01
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 17250 ns 18708.5 ns 0.92
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 19521 ns 18625 ns 1.05
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 18291 ns 18375 ns 1.00
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 18291.5 ns 16708 ns 1.09
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 299625.5 ns 296471 ns 1.01
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 708 ns 0.82
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 583 ns 667 ns 0.87
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 625 ns 584 ns 1.07
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 32863 ns 33481 ns 0.98
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9333 ns 8834 ns 1.06
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9125 ns 8875 ns 1.03
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9292 ns 9334 ns 1.00
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 8708 ns 8354.5 ns 1.04
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 161130 ns 158505 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) 64541 ns 64459 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) 64541 ns 64750 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) 64959 ns 64916 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) 64625 ns 64625 ns 1
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA 112602 ns 112347 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 275875 ns 279250 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 279833 ns 282167 ns 0.99
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 283458 ns 284125 ns 1.00
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 283583 ns 278708 ns 1.02
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA 189085 ns 187244.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 3299167 ns 3278417 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 3026270.5 ns 3081000 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 3019187.5 ns 3021792 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 4024583 ns 4040979.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 584255.5 ns 573775.5 ns 1.02
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 7515459 ns 7620208 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 7456208 ns 7449187.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 7490895.5 ns 7493708.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 8188916 ns 8208791 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1347259 ns 1340015.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 17415104.5 ns 18366417 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 17525145.5 ns 17522312.5 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 17569875 ns 17580834 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 14180208.5 ns 14093354.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23591729 ns 23631333 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 33628750 ns 33504604 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37137000 ns 37034667 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34864416.5 ns 34967583.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1862603 ns 1860248 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 187602250 ns 189693000 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 163289375 ns 165014875 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 152560000 ns 152416688 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 434843584 ns 434850958 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13906335 ns 13871408 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 287885958 ns 289105312.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 251098583 ns 250867083 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 294830208 ns 296775875 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 474116666.5 ns 473537562.5 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 22333 ns 22083 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22209 ns 22459 ns 0.99
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 24667 ns 25375 ns 0.97
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 21250 ns 24083 ns 0.88
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 96476 ns 95417 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 104083.5 ns 103083 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 104791 ns 103250 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 104125 ns 104542 ns 1.00
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 103875 ns 103041 ns 1.01
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 508784.5 ns 502007.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6000 ns 5917 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 6000 ns 5958 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7042 ns 6708 ns 1.05
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5459 ns 5791.5 ns 0.94
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 68961 ns 68401.5 ns 1.01
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 13770.5 ns 14792 ns 0.93
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 15375 ns 15000 ns 1.02
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 16187.5 ns 16542 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14604.5 ns 14875 ns 0.98
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 480016 ns 475091.5 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3034541 ns 3002625 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2081333 ns 2079375 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2270063 ns 2272333 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4522416.5 ns 4882708 ns 0.93
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 587464 ns 586443 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 23649250.5 ns 23536000 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18052375 ns 18038562.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 16973500 ns 16972167 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 34879666.5 ns 34545146 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2921251.5 ns 2768189 ns 1.06
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33967979 ns 33221458 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 27495520.5 ns 27561792 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 27395334 ns 27327000 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41094750 ns 42034750 ns 0.98
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 74791 ns 71417 ns 1.05
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 74000 ns 71854.5 ns 1.03
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 75208 ns 75708 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74520.5 ns 74708 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 103185 ns 101188 ns 1.02
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 252229 ns 205250.5 ns 1.23
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 218416.5 ns 206750 ns 1.06
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 318417 ns 208958 ns 1.52
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 281145.5 ns 217416 ns 1.29
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 546577.5 ns 541638 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 11792 ns 11875 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 11834 ns 11416 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 12687.5 ns 12958 ns 0.98
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11792 ns 11708 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 71493 ns 70557.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 26792 ns 25667 ns 1.04
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 26875 ns 26541.5 ns 1.01
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 27583 ns 27729.5 ns 0.99
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26625 ns 26667 ns 1.00
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 472979.5 ns 468068.5 ns 1.01
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 12042 ns 12812.5 ns 0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 14666 ns 12209 ns 1.20
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 13792 ns 14208 ns 0.97
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 11584 ns 12291.5 ns 0.94
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 53417.5 ns 52262 ns 1.02
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 25375 ns 25625 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 25750 ns 25916.5 ns 0.99
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 26209 ns 26250 ns 1.00
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 26125 ns 26604 ns 0.98
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 301308.5 ns 297345.5 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 179709 ns 178792 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 180250 ns 180750 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 180833 ns 181917 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 179770.5 ns 179166 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 56746 ns 56939 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 601167 ns 593333 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 583459 ns 582708 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 593854.5 ns 583667 ns 1.02
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 587750 ns 584542 ns 1.01
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 287220 ns 282717 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6479.5 ns 6167 ns 1.05
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5687.5 ns 5875 ns 0.97
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 6875 ns 6875 ns 1
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 5834 ns 5708.5 ns 1.02
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 70626.5 ns 69908.5 ns 1.01
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14292 ns 13791 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14459 ns 13917 ns 1.04
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 15292 ns 15667 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14208 ns 14458 ns 0.98
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 464130 ns 454508 ns 1.02
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) 1175917 ns 1225312.5 ns 0.96
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) 1245125 ns 1241959 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) 1285584 ns 1289958.5 ns 1.00
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) 999417 ns 1011625 ns 0.99
batchedmm(512, Bsize=4)/forward/GPU/CUDA 301338 ns 300319.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) 4109000 ns 4103042 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) 4433333 ns 4403333 ns 1.01
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) 4513562.5 ns 4523854.5 ns 1.00
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) 3727604 ns 3709771 ns 1.00
batchedmm(512, Bsize=4)/zygote/GPU/CUDA 1048333 ns 1034770 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1875 ns 1916 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1875 ns 1875 ns 1
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA 23450 ns 23619 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 4917 ns 4958 ns 0.99
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 4917 ns 5000 ns 0.98
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5000 ns 4958 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 4917 ns 4875 ns 1.01
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA 188032 ns 186116 ns 1.01
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5625 ns 5833 ns 0.96
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6292 ns 5917 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6667 ns 6667 ns 1
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5458 ns 5209 ns 1.05
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 54508 ns 54405.5 ns 1.00
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 11125 ns 11125 ns 1
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 11250 ns 11500 ns 0.98
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10625 ns 11458 ns 0.93
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 11104 ns 10500 ns 1.06
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 322621.5 ns 320192 ns 1.01
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) 375 ns 375 ns 1
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) 292 ns 375 ns 0.78
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) 375 ns 333 ns 1.13
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) 334 ns 375 ns 0.89
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA 22578 ns 22488.5 ns 1.00
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) 2750 ns 2792 ns 0.98
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) 3000 ns 2833 ns 1.06
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) 2875 ns 3083 ns 0.93
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) 2750 ns 2750 ns 1
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA 157805 ns 157059.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 11417 ns 11459 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 11833 ns 11625 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 12666.5 ns 12875 ns 0.98
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10667 ns 10958 ns 0.97
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 56499 ns 55353 ns 1.02
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 25125 ns 25020.5 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25084 ns 25292 ns 0.99
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 25208 ns 25125 ns 1.00
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 25000 ns 24875 ns 1.01
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 287174 ns 284593.5 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) 4250 ns 4250 ns 1
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) 4209 ns 4250 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) 4292 ns 4250 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) 4250 ns 4208 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA 25020 ns 24743 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) 16417 ns 16333 ns 1.01
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) 16250 ns 16375 ns 0.99
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) 16791 ns 16520.5 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) 16583 ns 16208 ns 1.02
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA 192990 ns 192574 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5792 ns 5833 ns 0.99
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 5833 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5833 ns 6042 ns 0.97
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5834 ns 5833 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 32887 ns 33721.5 ns 0.98
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 19917 ns 21000 ns 0.95
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 21104.5 ns 21000 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 21458 ns 21417 ns 1.00
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 20833 ns 20709 ns 1.01
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 172782.5 ns 172002 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) 421479.5 ns 422124.5 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) 387458 ns 387791 ns 1.00
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) 483666.5 ns 477333 ns 1.01
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) 104125 ns 103125 ns 1.01
batchedmm(16, Bsize=512)/forward/GPU/CUDA 66577 ns 66716 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) 858542 ns 921333 ns 0.93
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) 977646 ns 974250 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) 1191354 ns 1186458 ns 1.00
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) 467937.5 ns 457479.5 ns 1.02
batchedmm(16, Bsize=512)/zygote/GPU/CUDA 192502.5 ns 189036 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 112208 ns 80542 ns 1.39
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81563 ns 80709 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 83208 ns 84896 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80417 ns 79833 ns 1.01
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193461 ns 193358.5 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1637291 ns 1919250 ns 0.85
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1921875 ns 1876583 ns 1.02
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1912999.5 ns 1946041 ns 0.98
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1916000 ns 1921396 ns 1.00
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 393766 ns 391971 ns 1.00
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 333 ns 1
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) 333 ns 292 ns 1.14
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA 21781 ns 21948.5 ns 0.99
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 1875 ns 1917 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 1875 ns 1917 ns 0.98
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 1917 ns 1875 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 1834 ns 1792 ns 1.02
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA 166492 ns 166123 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 6500 ns 6417 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6792 ns 6666 ns 1.02
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 7667 ns 7771 ns 0.99
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5792 ns 6145.5 ns 0.94
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 57391 ns 56772 ns 1.01
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 9166 ns 9604.5 ns 0.95
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 9208 ns 9459 ns 0.97
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 9458 ns 9500 ns 1.00
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 9375 ns 9041 ns 1.04
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 297020.5 ns 294981.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 120279542 ns 120459792 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 173916083 ns 173682208 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148017417 ns 147804000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 105434958 ns 105720875 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5470683 ns 5472285 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 611677521 ns 610206729.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 555203375 ns 555562500 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 449267291.5 ns 452099291.5 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 626329062.5 ns 626409896 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34976369 ns 34955764 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 653577709 ns 657253583 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 664059604 ns 665008062.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 584797604.5 ns 581676208.5 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 856360417 ns 857648458 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 59375 ns 57875 ns 1.03
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47917 ns 47791 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47333 ns 47500 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 82208 ns 83395.5 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 37811 ns 37072 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1907167 ns 1915500 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1979584 ns 1932792 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1982292 ns 1995084 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1887520.5 ns 1890500 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 173554.5 ns 171922.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 267916.5 ns 267854.5 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 268187.5 ns 267708 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 289375 ns 269750 ns 1.07
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 267604.5 ns 268166 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 125126 ns 123763 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 672417 ns 594417 ns 1.13
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 682229.5 ns 681291 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 697833 ns 604895.5 ns 1.15
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 691667 ns 689917 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 681144 ns 674236.5 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 2155667 ns 2176375 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 2209917 ns 2222812.5 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 2230833 ns 2205042 ns 1.01
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 2213250 ns 2093562.5 ns 1.06
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 133068 ns 133331 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5474583 ns 5514416 ns 0.99
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5487146 ns 5508500 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5508750 ns 5535958 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5505584 ns 5491750 ns 1.00
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 730756 ns 730299 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 644000 ns 638167 ns 1.01
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 646333 ns 647708 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 645875 ns 659416 ns 0.98
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 636625 ns 643750 ns 0.99
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA 46573 ns 46729.5 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 1813145.5 ns 1822167 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 1720875 ns 1723042 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 1730500 ns 1727833 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 2102458 ns 2106333 ns 1.00
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA 221564 ns 219682 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58375 ns 58458 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47583 ns 46917 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47958 ns 47292 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 84792 ns 84125 ns 1.01
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 28293.5 ns 28215 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2023750.5 ns 2030041 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2086896 ns 2004250 ns 1.04
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2092792 ns 2122125 ns 0.99
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1989458 ns 1985979.5 ns 1.00
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 189678 ns 186715 ns 1.02
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 13137500 ns 13357770.5 ns 0.98
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 12420833 ns 12440000 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 12485416 ns 12492250 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 15160708 ns 15108458 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 515676 ns 510701.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 46983250 ns 47178791.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 41725771 ns 41760334 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 40959021.5 ns 40950875 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 58550667 ns 58205437.5 ns 1.01
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 3034668 ns 2894239.5 ns 1.05
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 96386417 ns 97014458.5 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 90955791.5 ns 91152834 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 90582166.5 ns 90701604.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 98700687.5 ns 98541521.5 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 58833 ns 58959 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 47208 ns 47375 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 47791 ns 47750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 80375 ns 79958 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 48423 ns 47779.5 ns 1.01
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1652167 ns 1918645.5 ns 0.86
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1967125 ns 1971000 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1975771 ns 1997667 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1889978.5 ns 1889750 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 196325 ns 192960 ns 1.02
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 333 ns 375 ns 0.89
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 32529 ns 33172 ns 0.98
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6312.5 ns 6292 ns 1.00
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6500 ns 6542 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6750 ns 6834 ns 0.99
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6375 ns 6125 ns 1.04
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 173262.5 ns 171303 ns 1.01
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) 292 ns 292 ns 1
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) 333 ns 292 ns 1.14
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) 292 ns 333 ns 0.88
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA 33122 ns 32323 ns 1.02
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) 2833 ns 2833 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) 2875 ns 2917 ns 0.99
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) 2917 ns 2917 ns 1
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) 2834 ns 2708 ns 1.05
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA 163502.5 ns 162112.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 283807333.5 ns 289426812.5 ns 0.98
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 339209291 ns 339624334 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 313321979.5 ns 315284104.5 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 271605333 ns 274668667 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7106197 ns 7120353.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1010796708 ns 1014634416 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 952570250 ns 953687125 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 857678666.5 ns 857733312.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1262433542 ns 1265357333 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 33982981 ns 33985258 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1664881084 ns 1675373667 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1682055500 ns 1668941291 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1607676375 ns 1606744000 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1777771875 ns 1787636084 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1413687.5 ns 1409499.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1461375 ns 1413833 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1420208 ns 1419895.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1414875 ns 1458541.5 ns 0.97
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 128286 ns 127493 ns 1.01
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5003833 ns 5016749.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5018125 ns 4651917 ns 1.08
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5022854 ns 5058791 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 5026291 ns 5012792 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 606256 ns 551564 ns 1.10
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 169934000 ns 171852250 ns 0.99
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 136588687.5 ns 129831062.5 ns 1.05
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 113187625 ns 115995771 ns 0.98
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 168763583 ns 168839667 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4844336 ns 4879222 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 622027459 ns 629070333 ns 0.99
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 495127250 ns 493488792 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 455703333 ns 456364583 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 647652500 ns 675660292 ns 0.96
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 16164061 ns 16223916 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) 8932687 ns 8950646 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) 8960208 ns 8924625 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) 7842146 ns 7865125 ns 1.00
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) 9760916.5 ns 9701750 ns 1.01
batchedmm(512, Bsize=32)/forward/GPU/CUDA 1593338 ns 1588053 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) 35890667 ns 36024125 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) 37271458 ns 37000208.5 ns 1.01
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) 33270271 ns 33425875 ns 1.00
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) 37815604 ns 37661542 ns 1.00
batchedmm(512, Bsize=32)/zygote/GPU/CUDA 6467020.5 ns 6463767 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) 47458 ns 47562.5 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) 47667 ns 47416 ns 1.01
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) 47583 ns 47666 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) 47333.5 ns 47375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA 18409 ns 17907 ns 1.03
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) 50417 ns 50542 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) 50500 ns 50375 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) 50625 ns 50584 ns 1.00
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) 50292 ns 50583 ns 0.99
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA 221942.5 ns 184398 ns 1.20
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 6750 ns 6958.5 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6958 ns 6500 ns 1.07
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 7834 ns 8042 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 6167 ns 6542 ns 0.94
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 117277.5 ns 89066 ns 1.32
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10020.5 ns 10042 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10459 ns 10437.5 ns 1.00
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10209 ns 10500 ns 0.97
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10709 ns 10375 ns 1.03
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 636678 ns 510214.5 ns 1.25
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 5750 ns 5666 ns 1.01
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6250 ns 5958 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 7562.5 ns 7417 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5292 ns 5458 ns 0.97
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 131618 ns 109271 ns 1.20
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13375 ns 13125 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13625 ns 13250 ns 1.03
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13583 ns 13375 ns 1.02
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 13875 ns 13208 ns 1.05
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 534735 ns 457940.5 ns 1.17
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 1083 ns 1083 ns 1
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 1125 ns 1084 ns 1.04
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 1083 ns 1084 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA 32184 ns 32174 ns 1.00
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 8041 ns 8000 ns 1.01
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 8125 ns 8292 ns 0.98
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8500 ns 0.99
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7916 ns 8125 ns 0.97
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA 217587 ns 199053.5 ns 1.09
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 23416.5 ns 23354.5 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 23625 ns 23250 ns 1.02
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 23584 ns 23542 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 23333 ns 23125 ns 1.01
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA 18406 ns 18347 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 52666 ns 52667 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 52708 ns 52584 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 52750 ns 52750 ns 1
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 52625 ns 52417 ns 1.00
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA 324876.5 ns 291115 ns 1.12
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1399458 ns 1398084 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1397791 ns 1402791 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1399625 ns 1401792 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1396645.5 ns 1402875 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 196317 ns 195544.5 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5003666.5 ns 5010813 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5011562.5 ns 5016584 ns 1.00
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5009354.5 ns 5062708 ns 0.99
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4929166 ns 5013500 ns 0.98
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 642705 ns 617335 ns 1.04
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3060229 ns 3040417 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2062833 ns 2105083 ns 0.98
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2271959 ns 2280208 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 4546292 ns 4865521 ns 0.93
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 582192 ns 579665 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24422521 ns 24414604.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18893959 ns 18876208.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17756020.5 ns 17652979 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 35829479.5 ns 35825688 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2983107 ns 2847809 ns 1.05
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 33974187.5 ns 34006188 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 28302938 ns 28283750 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 28035291.5 ns 27926083.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 41470145.5 ns 41742416.5 ns 0.99
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) 144105041 ns 144750166 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) 147427375 ns 146949375 ns 1.00
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) 126884687.5 ns 126208208.5 ns 1.01
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) 173054375 ns 173205292 ns 1.00
batchedmm(512, Bsize=512)/forward/GPU/CUDA 22557393 ns 22782449 ns 0.99
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) 1956977417 ns 1847080125 ns 1.06
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) 885326041 ns 809911709 ns 1.09
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) 1037813833 ns 755677291 ns 1.37
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) 671200250 ns 667449084 ns 1.01
batchedmm(512, Bsize=512)/zygote/GPU/CUDA 118989039 ns 118406338 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 75000 ns 76791 ns 0.98
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 75959 ns 76042 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 83500 ns 76417 ns 1.09
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 74959 ns 72541 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 257706 ns 250232.5 ns 1.03
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 279979 ns 277229 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 238458 ns 193583 ns 1.23
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 296208 ns 205417 ns 1.44
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 291041.5 ns 303083.5 ns 0.96
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1346296 ns 1279646 ns 1.05
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) 35559750 ns 35472875 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) 36337666.5 ns 36379896 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) 32414125 ns 32315333.5 ns 1.00
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) 40612313 ns 40618416.5 ns 1.00
batchedmm(512, Bsize=128)/forward/GPU/CUDA 5842862 ns 5840653.5 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) 147914959 ns 146765250 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) 152919521 ns 153200125 ns 1.00
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) 139208021 ns 137307792 ns 1.01
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) 286973916 ns 285301125 ns 1.01
batchedmm(512, Bsize=128)/zygote/GPU/CUDA 34878379 ns 34880703 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 118953291.5 ns 120518062.5 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 174106167 ns 174031666 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 148140750 ns 148283312.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 106397000 ns 106552271 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5469783.5 ns 5465282.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 467953125 ns 469918416 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 466269250 ns 466837917 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 438982208 ns 437920916.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 743118750 ns 739774042 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 32258831 ns 32269604.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 714490041.5 ns 711087896 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 639936396 ns 640897313 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 628040062.5 ns 630411896 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 852625250 ns 849787625 ns 1.00
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1324000 ns 1302125 ns 1.02
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 968333.5 ns 905958 ns 1.07
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 989666 ns 938334 ns 1.05
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 2099083 ns 1987437 ns 1.06
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 581941.5 ns 573939.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 2954937.5 ns 2951687.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 2616209 ns 2611020.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 2650208.5 ns 2639896 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 3693125 ns 3702396 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1845744 ns 1765767 ns 1.05
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 5805500 ns 5801417 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 5795917 ns 5727666.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 5825979.5 ns 5818916 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 2900791.5 ns 2913834 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7375 ns 7417 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6250 ns 6166 ns 1.01
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6334 ns 6209 ns 1.02
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10083 ns 1
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25366 ns 25586 ns 0.99
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212375 ns 212792 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 232645.5 ns 220834 ns 1.05
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 220375 ns 221166 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 216416.5 ns 215459 ns 1.00
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 263885 ns 272866 ns 0.97
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 299505333 ns 300445333 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 219250584 ns 214002042 ns 1.02
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 195972000 ns 196386541 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 302200812.5 ns 307720792 ns 0.98
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7673524.5 ns 7675041.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 1227976646 ns 1232629833 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 894926666.5 ns 899311645.5 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 811451708 ns 825300584 ns 0.98
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 1157110729 ns 1150330250 ns 1.01
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 26480631 ns 26367421.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5542 ns 5458 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 6042 ns 5416 ns 1.12
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6479.5 ns 6750.5 ns 0.96
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 4833 ns 5084 ns 0.95
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 185138 ns 184497.5 ns 1.00
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7541 ns 7667 ns 0.98
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7500 ns 7333 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7583 ns 7500 ns 1.01
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7375 ns 7250 ns 1.02
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 640968 ns 655045 ns 0.98
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 583 ns 583 ns 1
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 542 ns 625 ns 0.87
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 625 ns 542 ns 1.15
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 23916 ns 24222 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 8854.5 ns 9542 ns 0.93
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 9583 ns 9833 ns 0.97
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 9542 ns 9667 ns 0.99
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9541 ns 9041 ns 1.06
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 222930.5 ns 221511.5 ns 1.01
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) 352187.5 ns 352562.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) 351375 ns 351833 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) 352187.5 ns 353416.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) 352854 ns 366166 ns 0.96
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA 20981 ns 21264 ns 0.99
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) 809000 ns 826208 ns 0.98
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) 774417 ns 775333.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) 808000 ns 808520.5 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) 828145.5 ns 828833 ns 1.00
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA 283278 ns 278649 ns 1.02
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) 334209 ns 340917 ns 0.98
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) 348667 ns 342729.5 ns 1.02
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) 449000 ns 453708 ns 0.99
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) 10041 ns 10687.5 ns 0.94
batchedmm(16, Bsize=32)/forward/GPU/CUDA 18205 ns 18338 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) 702917 ns 709875 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) 736959 ns 728042 ns 1.01
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) 998917 ns 1005792 ns 0.99
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) 26833 ns 26667 ns 1.01
batchedmm(16, Bsize=32)/zygote/GPU/CUDA 246883.5 ns 257132 ns 0.96
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) 376083.5 ns 380187.5 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) 353541.5 ns 355542 ns 0.99
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) 444042 ns 442146 ns 1.00
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) 29708 ns 30959 ns 0.96
batchedmm(16, Bsize=128)/forward/GPU/CUDA 22710 ns 22801.5 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) 717541 ns 726667 ns 0.99
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) 794458 ns 778791.5 ns 1.02
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) 1030979 ns 1034042 ns 1.00
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) 106333 ns 105042 ns 1.01
batchedmm(16, Bsize=128)/zygote/GPU/CUDA 215644.5 ns 214595.5 ns 1.00
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) 3500 ns 3583 ns 0.98
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) 3500 ns 3542 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) 3708 ns 3708 ns 1
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) 3583 ns 3542 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA 17630 ns 17801 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) 4250 ns 4583 ns 0.93
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) 4209 ns 4333 ns 0.97
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) 4333 ns 4375 ns 0.99
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) 4208 ns 4167 ns 1.01
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA 262870 ns 276455 ns 0.95
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 3958 ns 3833 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 3750 ns 3542 ns 1.06
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 4417 ns 4292 ns 1.03
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 3666 ns 3500 ns 1.05
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 203649.5 ns 219668 ns 0.93
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8459 ns 8334 ns 1.01
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8375 ns 8334 ns 1.00
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8708 ns 8708 ns 1
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8375 ns 8625 ns 0.97
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 1134731.5 ns 1228564 ns 0.92
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 204084 ns 203709 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210250 ns 209833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209959 ns 213750 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 198625 ns 200750 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 34477 ns 34897 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 601958 ns 611979.5 ns 0.98
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 620583.5 ns 623084 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 632250 ns 633542 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 629125 ns 630833 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 312652 ns 337730.5 ns 0.93
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) 1002541 ns 991250 ns 1.01
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) 1013375 ns 1017458.5 ns 1.00
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) 949958 ns 954833 ns 0.99
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) 863959 ns 864916.5 ns 1.00
batchedmm(128, Bsize=128)/forward/GPU/CUDA 207370 ns 208131 ns 1.00
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) 4494042 ns 4517208 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) 4651833.5 ns 4768041 ns 0.98
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) 4432083 ns 4459667 ns 0.99
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) 4266750 ns 4281312 ns 1.00
batchedmm(128, Bsize=128)/zygote/GPU/CUDA 924761 ns 937605 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3604.5 ns 3625 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3375 ns 3291 ns 1.03
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4083 ns 4250 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3583 ns 3166 ns 1.13
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 197591.5 ns 221703 ns 0.89
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7417 ns 7500 ns 0.99
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7250 ns 7458 ns 0.97
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7417 ns 7687.5 ns 0.96
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7250 ns 7084 ns 1.02
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 942275 ns 1025587 ns 0.92
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1630417 ns 1644333 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1198354 ns 1183209 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1354833 ns 1370292 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2350625 ns 2475167 ns 0.95
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 214827 ns 213710.5 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12313938 ns 12346958.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9588354.5 ns 9593646 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9260750 ns 9292209 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 17986208 ns 17963583.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1951608 ns 1947963.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17308583.5 ns 17361375 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14356333.5 ns 14393542 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14335437.5 ns 14339750 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21013646 ns 21095083 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 88958 ns 88167 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 87125 ns 88875 ns 0.98
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 93271 ns 91875 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 136500 ns 134020.5 ns 1.02
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 126007 ns 126192 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2010042 ns 2027813 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2026417 ns 2027000.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2041125 ns 2054000 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2030084 ns 2028125 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 921500 ns 1026969 ns 0.90
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) 2833 ns 2792 ns 1.01
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) 2792 ns 2583 ns 1.08
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) 3458.5 ns 3458 ns 1.00
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) 2812.5 ns 1917 ns 1.47
batchedmm(2, Bsize=4)/forward/GPU/CUDA 15285 ns 16376 ns 0.93
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) 2959 ns 2709 ns 1.09
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) 3084 ns 2792 ns 1.10
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) 3209 ns 2792 ns 1.15
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) 3083 ns 2833.5 ns 1.09
batchedmm(2, Bsize=4)/zygote/GPU/CUDA 178942 ns 186134.5 ns 0.96
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 7291 ns 7375 ns 0.99
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6042 ns 6041 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 5958 ns 6167 ns 0.97
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10083 ns 10125 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33546 ns 34252.5 ns 0.98
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 219458 ns 242958 ns 0.90
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 221333 ns 220917 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219917 ns 220417 ns 1.00
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 244250 ns 240375 ns 1.02
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 283938.5 ns 328052.5 ns 0.87
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3791 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3709 ns 3750 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA 22364 ns 22539 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 14666 ns 14584 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 14625 ns 14542 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 14500 ns 14584 ns 0.99
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 14542 ns 14417 ns 1.01
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA 439358.5 ns 484358 ns 0.91
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 92270.5 ns 92125 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 93521 ns 92458 ns 1.01
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 96917 ns 98562.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 140667 ns 118229 ns 1.19
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 125399 ns 125261.5 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1651458 ns 1913333 ns 0.86
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1725083.5 ns 1909771 ns 0.90
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1909333 ns 1956333 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1925145.5 ns 1924333 ns 1.00
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 875999 ns 935173 ns 0.94
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 870208 ns 879000 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 828041.5 ns 818395.5 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 1217708 ns 1219520.5 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 980708 ns 966459 ns 1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA 270193.5 ns 267198 ns 1.01
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2798667 ns 2822917 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 2451875 ns 2496917 ns 0.98
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 3325958 ns 3359000 ns 0.99
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3420104 ns 3411333 ns 1.00
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1517252.5 ns 1570113.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 15833.5 ns 17000 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 15042 ns 15458.5 ns 0.97
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 17667 ns 19041 ns 0.93
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 14500 ns 16875 ns 0.86
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 130046 ns 133146.5 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 223917 ns 258834 ns 0.87
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 216083 ns 215125 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 227333 ns 215792 ns 1.05
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 257875 ns 227875 ns 1.13
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 588866 ns 602653.5 ns 0.98
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 220187.5 ns 219062.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 219687.5 ns 221375 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 222292 ns 222875 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 223709 ns 220791 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 244511.5 ns 247312 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 497521 ns 497625 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 499708 ns 535916 ns 0.93
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 509167 ns 499208 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 561583.5 ns 511125 ns 1.10
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1232042.5 ns 1333241 ns 0.92
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) 4625 ns 3833.5 ns 1.21
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) 4458 ns 4250 ns 1.05
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) 5667 ns 5166.5 ns 1.10
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) 3437.5 ns 3792 ns 0.91
batchedmm(16, Bsize=4)/forward/GPU/CUDA 16793 ns 16912 ns 0.99
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) 7542 ns 7542 ns 1
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) 7334 ns 7167 ns 1.02
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) 7541 ns 7542 ns 1.00
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) 7791 ns 7667 ns 1.02
batchedmm(16, Bsize=4)/zygote/GPU/CUDA 182635 ns 186762.5 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 19083 ns 18667 ns 1.02
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17667 ns 16708 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19458 ns 20584 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 20041 ns 18084 ns 1.11
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 132734.5 ns 136037 ns 0.98
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 212083 ns 224209 ns 0.95
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 238375 ns 212687 ns 1.12
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 224958 ns 213167 ns 1.06
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 220333 ns 222979.5 ns 0.99
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 855682.5 ns 896805 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 4291.5 ns 4250 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 4000 ns 4333.5 ns 0.92
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 5187.5 ns 5125 ns 1.01
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 4145.5 ns 3875 ns 1.07
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 207043 ns 222577.5 ns 0.93
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 10750 ns 10542 ns 1.02
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 10708 ns 10791 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 10375 ns 10959 ns 0.95
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 10208 ns 10333 ns 0.99
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 997637 ns 1034707.5 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 3250 ns 3375 ns 0.96
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 3334 ns 3333 ns 1.00
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 4250 ns 4042 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 3334 ns 2958 ns 1.13
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 211528 ns 225445.5 ns 0.94
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7583.5 ns 7500 ns 1.01
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7520.5 ns 7750 ns 0.97
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7542 ns 7625 ns 0.99
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7542 ns 7208 ns 1.05
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 1003949 ns 1042046 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 23433333 ns 23498333.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 35079771 ns 34789375 ns 1.01
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 37588709 ns 37689958 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 34958437.5 ns 34909542 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1845047 ns 1849921 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 183360708 ns 184647292 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 162555958 ns 163834583 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 146640208.5 ns 146363541.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 273550667 ns 274565083 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 16535211 ns 16510014 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 274702250 ns 278243563 ns 0.99
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 250347021 ns 245760791.5 ns 1.02
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 232513104 ns 231789354 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 323149875 ns 324000854.5 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 183458 ns 182625 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 182417 ns 184458 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 184125 ns 186250 ns 0.99
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 188209 ns 181875 ns 1.03
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 191890 ns 206355.5 ns 0.93
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 590916.5 ns 628291.5 ns 0.94
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 585792 ns 608229.5 ns 0.96
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 597958 ns 598250 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 636312.5 ns 637791 ns 1.00
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 954926 ns 999947 ns 0.95
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) 3848458 ns 3874375 ns 0.99
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) 3925250 ns 3917042 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) 3537667 ns 3534687.5 ns 1.00
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) 4572958 ns 4554291 ns 1.00
batchedmm(128, Bsize=512)/forward/GPU/CUDA 535917.5 ns 531266.5 ns 1.01
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) 17305750 ns 17461354.5 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) 17738666 ns 17833459 ns 0.99
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) 16523396 ns 16559937.5 ns 1.00
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) 19975812.5 ns 19938750 ns 1.00
batchedmm(128, Bsize=512)/zygote/GPU/CUDA 2621357 ns 2619194 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 584 ns 625 ns 0.93
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 584 ns 500 ns 1.17
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 625 ns 666 ns 0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 584 ns 583 ns 1.00
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA 31387 ns 33463 ns 0.94
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 9375 ns 9292 ns 1.01
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 9625 ns 9458 ns 1.02
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 9666.5 ns 9375 ns 1.03
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 9584 ns 9187.5 ns 1.04
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA 248606 ns 252733 ns 0.98
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 651447167 ns 651812167 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 390769354.5 ns 390086667 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 339760625 ns 327502625 ns 1.04
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 749321584 ns 747314333 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12465839.5 ns 12474949 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 1892321750.5 ns 1879705041.5 ns 1.01
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 1639787791 ns 1650371917 ns 0.99
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 1515586291.5 ns 1514378771 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 2197131604 ns 2204966313 ns 1.00
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49313815.5 ns 49428315 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1618083 ns 1651458 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1193083.5 ns 1196083 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1358750 ns 1387103.5 ns 0.98
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2366750.5 ns 2353958 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215176 ns 217144 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12683125 ns 12704667 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9968083.5 ns 9935187.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9611958 ns 9671333.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18417271 ns 18432334 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2006331 ns 2021545.5 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17616792 ns 17670625 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14686375 ns 14743791.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14535312.5 ns 14593292 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21460041 ns 21437146 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 26292 ns 26250 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 26333 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 26292 ns 26333 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 26250 ns 26292 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA 23829 ns 24013 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 67083 ns 67166 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67500 ns 67208 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 67333 ns 67917 ns 0.99
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66875 ns 66958 ns 1.00
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA 371762 ns 380547.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 203959 ns 202875 ns 1.01
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 210417 ns 210375 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 209708 ns 209916 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 199625 ns 198750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 25968.5 ns 25898 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 644291 ns 645354 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 624500 ns 637500.5 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 621688 ns 634542 ns 0.98
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 633208.5 ns 634250 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 326634.5 ns 326606.5 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 640396 ns 672209 ns 0.95
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 671834 ns 637917 ns 1.05
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 659042 ns 665042 ns 0.99
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 678959 ns 664917 ns 1.02
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 132210 ns 131949 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2235354 ns 2224563 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2239791.5 ns 2248771 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2263812 ns 2241125 ns 1.01
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2239458 ns 2237000 ns 1.00
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1097894 ns 1095016 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 17250 ns 17417 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 17167 ns 17333 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 19667 ns 19500 ns 1.01
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 18333 ns 16875 ns 1.09
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA 132452.5 ns 133320 ns 0.99
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 221770.5 ns 260770.5 ns 0.85
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 219750 ns 219458.5 ns 1.00
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 219208 ns 229000 ns 0.96
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 232250 ns 263334 ns 0.88
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 908726 ns 947049 ns 0.96
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 625 ns 625 ns 1
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 584 ns 666 ns 0.88
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 625 ns 667 ns 0.94
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 666 ns 584 ns 1.14
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA 23608 ns 23873 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 9667 ns 10000 ns 0.97
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 10083 ns 9750 ns 1.03
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 10209 ns 10125 ns 1.01
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 9625 ns 9750 ns 0.99
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA 244776.5 ns 245331.5 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) 5292 ns 5375 ns 0.98
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) 5959 ns 5625 ns 1.06
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) 6209 ns 6604.5 ns 0.94
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) 5020.5 ns 5000 ns 1.00
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA 202256 ns 209896.5 ns 0.96
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 7625 ns 7875 ns 0.97
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 7833 ns 7292 ns 1.07
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 7333 ns 7687.5 ns 0.95
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 7292 ns 7334 ns 0.99
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA 716742.5 ns 739872 ns 0.97
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 2458 ns 2041 ns 1.20
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 2229.5 ns 2250 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 2334 ns 2458 ns 0.95
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 2125 ns 2084 ns 1.02
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA 17972.5 ns 18207 ns 0.99
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 6583 ns 6542 ns 1.01
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 6708 ns 6458 ns 1.04
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 6542 ns 6708 ns 0.98
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 6750 ns 6541 ns 1.03
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA 296406 ns 306864 ns 0.97
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) 748875 ns 747125 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) 746750 ns 749958.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) 749167 ns 747167 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) 752250 ns 771333.5 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA 20898 ns 21305 ns 0.98
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) 796000 ns 791000 ns 1.01
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) 772750 ns 780041.5 ns 0.99
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) 772750 ns 775416 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) 792625 ns 794812.5 ns 1.00
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA 273656.5 ns 271390 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 6958 ns 6959 ns 1.00
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 6125 ns 6000 ns 1.02
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 6208 ns 6125 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 10250 ns 10167 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 33214 ns 33759 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 232416.5 ns 259750 ns 0.89
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 228208 ns 238854 ns 0.96
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 227229.5 ns 231104 ns 0.98
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 253145.5 ns 250208 ns 1.01
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 335313 ns 336384 ns 1.00
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) 10333 ns 10125 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) 10083 ns 10312.5 ns 0.98
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) 11020.5 ns 10875 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) 10291 ns 10167 ns 1.01
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA 211237 ns 223921.5 ns 0.94
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 24750 ns 24167 ns 1.02
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 25584 ns 24583 ns 1.04
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 24667 ns 25333 ns 0.97
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 24375 ns 24584 ns 0.99
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA 1030501.5 ns 1062400 ns 0.97
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 106087854.5 ns 106104729.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 116949583 ns 117502187.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 120635291 ns 120758625 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 117517042 ns 117423500 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2650195 ns 2624434 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 393002917 ns 392280708 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 366126708 ns 358697709 ns 1.02
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 356418042 ns 357440917 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 540646313 ns 540821208.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 15180136 ns 15254730 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 784599250 ns 781416292 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 758649208 ns 760831458 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 752039895.5 ns 750885583.5 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 787142437.5 ns 784554021 ns 1.00
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) 6417 ns 7583 ns 0.85
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) 7459 ns 6875 ns 1.08
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) 7958 ns 8208 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6125 ns 7917 ns 0.77
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA 207777 ns 214784 ns 0.97
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 14208 ns 14542 ns 0.98
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 14104.5 ns 13667 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 14542 ns 14125 ns 1.03
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 14292 ns 14375 ns 0.99
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA 980639 ns 1015761 ns 0.97
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) 6125 ns 5750 ns 1.07
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) 6416 ns 6125 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) 6667 ns 7500 ns 0.89
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) 5209 ns 5500 ns 0.95
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA 206305.5 ns 211436.5 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) 13000 ns 12875 ns 1.01
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) 13000 ns 12417 ns 1.05
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) 13000 ns 12687.5 ns 1.02
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) 12833 ns 13042 ns 0.98
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA 703910.5 ns 728295 ns 0.97
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) 6042 ns 5250 ns 1.15
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) 6042 ns 5709 ns 1.06
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) 6542 ns 6542 ns 1
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) 5166 ns 5375 ns 0.96
batchedmm(2, Bsize=128)/forward/GPU/CUDA 17119 ns 17219 ns 0.99
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) 15958 ns 15750 ns 1.01
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) 15833 ns 15375 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) 16125 ns 15584 ns 1.03
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) 16083 ns 15916 ns 1.01
batchedmm(2, Bsize=128)/zygote/GPU/CUDA 188536 ns 188803.5 ns 1.00
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 417 ns 0.90
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 333 ns 417 ns 0.80
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 417 ns 334 ns 1.25
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 23430 ns 23653 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6250 ns 6583 ns 0.95
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6625 ns 6625 ns 1
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6542 ns 6625 ns 0.99
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6500 ns 6375 ns 1.02
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 228627.5 ns 227179 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) 5917 ns 5958 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) 5875 ns 6041 ns 0.97
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) 5959 ns 5959 ns 1
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) 6000 ns 5875 ns 1.02
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA 24479 ns 24470 ns 1.00
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) 21000 ns 21520.5 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) 20917 ns 21209 ns 0.99
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) 21167 ns 21667 ns 0.98
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) 21458 ns 21334 ns 1.01
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA 251996 ns 249183.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 144895.5 ns 144062.5 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 144542 ns 143042 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 148208 ns 146334 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 184458 ns 188146 ns 0.98
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 167363.5 ns 167467 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 1308021 ns 1317583 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 1304542 ns 1321709 ns 0.99
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 1190917 ns 1365791.5 ns 0.87
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 1327812.5 ns 1318666 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1248522 ns 1237894 ns 1.01
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) 23917 ns 24708 ns 0.97
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) 22167 ns 24375 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) 25000 ns 24375 ns 1.03
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) 22291 ns 22374.5 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA 317685.5 ns 318636 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) 177479.5 ns 134750 ns 1.32
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) 118500 ns 181250 ns 0.65
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) 117895.5 ns 130000 ns 0.91
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) 130313 ns 130958 ns 1.00
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA 1369049 ns 1345187.5 ns 1.02
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 416 ns 417 ns 1.00
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 416 ns 333 ns 1.25
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA 23211 ns 23482 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 6375 ns 6625 ns 0.96
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 6875 ns 6500 ns 1.06
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 6625 ns 6708 ns 0.99
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 6417 ns 6792 ns 0.94
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA 244448 ns 243071 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 4625 ns 4625 ns 1
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 4625 ns 4541.5 ns 1.02
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 5250 ns 5333 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 4500 ns 4583 ns 0.98
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 232495 ns 231105.5 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 10250 ns 9875 ns 1.04
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 10208 ns 9916.5 ns 1.03
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 10500 ns 10417 ns 1.01
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 10417 ns 10375 ns 1.00
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 1282590 ns 1276883 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) 1625 ns 1625 ns 1
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) 1667 ns 1625 ns 1.03
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) 1666 ns 1667 ns 1.00
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA 23093 ns 23221 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) 5667 ns 5750 ns 0.99
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) 5959 ns 5750 ns 1.04
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) 5750 ns 6083 ns 0.95
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) 5750 ns 5709 ns 1.01
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA 263887.5 ns 262260 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 6859958 ns 6814041 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 6349229 ns 6367459 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 6486250 ns 6578812.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 7533145.5 ns 7695958 ns 0.98
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 215187 ns 214554 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 24039416.5 ns 24052709 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 21294937.5 ns 21310875 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 20970625 ns 21123834 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 29781625 ns 29855166.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 2104879.5 ns 2121783 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 48555729 ns 48838979.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 45322250 ns 45549667 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 45685542 ns 45706771 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 49355500 ns 49408500 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) 5750 ns 5875 ns 0.98
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) 6416 ns 5709 ns 1.12
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) 6709 ns 6708 ns 1.00
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) 5666 ns 5541 ns 1.02
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA 210939 ns 212106.5 ns 0.99
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8458 ns 8875 ns 0.95
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8541 ns 8167 ns 1.05
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8833 ns 8542 ns 1.03
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8500 ns 8208 ns 1.04
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA 993327.5 ns 1001631 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 1512250 ns 1556417 ns 0.97
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 1278521.5 ns 1270792 ns 1.01
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 1621229 ns 1624187.5 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2168270.5 ns 2180520.5 ns 0.99
lenet(28, 28, 1, 128)/forward/GPU/CUDA 271357.5 ns 274298 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 7839709 ns 7888792 ns 0.99
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 6591250 ns 6591250 ns 1
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 7072479.5 ns 7197854 ns 0.98
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 10485333.5 ns 10478229.5 ns 1.00
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1768229 ns 1773709 ns 1.00
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) 373584 ns 366500 ns 1.02
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) 384291 ns 371020.5 ns 1.04
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) 460208.5 ns 457708 ns 1.01
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) 21708 ns 33208.5 ns 0.65
batchedmm(128, Bsize=4)/forward/GPU/CUDA 43027 ns 47286 ns 0.91
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) 718458 ns 723916.5 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) 794250 ns 801750 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) 1056291.5 ns 1064875 ns 0.99
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) 118375 ns 115334 ns 1.03
batchedmm(128, Bsize=4)/zygote/GPU/CUDA 284783.5 ns 287209.5 ns 0.99
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) 397459 ns 397291 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) 288292 ns 287834 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) 288042 ns 288166 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) 750750 ns 750833 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA 43337 ns 44324 ns 0.98
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) 662208 ns 661875 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) 531708 ns 532416 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) 534375 ns 535458 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) 974250 ns 973250 ns 1.00
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA 189452.5 ns 191330.5 ns 0.99
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 643959 ns 670958 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 648000 ns 644229 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 654729 ns 680667 ns 0.96
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 674875 ns 648125 ns 1.04
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 131586 ns 132061.5 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2467333.5 ns 2459333 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2285979 ns 2456084 ns 0.93
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2491500 ns 2464542 ns 1.01
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2449104.5 ns 2456083 ns 1.00
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1236661 ns 1216753 ns 1.02
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) 3708 ns 3708 ns 1
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) 3542 ns 3334 ns 1.06
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) 3875 ns 4334 ns 0.89
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) 3250 ns 2667 ns 1.22
batchedmm(2, Bsize=32)/forward/GPU/CUDA 16170 ns 16517 ns 0.98
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) 5750 ns 5500 ns 1.05
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) 5916 ns 5458 ns 1.08
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) 5958 ns 5625 ns 1.06
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) 5792 ns 5542 ns 1.05
batchedmm(2, Bsize=32)/zygote/GPU/CUDA 184224.5 ns 186819.5 ns 0.99
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1461209 ns 1458167 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1497833 ns 1500500 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1499334 ns 1499333 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1435583 ns 1437750 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA 39820 ns 39930 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4764375 ns 5130750 ns 0.93
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5289146 ns 5285584 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5166458 ns 5315979 ns 0.97
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4994750 ns 4998959 ns 1.00
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 195352 ns 195663 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) 3750 ns 3708 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) 3750 ns 3709 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) 3750 ns 3750 ns 1
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) 3708 ns 3750 ns 0.99
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA 33527 ns 33499 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) 15333 ns 15375 ns 1.00
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) 15500 ns 15417 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) 15500 ns 15500 ns 1
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) 15333 ns 15167 ns 1.01
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA 352561.5 ns 351211 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) 70875 ns 70667 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) 71208 ns 71208 ns 1
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) 71458 ns 71959 ns 0.99
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) 71041 ns 71333 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA 112791 ns 113147 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 318375 ns 318500 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 319500 ns 318000 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 335167 ns 323666 ns 1.04
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 318520.5 ns 317125 ns 1.00
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA 192194 ns 195331 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) 1083 ns 1084 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) 1084 ns 1125 ns 0.96
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) 1125 ns 1084 ns 1.04
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) 1125 ns 1000 ns 1.13
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA 23034 ns 23576 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) 8417 ns 8458 ns 1.00
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) 8125 ns 8334 ns 0.97
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) 8375 ns 8292 ns 1.01
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) 8208 ns 8375 ns 0.98
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA 246121.5 ns 249171.5 ns 0.99
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) 506083 ns 506709 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) 492209 ns 492375 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) 560459 ns 562708 ns 1.00
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) 218042 ns 222187.5 ns 0.98
batchedmm(128, Bsize=32)/forward/GPU/CUDA 128098 ns 129166 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) 1379666.5 ns 1387250 ns 0.99
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) 1410750 ns 1449208 ns 0.97
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) 1735958.5 ns 1788375 ns 0.97
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) 870542 ns 865812.5 ns 1.01
batchedmm(128, Bsize=32)/zygote/GPU/CUDA 273083 ns 273491 ns 1.00
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) 375 ns 375 ns 1
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) 417 ns 292 ns 1.43
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) 375 ns 416 ns 0.90
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) 375 ns 333 ns 1.13
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA 31549 ns 32843 ns 0.96
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) 6500 ns 6667 ns 0.97
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) 6583 ns 6458 ns 1.02
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) 6542 ns 6625 ns 0.99
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) 6334 ns 6458 ns 0.98
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA 248517.5 ns 250973.5 ns 0.99
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1723916.5 ns 1722042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1733083 ns 1723208.5 ns 1.01
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1726229 ns 1721083 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1770167 ns 1723750 ns 1.03
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 168350 ns 168847 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 4370791.5 ns 4362042 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 4361667 ns 4261187.5 ns 1.02
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 4259562.5 ns 4415583.5 ns 0.96
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4371459 ns 4366958.5 ns 1.00
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 1150258 ns 1143038 ns 1.01
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) 6583 ns 6750 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) 6667 ns 6959 ns 0.96
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) 6875 ns 6959 ns 0.99
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) 6583 ns 6708.5 ns 0.98
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA 19334 ns 20756 ns 0.93
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 51125 ns 51417 ns 0.99
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 50895.5 ns 32917 ns 1.55
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 32791 ns 33333 ns 0.98
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 71104.5 ns 51208.5 ns 1.39
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA 196844.5 ns 197240.5 ns 1.00
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) 18083 ns 17542 ns 1.03
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) 18208 ns 17875 ns 1.02
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) 18250 ns 18916 ns 0.96
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) 17417 ns 17750 ns 0.98
batchedmm(2, Bsize=512)/forward/GPU/CUDA 18164 ns 18861 ns 0.96
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) 53416 ns 53458 ns 1.00
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) 53666 ns 53334 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) 53625 ns 53250 ns 1.01
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) 53875 ns 53500 ns 1.01
batchedmm(2, Bsize=512)/zygote/GPU/CUDA 319761 ns 319618.5 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) 75375 ns 75292 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) 75292 ns 75375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) 75584 ns 75792 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) 75250 ns 75208 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA 46528 ns 47162 ns 0.99
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) 324750 ns 324375 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) 331916 ns 327625 ns 1.01
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) 342000 ns 329583 ns 1.04
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) 325375 ns 324208 ns 1.00
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA 209260 ns 211676.5 ns 0.99
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 1486417 ns 1484375 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 1526791 ns 1527958 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 1526875 ns 1527583 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 1463375 ns 1462209 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 51992.5 ns 51967 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 5107000 ns 5124708 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 5260791.5 ns 5280333 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 5161042 ns 5332500 ns 0.97
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 4983187.5 ns 4985875 ns 1.00
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 201588 ns 202369.5 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) 28250 ns 28250 ns 1
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) 28250 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) 28250 ns 28333 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) 28250 ns 28291 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA 24143 ns 24821 ns 0.97
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) 66417 ns 66459 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) 67000 ns 66458 ns 1.01
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) 66834 ns 66833 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) 66375 ns 66416 ns 1.00
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA 492191 ns 482606 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1495250.5 ns 1501229 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 1147437.5 ns 1127563 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1073479 ns 1119291.5 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2230020.5 ns 2246375 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 584155.5 ns 570915 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 3049917 ns 3082875 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 2725583 ns 2738375 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 2748541.5 ns 2760354 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 3815417 ns 3780667 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 2003962 ns 1961915 ns 1.02
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 7917521 ns 7895333 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 7897062 ns 7893459 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 7875417 ns 7944812.5 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 4816000 ns 4834521 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) 79375 ns 80959 ns 0.98
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) 81312.5 ns 80333 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) 82625 ns 82166 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) 81292 ns 134375.5 ns 0.60
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA 193985 ns 193995.5 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) 2006291 ns 2014625 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) 2022000 ns 2006229 ns 1.01
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) 2045021 ns 2047021 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) 2017334 ns 2022958 ns 1.00
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA 750509 ns 740969 ns 1.01

This comment was automatically generated by workflow using github-action-benchmark.

examples/HyperNet/main.jl Outdated Show resolved Hide resolved
@wsmoses
Copy link
Contributor

wsmoses commented Jan 5, 2025

@avik-pal sorry you probably need to restart the test [I just did a bump for enzyme's interpreter and then one for reactant that will let us improve precompilation]

@avik-pal
Copy link
Member Author

avik-pal commented Jan 5, 2025

One good thing is I managed to get an example (SimpleRNN one) which produces the error (https://discourse.julialang.org/t/trying-to-implement-vae-using-lux-and-reactant/124353?u=avikpal). The bad part is that it involves a 20K line IR 😓. I suspect it is from a scalar indexing issue but I don't know what is causing it and why the getindex doesn't error in the first place.

@avik-pal
Copy link
Member Author

avik-pal commented Jan 5, 2025

This PR also adds a LUX_DUMP_REACTANT_HLO_OPTIMIZE env var to dump the HLO compiled inside the training functions

@avik-pal
Copy link
Member Author

avik-pal commented Jan 6, 2025

envs/incorrect_ir.mlir:24176:34: error: use of undeclared SSA value name
    %312 = "stablehlo.transpose"(%430) <{permutation = array<i64: 0>}> : (tensor<2xui64>) -> tensor<2xui64>
                                 ^

We are definitely generating something incorrect here. %430 is present in a different block.

@avik-pal avik-pal linked an issue Jan 7, 2025 that may be closed by this pull request
@avik-pal avik-pal linked an issue Jan 7, 2025 that may be closed by this pull request
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

WeightInitializers.DeviceAgnostic doesn't respect Reactant Incorrect IR generated for some neural networks
2 participants