-
Notifications
You must be signed in to change notification settings - Fork 63
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
docs: migrate most examples to Reactant #1180
base: main
Are you sure you want to change the base?
Conversation
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: 706a05a | Previous: 46a012d | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
4000 ns |
3791 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4334 ns |
4500 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4959 ns |
4875 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3916.5 ns |
3666 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
64908 ns |
59711.5 ns |
1.09 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10604.5 ns |
10167 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
11209 ns |
10458 ns |
1.07 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
11125 ns |
10750 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9916 ns |
10625 ns |
0.93 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
425347 ns |
419469 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1083 ns |
1062.5 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1250 ns |
1167 ns |
1.07 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1292 ns |
1500 ns |
0.86 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1042 ns |
1125 ns |
0.93 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
18063 ns |
18540 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4083 ns |
4083 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4083 ns |
4042 ns |
1.01 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4209 ns |
4208 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4020.5 ns |
3958 ns |
1.02 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
110564 ns |
109802.5 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57333 ns |
57542 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46500 ns |
46416 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47209 ns |
47125 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82333 ns |
80875 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37199 ns |
37744 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2033020.5 ns |
2035395.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2085895.5 ns |
2078396 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2101958 ns |
2078708 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1997062.5 ns |
1998584 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
195555 ns |
195463 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
144041 ns |
144250 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
144584 ns |
144166.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
145166.5 ns |
145125 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
143750 ns |
153104.5 ns |
0.94 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
166186.5 ns |
165592.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1117854 ns |
1120291.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1116167 ns |
1113167 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1136000 ns |
832708.5 ns |
1.36 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1119083 ns |
1117084 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
529755 ns |
520015.5 ns |
1.02 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3667 ns |
3375 ns |
1.09 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3708 ns |
3542 ns |
1.05 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4333 ns |
4166 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3417 ns |
3125 ns |
1.09 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
71440.5 ns |
66073.5 ns |
1.08 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9917 ns |
9042 ns |
1.10 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9375 ns |
8750 ns |
1.07 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9416 ns |
10208 ns |
0.92 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9208 ns |
8833 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
475729 ns |
469701 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15250 ns |
17041 ns |
0.89 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15959 ns |
15834 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
16333 ns |
16604.5 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15209 ns |
16791 ns |
0.91 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
55464 ns |
54530 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215833 ns |
213750 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
215042 ns |
214875 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
214083 ns |
215667 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215541 ns |
226125 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
277083 ns |
269469 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
542 ns |
1.15 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
875 ns |
708 ns |
1.24 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
584 ns |
709 ns |
0.82 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
541 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
17559 ns |
17336 ns |
1.01 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1541 ns |
1375 ns |
1.12 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1500 ns |
1375 ns |
1.09 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1500 ns |
1500 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1417 ns |
1458 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
102642 ns |
100554 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7208 ns |
7000 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5750 ns |
1.02 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5917 ns |
6042 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10000 ns |
9750 ns |
1.03 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23623 ns |
23286 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221166 ns |
222021 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
229083 ns |
228542 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229083 ns |
229292 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214416.5 ns |
213937.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
169773 ns |
166141.5 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3916 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3916 ns |
3917 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3958 ns |
3959 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3958 ns |
3917 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23615 ns |
23204 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16708 ns |
16917 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16875 ns |
16792 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16750 ns |
17250 ns |
0.97 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16875 ns |
16750 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
164235.5 ns |
164061.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
578041 ns |
568792 ns |
1.02 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
578458 ns |
578645.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
580333 ns |
578083 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
574667 ns |
575625 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113670 ns |
113438.5 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1381458.5 ns |
1422625 ns |
0.97 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1417875 ns |
1420000 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1422125 ns |
1422375 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1420792 ns |
1426708 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
213871 ns |
213572 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1082646 ns |
1077687.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
975542 ns |
960917 ns |
1.02 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1348500 ns |
1353229.5 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1302083 ns |
1315312 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
277486.5 ns |
274529.5 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5993833 ns |
5961958 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4614270.5 ns |
4633250 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4932083 ns |
4975188 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5723020.5 ns |
5557125 ns |
1.03 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1095707 ns |
1081948 ns |
1.01 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
583 ns |
0.93 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23887 ns |
23910 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2209 ns |
2208 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2250 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2250 ns |
2167 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2208 ns |
2125 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
172803.5 ns |
176064.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4208 ns |
4125 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
3917 ns |
4375 ns |
0.90 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5166 ns |
5167 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4125 ns |
4250 ns |
0.97 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
66602.5 ns |
65504 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11708 ns |
11875 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11625 ns |
11000 ns |
1.06 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11791 ns |
11917 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11417 ns |
11500 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
453473 ns |
448080.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6209 ns |
7000 ns |
0.89 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7083.5 ns |
6958 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7958 ns |
8250 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6209 ns |
6125 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
53188.5 ns |
52534 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
17250 ns |
18708.5 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
19521 ns |
18625 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18291 ns |
18375 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
18291.5 ns |
16708 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
299625.5 ns |
296471 ns |
1.01 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
708 ns |
0.82 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
583 ns |
667 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
32863 ns |
33481 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9333 ns |
8834 ns |
1.06 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9125 ns |
8875 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9292 ns |
9334 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8708 ns |
8354.5 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
161130 ns |
158505 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64541 ns |
64459 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64541 ns |
64750 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64959 ns |
64916 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64625 ns |
64625 ns |
1 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
112602 ns |
112347 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
275875 ns |
279250 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
279833 ns |
282167 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
283458 ns |
284125 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
283583 ns |
278708 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
189085 ns |
187244.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3299167 ns |
3278417 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3026270.5 ns |
3081000 ns |
0.98 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3019187.5 ns |
3021792 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4024583 ns |
4040979.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
584255.5 ns |
573775.5 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7515459 ns |
7620208 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7456208 ns |
7449187.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7490895.5 ns |
7493708.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8188916 ns |
8208791 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1347259 ns |
1340015.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17415104.5 ns |
18366417 ns |
0.95 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17525145.5 ns |
17522312.5 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
17569875 ns |
17580834 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14180208.5 ns |
14093354.5 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23591729 ns |
23631333 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33628750 ns |
33504604 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37137000 ns |
37034667 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34864416.5 ns |
34967583.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1862603 ns |
1860248 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
187602250 ns |
189693000 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
163289375 ns |
165014875 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
152560000 ns |
152416688 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
434843584 ns |
434850958 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
13906335 ns |
13871408 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
287885958 ns |
289105312.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
251098583 ns |
250867083 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
294830208 ns |
296775875 ns |
0.99 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
474116666.5 ns |
473537562.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22333 ns |
22083 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22209 ns |
22459 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
24667 ns |
25375 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21250 ns |
24083 ns |
0.88 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
96476 ns |
95417 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
104083.5 ns |
103083 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
104791 ns |
103250 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104125 ns |
104542 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
103875 ns |
103041 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
508784.5 ns |
502007.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6000 ns |
5917 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6000 ns |
5958 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7042 ns |
6708 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5459 ns |
5791.5 ns |
0.94 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
68961 ns |
68401.5 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13770.5 ns |
14792 ns |
0.93 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15375 ns |
15000 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16187.5 ns |
16542 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14604.5 ns |
14875 ns |
0.98 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
480016 ns |
475091.5 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3034541 ns |
3002625 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2081333 ns |
2079375 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2270063 ns |
2272333 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4522416.5 ns |
4882708 ns |
0.93 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
587464 ns |
586443 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23649250.5 ns |
23536000 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18052375 ns |
18038562.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
16973500 ns |
16972167 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
34879666.5 ns |
34545146 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2921251.5 ns |
2768189 ns |
1.06 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33967979 ns |
33221458 ns |
1.02 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27495520.5 ns |
27561792 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27395334 ns |
27327000 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41094750 ns |
42034750 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
74791 ns |
71417 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
74000 ns |
71854.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
75208 ns |
75708 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74520.5 ns |
74708 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
103185 ns |
101188 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
252229 ns |
205250.5 ns |
1.23 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
218416.5 ns |
206750 ns |
1.06 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
318417 ns |
208958 ns |
1.52 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
281145.5 ns |
217416 ns |
1.29 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
546577.5 ns |
541638 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11792 ns |
11875 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11834 ns |
11416 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12687.5 ns |
12958 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11792 ns |
11708 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
71493 ns |
70557.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26792 ns |
25667 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26875 ns |
26541.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
27583 ns |
27729.5 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26625 ns |
26667 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
472979.5 ns |
468068.5 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12042 ns |
12812.5 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
14666 ns |
12209 ns |
1.20 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
13792 ns |
14208 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11584 ns |
12291.5 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
53417.5 ns |
52262 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
25375 ns |
25625 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
25750 ns |
25916.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26209 ns |
26250 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26125 ns |
26604 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
301308.5 ns |
297345.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
179709 ns |
178792 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
180250 ns |
180750 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
180833 ns |
181917 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
179770.5 ns |
179166 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
56746 ns |
56939 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
601167 ns |
593333 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
583459 ns |
582708 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
593854.5 ns |
583667 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
587750 ns |
584542 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
287220 ns |
282717 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6479.5 ns |
6167 ns |
1.05 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5687.5 ns |
5875 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6875 ns |
6875 ns |
1 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5834 ns |
5708.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
70626.5 ns |
69908.5 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14292 ns |
13791 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14459 ns |
13917 ns |
1.04 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15292 ns |
15667 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14208 ns |
14458 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
464130 ns |
454508 ns |
1.02 |
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1175917 ns |
1225312.5 ns |
0.96 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1245125 ns |
1241959 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1285584 ns |
1289958.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
999417 ns |
1011625 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301338 ns |
300319.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4109000 ns |
4103042 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4433333 ns |
4403333 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4513562.5 ns |
4523854.5 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
3727604 ns |
3709771 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1048333 ns |
1034770 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1875 ns |
1916 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
23450 ns |
23619 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4917 ns |
4958 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4917 ns |
5000 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5000 ns |
4958 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4917 ns |
4875 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
188032 ns |
186116 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5625 ns |
5833 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6292 ns |
5917 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6667 ns |
6667 ns |
1 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5458 ns |
5209 ns |
1.05 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
54508 ns |
54405.5 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11125 ns |
11125 ns |
1 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11250 ns |
11500 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10625 ns |
11458 ns |
0.93 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11104 ns |
10500 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
322621.5 ns |
320192 ns |
1.01 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
375 ns |
375 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
375 ns |
0.78 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
375 ns |
333 ns |
1.13 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
334 ns |
375 ns |
0.89 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
22578 ns |
22488.5 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2750 ns |
2792 ns |
0.98 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
3000 ns |
2833 ns |
1.06 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2875 ns |
3083 ns |
0.93 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2750 ns |
2750 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
157805 ns |
157059.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11417 ns |
11459 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11833 ns |
11625 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12666.5 ns |
12875 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10667 ns |
10958 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
56499 ns |
55353 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25125 ns |
25020.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25084 ns |
25292 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25208 ns |
25125 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
25000 ns |
24875 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
287174 ns |
284593.5 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4250 ns |
4250 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4209 ns |
4250 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4292 ns |
4250 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4250 ns |
4208 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
25020 ns |
24743 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16417 ns |
16333 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16250 ns |
16375 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16791 ns |
16520.5 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16583 ns |
16208 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
192990 ns |
192574 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5792 ns |
5833 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5833 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5833 ns |
6042 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5834 ns |
5833 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
32887 ns |
33721.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
19917 ns |
21000 ns |
0.95 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21104.5 ns |
21000 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21458 ns |
21417 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20833 ns |
20709 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
172782.5 ns |
172002 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
421479.5 ns |
422124.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
387458 ns |
387791 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
483666.5 ns |
477333 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
104125 ns |
103125 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
66577 ns |
66716 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
858542 ns |
921333 ns |
0.93 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
977646 ns |
974250 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1191354 ns |
1186458 ns |
1.00 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
467937.5 ns |
457479.5 ns |
1.02 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
192502.5 ns |
189036 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
112208 ns |
80542 ns |
1.39 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
81563 ns |
80709 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83208 ns |
84896 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80417 ns |
79833 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193461 ns |
193358.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1637291 ns |
1919250 ns |
0.85 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1921875 ns |
1876583 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1912999.5 ns |
1946041 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1916000 ns |
1921396 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
393766 ns |
391971 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
21781 ns |
21948.5 ns |
0.99 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1875 ns |
1917 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1875 ns |
1917 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1917 ns |
1875 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
166492 ns |
166123 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6500 ns |
6417 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6792 ns |
6666 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7667 ns |
7771 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5792 ns |
6145.5 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
57391 ns |
56772 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9166 ns |
9604.5 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9208 ns |
9459 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9458 ns |
9500 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9375 ns |
9041 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
297020.5 ns |
294981.5 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
120279542 ns |
120459792 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173916083 ns |
173682208 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148017417 ns |
147804000 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
105434958 ns |
105720875 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5470683 ns |
5472285 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
611677521 ns |
610206729.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
555203375 ns |
555562500 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
449267291.5 ns |
452099291.5 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
626329062.5 ns |
626409896 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
34976369 ns |
34955764 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
653577709 ns |
657253583 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
664059604 ns |
665008062.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
584797604.5 ns |
581676208.5 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
856360417 ns |
857648458 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59375 ns |
57875 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47917 ns |
47791 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47333 ns |
47500 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82208 ns |
83395.5 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37811 ns |
37072 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1907167 ns |
1915500 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1979584 ns |
1932792 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1982292 ns |
1995084 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1887520.5 ns |
1890500 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
173554.5 ns |
171922.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
267916.5 ns |
267854.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
268187.5 ns |
267708 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
289375 ns |
269750 ns |
1.07 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
267604.5 ns |
268166 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
125126 ns |
123763 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
672417 ns |
594417 ns |
1.13 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
682229.5 ns |
681291 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
697833 ns |
604895.5 ns |
1.15 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
691667 ns |
689917 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
681144 ns |
674236.5 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2155667 ns |
2176375 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2209917 ns |
2222812.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2230833 ns |
2205042 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2213250 ns |
2093562.5 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
133068 ns |
133331 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5474583 ns |
5514416 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5487146 ns |
5508500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5508750 ns |
5535958 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5505584 ns |
5491750 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
730756 ns |
730299 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
644000 ns |
638167 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
646333 ns |
647708 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
645875 ns |
659416 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
636625 ns |
643750 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
46573 ns |
46729.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1813145.5 ns |
1822167 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1720875 ns |
1723042 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1730500 ns |
1727833 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2102458 ns |
2106333 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
221564 ns |
219682 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58375 ns |
58458 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47583 ns |
46917 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47958 ns |
47292 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84792 ns |
84125 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
28293.5 ns |
28215 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2023750.5 ns |
2030041 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2086896 ns |
2004250 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2092792 ns |
2122125 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1989458 ns |
1985979.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
189678 ns |
186715 ns |
1.02 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13137500 ns |
13357770.5 ns |
0.98 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12420833 ns |
12440000 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12485416 ns |
12492250 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15160708 ns |
15108458 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
515676 ns |
510701.5 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
46983250 ns |
47178791.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41725771 ns |
41760334 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
40959021.5 ns |
40950875 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58550667 ns |
58205437.5 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3034668 ns |
2894239.5 ns |
1.05 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
96386417 ns |
97014458.5 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
90955791.5 ns |
91152834 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
90582166.5 ns |
90701604.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
98700687.5 ns |
98541521.5 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58833 ns |
58959 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47208 ns |
47375 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47791 ns |
47750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
80375 ns |
79958 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
48423 ns |
47779.5 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1652167 ns |
1918645.5 ns |
0.86 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1967125 ns |
1971000 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1975771 ns |
1997667 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1889978.5 ns |
1889750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
196325 ns |
192960 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
32529 ns |
33172 ns |
0.98 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6312.5 ns |
6292 ns |
1.00 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6500 ns |
6542 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6750 ns |
6834 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6375 ns |
6125 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
173262.5 ns |
171303 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
33122 ns |
32323 ns |
1.02 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2833 ns |
2833 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2875 ns |
2917 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2917 ns |
2917 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2834 ns |
2708 ns |
1.05 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
163502.5 ns |
162112.5 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
283807333.5 ns |
289426812.5 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
339209291 ns |
339624334 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
313321979.5 ns |
315284104.5 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
271605333 ns |
274668667 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
7106197 ns |
7120353.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1010796708 ns |
1014634416 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
952570250 ns |
953687125 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
857678666.5 ns |
857733312.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1262433542 ns |
1265357333 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
33982981 ns |
33985258 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1664881084 ns |
1675373667 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1682055500 ns |
1668941291 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1607676375 ns |
1606744000 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1777771875 ns |
1787636084 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1413687.5 ns |
1409499.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1461375 ns |
1413833 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1420208 ns |
1419895.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1414875 ns |
1458541.5 ns |
0.97 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
128286 ns |
127493 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5003833 ns |
5016749.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5018125 ns |
4651917 ns |
1.08 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5022854 ns |
5058791 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5026291 ns |
5012792 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
606256 ns |
551564 ns |
1.10 |
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
169934000 ns |
171852250 ns |
0.99 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
136588687.5 ns |
129831062.5 ns |
1.05 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
113187625 ns |
115995771 ns |
0.98 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
168763583 ns |
168839667 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
4844336 ns |
4879222 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
622027459 ns |
629070333 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
495127250 ns |
493488792 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
455703333 ns |
456364583 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
647652500 ns |
675660292 ns |
0.96 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16164061 ns |
16223916 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8932687 ns |
8950646 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8960208 ns |
8924625 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7842146 ns |
7865125 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
9760916.5 ns |
9701750 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1593338 ns |
1588053 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
35890667 ns |
36024125 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
37271458 ns |
37000208.5 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33270271 ns |
33425875 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
37815604 ns |
37661542 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
6467020.5 ns |
6463767 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47458 ns |
47562.5 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47667 ns |
47416 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47583 ns |
47666 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47333.5 ns |
47375 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
18409 ns |
17907 ns |
1.03 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50417 ns |
50542 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50500 ns |
50375 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50625 ns |
50584 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50292 ns |
50583 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
221942.5 ns |
184398 ns |
1.20 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6750 ns |
6958.5 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6958 ns |
6500 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7834 ns |
8042 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6167 ns |
6542 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
117277.5 ns |
89066 ns |
1.32 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10020.5 ns |
10042 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10459 ns |
10437.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10209 ns |
10500 ns |
0.97 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10709 ns |
10375 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
636678 ns |
510214.5 ns |
1.25 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5750 ns |
5666 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6250 ns |
5958 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7562.5 ns |
7417 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5292 ns |
5458 ns |
0.97 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
131618 ns |
109271 ns |
1.20 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13375 ns |
13125 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13625 ns |
13250 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13583 ns |
13375 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
13875 ns |
13208 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
534735 ns |
457940.5 ns |
1.17 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1125 ns |
1084 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
1084 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
32184 ns |
32174 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8041 ns |
8000 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8125 ns |
8292 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8500 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7916 ns |
8125 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
217587 ns |
199053.5 ns |
1.09 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23416.5 ns |
23354.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23625 ns |
23250 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23584 ns |
23542 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23333 ns |
23125 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
18406 ns |
18347 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52666 ns |
52667 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52708 ns |
52584 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52750 ns |
52750 ns |
1 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52625 ns |
52417 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
324876.5 ns |
291115 ns |
1.12 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1399458 ns |
1398084 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1397791 ns |
1402791 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1399625 ns |
1401792 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1396645.5 ns |
1402875 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
196317 ns |
195544.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5003666.5 ns |
5010813 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5011562.5 ns |
5016584 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5009354.5 ns |
5062708 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4929166 ns |
5013500 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
642705 ns |
617335 ns |
1.04 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3060229 ns |
3040417 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2062833 ns |
2105083 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2271959 ns |
2280208 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4546292 ns |
4865521 ns |
0.93 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
582192 ns |
579665 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24422521 ns |
24414604.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18893959 ns |
18876208.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
17756020.5 ns |
17652979 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35829479.5 ns |
35825688 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
2983107 ns |
2847809 ns |
1.05 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33974187.5 ns |
34006188 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28302938 ns |
28283750 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28035291.5 ns |
27926083.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41470145.5 ns |
41742416.5 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
144105041 ns |
144750166 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
147427375 ns |
146949375 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
126884687.5 ns |
126208208.5 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
173054375 ns |
173205292 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22557393 ns |
22782449 ns |
0.99 |
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
1956977417 ns |
1847080125 ns |
1.06 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
885326041 ns |
809911709 ns |
1.09 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1037813833 ns |
755677291 ns |
1.37 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
671200250 ns |
667449084 ns |
1.01 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
118989039 ns |
118406338 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
75000 ns |
76791 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
75959 ns |
76042 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
83500 ns |
76417 ns |
1.09 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
74959 ns |
72541 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
257706 ns |
250232.5 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
279979 ns |
277229 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
238458 ns |
193583 ns |
1.23 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
296208 ns |
205417 ns |
1.44 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
291041.5 ns |
303083.5 ns |
0.96 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1346296 ns |
1279646 ns |
1.05 |
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35559750 ns |
35472875 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
36337666.5 ns |
36379896 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32414125 ns |
32315333.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40612313 ns |
40618416.5 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5842862 ns |
5840653.5 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
147914959 ns |
146765250 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
152919521 ns |
153200125 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
139208021 ns |
137307792 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
286973916 ns |
285301125 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
34878379 ns |
34880703 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
118953291.5 ns |
120518062.5 ns |
0.99 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174106167 ns |
174031666 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148140750 ns |
148283312.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106397000 ns |
106552271 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5469783.5 ns |
5465282.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
467953125 ns |
469918416 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
466269250 ns |
466837917 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
438982208 ns |
437920916.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
743118750 ns |
739774042 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
32258831 ns |
32269604.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
714490041.5 ns |
711087896 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
639936396 ns |
640897313 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
628040062.5 ns |
630411896 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
852625250 ns |
849787625 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1324000 ns |
1302125 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
968333.5 ns |
905958 ns |
1.07 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
989666 ns |
938334 ns |
1.05 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2099083 ns |
1987437 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
581941.5 ns |
573939.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2954937.5 ns |
2951687.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2616209 ns |
2611020.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2650208.5 ns |
2639896 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3693125 ns |
3702396 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1845744 ns |
1765767 ns |
1.05 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5805500 ns |
5801417 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5795917 ns |
5727666.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
5825979.5 ns |
5818916 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2900791.5 ns |
2913834 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7375 ns |
7417 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6250 ns |
6166 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6334 ns |
6209 ns |
1.02 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10083 ns |
1 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25366 ns |
25586 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212375 ns |
212792 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
232645.5 ns |
220834 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220375 ns |
221166 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
216416.5 ns |
215459 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
263885 ns |
272866 ns |
0.97 |
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
299505333 ns |
300445333 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
219250584 ns |
214002042 ns |
1.02 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
195972000 ns |
196386541 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
302200812.5 ns |
307720792 ns |
0.98 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
7673524.5 ns |
7675041.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1227976646 ns |
1232629833 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
894926666.5 ns |
899311645.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
811451708 ns |
825300584 ns |
0.98 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1157110729 ns |
1150330250 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
26480631 ns |
26367421.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5542 ns |
5458 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5416 ns |
1.12 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6479.5 ns |
6750.5 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4833 ns |
5084 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
185138 ns |
184497.5 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7541 ns |
7667 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7500 ns |
7333 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7583 ns |
7500 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7375 ns |
7250 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
640968 ns |
655045 ns |
0.98 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
583 ns |
583 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
23916 ns |
24222 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
8854.5 ns |
9542 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9583 ns |
9833 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9542 ns |
9667 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9541 ns |
9041 ns |
1.06 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
222930.5 ns |
221511.5 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
352187.5 ns |
352562.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
351375 ns |
351833 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
352187.5 ns |
353416.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
352854 ns |
366166 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
20981 ns |
21264 ns |
0.99 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
809000 ns |
826208 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
774417 ns |
775333.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
808000 ns |
808520.5 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
828145.5 ns |
828833 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
283278 ns |
278649 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
334209 ns |
340917 ns |
0.98 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
348667 ns |
342729.5 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
449000 ns |
453708 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
10041 ns |
10687.5 ns |
0.94 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
18205 ns |
18338 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
702917 ns |
709875 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
736959 ns |
728042 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
998917 ns |
1005792 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
26833 ns |
26667 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
246883.5 ns |
257132 ns |
0.96 |
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
376083.5 ns |
380187.5 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
353541.5 ns |
355542 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
444042 ns |
442146 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
29708 ns |
30959 ns |
0.96 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
22710 ns |
22801.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
717541 ns |
726667 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
794458 ns |
778791.5 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1030979 ns |
1034042 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
106333 ns |
105042 ns |
1.01 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
215644.5 ns |
214595.5 ns |
1.00 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3500 ns |
3583 ns |
0.98 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3500 ns |
3542 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3708 ns |
3708 ns |
1 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3583 ns |
3542 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
17630 ns |
17801 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4250 ns |
4583 ns |
0.93 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4209 ns |
4333 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4333 ns |
4375 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4208 ns |
4167 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
262870 ns |
276455 ns |
0.95 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3958 ns |
3833 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3750 ns |
3542 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4417 ns |
4292 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3666 ns |
3500 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
203649.5 ns |
219668 ns |
0.93 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8459 ns |
8334 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8375 ns |
8334 ns |
1.00 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8708 ns |
8708 ns |
1 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8375 ns |
8625 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
1134731.5 ns |
1228564 ns |
0.92 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
204084 ns |
203709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
210250 ns |
209833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209959 ns |
213750 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
198625 ns |
200750 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34477 ns |
34897 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
601958 ns |
611979.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
620583.5 ns |
623084 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
632250 ns |
633542 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
629125 ns |
630833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
312652 ns |
337730.5 ns |
0.93 |
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
1002541 ns |
991250 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1013375 ns |
1017458.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
949958 ns |
954833 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
863959 ns |
864916.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
207370 ns |
208131 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4494042 ns |
4517208 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4651833.5 ns |
4768041 ns |
0.98 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4432083 ns |
4459667 ns |
0.99 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
4266750 ns |
4281312 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
924761 ns |
937605 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3604.5 ns |
3625 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3375 ns |
3291 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4083 ns |
4250 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3583 ns |
3166 ns |
1.13 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
197591.5 ns |
221703 ns |
0.89 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7417 ns |
7500 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7250 ns |
7458 ns |
0.97 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7417 ns |
7687.5 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7250 ns |
7084 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
942275 ns |
1025587 ns |
0.92 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1630417 ns |
1644333 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1198354 ns |
1183209 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1354833 ns |
1370292 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2350625 ns |
2475167 ns |
0.95 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214827 ns |
213710.5 ns |
1.01 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12313938 ns |
12346958.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9588354.5 ns |
9593646 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9260750 ns |
9292209 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
17986208 ns |
17963583.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
1951608 ns |
1947963.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17308583.5 ns |
17361375 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14356333.5 ns |
14393542 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14335437.5 ns |
14339750 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21013646 ns |
21095083 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
88958 ns |
88167 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
87125 ns |
88875 ns |
0.98 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
93271 ns |
91875 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
136500 ns |
134020.5 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
126007 ns |
126192 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2010042 ns |
2027813 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2026417 ns |
2027000.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2041125 ns |
2054000 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2030084 ns |
2028125 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
921500 ns |
1026969 ns |
0.90 |
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
2833 ns |
2792 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
2792 ns |
2583 ns |
1.08 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
3458.5 ns |
3458 ns |
1.00 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
2812.5 ns |
1917 ns |
1.47 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
15285 ns |
16376 ns |
0.93 |
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2959 ns |
2709 ns |
1.09 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3084 ns |
2792 ns |
1.10 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3209 ns |
2792 ns |
1.15 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
3083 ns |
2833.5 ns |
1.09 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
178942 ns |
186134.5 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7291 ns |
7375 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6042 ns |
6041 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5958 ns |
6167 ns |
0.97 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10125 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33546 ns |
34252.5 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
219458 ns |
242958 ns |
0.90 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
221333 ns |
220917 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
219917 ns |
220417 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
244250 ns |
240375 ns |
1.02 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
283938.5 ns |
328052.5 ns |
0.87 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3791 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3709 ns |
3750 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22364 ns |
22539 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14666 ns |
14584 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14625 ns |
14542 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14500 ns |
14584 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14542 ns |
14417 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
439358.5 ns |
484358 ns |
0.91 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
92270.5 ns |
92125 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
93521 ns |
92458 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
96917 ns |
98562.5 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
140667 ns |
118229 ns |
1.19 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
125399 ns |
125261.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1651458 ns |
1913333 ns |
0.86 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1725083.5 ns |
1909771 ns |
0.90 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1909333 ns |
1956333 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1925145.5 ns |
1924333 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
875999 ns |
935173 ns |
0.94 |
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
870208 ns |
879000 ns |
0.99 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
828041.5 ns |
818395.5 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1217708 ns |
1219520.5 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
980708 ns |
966459 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
270193.5 ns |
267198 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2798667 ns |
2822917 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2451875 ns |
2496917 ns |
0.98 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3325958 ns |
3359000 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3420104 ns |
3411333 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1517252.5 ns |
1570113.5 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15833.5 ns |
17000 ns |
0.93 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15042 ns |
15458.5 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17667 ns |
19041 ns |
0.93 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
14500 ns |
16875 ns |
0.86 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
130046 ns |
133146.5 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
223917 ns |
258834 ns |
0.87 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
216083 ns |
215125 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
227333 ns |
215792 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
257875 ns |
227875 ns |
1.13 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
588866 ns |
602653.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
220187.5 ns |
219062.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
219687.5 ns |
221375 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
222292 ns |
222875 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
223709 ns |
220791 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
244511.5 ns |
247312 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
497521 ns |
497625 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
499708 ns |
535916 ns |
0.93 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
509167 ns |
499208 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
561583.5 ns |
511125 ns |
1.10 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1232042.5 ns |
1333241 ns |
0.92 |
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
4625 ns |
3833.5 ns |
1.21 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
4458 ns |
4250 ns |
1.05 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
5667 ns |
5166.5 ns |
1.10 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
3437.5 ns |
3792 ns |
0.91 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
16793 ns |
16912 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
7542 ns |
7542 ns |
1 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
7334 ns |
7167 ns |
1.02 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
7541 ns |
7542 ns |
1.00 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
7791 ns |
7667 ns |
1.02 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
182635 ns |
186762.5 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19083 ns |
18667 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17667 ns |
16708 ns |
1.06 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19458 ns |
20584 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
20041 ns |
18084 ns |
1.11 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
132734.5 ns |
136037 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212083 ns |
224209 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
238375 ns |
212687 ns |
1.12 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
224958 ns |
213167 ns |
1.06 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
220333 ns |
222979.5 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
855682.5 ns |
896805 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4291.5 ns |
4250 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4000 ns |
4333.5 ns |
0.92 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5187.5 ns |
5125 ns |
1.01 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4145.5 ns |
3875 ns |
1.07 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
207043 ns |
222577.5 ns |
0.93 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10750 ns |
10542 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10708 ns |
10791 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10375 ns |
10959 ns |
0.95 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10208 ns |
10333 ns |
0.99 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
997637 ns |
1034707.5 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3250 ns |
3375 ns |
0.96 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3334 ns |
3333 ns |
1.00 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4250 ns |
4042 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3334 ns |
2958 ns |
1.13 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
211528 ns |
225445.5 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7583.5 ns |
7500 ns |
1.01 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7520.5 ns |
7750 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7542 ns |
7625 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7542 ns |
7208 ns |
1.05 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
1003949 ns |
1042046 ns |
0.96 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23433333 ns |
23498333.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
35079771 ns |
34789375 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37588709 ns |
37689958 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34958437.5 ns |
34909542 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1845047 ns |
1849921 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
183360708 ns |
184647292 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
162555958 ns |
163834583 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146640208.5 ns |
146363541.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
273550667 ns |
274565083 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
16535211 ns |
16510014 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
274702250 ns |
278243563 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
250347021 ns |
245760791.5 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
232513104 ns |
231789354 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
323149875 ns |
324000854.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
183458 ns |
182625 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
182417 ns |
184458 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
184125 ns |
186250 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
188209 ns |
181875 ns |
1.03 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
191890 ns |
206355.5 ns |
0.93 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
590916.5 ns |
628291.5 ns |
0.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
585792 ns |
608229.5 ns |
0.96 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
597958 ns |
598250 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
636312.5 ns |
637791 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
954926 ns |
999947 ns |
0.95 |
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3848458 ns |
3874375 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3925250 ns |
3917042 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3537667 ns |
3534687.5 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
4572958 ns |
4554291 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
535917.5 ns |
531266.5 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17305750 ns |
17461354.5 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17738666 ns |
17833459 ns |
0.99 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16523396 ns |
16559937.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
19975812.5 ns |
19938750 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
2621357 ns |
2619194 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
666 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
584 ns |
583 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
31387 ns |
33463 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9375 ns |
9292 ns |
1.01 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9625 ns |
9458 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9666.5 ns |
9375 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9584 ns |
9187.5 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
248606 ns |
252733 ns |
0.98 |
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
651447167 ns |
651812167 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
390769354.5 ns |
390086667 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
339760625 ns |
327502625 ns |
1.04 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
749321584 ns |
747314333 ns |
1.00 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
12465839.5 ns |
12474949 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1892321750.5 ns |
1879705041.5 ns |
1.01 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1639787791 ns |
1650371917 ns |
0.99 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1515586291.5 ns |
1514378771 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2197131604 ns |
2204966313 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
49313815.5 ns |
49428315 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1618083 ns |
1651458 ns |
0.98 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1193083.5 ns |
1196083 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1358750 ns |
1387103.5 ns |
0.98 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2366750.5 ns |
2353958 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215176 ns |
217144 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12683125 ns |
12704667 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9968083.5 ns |
9935187.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9611958 ns |
9671333.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18417271 ns |
18432334 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2006331 ns |
2021545.5 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17616792 ns |
17670625 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14686375 ns |
14743791.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14535312.5 ns |
14593292 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21460041 ns |
21437146 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26292 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26333 ns |
26292 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26292 ns |
26333 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26292 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23829 ns |
24013 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67083 ns |
67166 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67500 ns |
67208 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
67333 ns |
67917 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66875 ns |
66958 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
371762 ns |
380547.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203959 ns |
202875 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
210417 ns |
210375 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209708 ns |
209916 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
199625 ns |
198750 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
25968.5 ns |
25898 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
644291 ns |
645354 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
624500 ns |
637500.5 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
621688 ns |
634542 ns |
0.98 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
633208.5 ns |
634250 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
326634.5 ns |
326606.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
640396 ns |
672209 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
671834 ns |
637917 ns |
1.05 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
659042 ns |
665042 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
678959 ns |
664917 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
132210 ns |
131949 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2235354 ns |
2224563 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2239791.5 ns |
2248771 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2263812 ns |
2241125 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2239458 ns |
2237000 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1097894 ns |
1095016 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17250 ns |
17417 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17167 ns |
17333 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19667 ns |
19500 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18333 ns |
16875 ns |
1.09 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
132452.5 ns |
133320 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221770.5 ns |
260770.5 ns |
0.85 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
219750 ns |
219458.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
219208 ns |
229000 ns |
0.96 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
232250 ns |
263334 ns |
0.88 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
908726 ns |
947049 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
666 ns |
0.88 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
667 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
666 ns |
584 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23608 ns |
23873 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9667 ns |
10000 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
10083 ns |
9750 ns |
1.03 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10209 ns |
10125 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9625 ns |
9750 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
244776.5 ns |
245331.5 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5292 ns |
5375 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5959 ns |
5625 ns |
1.06 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6209 ns |
6604.5 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5020.5 ns |
5000 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
202256 ns |
209896.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7625 ns |
7875 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7833 ns |
7292 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7333 ns |
7687.5 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7292 ns |
7334 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
716742.5 ns |
739872 ns |
0.97 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2458 ns |
2041 ns |
1.20 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2229.5 ns |
2250 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2334 ns |
2458 ns |
0.95 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2125 ns |
2084 ns |
1.02 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
17972.5 ns |
18207 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6583 ns |
6542 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6708 ns |
6458 ns |
1.04 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6542 ns |
6708 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6750 ns |
6541 ns |
1.03 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
296406 ns |
306864 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
748875 ns |
747125 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
746750 ns |
749958.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
749167 ns |
747167 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
752250 ns |
771333.5 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
20898 ns |
21305 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
796000 ns |
791000 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
772750 ns |
780041.5 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
772750 ns |
775416 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
792625 ns |
794812.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
273656.5 ns |
271390 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
6958 ns |
6959 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
6000 ns |
1.02 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6208 ns |
6125 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10250 ns |
10167 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33214 ns |
33759 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
232416.5 ns |
259750 ns |
0.89 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228208 ns |
238854 ns |
0.96 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
227229.5 ns |
231104 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
253145.5 ns |
250208 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
335313 ns |
336384 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
10333 ns |
10125 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10083 ns |
10312.5 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
11020.5 ns |
10875 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10291 ns |
10167 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
211237 ns |
223921.5 ns |
0.94 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24750 ns |
24167 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25584 ns |
24583 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24667 ns |
25333 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24375 ns |
24584 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
1030501.5 ns |
1062400 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106087854.5 ns |
106104729.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
116949583 ns |
117502187.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
120635291 ns |
120758625 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117517042 ns |
117423500 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
2650195 ns |
2624434 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
393002917 ns |
392280708 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
366126708 ns |
358697709 ns |
1.02 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
356418042 ns |
357440917 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
540646313 ns |
540821208.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
15180136 ns |
15254730 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
784599250 ns |
781416292 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
758649208 ns |
760831458 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
752039895.5 ns |
750885583.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
787142437.5 ns |
784554021 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6417 ns |
7583 ns |
0.85 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7459 ns |
6875 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7958 ns |
8208 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6125 ns |
7917 ns |
0.77 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
207777 ns |
214784 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14208 ns |
14542 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14104.5 ns |
13667 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14542 ns |
14125 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14292 ns |
14375 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
980639 ns |
1015761 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6125 ns |
5750 ns |
1.07 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6416 ns |
6125 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6667 ns |
7500 ns |
0.89 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5209 ns |
5500 ns |
0.95 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
206305.5 ns |
211436.5 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13000 ns |
12875 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
13000 ns |
12417 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
13000 ns |
12687.5 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12833 ns |
13042 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
703910.5 ns |
728295 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
6042 ns |
5250 ns |
1.15 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
6042 ns |
5709 ns |
1.06 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
6542 ns |
6542 ns |
1 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
5166 ns |
5375 ns |
0.96 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
17119 ns |
17219 ns |
0.99 |
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
15958 ns |
15750 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
15833 ns |
15375 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
16125 ns |
15584 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
16083 ns |
15916 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
188536 ns |
188803.5 ns |
1.00 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
417 ns |
0.90 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
333 ns |
417 ns |
0.80 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
417 ns |
334 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
23430 ns |
23653 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6250 ns |
6583 ns |
0.95 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6625 ns |
6625 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6542 ns |
6625 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6500 ns |
6375 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
228627.5 ns |
227179 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5917 ns |
5958 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5875 ns |
6041 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
5959 ns |
5959 ns |
1 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6000 ns |
5875 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
24479 ns |
24470 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21000 ns |
21520.5 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
20917 ns |
21209 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21167 ns |
21667 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21458 ns |
21334 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
251996 ns |
249183.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
144895.5 ns |
144062.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
144542 ns |
143042 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
148208 ns |
146334 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
184458 ns |
188146 ns |
0.98 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
167363.5 ns |
167467 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1308021 ns |
1317583 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1304542 ns |
1321709 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1190917 ns |
1365791.5 ns |
0.87 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1327812.5 ns |
1318666 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1248522 ns |
1237894 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
23917 ns |
24708 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22167 ns |
24375 ns |
0.91 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25000 ns |
24375 ns |
1.03 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22291 ns |
22374.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
317685.5 ns |
318636 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
177479.5 ns |
134750 ns |
1.32 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
118500 ns |
181250 ns |
0.65 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
117895.5 ns |
130000 ns |
0.91 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
130313 ns |
130958 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1369049 ns |
1345187.5 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
417 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
416 ns |
333 ns |
1.25 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23211 ns |
23482 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6375 ns |
6625 ns |
0.96 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6875 ns |
6500 ns |
1.06 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6708 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6417 ns |
6792 ns |
0.94 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
244448 ns |
243071 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4625 ns |
4625 ns |
1 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4625 ns |
4541.5 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5250 ns |
5333 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4500 ns |
4583 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
232495 ns |
231105.5 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10250 ns |
9875 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10208 ns |
9916.5 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10417 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10417 ns |
10375 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1282590 ns |
1276883 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1625 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1667 ns |
1625 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1666 ns |
1667 ns |
1.00 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23093 ns |
23221 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5667 ns |
5750 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5959 ns |
5750 ns |
1.04 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5750 ns |
6083 ns |
0.95 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5750 ns |
5709 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
263887.5 ns |
262260 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6859958 ns |
6814041 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6349229 ns |
6367459 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6486250 ns |
6578812.5 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7533145.5 ns |
7695958 ns |
0.98 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215187 ns |
214554 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24039416.5 ns |
24052709 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21294937.5 ns |
21310875 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
20970625 ns |
21123834 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29781625 ns |
29855166.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2104879.5 ns |
2121783 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
48555729 ns |
48838979.5 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45322250 ns |
45549667 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45685542 ns |
45706771 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
49355500 ns |
49408500 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5750 ns |
5875 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6416 ns |
5709 ns |
1.12 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
6709 ns |
6708 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5666 ns |
5541 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
210939 ns |
212106.5 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8458 ns |
8875 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8541 ns |
8167 ns |
1.05 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8833 ns |
8542 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8500 ns |
8208 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
993327.5 ns |
1001631 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1512250 ns |
1556417 ns |
0.97 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1278521.5 ns |
1270792 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1621229 ns |
1624187.5 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2168270.5 ns |
2180520.5 ns |
0.99 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
271357.5 ns |
274298 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7839709 ns |
7888792 ns |
0.99 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6591250 ns |
6591250 ns |
1 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7072479.5 ns |
7197854 ns |
0.98 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10485333.5 ns |
10478229.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1768229 ns |
1773709 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
373584 ns |
366500 ns |
1.02 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
384291 ns |
371020.5 ns |
1.04 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
460208.5 ns |
457708 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
21708 ns |
33208.5 ns |
0.65 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
43027 ns |
47286 ns |
0.91 |
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
718458 ns |
723916.5 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
794250 ns |
801750 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1056291.5 ns |
1064875 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
118375 ns |
115334 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
284783.5 ns |
287209.5 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397459 ns |
397291 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
288292 ns |
287834 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
288042 ns |
288166 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
750750 ns |
750833 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43337 ns |
44324 ns |
0.98 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
662208 ns |
661875 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
531708 ns |
532416 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
534375 ns |
535458 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
974250 ns |
973250 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
189452.5 ns |
191330.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
643959 ns |
670958 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
648000 ns |
644229 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
654729 ns |
680667 ns |
0.96 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
674875 ns |
648125 ns |
1.04 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
131586 ns |
132061.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2467333.5 ns |
2459333 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2285979 ns |
2456084 ns |
0.93 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2491500 ns |
2464542 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2449104.5 ns |
2456083 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1236661 ns |
1216753 ns |
1.02 |
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
3708 ns |
3708 ns |
1 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
3542 ns |
3334 ns |
1.06 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
3875 ns |
4334 ns |
0.89 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
3250 ns |
2667 ns |
1.22 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16170 ns |
16517 ns |
0.98 |
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
5750 ns |
5500 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
5916 ns |
5458 ns |
1.08 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
5958 ns |
5625 ns |
1.06 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
5792 ns |
5542 ns |
1.05 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
184224.5 ns |
186819.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1461209 ns |
1458167 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1497833 ns |
1500500 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1499334 ns |
1499333 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1435583 ns |
1437750 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
39820 ns |
39930 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4764375 ns |
5130750 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5289146 ns |
5285584 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5166458 ns |
5315979 ns |
0.97 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4994750 ns |
4998959 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
195352 ns |
195663 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3750 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3750 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
33527 ns |
33499 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15333 ns |
15375 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15500 ns |
15417 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15500 ns |
15500 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15333 ns |
15167 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
352561.5 ns |
351211 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
70875 ns |
70667 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71208 ns |
71208 ns |
1 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71458 ns |
71959 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71041 ns |
71333 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
112791 ns |
113147 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
318375 ns |
318500 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
319500 ns |
318000 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
335167 ns |
323666 ns |
1.04 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
318520.5 ns |
317125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
192194 ns |
195331 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1084 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1084 ns |
1125 ns |
0.96 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1125 ns |
1084 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1125 ns |
1000 ns |
1.13 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
23034 ns |
23576 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8417 ns |
8458 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8125 ns |
8334 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8292 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8208 ns |
8375 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
246121.5 ns |
249171.5 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
506083 ns |
506709 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
492209 ns |
492375 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
560459 ns |
562708 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
218042 ns |
222187.5 ns |
0.98 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
128098 ns |
129166 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1379666.5 ns |
1387250 ns |
0.99 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1410750 ns |
1449208 ns |
0.97 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1735958.5 ns |
1788375 ns |
0.97 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
870542 ns |
865812.5 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
273083 ns |
273491 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
417 ns |
292 ns |
1.43 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
375 ns |
416 ns |
0.90 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
31549 ns |
32843 ns |
0.96 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6500 ns |
6667 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6583 ns |
6458 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6542 ns |
6625 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6334 ns |
6458 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
248517.5 ns |
250973.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1723916.5 ns |
1722042 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1733083 ns |
1723208.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1726229 ns |
1721083 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1770167 ns |
1723750 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
168350 ns |
168847 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4370791.5 ns |
4362042 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4361667 ns |
4261187.5 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4259562.5 ns |
4415583.5 ns |
0.96 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4371459 ns |
4366958.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1150258 ns |
1143038 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6583 ns |
6750 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6667 ns |
6959 ns |
0.96 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
6875 ns |
6959 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6583 ns |
6708.5 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
19334 ns |
20756 ns |
0.93 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
51125 ns |
51417 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
50895.5 ns |
32917 ns |
1.55 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
32791 ns |
33333 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
71104.5 ns |
51208.5 ns |
1.39 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
196844.5 ns |
197240.5 ns |
1.00 |
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
18083 ns |
17542 ns |
1.03 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
18208 ns |
17875 ns |
1.02 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
18250 ns |
18916 ns |
0.96 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
17417 ns |
17750 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18164 ns |
18861 ns |
0.96 |
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
53416 ns |
53458 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
53666 ns |
53334 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
53625 ns |
53250 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
53875 ns |
53500 ns |
1.01 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
319761 ns |
319618.5 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75375 ns |
75292 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75292 ns |
75375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75584 ns |
75792 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75250 ns |
75208 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46528 ns |
47162 ns |
0.99 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
324750 ns |
324375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
331916 ns |
327625 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
342000 ns |
329583 ns |
1.04 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
325375 ns |
324208 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
209260 ns |
211676.5 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1486417 ns |
1484375 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1526791 ns |
1527958 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1526875 ns |
1527583 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1463375 ns |
1462209 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
51992.5 ns |
51967 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5107000 ns |
5124708 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5260791.5 ns |
5280333 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5161042 ns |
5332500 ns |
0.97 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4983187.5 ns |
4985875 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
201588 ns |
202369.5 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28250 ns |
28250 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28250 ns |
28291 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28250 ns |
28333 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28250 ns |
28291 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24143 ns |
24821 ns |
0.97 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66417 ns |
66459 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67000 ns |
66458 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66834 ns |
66833 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66375 ns |
66416 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
492191 ns |
482606 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1495250.5 ns |
1501229 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1147437.5 ns |
1127563 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1073479 ns |
1119291.5 ns |
0.96 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2230020.5 ns |
2246375 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
584155.5 ns |
570915 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3049917 ns |
3082875 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2725583 ns |
2738375 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2748541.5 ns |
2760354 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3815417 ns |
3780667 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
2003962 ns |
1961915 ns |
1.02 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
7917521 ns |
7895333 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
7897062 ns |
7893459 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
7875417 ns |
7944812.5 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4816000 ns |
4834521 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
79375 ns |
80959 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
81312.5 ns |
80333 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
82625 ns |
82166 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81292 ns |
134375.5 ns |
0.60 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
193985 ns |
193995.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2006291 ns |
2014625 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2022000 ns |
2006229 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2045021 ns |
2047021 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2017334 ns |
2022958 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
750509 ns |
740969 ns |
1.01 |
This comment was automatically generated by workflow using github-action-benchmark.
@avik-pal sorry you probably need to restart the test [I just did a bump for enzyme's interpreter and then one for reactant that will let us improve precompilation] |
One good thing is I managed to get an example ( |
This PR also adds a |
envs/incorrect_ir.mlir:24176:34: error: use of undeclared SSA value name
%312 = "stablehlo.transpose"(%430) <{permutation = array<i64: 0>}> : (tensor<2xui64>) -> tensor<2xui64>
^ We are definitely generating something incorrect here. |
CUDA CI is giving some BT with downloading artifacts
NeuralODE -- tricky here. maybe use CPUother upstream changes:
dynamic_slice
adjoint -- feat: DynamicSliceOp adjoint EnzymeAD/Enzyme-JAX#220