-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: bump Zygote version #1182
base: main
Are you sure you want to change the base?
Conversation
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
quite a lot of things are broken in LuxLib |
f0f7fdf
to
34f9cf2
Compare
33e4959
to
fbb55bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: fbb55bb | Previous: 1053879 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3958 ns |
3875 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4625 ns |
4292 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5708 ns |
4958 ns |
1.15 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
4125 ns |
3708 ns |
1.11 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
84164.5 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
11083 ns |
10750 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10375 ns |
10416 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10250 ns |
10833 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10333.5 ns |
10500 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
601287.5 ns |
||
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1084 ns |
1250 ns |
0.87 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1166 ns |
1042 ns |
1.12 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1292 ns |
1417 ns |
0.91 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1083.5 ns |
1208 ns |
0.90 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
23470 ns |
||
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4083 ns |
4125 ns |
0.99 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
4000 ns |
3792 ns |
1.05 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4458 ns |
4208 ns |
1.06 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4042 ns |
4166 ns |
0.97 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
147686 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
57958 ns |
57458 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38208 ns |
46709 ns |
0.82 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46417 ns |
38291.5 ns |
1.21 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82375 ns |
82166 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
38287 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2020000 ns |
2036084 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2099042 ns |
2088000 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2093917 ns |
2101833.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2003229 ns |
1996395.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
264389 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
143667 ns |
171187 ns |
0.84 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
183750 ns |
141166 ns |
1.30 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
146542 ns |
145416.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
147125 ns |
143604 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
178502.5 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1128250 ns |
1123959 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1155334 ns |
1117541.5 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1118645.5 ns |
1153479.5 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1122270.5 ns |
1120542 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
749393.5 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3334 ns |
3250 ns |
1.03 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3750 ns |
3542 ns |
1.06 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4750 ns |
4083 ns |
1.16 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3417 ns |
3042 ns |
1.12 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
91881.5 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
9042 ns |
9145.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10875 ns |
8833 ns |
1.23 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9041 ns |
10333 ns |
0.87 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
9166 ns |
9292 ns |
0.99 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
663001.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
15375 ns |
15250 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
15042 ns |
17354.5 ns |
0.87 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17542 ns |
16208 ns |
1.08 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16375 ns |
15187.5 ns |
1.08 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
62376 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
212291.5 ns |
216750 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
216729.5 ns |
211208 ns |
1.03 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
212459 ns |
212166.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
214500 ns |
227042 ns |
0.94 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
431910.5 ns |
||
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
583 ns |
667 ns |
0.87 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
667 ns |
583 ns |
1.14 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
750 ns |
770.5 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
667 ns |
500 ns |
1.33 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
22147 ns |
||
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1375 ns |
1459 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1417 ns |
1417 ns |
1 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1458 ns |
1417 ns |
1.03 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1541.5 ns |
1458 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
192323.5 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7166 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5334 ns |
5875 ns |
0.91 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5833 ns |
5250 ns |
1.11 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
9917 ns |
10041 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
23959 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220250 ns |
221000 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228833 ns |
227229.5 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
228458 ns |
228708 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
225396 ns |
213792 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
241534 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3875 ns |
3834 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3917 ns |
3875 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3875 ns |
3917 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
24268 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16875 ns |
16750 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16917 ns |
16708 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16708 ns |
16542 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16792 ns |
17042 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
259242.5 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
596834 ns |
580104.5 ns |
1.03 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
574417 ns |
575958 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
576334 ns |
579375 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
570750 ns |
580708 ns |
0.98 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113219.5 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1453000 ns |
1416791 ns |
1.03 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1430270.5 ns |
1424167 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1427667 ns |
1423042 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1417875 ns |
1425000 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
248518 ns |
||
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1070062.5 ns |
1079063 ns |
0.99 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
910750 ns |
963917 ns |
0.94 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1352833 ns |
1334458 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1315959 ns |
1297667 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
277063.5 ns |
||
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
6014562.5 ns |
5943395.5 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4582416 ns |
4600125 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4950667 ns |
4951395.5 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5511708.5 ns |
5560500 ns |
0.99 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1307208 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
541 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
24457.5 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2083 ns |
2166 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2167 ns |
2042 ns |
1.06 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2084 ns |
2125 ns |
0.98 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2208 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
281559.5 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
3625 ns |
3687.5 ns |
0.98 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4167 ns |
3791 ns |
1.10 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5292 ns |
4792 ns |
1.10 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
3834 ns |
3667 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
145732 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11187.5 ns |
10875 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11000 ns |
11084 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12042 ns |
11500 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11125 ns |
11250 ns |
0.99 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
846723 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6583 ns |
6125 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7292 ns |
6834 ns |
1.07 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8395.5 ns |
7542 ns |
1.11 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6375 ns |
6250 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
116722 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
19041.5 ns |
17625 ns |
1.08 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
16625 ns |
17542 ns |
0.95 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
17334 ns |
18834 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16666 ns |
17416 ns |
0.96 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
602008 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
666 ns |
0.81 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
38275 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9208 ns |
8500 ns |
1.08 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
8666 ns |
8750 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9292 ns |
9125 ns |
1.02 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8604.5 ns |
9208 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
215517.5 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64708 ns |
64375 ns |
1.01 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64667 ns |
64542 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64708 ns |
64667 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64666 ns |
64500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
110361 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
283250 ns |
277667 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
283792 ns |
287083 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
284916 ns |
291375 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
284208 ns |
284145.5 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
206006.5 ns |
||
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3370729.5 ns |
3306333 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
2767958 ns |
3031917 ns |
0.91 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3015083 ns |
2796833 ns |
1.08 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4054688 ns |
3935125 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
552901 ns |
||
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7644208.5 ns |
7260770.5 ns |
1.05 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7415166 ns |
7411416 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7500291 ns |
7367271 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8071625 ns |
8191583.5 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1620814.5 ns |
||
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17529458 ns |
17581104 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17658291 ns |
17521584 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
17531437 ns |
17682146 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14113437.5 ns |
14123875 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23781895.5 ns |
23725208 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
44495166.5 ns |
34375583 ns |
1.29 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37037708 ns |
40913375 ns |
0.91 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34961791.5 ns |
34801458 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1854251.5 ns |
||
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
314799771 ns |
189578375 ns |
1.66 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
179087583 ns |
164456312.5 ns |
1.09 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
150044958 ns |
155623541 ns |
0.96 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
280110896 ns |
434187396 ns |
0.65 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
18206383 ns |
||
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
291528459 ns |
289496083 ns |
1.01 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
276677333 ns |
262462166 ns |
1.05 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
295323083 ns |
305828042 ns |
0.97 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
476127166.5 ns |
474493916.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
21542 ns |
23604 ns |
0.91 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
21916 ns |
24250 ns |
0.90 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
24542 ns |
23979 ns |
1.02 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
21209 ns |
21291 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
206275.5 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
103708.5 ns |
104687.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
104250 ns |
104875 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
104750 ns |
104125 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
103312.5 ns |
103292 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
986759.5 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5833 ns |
6749.5 ns |
0.86 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5666 ns |
5416 ns |
1.05 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7687 ns |
7000 ns |
1.10 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5833.5 ns |
5333 ns |
1.09 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
153803.5 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15042 ns |
14833 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
15354 ns |
14709 ns |
1.04 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15417 ns |
16166 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14750 ns |
14770.5 ns |
1.00 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
999769 ns |
||
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3020958.5 ns |
3018000 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2068333 ns |
2066604.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2257333 ns |
2280541.5 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4823749.5 ns |
4577917 ns |
1.05 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
583221 ns |
||
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23607917 ns |
23533375 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18238542 ns |
18022709 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
16982416 ns |
17334750 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
34958125 ns |
34837750 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3119291 ns |
||
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33420375 ns |
33300333 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27946250 ns |
27629000 ns |
1.01 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27416375 ns |
27822584 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
40738584 ns |
41187708 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
73459 ns |
74520.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
73270.5 ns |
74875 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
75250 ns |
82167 ns |
0.92 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
73125 ns |
74583 ns |
0.98 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
214994.5 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
309187.5 ns |
308437.5 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
277958 ns |
225749.5 ns |
1.23 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
324625 ns |
320208.5 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
314708.5 ns |
218542 ns |
1.44 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1111407 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11625 ns |
11583 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
12250 ns |
11583 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12895.5 ns |
13208 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11542 ns |
11458 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
153066 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
28625 ns |
28167 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
28667 ns |
28375 ns |
1.01 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
29250 ns |
29709 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
28375 ns |
28917 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
1012161.5 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12666 ns |
12000 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
13020.5 ns |
12292 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14812.5 ns |
13958 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12083 ns |
12333 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
118404 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26667 ns |
25666 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
25791 ns |
25959 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26291 ns |
26500 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
26708 ns |
26459 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
658520 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
179292 ns |
180521 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
179562.5 ns |
179354.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
183125 ns |
183458 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
179728.5 ns |
180375 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
97820 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
589083 ns |
590375 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
592333 ns |
594250 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
594104 ns |
594916 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
644063 ns |
583541 ns |
1.10 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
520811 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5959 ns |
6084 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6000 ns |
5854.5 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
7895.5 ns |
7104.5 ns |
1.11 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5750 ns |
5917 ns |
0.97 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
156484.5 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14083 ns |
14208 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14458 ns |
13500 ns |
1.07 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15334 ns |
15625 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14125 ns |
13834 ns |
1.02 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
969271.5 ns |
||
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1222812.5 ns |
1217312.5 ns |
1.00 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1228666.5 ns |
1268500 ns |
0.97 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1273459 ns |
1281209 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
992667 ns |
998541.5 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
350542 ns |
||
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4224729.5 ns |
4105042 ns |
1.03 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4563542 ns |
4410083.5 ns |
1.03 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4574542 ns |
4905208.5 ns |
0.93 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
3716500 ns |
3703875 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1103744.5 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1792 ns |
1792 ns |
1 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1833 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1833 ns |
1791 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
24330 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4792 ns |
4833 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4792 ns |
4833 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
4875 ns |
4833 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4792 ns |
4875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
296478.5 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5958 ns |
5375 ns |
1.11 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6584 ns |
5958 ns |
1.11 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7000 ns |
7166.5 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5520.5 ns |
5333.5 ns |
1.04 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
113977 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11812.5 ns |
10500 ns |
1.13 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
10833 ns |
11042 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10875 ns |
11125 ns |
0.98 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10854.5 ns |
11542 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
619957.5 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23636 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2750 ns |
2750 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2792 ns |
2708 ns |
1.03 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2875 ns |
2750 ns |
1.05 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2791 ns |
3083 ns |
0.91 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
223388.5 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11416.5 ns |
10875 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
11833 ns |
11125 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
12667 ns |
12958.5 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11125 ns |
11229.5 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
114708.5 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24667 ns |
24604.5 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
25041.5 ns |
24834 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25166 ns |
25333 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24917 ns |
25333 ns |
0.98 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
538198 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4166 ns |
4166 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4250 ns |
4167 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4291 ns |
4208 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4208 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
25312 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16042 ns |
16375 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16208 ns |
16500 ns |
0.98 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16458 ns |
16167 ns |
1.02 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16292 ns |
16291 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
335052 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5834 ns |
5834 ns |
1 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5792 ns |
5834 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5875 ns |
5792 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5834 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
38763 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
21083 ns |
20792 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
21208 ns |
21000 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21271 ns |
21166 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20708 ns |
21167 ns |
0.98 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
241196.5 ns |
||
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
424458 ns |
423895.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
363791.5 ns |
380479 ns |
0.96 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
480041 ns |
485125 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
103042 ns |
106958 ns |
0.96 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
67027 ns |
||
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
931708 ns |
937833 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
958250 ns |
963250 ns |
0.99 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1177167 ns |
1216083 ns |
0.97 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
465208 ns |
428542 ns |
1.09 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
223255.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
80146 ns |
80291.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
80542 ns |
79458 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83791 ns |
87042 ns |
0.96 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
81499.5 ns |
80375 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
190291 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1923125 ns |
1917916.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1933250 ns |
1918437.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1918834 ns |
1950812.5 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1879479 ns |
1915188 ns |
0.98 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
592745 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
291 ns |
1.00 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
333 ns |
0.88 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22350 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1875 ns |
1792 ns |
1.05 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1792 ns |
1834 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1792 ns |
1875 ns |
0.96 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
277190.5 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6375 ns |
6000 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
7167 ns |
6167 ns |
1.16 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
8333.5 ns |
7834 ns |
1.06 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6250 ns |
6125 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
112768 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9334 ns |
9041 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9334 ns |
9125 ns |
1.02 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9041 ns |
9333 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9375 ns |
9625 ns |
0.97 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
556145.5 ns |
||
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
118287959 ns |
120446062.5 ns |
0.98 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
181584917 ns |
174298416.5 ns |
1.04 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148196917 ns |
155622396 ns |
0.95 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
103852625 ns |
104910437 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5469523 ns |
||
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
615821917 ns |
613470583 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
578633624.5 ns |
555889999.5 ns |
1.04 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
450083166 ns |
467916666 ns |
0.96 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
632957959 ns |
629979541 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
40466551 ns |
||
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
722005917 ns |
717129562 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
685124708 ns |
665448791 ns |
1.03 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
591788833 ns |
597201792 ns |
0.99 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
858797021 ns |
855951979.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59042 ns |
58542 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38750 ns |
48208 ns |
0.80 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47334 ns |
39083 ns |
1.21 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84125 ns |
80167 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
61529.5 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1927125 ns |
1918312.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1981125 ns |
1976771 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1977750 ns |
1793729 ns |
1.10 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1894291.5 ns |
1888625 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
251827 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
268458 ns |
268666.5 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
268041.5 ns |
268458 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
287334 ns |
269271 ns |
1.07 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
267479.5 ns |
265875 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
220250.5 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
698042 ns |
676000 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
601791.5 ns |
587417 ns |
1.02 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
700000 ns |
601499.5 ns |
1.16 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
695000 ns |
700333 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1124743 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2228750 ns |
2212542 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2208542 ns |
2211416 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2225084 ns |
2103833 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2211459 ns |
2216500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
188835 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5494104 ns |
5504541 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5558562 ns |
5488625 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5503645.5 ns |
5582375 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5495875 ns |
5490917 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1129259.5 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
638833 ns |
647417 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
663042 ns |
641916.5 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
648833 ns |
650125 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
637584 ns |
642917 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47883 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1849208 ns |
1821291 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1677167 ns |
1717958 ns |
0.98 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1721708 ns |
1666375 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2101125 ns |
2103666.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
272858 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58708 ns |
58292 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38625 ns |
47209 ns |
0.82 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47020.5 ns |
37250 ns |
1.26 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84709 ns |
80791 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
29330 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2026750 ns |
2017916.5 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2100292 ns |
2086583 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2089125 ns |
1901083 ns |
1.10 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1994854 ns |
1990750 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
258958.5 ns |
||
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13379958.5 ns |
13371875 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12474083 ns |
12426458 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12536417 ns |
12666062 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15161354 ns |
15204979 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
630867 ns |
||
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47297021 ns |
47257417 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
42067562.5 ns |
41744209 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41158062.5 ns |
41179062.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58063750 ns |
58639833 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3558583 ns |
||
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
97382229 ns |
73940917 ns |
1.32 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
68461583 ns |
90904041 ns |
0.75 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
91128625 ns |
91001000 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
75579000 ns |
98448625 ns |
0.77 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59062.5 ns |
58833 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
38708 ns |
47958 ns |
0.81 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47667 ns |
38542 ns |
1.24 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84291 ns |
84292 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
71350 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1910459 ns |
1904750 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1977000 ns |
1969542 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1966667 ns |
1800875 ns |
1.09 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1889500 ns |
1895917 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
253428.5 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
416 ns |
0.80 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
38593 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6333 ns |
6145.5 ns |
1.03 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6417 ns |
6458 ns |
0.99 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6375 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6042 ns |
6625 ns |
0.91 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
224418.5 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
250 ns |
292 ns |
0.86 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
29806 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2792 ns |
2666 ns |
1.05 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2667 ns |
2875 ns |
0.93 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2875 ns |
2833 ns |
1.01 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2667 ns |
2875 ns |
0.93 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
204955.5 ns |
||
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
288215500 ns |
284556437.5 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
348637166 ns |
340224270.5 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
316354750 ns |
320916166 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
272734271 ns |
270718833 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
8893905 ns |
||
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1059837958 ns |
998965333.5 ns |
1.06 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
945263354 ns |
956359521 ns |
0.99 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
824575917 ns |
868085334 ns |
0.95 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1284825562.5 ns |
1210263479.5 ns |
1.06 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
44588307 ns |
||
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1792463895.5 ns |
1439494000 ns |
1.25 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1313057750 ns |
1675455020.5 ns |
0.78 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1663479791.5 ns |
1623450375 ns |
1.02 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1413227333 ns |
1781275542 ns |
0.79 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1413729 ns |
1402500 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1416917 ns |
1406416 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1415854 ns |
1410125 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1459895.5 ns |
1406875 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
123251.5 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4716896 ns |
5015125 ns |
0.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5054291 ns |
5021375 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5022292 ns |
5065333 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5019854.5 ns |
5030104.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
771745 ns |
||
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
181117209 ns |
178918125 ns |
1.01 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
186561459 ns |
137633791 ns |
1.36 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
117944875 ns |
137284041 ns |
0.86 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
168607583 ns |
169122750 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
5385280 ns |
||
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
827974687.5 ns |
824093375 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
557055812.5 ns |
493391208 ns |
1.13 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
529594354 ns |
544904625 ns |
0.97 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
647731312.5 ns |
646424584 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16971831 ns |
||
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
9017417 ns |
8944417 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
9037250 ns |
8930333 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7909687.5 ns |
8002583 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
9741229.5 ns |
9740458 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1611948 ns |
||
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
36245833 ns |
37148750 ns |
0.98 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
38516479.5 ns |
36964208 ns |
1.04 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33599000 ns |
34465958 ns |
0.97 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
37772792 ns |
38308250 ns |
0.99 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
7260561 ns |
||
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47500 ns |
47458 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47416 ns |
47334 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47500 ns |
47542 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47250 ns |
47584 ns |
0.99 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
24124 ns |
||
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50500 ns |
50542 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50292 ns |
50542 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
50542 ns |
50625 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50875 ns |
50500 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
300314 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
7000 ns |
6292 ns |
1.11 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
7667 ns |
6625 ns |
1.16 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
8791 ns |
8479 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6500 ns |
6792 ns |
0.96 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
116485.5 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10791 ns |
9584 ns |
1.13 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9833 ns |
10625 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10292 ns |
10375 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10416 ns |
10458 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
703047.5 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6125 ns |
5250 ns |
1.17 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6292 ns |
5917 ns |
1.06 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7417 ns |
7917 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5520.5 ns |
5750 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
116263.5 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
16917 ns |
18291.5 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
16000 ns |
15958 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
16916 ns |
16500 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
16333 ns |
16583 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
543524 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1042 ns |
1083 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1084 ns |
1083 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1084 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
37684 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8541 ns |
8104.5 ns |
1.05 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8375 ns |
8084 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8375 ns |
8125 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7959 ns |
8458 ns |
0.94 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
254817.5 ns |
||
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23583 ns |
23125 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23250 ns |
23167 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23542 ns |
23167 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23167 ns |
23541 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
25502 ns |
||
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52792 ns |
52500 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52625 ns |
52417 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
52708 ns |
52645.5 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52834 ns |
52458 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
311234 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1399083 ns |
1405062.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1411104.5 ns |
1402583.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1402479.5 ns |
1406875 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1451833 ns |
1403729.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
191772.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5010000 ns |
5007708 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5019541.5 ns |
5013292 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5008312.5 ns |
5046271 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5012500 ns |
5005125 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
625895 ns |
||
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3055354.5 ns |
3074708 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2058584 ns |
2091499.5 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2268604 ns |
2290083.5 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4546708 ns |
4915708.5 ns |
0.92 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
580567.5 ns |
||
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24328291 ns |
24422083 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18995291 ns |
18926750 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
17752375 ns |
18059792 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
35747458 ns |
35835500.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3358502 ns |
||
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34024958 ns |
34039292 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28742750 ns |
28325625 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28009896 ns |
28468583 ns |
0.98 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41545375 ns |
41461250 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
144079020.5 ns |
144570938 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
145637708.5 ns |
147768250 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
126278208.5 ns |
127812375 ns |
0.99 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
172830166 ns |
173201708 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22565752 ns |
||
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
1238375084 ns |
952803959 ns |
1.30 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
867940959 ns |
1880403417 ns |
0.46 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
847256041.5 ns |
721103250 ns |
1.17 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
684237750 ns |
665759084 ns |
1.03 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
126557534 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
83792 ns |
77270.5 ns |
1.08 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
73250 ns |
72541 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
85395.5 ns |
76166 ns |
1.12 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
83917 ns |
72646 ns |
1.16 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
216311.5 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
297458 ns |
291833 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
272833.5 ns |
193625 ns |
1.41 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
298833 ns |
275146 ns |
1.09 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
291000 ns |
289604.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1284143 ns |
||
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35788208 ns |
35435979 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
36447167 ns |
36430959 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32523292 ns |
32728396 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40358042 ns |
40524416 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5845219.5 ns |
||
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
149270938 ns |
148443209 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
158553959 ns |
153839875 ns |
1.03 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
137948291 ns |
142207500 ns |
0.97 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
287361521 ns |
286559208 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
38850415 ns |
||
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
122460166.5 ns |
121670542 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
182061416.5 ns |
174360666.5 ns |
1.04 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148298395.5 ns |
155087062.5 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
105302125 ns |
106968083 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5472066 ns |
||
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
472275270.5 ns |
468237229 ns |
1.01 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
488869208 ns |
467305229 ns |
1.05 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
440999063 ns |
457270500 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
743285687 ns |
742197000 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
37316517 ns |
||
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
653320959 ns |
775778042 ns |
0.84 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
660283125 ns |
639059458 ns |
1.03 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
620928958 ns |
642570667 ns |
0.97 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
857369625 ns |
849532312.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1297896.5 ns |
1345916 ns |
0.96 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
787229 ns |
984292 ns |
0.80 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
989292 ns |
764770.5 ns |
1.29 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2088209 ns |
2095229.5 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
569909 ns |
||
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
3033625 ns |
2954875 ns |
1.03 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2541229.5 ns |
2619000 ns |
0.97 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2644625 ns |
2499292 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3729667 ns |
3688708.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1565412 ns |
||
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5803354.5 ns |
5790208 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5885583.5 ns |
5791792 ns |
1.02 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
5788458 ns |
5888041 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2898437.5 ns |
2887459 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7542 ns |
7208 ns |
1.05 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5125 ns |
5833 ns |
0.88 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6167 ns |
5250 ns |
1.17 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10125 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33878 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213708 ns |
223354 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220437.5 ns |
232209 ns |
0.95 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230062.5 ns |
220729.5 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215646 ns |
219292 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
244280.5 ns |
||
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
304048125 ns |
303148916.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
273835417 ns |
220759541.5 ns |
1.24 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
188943750 ns |
221905479 ns |
0.85 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
309097375 ns |
309164583 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
8618172 ns |
||
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1241452292 ns |
1233285583 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
967082687.5 ns |
899326000 ns |
1.08 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
814369729 ns |
858911520.5 ns |
0.95 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1154828166 ns |
1144926250 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
29030550 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5396 ns |
4959 ns |
1.09 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6125 ns |
5209 ns |
1.18 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6375 ns |
6875 ns |
0.93 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5083 ns |
5125 ns |
0.99 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
115013 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
11333 ns |
10333 ns |
1.10 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10208 ns |
10209 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10375 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10334 ns |
10583 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
582351.5 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
500 ns |
1.25 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
30478 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9792 ns |
9125 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9104 ns |
9208 ns |
0.99 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10292 ns |
9209 ns |
1.12 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9125 ns |
9417 ns |
0.97 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
222363.5 ns |
||
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
352292 ns |
352041 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
351812.5 ns |
352167 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
352354.5 ns |
352833 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
352333 ns |
352250 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
30120 ns |
||
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
774188 ns |
810042 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
788584 ns |
832334 ns |
0.95 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
774937.5 ns |
777896 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
815958 ns |
833959 ns |
0.98 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
248583 ns |
||
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
343333.5 ns |
339375 ns |
1.01 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
321708 ns |
345208.5 ns |
0.93 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
453645.5 ns |
443583 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
10333 ns |
10500 ns |
0.98 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
18868 ns |
||
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
728542 ns |
720437.5 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
729229.5 ns |
730000 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1002000 ns |
1036000 ns |
0.97 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
26917 ns |
26584 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
216969.5 ns |
||
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
382458.5 ns |
378750 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
328917 ns |
347042 ns |
0.95 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
443167 ns |
446167 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
30250 ns |
30208 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
23253 ns |
||
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
748354 ns |
736541 ns |
1.02 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
775125 ns |
781270.5 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1025125 ns |
1066792 ns |
0.96 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
89271 ns |
104812.5 ns |
0.85 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
203377.5 ns |
||
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3500 ns |
3375 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3500 ns |
3458 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3542 ns |
3709 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3459 ns |
3625 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
23660.5 ns |
||
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4291 ns |
4167 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4458 ns |
4208 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4500 ns |
4250 ns |
1.06 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4333 ns |
4291 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
251068.5 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3583 ns |
3625 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4000 ns |
3375 ns |
1.19 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4709 ns |
4437.5 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
163977.5 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8292 ns |
8375 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8625 ns |
8208 ns |
1.05 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8500 ns |
8583 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8458 ns |
8542 ns |
0.99 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
986161 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
205042 ns |
205167 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
209584 ns |
209208 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209167 ns |
208833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
200083 ns |
199083 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
44368 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
600209 ns |
606958 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
623041 ns |
671708 ns |
0.93 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
621875 ns |
624000 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
592416 ns |
633208 ns |
0.94 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
291829.5 ns |
||
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
1015167 ns |
996958.5 ns |
1.02 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1030833.5 ns |
1038063 ns |
0.99 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
950208 ns |
970916.5 ns |
0.98 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
873916 ns |
870270.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
208016.5 ns |
||
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4553437.5 ns |
4514312 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4805250 ns |
4740687.5 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4433625 ns |
4626625 ns |
0.96 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
4317000 ns |
4278333 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
955859 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3333 ns |
3083 ns |
1.08 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3834 ns |
3209 ns |
1.19 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4333 ns |
4417 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3333 ns |
3458 ns |
0.96 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
153102 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7167 ns |
7250 ns |
0.99 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7292 ns |
7167 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7458 ns |
7333 ns |
1.02 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7104.5 ns |
7541 ns |
0.94 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
866316.5 ns |
||
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1646334 ns |
1650062.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1146229.5 ns |
1162479.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1367625 ns |
1343562.5 ns |
1.02 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2339708 ns |
2474584 ns |
0.95 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215061 ns |
||
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12343895.5 ns |
12306500 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9498791 ns |
9576334 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9269479 ns |
9347167 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18048312.5 ns |
18004520.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2033069 ns |
||
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17389750 ns |
17357042 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14235958.5 ns |
14404458 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14303124.5 ns |
14505083.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21020542 ns |
21117625 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
90208.5 ns |
88584 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
89792 ns |
89416.5 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
93104 ns |
91000 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
89000 ns |
116312.5 ns |
0.77 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
121503 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2043875 ns |
2027750 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1715271 ns |
2156354 ns |
0.80 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2026375 ns |
1755083 ns |
1.15 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2027125 ns |
2022583 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
765965 ns |
||
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
1875.5 ns |
3416 ns |
0.55 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
1750 ns |
2792 ns |
0.63 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
3292 ns |
2021 ns |
1.63 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
3187.5 ns |
3459 ns |
0.92 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
16456 ns |
||
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
3000 ns |
2750 ns |
1.09 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
2917 ns |
3042 ns |
0.96 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3083 ns |
3083 ns |
1 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
3042 ns |
3084 ns |
0.99 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
166217.5 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7292 ns |
7209 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5209 ns |
6041 ns |
0.86 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
5333 ns |
1.13 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10083 ns |
10083 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
43156 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
221875 ns |
214125 ns |
1.04 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220084 ns |
229084 ns |
0.96 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
222375 ns |
223791.5 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
217208 ns |
221708 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
255319 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3709 ns |
3708 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3792 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3750 ns |
3791 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3708 ns |
3708 ns |
1 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22634 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14542 ns |
14584 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14292 ns |
14458 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14625 ns |
14292 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14375 ns |
14583 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
323742.5 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
136583 ns |
96000 ns |
1.42 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
92708 ns |
91334 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
97395.5 ns |
94166.5 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
93812 ns |
137583 ns |
0.68 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
121513 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1939833.5 ns |
1927479 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1704792 ns |
1933333 ns |
0.88 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1929584 ns |
1671542 ns |
1.15 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1929583.5 ns |
1929000 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
763774 ns |
||
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
888500 ns |
880583 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
789979 ns |
820750 ns |
0.96 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1214208 ns |
1161125 ns |
1.05 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
959625 ns |
964042 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
277456 ns |
||
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2821500 ns |
2817062.5 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2452521.5 ns |
2505978.5 ns |
0.98 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3364958 ns |
3333708 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3405500 ns |
3424937.5 ns |
0.99 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1354698.5 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17584 ns |
17166 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
14875 ns |
15292 ns |
0.97 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17750 ns |
16937.5 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15000 ns |
16792 ns |
0.89 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
100300 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
256896 ns |
227729.5 ns |
1.13 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220875 ns |
260125 ns |
0.85 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
227417 ns |
216458 ns |
1.05 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
227417 ns |
259708 ns |
0.88 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
512375.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
222625 ns |
221208.5 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
220333.5 ns |
221937 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
222250 ns |
221042 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
218687 ns |
221958.5 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
207105.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
532458 ns |
495666 ns |
1.07 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
551291 ns |
561062.5 ns |
0.98 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
510000 ns |
501250 ns |
1.02 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
505292 ns |
572917 ns |
0.88 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1027689 ns |
||
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
3917 ns |
4167 ns |
0.94 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
3583 ns |
3625 ns |
0.99 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
4750 ns |
5417 ns |
0.88 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
3584 ns |
3750 ns |
0.96 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
17203 ns |
||
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
7875 ns |
7500 ns |
1.05 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
7625 ns |
7458 ns |
1.02 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
7541 ns |
7458 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
7500 ns |
7917 ns |
0.95 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
169752.5 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
19208 ns |
18625 ns |
1.03 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
16458 ns |
17500 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
18479 ns |
19375 ns |
0.95 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17979 ns |
18292 ns |
0.98 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
100825 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215833 ns |
223917 ns |
0.96 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
215084 ns |
229208.5 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
227875 ns |
218333 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
215895.5 ns |
228667 ns |
0.94 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
670325.5 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4500 ns |
4166 ns |
1.08 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4291 ns |
4166 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5042 ns |
5375 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
3916.5 ns |
4416 ns |
0.89 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
153169.5 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10292 ns |
10042 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10084 ns |
9750 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9791 ns |
10417 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10291 ns |
10334 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
865220.5 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
2750 ns |
3375 ns |
0.81 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3167 ns |
2833 ns |
1.12 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4542 ns |
4375 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
2916 ns |
2792 ns |
1.04 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
150735.5 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7750 ns |
7083 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7292 ns |
7333 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7375 ns |
7417 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7375 ns |
7375 ns |
1 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
897552.5 ns |
||
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23417958 ns |
23307041.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
42970625 ns |
33839458 ns |
1.27 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37718062.5 ns |
40745646 ns |
0.93 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34894354 ns |
34862708 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1845123.5 ns |
||
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
184815791.5 ns |
184254354 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
190477000 ns |
169428437.5 ns |
1.12 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146579041 ns |
150235166.5 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
273351749.5 ns |
273092750 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
20803838 ns |
||
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
280073541.5 ns |
284314042 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
266185750 ns |
259222834 ns |
1.03 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
232319750 ns |
233454625 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
323259541 ns |
323194834 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182709 ns |
183354.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
182833 ns |
182083 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
184667 ns |
185375 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
182645.5 ns |
183166.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
103013 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
586479 ns |
598042 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
633375 ns |
638604 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
598917 ns |
590042 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
599354.5 ns |
639625 ns |
0.94 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
708021 ns |
||
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3905000.5 ns |
3814396 ns |
1.02 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
4140542 ns |
3917959 ns |
1.06 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3516750 ns |
3558667 ns |
0.99 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
4568000 ns |
4558792 ns |
1.00 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
554894 ns |
||
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17566375 ns |
17242875 ns |
1.02 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
18253708 ns |
17847895.5 ns |
1.02 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16570958.5 ns |
16851208 ns |
0.98 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
20030167 ns |
19971167 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
3452900 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
667 ns |
0.81 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
32484 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9750 ns |
9333 ns |
1.04 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9500 ns |
8917 ns |
1.07 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9020.5 ns |
9792 ns |
0.92 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9167 ns |
9750 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
241650.5 ns |
||
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
594961166 ns |
652733938 ns |
0.91 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
467396666.5 ns |
393383500 ns |
1.19 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
366854000 ns |
395122417 ns |
0.93 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
606015292 ns |
624702084 ns |
0.97 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
14349377 ns |
||
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1891372250 ns |
1882307625 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1675552271 ns |
1638716333.5 ns |
1.02 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1505550791 ns |
1551357292 ns |
0.97 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2283985791 ns |
2292499417 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
53428360 ns |
||
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1646750 ns |
1649417 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1155416 ns |
1198625 ns |
0.96 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1389562.5 ns |
1369208 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2444458 ns |
2494208 ns |
0.98 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
214702 ns |
||
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12770833 ns |
12699979.5 ns |
1.01 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9933041.5 ns |
9947354 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9684750 ns |
9680125.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18434542 ns |
18361875 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2095669.5 ns |
||
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17728375 ns |
17714687.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14620271 ns |
14723938 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14581833.5 ns |
14690791 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21438020.5 ns |
21421188 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26417 ns |
26250 ns |
1.01 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26292 ns |
26209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26250 ns |
26209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26250 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
23919 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67375 ns |
67292 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67042 ns |
67625 ns |
0.99 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
68209 ns |
67000 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
67084 ns |
67167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
302304 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203875 ns |
204208 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
208458 ns |
209583 ns |
0.99 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
210209 ns |
209542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
201125 ns |
199166 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34142 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
603542 ns |
602458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
629250 ns |
626542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
634500 ns |
624687.5 ns |
1.02 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
593187.5 ns |
632958 ns |
0.94 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
264230.5 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
669084 ns |
656125 ns |
1.02 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
610875 ns |
646104 ns |
0.95 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
652812.5 ns |
546958 ns |
1.19 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
633104 ns |
679042 ns |
0.93 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
183657 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2246084 ns |
2259375 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2267666 ns |
2247416.5 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2233166 ns |
2013146 ns |
1.11 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2245167 ns |
2262166.5 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1072516 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
16875 ns |
18354.5 ns |
0.92 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17812.5 ns |
17375 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
19562 ns |
19625 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
16812.5 ns |
18542 ns |
0.91 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
100058.5 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
262833 ns |
259959 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
225875 ns |
263500 ns |
0.86 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
231396 ns |
221375 ns |
1.05 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
218917 ns |
261334 ns |
0.84 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
692348.5 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
667 ns |
584 ns |
1.14 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
667 ns |
625 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
708 ns |
0.77 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
23622 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9917 ns |
10125 ns |
0.98 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9708 ns |
9709 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
9584 ns |
10458 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9458 ns |
10250 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
220965 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5333 ns |
5500 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
6208.5 ns |
5375 ns |
1.16 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6542 ns |
7041.5 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5166.5 ns |
5167 ns |
1.00 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
113392 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7792 ns |
7875 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7625 ns |
7750 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7375 ns |
7542 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7167 ns |
7791 ns |
0.92 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
556279 ns |
||
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2375 ns |
2041 ns |
1.16 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2209 ns |
1958 ns |
1.13 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2458 ns |
2209 ns |
1.11 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2083 ns |
2167 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
24185 ns |
||
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6708 ns |
6333 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6500 ns |
6542 ns |
0.99 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6709 ns |
6416 ns |
1.05 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6542 ns |
6666 ns |
0.98 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
259383 ns |
||
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
753791.5 ns |
749417 ns |
1.01 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
748958 ns |
746625 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
747042 ns |
749166.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
747000 ns |
772625 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
29250.5 ns |
||
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
794709 ns |
792667 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
776292 ns |
792625 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
791916.5 ns |
775750 ns |
1.02 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
792916.5 ns |
808562.5 ns |
0.98 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
233961.5 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7334 ns |
7334 ns |
1 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
4750 ns |
5959 ns |
0.80 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
5333 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10208 ns |
10125 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33809 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
263708 ns |
220166 ns |
1.20 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
234791 ns |
239292 ns |
0.98 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
237375 ns |
229167 ns |
1.04 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
213083 ns |
254959 ns |
0.84 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
273907 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9750 ns |
9792 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10459 ns |
10000 ns |
1.05 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10917 ns |
11166 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10083 ns |
9750 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
147727.5 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
24709 ns |
24541 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24167 ns |
24291 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
23875 ns |
24917 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24583 ns |
24625 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
912295 ns |
||
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106420229.5 ns |
105924583 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
125882041.5 ns |
116546459 ns |
1.08 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
120619062.5 ns |
124211854 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117755479 ns |
117471395.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
3559519 ns |
||
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
397139166 ns |
393647209 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
363607041 ns |
356631062.5 ns |
1.02 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
404676250 ns |
357758708 ns |
1.13 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
602079583.5 ns |
619205000 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
20813952.5 ns |
||
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
801353333.5 ns |
612150166 ns |
1.31 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
603610416 ns |
766180166.5 ns |
0.79 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
749029833 ns |
749713459 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
607015562.5 ns |
785793916 ns |
0.77 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
7167 ns |
7000 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
8166.5 ns |
6875 ns |
1.19 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
9875 ns |
8625 ns |
1.14 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6667 ns |
6542 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
115987.5 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14458 ns |
13500 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13917 ns |
13625 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
13875 ns |
14375 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13792 ns |
14584 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
657670.5 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6084 ns |
5917 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6875 ns |
5770.5 ns |
1.19 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
7083 ns |
7875 ns |
0.90 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5458 ns |
5583 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
114549 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13000 ns |
13000 ns |
1 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12708.5 ns |
12625 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12375 ns |
12834 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12416 ns |
12895.5 ns |
0.96 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
547807 ns |
||
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
5500 ns |
5895.5 ns |
0.93 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
5833 ns |
5292 ns |
1.10 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
5750 ns |
5916 ns |
0.97 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
5708 ns |
5417 ns |
1.05 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
17235 ns |
||
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
15666 ns |
15667 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
16125 ns |
15895.5 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
16250 ns |
15916 ns |
1.02 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
15666.5 ns |
16041 ns |
0.98 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
167633 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
292 ns |
1.28 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
417 ns |
0.80 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
29285 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6833 ns |
6292 ns |
1.09 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6729.5 ns |
6667 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
6834 ns |
6667 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6125 ns |
6666 ns |
0.92 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
216060.5 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5958 ns |
5916 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5834 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6042 ns |
5917 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5875 ns |
6041 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
29354 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21875 ns |
21667 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21250 ns |
21208 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21334 ns |
21750 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21208 ns |
21875 ns |
0.97 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
231464 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
147062.5 ns |
144583 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
145521 ns |
162416 ns |
0.90 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
149354 ns |
146625 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
144291 ns |
187542 ns |
0.77 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
179480 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1325916 ns |
1319875 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1334041 ns |
1320770.5 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1326625 ns |
957604 ns |
1.39 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1324417 ns |
1324833 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1043728 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25417 ns |
23125 ns |
1.10 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
22500 ns |
22437.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25125 ns |
23854.5 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
24875 ns |
24396 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
206667 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
131041.5 ns |
129875 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
141083 ns |
138125 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
151687 ns |
118937.5 ns |
1.28 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
130417 ns |
176083 ns |
0.74 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1029927 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
23245 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6792 ns |
6833.5 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6708 ns |
6708 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7000 ns |
6667 ns |
1.05 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6333 ns |
6917 ns |
0.92 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
218284 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4166 ns |
4333.5 ns |
0.96 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4584 ns |
4292 ns |
1.07 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
5625 ns |
5292 ns |
1.06 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4042 ns |
4042 ns |
1 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
157788.5 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11750 ns |
11542 ns |
1.02 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11792 ns |
11958 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12166 ns |
11708 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11875 ns |
12625 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1028295 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1584 ns |
1584 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1625 ns |
1583 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1583 ns |
1583 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1583 ns |
1667 ns |
0.95 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23573 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
5667 ns |
5667 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5667 ns |
5625 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5708 ns |
5791 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5667 ns |
5791 ns |
0.98 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
247830.5 ns |
||
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6841083 ns |
6893499.5 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6380312.5 ns |
6374750 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6541375 ns |
6500541.5 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7550208 ns |
7628458 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
268531 ns |
||
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
24100333.5 ns |
24057854 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21277333.5 ns |
21255853.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21000417 ns |
21045937.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29664625 ns |
29752958 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2233805 ns |
||
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
49206750 ns |
37194104 ns |
1.32 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
34126750 ns |
45565937.5 ns |
0.75 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
45951417 ns |
45856833 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
38000291 ns |
49410209 ns |
0.77 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5917 ns |
5729.5 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6917 ns |
6041 ns |
1.15 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7167 ns |
7542 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
5791.5 ns |
5583 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
116232.5 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8917 ns |
7812.5 ns |
1.14 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8667 ns |
8333 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8584 ns |
8667 ns |
0.99 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8083.5 ns |
8750 ns |
0.92 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
643141.5 ns |
||
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1571084 ns |
1558521 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1234000 ns |
1261333 ns |
0.98 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1629167 ns |
1624791.5 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2116229.5 ns |
2151979 ns |
0.98 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
278928 ns |
||
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7937750 ns |
7911312.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6341959 ns |
6595562.5 ns |
0.96 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7183167 ns |
7113500.5 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10478041.5 ns |
10486458 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1487071 ns |
||
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
374395.5 ns |
370375.5 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
350938 ns |
370334 ns |
0.95 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
456333 ns |
457042 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
22354.5 ns |
24083.5 ns |
0.93 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
47147.5 ns |
||
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
746625 ns |
740416 ns |
1.01 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
795750 ns |
810542 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1058750 ns |
1091458.5 ns |
0.97 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
94292 ns |
119250 ns |
0.79 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
199913 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
397375 ns |
397375 ns |
1 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
211166 ns |
288000 ns |
0.73 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287917 ns |
211583 ns |
1.36 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
751958 ns |
750270.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
43831 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
670000 ns |
673041 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
459708 ns |
532334 ns |
0.86 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
531333 ns |
474084 ns |
1.12 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
974625 ns |
973792 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
213381 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
681375 ns |
662833.5 ns |
1.03 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
543708 ns |
641958 ns |
0.85 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
648417 ns |
544334 ns |
1.19 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
640520.5 ns |
670813 ns |
0.95 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
183734.5 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2318125 ns |
2467229 ns |
0.94 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2469938 ns |
2462313 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2457937.5 ns |
2482583.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2465041 ns |
2448459 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1103811.5 ns |
||
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
2750 ns |
3583.5 ns |
0.77 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
2584 ns |
2687.5 ns |
0.96 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
4187.5 ns |
2959 ns |
1.42 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
3437.5 ns |
3833 ns |
0.90 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16704 ns |
||
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
5750 ns |
5542 ns |
1.04 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
5792 ns |
5792 ns |
1 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
5833 ns |
5833 ns |
1 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
5750 ns |
5833.5 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
166615 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1463166 ns |
1460979.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1497000 ns |
1498958 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1504291 ns |
1492334 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1442417 ns |
1436709 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
63672 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5129125 ns |
5110375 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5274312.5 ns |
5286896 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5290666 ns |
4965208 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5015042 ns |
4987187.5 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
278519 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3708 ns |
3709 ns |
1.00 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3709 ns |
3750 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3709 ns |
3709 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3709 ns |
3709 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
31642 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15333 ns |
15250 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15084 ns |
15375 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15542 ns |
15208 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15250 ns |
15542 ns |
0.98 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
278272 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71417 ns |
71167 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
70250 ns |
71208 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71333 ns |
71125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
70416 ns |
70145.5 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113876 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
320458 ns |
318209 ns |
1.01 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
320625 ns |
321166 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
319333 ns |
331000 ns |
0.96 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
318959 ns |
318208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
226322.5 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1000 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1000 ns |
1084 ns |
0.92 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1125 ns |
1083 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1000 ns |
1125 ns |
0.89 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
28049 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8500 ns |
8208 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8250 ns |
8333 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8500 ns |
8542 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
7958 ns |
8458 ns |
0.94 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
229450 ns |
||
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
518146 ns |
513416.5 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
470041 ns |
491000 ns |
0.96 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
558896.5 ns |
564167 ns |
0.99 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
215375 ns |
219125 ns |
0.98 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129321 ns |
||
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1415813 ns |
1389604.5 ns |
1.02 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1365521 ns |
1470916.5 ns |
0.93 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1749208 ns |
1739750 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
869291 ns |
867042 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
314195 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
292 ns |
1.43 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
417 ns |
0.80 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32861 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6791 ns |
6792 ns |
1.00 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6270.5 ns |
6667 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6667 ns |
6667 ns |
1 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6208 ns |
6583 ns |
0.94 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
240520 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1719583 ns |
1744875 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1722375 ns |
1720437.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1729625 ns |
1725229 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1724625 ns |
1774833.5 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
181797 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4370084 ns |
4362875 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4363333 ns |
4366833.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4368438 ns |
4017625 ns |
1.09 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4345708 ns |
4360042 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1045982.5 ns |
||
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6750 ns |
6709 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6791 ns |
6541 ns |
1.04 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7041 ns |
7125 ns |
0.99 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
6750 ns |
6896 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
27491 ns |
||
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
52812 ns |
32667 ns |
1.62 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
33062.5 ns |
51125 ns |
0.65 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
51395.5 ns |
33125 ns |
1.55 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
51375 ns |
52271 ns |
0.98 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
205638.5 ns |
||
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
17583 ns |
18166.5 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
17958.5 ns |
17500 ns |
1.03 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
18375 ns |
18875 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
17833 ns |
17666.5 ns |
1.01 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18754 ns |
||
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
53875 ns |
53667 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
53666 ns |
53584 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
53458 ns |
53417 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
53583 ns |
54000 ns |
0.99 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
255832 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75500 ns |
75334 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
73875 ns |
75375 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75416 ns |
75209 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75250 ns |
74916 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
47402.5 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
329520.5 ns |
324959 ns |
1.01 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
327625 ns |
340167 ns |
0.96 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
328666 ns |
336875 ns |
0.98 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
325375 ns |
324833 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
243231 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1485708 ns |
1486958 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1518750 ns |
1526792 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1527125 ns |
1521459 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1463375 ns |
1463834 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
74346.5 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5120104.5 ns |
5117062 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5004791.5 ns |
5294604 ns |
0.95 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5247958 ns |
4960833 ns |
1.06 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4989625 ns |
4987709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
285963.5 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28208 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28166 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28125 ns |
28292 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28250 ns |
28292 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24611 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66584 ns |
66333 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66333 ns |
66833 ns |
0.99 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66833 ns |
66500 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66708 ns |
66459 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
359964 ns |
||
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1492854 ns |
1395354 ns |
1.07 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
939312.5 ns |
1059146 ns |
0.89 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1145167 ns |
814208 ns |
1.41 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2249458 ns |
2269396 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
551648 ns |
||
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
3081083 ns |
3090979 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
1924375 ns |
2740854.5 ns |
0.70 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2755104.5 ns |
2544104.5 ns |
1.08 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3852000.5 ns |
3812666 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
1605594.5 ns |
||
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
7915791.5 ns |
7882104 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
7577250 ns |
7902666.5 ns |
0.96 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
7897333 ns |
8008791.5 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4836729.5 ns |
4806271 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
81291 ns |
81167 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
82208 ns |
83208.5 ns |
0.99 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
83375 ns |
81979.5 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
79000.5 ns |
80417 ns |
0.98 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
188718.5 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2024895.5 ns |
2017166.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1770313 ns |
2013729 ns |
0.88 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2017667 ns |
1774125 ns |
1.14 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2030229 ns |
2014354.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
614926 ns |
This comment was automatically generated by workflow using github-action-benchmark.
No description provided.