-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
chore: bump Zygote version #1182
Conversation
Benchmark Results (ASV)
Benchmark PlotsA plot of the benchmark results have been uploaded as an artifact to the workflow run for this PR. |
quite a lot of things are broken in LuxLib |
f0f7fdf
to
34f9cf2
Compare
33e4959
to
fbb55bb
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Lux Benchmarks
Benchmark suite | Current: ec156a5 | Previous: 1053879 | Ratio |
---|---|---|---|
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3792 ns |
3875 ns |
0.98 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4125 ns |
4292 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
5333 ns |
4958 ns |
1.08 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3917 ns |
3708 ns |
1.06 |
layernorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
81338 ns |
||
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10209 ns |
10750 ns |
0.95 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10792 ns |
10416 ns |
1.04 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10459 ns |
10833 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
10708 ns |
10500 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
589598 ns |
||
bias_activation(32, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
1125 ns |
1250 ns |
0.90 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
1291 ns |
1042 ns |
1.24 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
1416 ns |
1417 ns |
1.00 |
bias_activation(32, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
1208 ns |
1208 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/forward/GPU/CUDA |
22735 ns |
||
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
4042 ns |
4125 ns |
0.98 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
3958 ns |
3792 ns |
1.04 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
4208 ns |
4208 ns |
1 |
bias_activation(32, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
4000 ns |
4166 ns |
0.96 |
bias_activation(32, act=relu)(32 x 128)/zygote/GPU/CUDA |
143372 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58083 ns |
57458 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
46167 ns |
46709 ns |
0.99 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
46375 ns |
38291.5 ns |
1.21 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82625 ns |
82166 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
37541 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2036333.5 ns |
2036084 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2083916 ns |
2088000 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2096041.5 ns |
2101833.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2002875 ns |
1996395.5 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
261968.5 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
149875.5 ns |
171187 ns |
0.88 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
145041.5 ns |
141166 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
146125 ns |
145416.5 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
145563 ns |
143604 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
182105.5 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1120542 ns |
1123959 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1109125 ns |
1117541.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1126041.5 ns |
1153479.5 ns |
0.98 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1117750 ns |
1120542 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
732481 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3604.5 ns |
3250 ns |
1.11 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3542 ns |
3542 ns |
1 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4417 ns |
4083 ns |
1.08 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3500 ns |
3042 ns |
1.15 |
layernorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
91689.5 ns |
||
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10417 ns |
9145.5 ns |
1.14 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9667 ns |
8833 ns |
1.09 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10708 ns |
10333 ns |
1.04 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10166 ns |
9292 ns |
1.09 |
layernorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
684242.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17708.5 ns |
15250 ns |
1.16 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
17375 ns |
17354.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17750 ns |
16208 ns |
1.10 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
15417 ns |
15187.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
65373.5 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
216791.5 ns |
216750 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
220583 ns |
211208 ns |
1.04 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216125 ns |
212166.5 ns |
1.02 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
211583 ns |
227042 ns |
0.93 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
443698.5 ns |
||
bias_activation(2, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
625 ns |
667 ns |
0.94 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
584 ns |
583 ns |
1.00 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
708 ns |
770.5 ns |
0.92 |
bias_activation(2, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
666 ns |
500 ns |
1.33 |
bias_activation(2, act=relu)(2 x 128)/forward/GPU/CUDA |
22099 ns |
||
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
1625 ns |
1459 ns |
1.11 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
1625 ns |
1417 ns |
1.15 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
1500 ns |
1417 ns |
1.06 |
bias_activation(2, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
1417 ns |
1458 ns |
0.97 |
bias_activation(2, act=relu)(2 x 128)/zygote/GPU/CUDA |
208022 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7000 ns |
7166 ns |
0.98 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5833 ns |
5875 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
5917 ns |
5250 ns |
1.13 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10042 ns |
10041 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
24008 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
230645.5 ns |
221000 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
238250 ns |
227229.5 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
229500 ns |
228708 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
225000 ns |
213792 ns |
1.05 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
237945.5 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
3917 ns |
3834 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
3875 ns |
3875 ns |
1 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
3875 ns |
3917 ns |
0.99 |
dense(32, bias=false, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
3958 ns |
3875 ns |
1.02 |
dense(32, bias=false, act=relu)(32 x 128)/forward/GPU/CUDA |
23657 ns |
||
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16958 ns |
16750 ns |
1.01 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16791 ns |
16708 ns |
1.00 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
17125 ns |
16542 ns |
1.04 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16709 ns |
17042 ns |
0.98 |
dense(32, bias=false, act=relu)(32 x 128)/zygote/GPU/CUDA |
257669 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
576250 ns |
580104.5 ns |
0.99 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
575792 ns |
575958 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
579250 ns |
579375 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
578459 ns |
580708 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/forward/GPU/CUDA |
113043.5 ns |
||
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1435521 ns |
1416791 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1422417 ns |
1424167 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1424334 ns |
1423042 ns |
1.00 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
1433042 ns |
1425000 ns |
1.01 |
dense(512, bias=false, act=gelu)(512 x 128)/zygote/GPU/CUDA |
245338.5 ns |
||
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) |
1076084 ns |
1079063 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) |
959917 ns |
963917 ns |
1.00 |
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) |
1353395.5 ns |
1334458 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) |
1307459 ns |
1297667 ns |
1.01 |
lenet(28, 28, 1, 64)/forward/GPU/CUDA |
277123 ns |
||
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) |
5983500 ns |
5943395.5 ns |
1.01 |
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) |
4589875 ns |
4600125 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) |
4956312.5 ns |
4951395.5 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) |
5581854.5 ns |
5560500 ns |
1.00 |
lenet(28, 28, 1, 64)/zygote/GPU/CUDA |
1322976 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
500 ns |
500 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
542 ns |
500 ns |
1.08 |
dense(2, bias=true, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
542 ns |
542 ns |
1 |
dense(2, bias=true, act=relu)(2 x 128)/forward/GPU/CUDA |
23795 ns |
||
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
2167 ns |
2166 ns |
1.00 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2125 ns |
2042 ns |
1.04 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
2167 ns |
2125 ns |
1.02 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2125 ns |
2208 ns |
0.96 |
dense(2, bias=true, act=relu)(2 x 128)/zygote/GPU/CUDA |
287236 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
3833.5 ns |
3687.5 ns |
1.04 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
3895.5 ns |
3791 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
4958 ns |
4792 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4000 ns |
3667 ns |
1.09 |
layernorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
145829 ns |
||
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
11417 ns |
10875 ns |
1.05 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
11292 ns |
11084 ns |
1.02 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
11812.5 ns |
11500 ns |
1.03 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
11291 ns |
11250 ns |
1.00 |
layernorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
843169.5 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6250 ns |
6125 ns |
1.02 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
6208 ns |
6834 ns |
0.91 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8667 ns |
7542 ns |
1.15 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
6792 ns |
6250 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
116006.5 ns |
||
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
18667 ns |
17625 ns |
1.06 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
17083 ns |
17542 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
18209 ns |
18834 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
16875 ns |
17416 ns |
0.97 |
groupnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
597830 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
542 ns |
666 ns |
0.81 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
666 ns |
625 ns |
1.07 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
583 ns |
625 ns |
0.93 |
batchnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
38116 ns |
||
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
8917 ns |
8500 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9084 ns |
8750 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9542 ns |
9125 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
8833 ns |
9208 ns |
0.96 |
batchnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
211578.5 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
64584 ns |
64375 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
64541 ns |
64542 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
64542 ns |
64667 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
64792 ns |
64500 ns |
1.00 |
dense(512, bias=false, act=identity)(512 x 128)/forward/GPU/CUDA |
110661 ns |
||
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
284125 ns |
277667 ns |
1.02 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
282334 ns |
287083 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
284209 ns |
291375 ns |
0.98 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
282375 ns |
284145.5 ns |
0.99 |
dense(512, bias=false, act=identity)(512 x 128)/zygote/GPU/CUDA |
204945 ns |
||
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) |
3312312.5 ns |
3306333 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) |
3024792 ns |
3031917 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) |
3015708 ns |
2796833 ns |
1.08 |
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) |
4068666.5 ns |
3935125 ns |
1.03 |
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA |
549444 ns |
||
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) |
7655625.5 ns |
7260770.5 ns |
1.05 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) |
7477937.5 ns |
7411416 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) |
7488062 ns |
7367271 ns |
1.02 |
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) |
8275396 ns |
8191583.5 ns |
1.01 |
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA |
1578004 ns |
||
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) |
17537667 ns |
17581104 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) |
17536167 ns |
17521584 ns |
1.00 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) |
17544416.5 ns |
17682146 ns |
0.99 |
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) |
14138896 ns |
14123875 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23286395.5 ns |
23725208 ns |
0.98 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
33822000.5 ns |
34375583 ns |
0.98 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37254395.5 ns |
40913375 ns |
0.91 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
34962750 ns |
34801458 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1853536 ns |
||
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
315169333.5 ns |
189578375 ns |
1.66 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
164173312.5 ns |
164456312.5 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
151307250 ns |
155623541 ns |
0.97 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
280314896 ns |
434187396 ns |
0.65 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
18190556 ns |
||
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
289334209 ns |
289496083 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
262239208 ns |
262462166 ns |
1.00 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
297151583 ns |
305828042 ns |
0.97 |
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
478006146 ns |
474493916.5 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
22959 ns |
23604 ns |
0.97 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
24416 ns |
24250 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
24709 ns |
23979 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
22000 ns |
21291 ns |
1.03 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
200917 ns |
||
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
103687.5 ns |
104687.5 ns |
0.99 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
104729.5 ns |
104875 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
105500 ns |
104125 ns |
1.01 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
103792 ns |
103292 ns |
1.00 |
layernorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
967522 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5708 ns |
6749.5 ns |
0.85 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5542 ns |
5416 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6625 ns |
7000 ns |
0.95 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5458 ns |
5333 ns |
1.02 |
layernorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
152037.5 ns |
||
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
15000 ns |
14833 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
14917 ns |
14709 ns |
1.01 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
16042 ns |
16166 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14583 ns |
14770.5 ns |
0.99 |
layernorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
991796 ns |
||
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
2874292 ns |
3018000 ns |
0.95 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2042958 ns |
2066604.5 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2269333 ns |
2280541.5 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4839521 ns |
4577917 ns |
1.06 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
586423 ns |
||
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
23593083 ns |
23533375 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
17967458 ns |
18022709 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
16920542 ns |
17334750 ns |
0.98 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
34844500 ns |
34837750 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3227368 ns |
||
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
33399209 ns |
33300333 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
27567250 ns |
27629000 ns |
1.00 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
27441041 ns |
27822584 ns |
0.99 |
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41052937.5 ns |
41187708 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
83979.5 ns |
74520.5 ns |
1.13 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
72625 ns |
74875 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
80104 ns |
82167 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
75375 ns |
74583 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
214747 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
313750 ns |
308437.5 ns |
1.02 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
281916 ns |
225749.5 ns |
1.25 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
292500 ns |
320208.5 ns |
0.91 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219395.5 ns |
218542 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1125841.5 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
11541.5 ns |
11583 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11625 ns |
11583 ns |
1.00 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
12500 ns |
13208 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
11833.5 ns |
11458 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
155078.5 ns |
||
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
28937.5 ns |
28167 ns |
1.03 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
26916.5 ns |
28375 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
29292 ns |
29709 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
28354 ns |
28917 ns |
0.98 |
layernorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
1027471 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
12125 ns |
12000 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
11584 ns |
12292 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
14375 ns |
13958 ns |
1.03 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
12792 ns |
12333 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
123504 ns |
||
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
26583 ns |
25666 ns |
1.04 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
25500 ns |
25959 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
26709 ns |
26500 ns |
1.01 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
25917 ns |
26459 ns |
0.98 |
groupnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
668377 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
182167 ns |
180521 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
182125 ns |
179354.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
182583 ns |
183458 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
182104.5 ns |
180375 ns |
1.01 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
98989.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
584708.5 ns |
590375 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
582833 ns |
594250 ns |
0.98 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
595458.5 ns |
594916 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
584792 ns |
583541 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
525338.5 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5708 ns |
6084 ns |
0.94 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5854.5 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6958.5 ns |
7104.5 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5833 ns |
5917 ns |
0.99 |
layernorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
156166.5 ns |
||
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
13458 ns |
14208 ns |
0.95 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13667 ns |
13500 ns |
1.01 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
15583 ns |
15625 ns |
1.00 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
13542 ns |
13834 ns |
0.98 |
layernorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
983586 ns |
||
batchedmm(512, Bsize=4)/forward/CPU/2 thread(s) |
1199250 ns |
1217312.5 ns |
0.99 |
batchedmm(512, Bsize=4)/forward/CPU/4 thread(s) |
1238417 ns |
1268500 ns |
0.98 |
batchedmm(512, Bsize=4)/forward/CPU/8 thread(s) |
1338709 ns |
1281209 ns |
1.04 |
batchedmm(512, Bsize=4)/forward/CPU/1 thread(s) |
1038187.5 ns |
998541.5 ns |
1.04 |
batchedmm(512, Bsize=4)/forward/GPU/CUDA |
301214 ns |
||
batchedmm(512, Bsize=4)/zygote/CPU/2 thread(s) |
4107563 ns |
4105042 ns |
1.00 |
batchedmm(512, Bsize=4)/zygote/CPU/4 thread(s) |
4459000 ns |
4410083.5 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/CPU/8 thread(s) |
4581042 ns |
4905208.5 ns |
0.93 |
batchedmm(512, Bsize=4)/zygote/CPU/1 thread(s) |
3738291.5 ns |
3703875 ns |
1.01 |
batchedmm(512, Bsize=4)/zygote/GPU/CUDA |
1094582 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1834 ns |
1791 ns |
1.02 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1833 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/forward/GPU/CUDA |
24056 ns |
||
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
4875 ns |
4833 ns |
1.01 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
4791 ns |
4833 ns |
0.99 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
5000 ns |
4833 ns |
1.03 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
4792 ns |
4875 ns |
0.98 |
dense(2, bias=true, act=gelu)(2 x 128)/zygote/GPU/CUDA |
296208.5 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5875 ns |
5375 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5458 ns |
5958 ns |
0.92 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7208 ns |
7166.5 ns |
1.01 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6125 ns |
5333.5 ns |
1.15 |
groupnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
114786.5 ns |
||
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
11458 ns |
10500 ns |
1.09 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11084 ns |
11042 ns |
1.00 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
11417 ns |
11125 ns |
1.03 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10792 ns |
11542 ns |
0.94 |
groupnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
625509.5 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/4 thread(s) |
291 ns |
333 ns |
0.87 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/forward/CPU/1 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=false, act=relu)(2 x 128)/forward/GPU/CUDA |
23447 ns |
||
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/2 thread(s) |
3042 ns |
2750 ns |
1.11 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/4 thread(s) |
2750 ns |
2708 ns |
1.02 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/8 thread(s) |
3125 ns |
2750 ns |
1.14 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/CPU/1 thread(s) |
2834 ns |
3083 ns |
0.92 |
dense(2, bias=false, act=relu)(2 x 128)/zygote/GPU/CUDA |
222903.5 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
11375 ns |
10875 ns |
1.05 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
10708 ns |
11125 ns |
0.96 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
13333 ns |
12958.5 ns |
1.03 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
11979.5 ns |
11229.5 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
117415.5 ns |
||
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25541.5 ns |
24604.5 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24583 ns |
24834 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
25667 ns |
25333 ns |
1.01 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24958 ns |
25333 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
541281.5 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/2 thread(s) |
4167 ns |
4166 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/4 thread(s) |
4167 ns |
4167 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/8 thread(s) |
4208 ns |
4208 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/CPU/1 thread(s) |
4208 ns |
4208 ns |
1 |
dense(32, bias=true, act=relu)(32 x 128)/forward/GPU/CUDA |
24871 ns |
||
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/2 thread(s) |
16417 ns |
16375 ns |
1.00 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/4 thread(s) |
16375 ns |
16500 ns |
0.99 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/8 thread(s) |
16333 ns |
16167 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/CPU/1 thread(s) |
16375 ns |
16291 ns |
1.01 |
dense(32, bias=true, act=relu)(32 x 128)/zygote/GPU/CUDA |
335381.5 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5792 ns |
5834 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5875 ns |
5834 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5833 ns |
5792 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5834 ns |
5875 ns |
0.99 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
39849.5 ns |
||
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
20854.5 ns |
20792 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
20916 ns |
21000 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
21375 ns |
21166 ns |
1.01 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
20458 ns |
21167 ns |
0.97 |
batchnorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
242848.5 ns |
||
batchedmm(16, Bsize=512)/forward/CPU/2 thread(s) |
422125 ns |
423895.5 ns |
1.00 |
batchedmm(16, Bsize=512)/forward/CPU/4 thread(s) |
383167 ns |
380479 ns |
1.01 |
batchedmm(16, Bsize=512)/forward/CPU/8 thread(s) |
479645.5 ns |
485125 ns |
0.99 |
batchedmm(16, Bsize=512)/forward/CPU/1 thread(s) |
104708 ns |
106958 ns |
0.98 |
batchedmm(16, Bsize=512)/forward/GPU/CUDA |
67777 ns |
||
batchedmm(16, Bsize=512)/zygote/CPU/2 thread(s) |
892583.5 ns |
937833 ns |
0.95 |
batchedmm(16, Bsize=512)/zygote/CPU/4 thread(s) |
929209 ns |
963250 ns |
0.96 |
batchedmm(16, Bsize=512)/zygote/CPU/8 thread(s) |
1177208 ns |
1216083 ns |
0.97 |
batchedmm(16, Bsize=512)/zygote/CPU/1 thread(s) |
380333.5 ns |
428542 ns |
0.89 |
batchedmm(16, Bsize=512)/zygote/GPU/CUDA |
228581 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
81292 ns |
80291.5 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
80167 ns |
79458 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
82333 ns |
87042 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
77937.5 ns |
80375 ns |
0.97 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189798 ns |
||
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1924667 ns |
1917916.5 ns |
1.00 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1830084 ns |
1918437.5 ns |
0.95 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1928375 ns |
1950812.5 ns |
0.99 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1925208 ns |
1915188 ns |
1.01 |
groupnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
537097.5 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
333 ns |
291 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
333 ns |
292 ns |
1.14 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
333 ns |
333 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/forward/GPU/CUDA |
22097 ns |
||
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
1834 ns |
1792 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
1875 ns |
1834 ns |
1.02 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
1875 ns |
1875 ns |
1 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
1834 ns |
1875 ns |
0.98 |
dense(2, bias=true, act=identity)(2 x 128)/zygote/GPU/CUDA |
272458 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
6417 ns |
6000 ns |
1.07 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5792 ns |
6167 ns |
0.94 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
7416 ns |
7834 ns |
0.95 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
6959 ns |
6125 ns |
1.14 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
113394 ns |
||
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
9062.5 ns |
9041 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
9000 ns |
9125 ns |
0.99 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
9667 ns |
9333 ns |
1.04 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9583 ns |
9625 ns |
1.00 |
groupnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
554729 ns |
||
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
119920625 ns |
120446062.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
174078042 ns |
174298416.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
148550875 ns |
155622396 ns |
0.95 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
106645833.5 ns |
104910437 ns |
1.02 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5474727 ns |
||
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
616704000 ns |
613470583 ns |
1.01 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
556150562.5 ns |
555889999.5 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
451899542 ns |
467916666 ns |
0.97 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
627747584 ns |
629979541 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
40525731 ns |
||
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
718477458.5 ns |
717129562 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
664231708 ns |
665448791 ns |
1.00 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
583167334 ns |
597201792 ns |
0.98 |
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
861400250 ns |
855951979.5 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
59125 ns |
58542 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47041 ns |
48208 ns |
0.98 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
48084 ns |
39083 ns |
1.23 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
85125 ns |
80167 ns |
1.06 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
61128 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1916791 ns |
1918312.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1965750 ns |
1976771 ns |
0.99 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1982417 ns |
1793729 ns |
1.11 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1887333 ns |
1888625 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
249435 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
266625 ns |
268666.5 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
266500 ns |
268458 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
271333 ns |
269271 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
268000 ns |
265875 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
212742.5 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
699042 ns |
676000 ns |
1.03 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
689500 ns |
587417 ns |
1.17 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
699875 ns |
601499.5 ns |
1.16 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
598792 ns |
700333 ns |
0.86 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1127714.5 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
2219145.5 ns |
2212542 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
2210937.5 ns |
2211416 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
2224625 ns |
2103833 ns |
1.06 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
2213667 ns |
2216500 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
184976.5 ns |
||
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5484500 ns |
5504541 ns |
1.00 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5446916 ns |
5488625 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5519875 ns |
5582375 ns |
0.99 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5520416 ns |
5490917 ns |
1.01 |
layernorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1131736 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
660917 ns |
647417 ns |
1.02 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
648208 ns |
641916.5 ns |
1.01 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
640667 ns |
650125 ns |
0.99 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
645833 ns |
642917 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/forward/GPU/CUDA |
47172 ns |
||
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
1823917 ns |
1821291 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
1720916 ns |
1717958 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
1721500 ns |
1666375 ns |
1.03 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
2109667 ns |
2103666.5 ns |
1.00 |
dense(512, bias=true, act=gelu)(512 x 128)/zygote/GPU/CUDA |
262960 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58708 ns |
58292 ns |
1.01 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47000 ns |
47209 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
45959 ns |
37250 ns |
1.23 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
84250 ns |
80791 ns |
1.04 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
29716 ns |
||
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1991208 ns |
2017916.5 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2077375.5 ns |
2086583 ns |
1.00 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2090500 ns |
1901083 ns |
1.10 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1979542 ns |
1990750 ns |
0.99 |
batchnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
255693.5 ns |
||
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
13384041.5 ns |
13371875 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
12375145.5 ns |
12426458 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
12579000 ns |
12666062 ns |
0.99 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
15171021 ns |
15204979 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
629268 ns |
||
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
47240959 ns |
47257417 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
41827084 ns |
41744209 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
41324042 ns |
41179062.5 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
58883625 ns |
58639833 ns |
1.00 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3664993 ns |
||
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
97032916 ns |
73940917 ns |
1.31 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
91851292 ns |
90904041 ns |
1.01 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
68140083 ns |
91001000 ns |
0.75 |
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
76516750 ns |
98448625 ns |
0.78 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
58875 ns |
58833 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
47209 ns |
47958 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
47416 ns |
38542 ns |
1.23 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
82250 ns |
84292 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
72599.5 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1926667 ns |
1904750 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1924167 ns |
1969542 ns |
0.98 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1978021 ns |
1800875 ns |
1.10 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1879000 ns |
1895917 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
253626 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
416 ns |
292 ns |
1.42 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
333 ns |
416 ns |
0.80 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
375 ns |
0.89 |
batchnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
39189 ns |
||
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6458 ns |
6145.5 ns |
1.05 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6708 ns |
6458 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6625 ns |
6375 ns |
1.04 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6417 ns |
6625 ns |
0.97 |
batchnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
222739.5 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/2 thread(s) |
292 ns |
250 ns |
1.17 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/4 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/8 thread(s) |
292 ns |
292 ns |
1 |
dense(2, bias=false, act=identity)(2 x 128)/forward/CPU/1 thread(s) |
291 ns |
292 ns |
1.00 |
dense(2, bias=false, act=identity)(2 x 128)/forward/GPU/CUDA |
31052 ns |
||
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/2 thread(s) |
2834 ns |
2666 ns |
1.06 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/4 thread(s) |
2833 ns |
2875 ns |
0.99 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/8 thread(s) |
2917 ns |
2833 ns |
1.03 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/CPU/1 thread(s) |
2667 ns |
2875 ns |
0.93 |
dense(2, bias=false, act=identity)(2 x 128)/zygote/GPU/CUDA |
206648 ns |
||
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
283869875 ns |
284556437.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
339651896 ns |
340224270.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
314422125 ns |
320916166 ns |
0.98 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
274518562 ns |
270718833 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
8817259 ns |
||
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
1007456249.5 ns |
998965333.5 ns |
1.01 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
922993062.5 ns |
956359521 ns |
0.97 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
820887500 ns |
868085334 ns |
0.95 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
1290788041.5 ns |
1210263479.5 ns |
1.07 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
44595138 ns |
||
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
1744020125 ns |
1439494000 ns |
1.21 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
1682554667 ns |
1675455020.5 ns |
1.00 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
1238035604.5 ns |
1623450375 ns |
0.76 |
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
1425416667 ns |
1781275542 ns |
0.80 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1415292 ns |
1402500 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1405374.5 ns |
1406416 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1426750 ns |
1410125 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1456500 ns |
1406875 ns |
1.04 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
124133.5 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5032146 ns |
5015125 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5016417 ns |
5021375 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5036520.5 ns |
5065333 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5023375.5 ns |
5030104.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
775369 ns |
||
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) |
178978250 ns |
178918125 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) |
137135917 ns |
137633791 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) |
112436937.5 ns |
137284041 ns |
0.82 |
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) |
168593666 ns |
169122750 ns |
1.00 |
vgg16(32, 32, 3, 32)/forward/GPU/CUDA |
5315150 ns |
||
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) |
820614479.5 ns |
824093375 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) |
492889042 ns |
493391208 ns |
1.00 |
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) |
537938833.5 ns |
544904625 ns |
0.99 |
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) |
650240833.5 ns |
646424584 ns |
1.01 |
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA |
16841266 ns |
||
batchedmm(512, Bsize=32)/forward/CPU/2 thread(s) |
8954458 ns |
8944417 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/4 thread(s) |
8913583 ns |
8930333 ns |
1.00 |
batchedmm(512, Bsize=32)/forward/CPU/8 thread(s) |
7890083.5 ns |
8002583 ns |
0.99 |
batchedmm(512, Bsize=32)/forward/CPU/1 thread(s) |
9817687.5 ns |
9740458 ns |
1.01 |
batchedmm(512, Bsize=32)/forward/GPU/CUDA |
1594567 ns |
||
batchedmm(512, Bsize=32)/zygote/CPU/2 thread(s) |
35907125 ns |
37148750 ns |
0.97 |
batchedmm(512, Bsize=32)/zygote/CPU/4 thread(s) |
37172208 ns |
36964208 ns |
1.01 |
batchedmm(512, Bsize=32)/zygote/CPU/8 thread(s) |
33643959 ns |
34465958 ns |
0.98 |
batchedmm(512, Bsize=32)/zygote/CPU/1 thread(s) |
38424416 ns |
38308250 ns |
1.00 |
batchedmm(512, Bsize=32)/zygote/GPU/CUDA |
7238610 ns |
||
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/2 thread(s) |
47500 ns |
47458 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/4 thread(s) |
47520.5 ns |
47334 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/8 thread(s) |
47584 ns |
47542 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/CPU/1 thread(s) |
47375 ns |
47584 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/forward/GPU/CUDA |
26021 ns |
||
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/2 thread(s) |
50416 ns |
50542 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/4 thread(s) |
50542 ns |
50542 ns |
1 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/8 thread(s) |
51041 ns |
50625 ns |
1.01 |
bias_activation(32, act=tanh)(32 x 128)/zygote/CPU/1 thread(s) |
50250 ns |
50500 ns |
1.00 |
bias_activation(32, act=tanh)(32 x 128)/zygote/GPU/CUDA |
302627 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
6750 ns |
6292 ns |
1.07 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
6583 ns |
6625 ns |
0.99 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7958 ns |
8479 ns |
0.94 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
7187.5 ns |
6792 ns |
1.06 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
119253 ns |
||
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
10708 ns |
9584 ns |
1.12 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
9833 ns |
10625 ns |
0.93 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
10542 ns |
10375 ns |
1.02 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
10459 ns |
10458 ns |
1.00 |
groupnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
712007 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
5833 ns |
5250 ns |
1.11 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
5625 ns |
5917 ns |
0.95 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
6875 ns |
7917 ns |
0.87 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
5917 ns |
5750 ns |
1.03 |
groupnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
117137 ns |
||
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
16417 ns |
18291.5 ns |
0.90 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
16292 ns |
15958 ns |
1.02 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
16709 ns |
16500 ns |
1.01 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
15916 ns |
16583 ns |
0.96 |
groupnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
557035.5 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1083 ns |
1 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
1042 ns |
1083 ns |
0.96 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
1125 ns |
1083 ns |
1.04 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
1083 ns |
1084 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/forward/GPU/CUDA |
38601 ns |
||
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
8708 ns |
8104.5 ns |
1.07 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
8292 ns |
8084 ns |
1.03 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
8792 ns |
8125 ns |
1.08 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
8416 ns |
8458 ns |
1.00 |
batchnorm(2, act=gelu, affine=false)(4 x 32)/zygote/GPU/CUDA |
258890.5 ns |
||
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
23166 ns |
23125 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
23208 ns |
23167 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
23584 ns |
23167 ns |
1.02 |
bias_activation(32, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
23145.5 ns |
23541 ns |
0.98 |
bias_activation(32, act=gelu)(32 x 128)/forward/GPU/CUDA |
25352 ns |
||
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
52875 ns |
52500 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
52708 ns |
52417 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
53042 ns |
52645.5 ns |
1.01 |
bias_activation(32, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
52375 ns |
52458 ns |
1.00 |
bias_activation(32, act=gelu)(32 x 128)/zygote/GPU/CUDA |
313345.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1452354 ns |
1405062.5 ns |
1.03 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1404208 ns |
1402583.5 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1397917 ns |
1406875 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1454834 ns |
1403729.5 ns |
1.04 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
191387.5 ns |
||
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4996375 ns |
5007708 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5013791.5 ns |
5013292 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5020375 ns |
5046271 ns |
0.99 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5026479 ns |
5005125 ns |
1.00 |
groupnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
628549.5 ns |
||
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) |
3055250 ns |
3074708 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) |
2100041.5 ns |
2091499.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) |
2293333 ns |
2290083.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) |
4930334 ns |
4915708.5 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA |
585303.5 ns |
||
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) |
24438916.5 ns |
24422083 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) |
18861291.5 ns |
18926750 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) |
17824792 ns |
18059792 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) |
36085333 ns |
35835500.5 ns |
1.01 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA |
3449539 ns |
||
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) |
34014625 ns |
34039292 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) |
28467208 ns |
28325625 ns |
1.00 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) |
28041750 ns |
28468583 ns |
0.99 |
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) |
41730458.5 ns |
41461250 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/CPU/2 thread(s) |
144611729.5 ns |
144570938 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/4 thread(s) |
148240375 ns |
147768250 ns |
1.00 |
batchedmm(512, Bsize=512)/forward/CPU/8 thread(s) |
125818584 ns |
127812375 ns |
0.98 |
batchedmm(512, Bsize=512)/forward/CPU/1 thread(s) |
174654208 ns |
173201708 ns |
1.01 |
batchedmm(512, Bsize=512)/forward/GPU/CUDA |
22555549.5 ns |
||
batchedmm(512, Bsize=512)/zygote/CPU/2 thread(s) |
919226875 ns |
952803959 ns |
0.96 |
batchedmm(512, Bsize=512)/zygote/CPU/4 thread(s) |
809074708.5 ns |
1880403417 ns |
0.43 |
batchedmm(512, Bsize=512)/zygote/CPU/8 thread(s) |
1254080292 ns |
721103250 ns |
1.74 |
batchedmm(512, Bsize=512)/zygote/CPU/1 thread(s) |
685615979 ns |
665759084 ns |
1.03 |
batchedmm(512, Bsize=512)/zygote/GPU/CUDA |
126044369 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
72083 ns |
77270.5 ns |
0.93 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
74604 ns |
72541 ns |
1.03 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
75437.5 ns |
76166 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
75374.5 ns |
72646 ns |
1.04 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
215767 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
295375 ns |
291833 ns |
1.01 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
282709 ns |
193625 ns |
1.46 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
298208 ns |
275146 ns |
1.08 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
309209 ns |
289604.5 ns |
1.07 |
layernorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1275180 ns |
||
batchedmm(512, Bsize=128)/forward/CPU/2 thread(s) |
35547541.5 ns |
35435979 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/4 thread(s) |
36315375 ns |
36430959 ns |
1.00 |
batchedmm(512, Bsize=128)/forward/CPU/8 thread(s) |
32440208 ns |
32728396 ns |
0.99 |
batchedmm(512, Bsize=128)/forward/CPU/1 thread(s) |
40768000 ns |
40524416 ns |
1.01 |
batchedmm(512, Bsize=128)/forward/GPU/CUDA |
5842752 ns |
||
batchedmm(512, Bsize=128)/zygote/CPU/2 thread(s) |
147568416 ns |
148443209 ns |
0.99 |
batchedmm(512, Bsize=128)/zygote/CPU/4 thread(s) |
155520042 ns |
153839875 ns |
1.01 |
batchedmm(512, Bsize=128)/zygote/CPU/8 thread(s) |
138729500 ns |
142207500 ns |
0.98 |
batchedmm(512, Bsize=128)/zygote/CPU/1 thread(s) |
287943603.5 ns |
286559208 ns |
1.00 |
batchedmm(512, Bsize=128)/zygote/GPU/CUDA |
38833581 ns |
||
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) |
119159459 ns |
121670542 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) |
173817812.5 ns |
174360666.5 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) |
147923417 ns |
155087062.5 ns |
0.95 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) |
105357583 ns |
106968083 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA |
5465142.5 ns |
||
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) |
468369208 ns |
468237229 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) |
467813208 ns |
467305229 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) |
438906333 ns |
457270500 ns |
0.96 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) |
744496896 ns |
742197000 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA |
37308411 ns |
||
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) |
651954375 ns |
775778042 ns |
0.84 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) |
641168458 ns |
639059458 ns |
1.00 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) |
627744458 ns |
642570667 ns |
0.98 |
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) |
853922062.5 ns |
849532312.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) |
1333375 ns |
1345916 ns |
0.99 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) |
992167 ns |
984292 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) |
984416.5 ns |
764770.5 ns |
1.29 |
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) |
2110042 ns |
2095229.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA |
573453 ns |
||
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) |
2969000 ns |
2954875 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) |
2635166 ns |
2619000 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) |
2641166 ns |
2499292 ns |
1.06 |
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) |
3739625 ns |
3688708.5 ns |
1.01 |
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA |
1548896.5 ns |
||
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) |
5795209 ns |
5790208 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) |
5779875 ns |
5791792 ns |
1.00 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) |
5789583 ns |
5888041 ns |
0.98 |
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) |
2913583 ns |
2887459 ns |
1.01 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7416 ns |
7208 ns |
1.03 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6041 ns |
5833 ns |
1.04 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6250 ns |
5250 ns |
1.19 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10167 ns |
10125 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
34401 ns |
||
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213917 ns |
223354 ns |
0.96 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
232666 ns |
232209 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
220896 ns |
220729.5 ns |
1.00 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
207499.5 ns |
219292 ns |
0.95 |
batchnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
242853 ns |
||
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) |
303103666.5 ns |
303148916.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) |
220145770.5 ns |
220759541.5 ns |
1.00 |
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) |
196298750 ns |
221905479 ns |
0.88 |
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) |
311409875 ns |
309164583 ns |
1.01 |
vgg16(32, 32, 3, 64)/forward/GPU/CUDA |
8609621 ns |
||
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) |
1240680500 ns |
1233285583 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) |
907039125 ns |
899326000 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) |
810799666.5 ns |
858911520.5 ns |
0.94 |
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) |
1153385750 ns |
1144926250 ns |
1.01 |
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA |
28911378.5 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5208.5 ns |
4959 ns |
1.05 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
4917 ns |
5209 ns |
0.94 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6708.5 ns |
6875 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5459 ns |
5125 ns |
1.07 |
groupnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
113562 ns |
||
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
10375 ns |
10333 ns |
1.00 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
10000 ns |
10209 ns |
0.98 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
10375 ns |
10375 ns |
1 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
9708 ns |
10583 ns |
0.92 |
groupnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
583379 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
625 ns |
1 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
625 ns |
0.87 |
batchnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
30377 ns |
||
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
9458 ns |
9125 ns |
1.04 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9416.5 ns |
9208 ns |
1.02 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10145.5 ns |
9209 ns |
1.10 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9708 ns |
9417 ns |
1.03 |
batchnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
217017.5 ns |
||
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/2 thread(s) |
351958 ns |
352041 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/4 thread(s) |
351625 ns |
352167 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/8 thread(s) |
354667 ns |
352833 ns |
1.01 |
bias_activation(512, act=gelu)(512 x 128)/forward/CPU/1 thread(s) |
351854.5 ns |
352250 ns |
1.00 |
bias_activation(512, act=gelu)(512 x 128)/forward/GPU/CUDA |
29337 ns |
||
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/2 thread(s) |
774041.5 ns |
810042 ns |
0.96 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/4 thread(s) |
810791.5 ns |
832334 ns |
0.97 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/8 thread(s) |
823250 ns |
777896 ns |
1.06 |
bias_activation(512, act=gelu)(512 x 128)/zygote/CPU/1 thread(s) |
812292 ns |
833959 ns |
0.97 |
bias_activation(512, act=gelu)(512 x 128)/zygote/GPU/CUDA |
243513.5 ns |
||
batchedmm(16, Bsize=32)/forward/CPU/2 thread(s) |
334834 ns |
339375 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/4 thread(s) |
342417 ns |
345208.5 ns |
0.99 |
batchedmm(16, Bsize=32)/forward/CPU/8 thread(s) |
453792 ns |
443583 ns |
1.02 |
batchedmm(16, Bsize=32)/forward/CPU/1 thread(s) |
10833 ns |
10500 ns |
1.03 |
batchedmm(16, Bsize=32)/forward/GPU/CUDA |
18172 ns |
||
batchedmm(16, Bsize=32)/zygote/CPU/2 thread(s) |
710333.5 ns |
720437.5 ns |
0.99 |
batchedmm(16, Bsize=32)/zygote/CPU/4 thread(s) |
730500 ns |
730000 ns |
1.00 |
batchedmm(16, Bsize=32)/zygote/CPU/8 thread(s) |
1003188 ns |
1036000 ns |
0.97 |
batchedmm(16, Bsize=32)/zygote/CPU/1 thread(s) |
26750 ns |
26584 ns |
1.01 |
batchedmm(16, Bsize=32)/zygote/GPU/CUDA |
214803.5 ns |
||
batchedmm(16, Bsize=128)/forward/CPU/2 thread(s) |
378583 ns |
378750 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/CPU/4 thread(s) |
349792 ns |
347042 ns |
1.01 |
batchedmm(16, Bsize=128)/forward/CPU/8 thread(s) |
441333 ns |
446167 ns |
0.99 |
batchedmm(16, Bsize=128)/forward/CPU/1 thread(s) |
30229.5 ns |
30208 ns |
1.00 |
batchedmm(16, Bsize=128)/forward/GPU/CUDA |
23132 ns |
||
batchedmm(16, Bsize=128)/zygote/CPU/2 thread(s) |
727499.5 ns |
736541 ns |
0.99 |
batchedmm(16, Bsize=128)/zygote/CPU/4 thread(s) |
777500 ns |
781270.5 ns |
1.00 |
batchedmm(16, Bsize=128)/zygote/CPU/8 thread(s) |
1037916.5 ns |
1066792 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/CPU/1 thread(s) |
102104 ns |
104812.5 ns |
0.97 |
batchedmm(16, Bsize=128)/zygote/GPU/CUDA |
197538 ns |
||
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/2 thread(s) |
3417 ns |
3375 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/4 thread(s) |
3584 ns |
3458 ns |
1.04 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/8 thread(s) |
3666.5 ns |
3709 ns |
0.99 |
bias_activation(2, act=tanh)(2 x 128)/forward/CPU/1 thread(s) |
3458 ns |
3625 ns |
0.95 |
bias_activation(2, act=tanh)(2 x 128)/forward/GPU/CUDA |
23861 ns |
||
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/2 thread(s) |
4209 ns |
4167 ns |
1.01 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/4 thread(s) |
4333 ns |
4208 ns |
1.03 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/8 thread(s) |
4334 ns |
4250 ns |
1.02 |
bias_activation(2, act=tanh)(2 x 128)/zygote/CPU/1 thread(s) |
4166 ns |
4291 ns |
0.97 |
bias_activation(2, act=tanh)(2 x 128)/zygote/GPU/CUDA |
243845.5 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
3542 ns |
3625 ns |
0.98 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
3791.5 ns |
3375 ns |
1.12 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4291 ns |
4437.5 ns |
0.97 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
3500 ns |
3708 ns |
0.94 |
layernorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
158016 ns |
||
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8709 ns |
8375 ns |
1.04 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8667 ns |
8208 ns |
1.06 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8708 ns |
8583 ns |
1.01 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8833 ns |
8542 ns |
1.03 |
layernorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
966403.5 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203083 ns |
205167 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
211334 ns |
209208 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209125 ns |
208833 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
200750 ns |
199083 ns |
1.01 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
43369 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
602875 ns |
606958 ns |
0.99 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
644895.5 ns |
671708 ns |
0.96 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
668625.5 ns |
624000 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
586333 ns |
633208 ns |
0.93 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
280503 ns |
||
batchedmm(128, Bsize=128)/forward/CPU/2 thread(s) |
992500 ns |
996958.5 ns |
1.00 |
batchedmm(128, Bsize=128)/forward/CPU/4 thread(s) |
1015062 ns |
1038063 ns |
0.98 |
batchedmm(128, Bsize=128)/forward/CPU/8 thread(s) |
956084 ns |
970916.5 ns |
0.98 |
batchedmm(128, Bsize=128)/forward/CPU/1 thread(s) |
878750 ns |
870270.5 ns |
1.01 |
batchedmm(128, Bsize=128)/forward/GPU/CUDA |
208314.5 ns |
||
batchedmm(128, Bsize=128)/zygote/CPU/2 thread(s) |
4501417 ns |
4514312 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/4 thread(s) |
4719833 ns |
4740687.5 ns |
1.00 |
batchedmm(128, Bsize=128)/zygote/CPU/8 thread(s) |
4455458.5 ns |
4626625 ns |
0.96 |
batchedmm(128, Bsize=128)/zygote/CPU/1 thread(s) |
4330333 ns |
4278333 ns |
1.01 |
batchedmm(128, Bsize=128)/zygote/GPU/CUDA |
968639 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3084 ns |
3083 ns |
1.00 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
3250 ns |
3209 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
4000 ns |
4417 ns |
0.91 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
3500 ns |
3458 ns |
1.01 |
layernorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
148838 ns |
||
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7500 ns |
7250 ns |
1.03 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7584 ns |
7167 ns |
1.06 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7708 ns |
7333 ns |
1.05 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7375 ns |
7541 ns |
0.98 |
layernorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
871483 ns |
||
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1626208 ns |
1650062.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1188833 ns |
1162479.5 ns |
1.02 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1372646 ns |
1343562.5 ns |
1.02 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2483103.5 ns |
2474584 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
215572.5 ns |
||
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12362437.5 ns |
12306500 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9588125 ns |
9576334 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9312000 ns |
9347167 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18078500 ns |
18004520.5 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2029403 ns |
||
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17307833 ns |
17357042 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14405084 ns |
14404458 ns |
1.00 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14373125 ns |
14505083.5 ns |
0.99 |
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21126541.5 ns |
21117625 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
88625 ns |
88584 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
139270.5 ns |
89416.5 ns |
1.56 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
96167 ns |
91000 ns |
1.06 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
132542 ns |
116312.5 ns |
1.14 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
121462 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2017959 ns |
2027750 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2029292 ns |
2156354 ns |
0.94 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2020083 ns |
1755083 ns |
1.15 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2035875.5 ns |
2022583 ns |
1.01 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
757899 ns |
||
batchedmm(2, Bsize=4)/forward/CPU/2 thread(s) |
2459 ns |
3416 ns |
0.72 |
batchedmm(2, Bsize=4)/forward/CPU/4 thread(s) |
2833 ns |
2792 ns |
1.01 |
batchedmm(2, Bsize=4)/forward/CPU/8 thread(s) |
3146 ns |
2021 ns |
1.56 |
batchedmm(2, Bsize=4)/forward/CPU/1 thread(s) |
2083 ns |
3459 ns |
0.60 |
batchedmm(2, Bsize=4)/forward/GPU/CUDA |
16243 ns |
||
batchedmm(2, Bsize=4)/zygote/CPU/2 thread(s) |
2958 ns |
2750 ns |
1.08 |
batchedmm(2, Bsize=4)/zygote/CPU/4 thread(s) |
3083 ns |
3042 ns |
1.01 |
batchedmm(2, Bsize=4)/zygote/CPU/8 thread(s) |
3167 ns |
3083 ns |
1.03 |
batchedmm(2, Bsize=4)/zygote/CPU/1 thread(s) |
3000 ns |
3084 ns |
0.97 |
batchedmm(2, Bsize=4)/zygote/GPU/CUDA |
165020.5 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7250 ns |
7209 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
6041 ns |
6041 ns |
1 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6125 ns |
5333 ns |
1.15 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10125 ns |
10083 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
42625 ns |
||
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
213791.5 ns |
214125 ns |
1.00 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
232500 ns |
229084 ns |
1.01 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
230854 ns |
223791.5 ns |
1.03 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
219667 ns |
221708 ns |
0.99 |
batchnorm(4, act=identity, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
249546 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3750 ns |
3792 ns |
0.99 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3792 ns |
3791 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3750 ns |
3708 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/forward/GPU/CUDA |
22106 ns |
||
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
14562.5 ns |
14584 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
14625 ns |
14458 ns |
1.01 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
14542 ns |
14292 ns |
1.02 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
14625 ns |
14583 ns |
1.00 |
dense(32, bias=true, act=identity)(32 x 128)/zygote/GPU/CUDA |
316306.5 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
92854 ns |
96000 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
122563 ns |
91334 ns |
1.34 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
100666.5 ns |
94166.5 ns |
1.07 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
92333 ns |
137583 ns |
0.67 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
121334 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1653291 ns |
1927479 ns |
0.86 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1927479 ns |
1933333 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1927083 ns |
1671542 ns |
1.15 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1942978.5 ns |
1929000 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
758061 ns |
||
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) |
865396 ns |
880583 ns |
0.98 |
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) |
817916.5 ns |
820750 ns |
1.00 |
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) |
1234083 ns |
1161125 ns |
1.06 |
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) |
977709 ns |
964042 ns |
1.01 |
lenet(28, 28, 1, 32)/forward/GPU/CUDA |
284889 ns |
||
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) |
2728500 ns |
2817062.5 ns |
0.97 |
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) |
2535458 ns |
2505978.5 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) |
3362209 ns |
3333708 ns |
1.01 |
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) |
3419083 ns |
3424937.5 ns |
1.00 |
lenet(28, 28, 1, 32)/zygote/GPU/CUDA |
1411392.5 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
17916 ns |
17166 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
18083.5 ns |
15292 ns |
1.18 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
17875 ns |
16937.5 ns |
1.06 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
17979 ns |
16792 ns |
1.07 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
100694 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
215458.5 ns |
227729.5 ns |
0.95 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
260188 ns |
260125 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
216292 ns |
216458 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
228917 ns |
259708 ns |
0.88 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
516979.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
221708 ns |
221208.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
222167 ns |
221937 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
222292 ns |
221042 ns |
1.01 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
221687.5 ns |
221958.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
201006.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
557083 ns |
495666 ns |
1.12 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
496917 ns |
561062.5 ns |
0.89 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
497792 ns |
501250 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
510145.5 ns |
572917 ns |
0.89 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1045931.5 ns |
||
batchedmm(16, Bsize=4)/forward/CPU/2 thread(s) |
3750 ns |
4167 ns |
0.90 |
batchedmm(16, Bsize=4)/forward/CPU/4 thread(s) |
3416 ns |
3625 ns |
0.94 |
batchedmm(16, Bsize=4)/forward/CPU/8 thread(s) |
5083 ns |
5417 ns |
0.94 |
batchedmm(16, Bsize=4)/forward/CPU/1 thread(s) |
4375 ns |
3750 ns |
1.17 |
batchedmm(16, Bsize=4)/forward/GPU/CUDA |
17558 ns |
||
batchedmm(16, Bsize=4)/zygote/CPU/2 thread(s) |
7625 ns |
7500 ns |
1.02 |
batchedmm(16, Bsize=4)/zygote/CPU/4 thread(s) |
7417 ns |
7458 ns |
0.99 |
batchedmm(16, Bsize=4)/zygote/CPU/8 thread(s) |
7542 ns |
7458 ns |
1.01 |
batchedmm(16, Bsize=4)/zygote/CPU/1 thread(s) |
7500 ns |
7917 ns |
0.95 |
batchedmm(16, Bsize=4)/zygote/GPU/CUDA |
172767 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18917 ns |
18625 ns |
1.02 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19166.5 ns |
17500 ns |
1.10 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20062 ns |
19375 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
19021 ns |
18292 ns |
1.04 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
101523.5 ns |
||
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
218145.5 ns |
223917 ns |
0.97 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
228291 ns |
229208.5 ns |
1.00 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
217000 ns |
218333 ns |
0.99 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
229958 ns |
228667 ns |
1.01 |
groupnorm(4, act=identity, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
670460 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
4333 ns |
4166 ns |
1.04 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
4250 ns |
4166 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
5042 ns |
5375 ns |
0.94 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
4520.5 ns |
4416 ns |
1.02 |
layernorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
150984 ns |
||
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
10667 ns |
10042 ns |
1.06 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
10625 ns |
9750 ns |
1.09 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
10416 ns |
10417 ns |
1.00 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
10667 ns |
10334 ns |
1.03 |
layernorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
882648.5 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
3166 ns |
3375 ns |
0.94 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
2833 ns |
2833 ns |
1 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
3875 ns |
4375 ns |
0.89 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
2709 ns |
2792 ns |
0.97 |
layernorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
152872 ns |
||
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7709 ns |
7083 ns |
1.09 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7291 ns |
7333 ns |
0.99 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7667 ns |
7417 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7625 ns |
7375 ns |
1.03 |
layernorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
914495.5 ns |
||
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
23680125.5 ns |
23307041.5 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
34568458 ns |
33839458 ns |
1.02 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
37926167 ns |
40745646 ns |
0.93 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
35101750 ns |
34862708 ns |
1.01 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
1860968.5 ns |
||
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
183757875 ns |
184254354 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
169713437 ns |
169428437.5 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
146792083 ns |
150235166.5 ns |
0.98 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
273709458 ns |
273092750 ns |
1.00 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
20977876 ns |
||
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
282161916 ns |
284314042 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
257729875 ns |
259222834 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
230823500 ns |
233454625 ns |
0.99 |
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
324459500 ns |
323194834 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
184875 ns |
183354.5 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
183896.5 ns |
182083 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
185458 ns |
185375 ns |
1.00 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
186041 ns |
183166.5 ns |
1.02 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
105716 ns |
||
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
588188 ns |
598042 ns |
0.98 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
632709 ns |
638604 ns |
0.99 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
631208 ns |
590042 ns |
1.07 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
646333.5 ns |
639625 ns |
1.01 |
groupnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
712382.5 ns |
||
batchedmm(128, Bsize=512)/forward/CPU/2 thread(s) |
3842625 ns |
3814396 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/4 thread(s) |
3954125 ns |
3917959 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/CPU/8 thread(s) |
3499167 ns |
3558667 ns |
0.98 |
batchedmm(128, Bsize=512)/forward/CPU/1 thread(s) |
4605875 ns |
4558792 ns |
1.01 |
batchedmm(128, Bsize=512)/forward/GPU/CUDA |
531878 ns |
||
batchedmm(128, Bsize=512)/zygote/CPU/2 thread(s) |
17425999.5 ns |
17242875 ns |
1.01 |
batchedmm(128, Bsize=512)/zygote/CPU/4 thread(s) |
17768729 ns |
17847895.5 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/CPU/8 thread(s) |
16557250 ns |
16851208 ns |
0.98 |
batchedmm(128, Bsize=512)/zygote/CPU/1 thread(s) |
20045083 ns |
19971167 ns |
1.00 |
batchedmm(128, Bsize=512)/zygote/GPU/CUDA |
3453483.5 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
584 ns |
500 ns |
1.17 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
583 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
625 ns |
542 ns |
1.15 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
542 ns |
667 ns |
0.81 |
batchnorm(2, act=relu, affine=false)(32 x 32)/forward/GPU/CUDA |
33258 ns |
||
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
9500 ns |
9333 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
9750 ns |
8917 ns |
1.09 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
9500 ns |
9792 ns |
0.97 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
9292 ns |
9750 ns |
0.95 |
batchnorm(2, act=relu, affine=false)(32 x 32)/zygote/GPU/CUDA |
241834.5 ns |
||
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) |
591184646 ns |
652733938 ns |
0.91 |
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) |
395387916 ns |
393383500 ns |
1.01 |
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) |
376635104 ns |
395122417 ns |
0.95 |
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) |
607672417 ns |
624702084 ns |
0.97 |
vgg16(32, 32, 3, 128)/forward/GPU/CUDA |
14365243.5 ns |
||
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) |
1891201917 ns |
1882307625 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) |
1640746145.5 ns |
1638716333.5 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) |
1511389417 ns |
1551357292 ns |
0.97 |
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) |
2292886042 ns |
2292499417 ns |
1.00 |
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA |
53675528.5 ns |
||
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
1640541.5 ns |
1649417 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
1190791 ns |
1198625 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
1393709 ns |
1369208 ns |
1.02 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
2499812 ns |
2494208 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
216895 ns |
||
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
12724166 ns |
12699979.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
9944959 ns |
9947354 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
9695479.5 ns |
9680125.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
18369375 ns |
18361875 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2072635.5 ns |
||
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
17716333 ns |
17714687.5 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
14731896.5 ns |
14723938 ns |
1.00 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
14584209 ns |
14690791 ns |
0.99 |
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
21451354.5 ns |
21421188 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
26291 ns |
26250 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
26292 ns |
26209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
26250 ns |
26209 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
26250 ns |
26250 ns |
1 |
dense(32, bias=false, act=gelu)(32 x 128)/forward/GPU/CUDA |
24451 ns |
||
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
67500 ns |
67292 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
67333 ns |
67625 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
68417 ns |
67000 ns |
1.02 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
67292 ns |
67167 ns |
1.00 |
dense(32, bias=false, act=gelu)(32 x 128)/zygote/GPU/CUDA |
302833.5 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
203667 ns |
204208 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
210000 ns |
209583 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
209750 ns |
209542 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
201583 ns |
199166 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
35493 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
602458.5 ns |
602458 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
673625 ns |
626542 ns |
1.08 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
667791.5 ns |
624687.5 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
635666.5 ns |
632958 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
265416 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
634250 ns |
656125 ns |
0.97 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
648708 ns |
646104 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
643875 ns |
546958 ns |
1.18 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
667792 ns |
679042 ns |
0.98 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
183437 ns |
||
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2237520.5 ns |
2259375 ns |
0.99 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2253354 ns |
2247416.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2252208.5 ns |
2013146 ns |
1.12 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2267375 ns |
2262166.5 ns |
1.00 |
layernorm(4, act=identity, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1064214 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
18979 ns |
18354.5 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
19250 ns |
17375 ns |
1.11 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
20270.5 ns |
19625 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
18542 ns |
18542 ns |
1 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
98452.5 ns |
||
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
256687.5 ns |
259959 ns |
0.99 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
263770.5 ns |
263500 ns |
1.00 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
227812.5 ns |
221375 ns |
1.03 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
267416.5 ns |
261334 ns |
1.02 |
groupnorm(4, act=relu, affine=true)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
681426.5 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
625 ns |
584 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
584 ns |
625 ns |
0.93 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
666 ns |
625 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
584 ns |
708 ns |
0.82 |
batchnorm(2, act=relu, affine=true)(32 x 32)/forward/GPU/CUDA |
24215 ns |
||
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
10042 ns |
10125 ns |
0.99 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
9833 ns |
9709 ns |
1.01 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
10500 ns |
10458 ns |
1.00 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
9979.5 ns |
10250 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(32 x 32)/zygote/GPU/CUDA |
216001 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
5583 ns |
5500 ns |
1.02 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
5917 ns |
5375 ns |
1.10 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
6875 ns |
7041.5 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
5333 ns |
5167 ns |
1.03 |
groupnorm(2, act=identity, affine=false)(4 x 32)/forward/GPU/CUDA |
112951.5 ns |
||
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
7833 ns |
7875 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
7292 ns |
7750 ns |
0.94 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
7375 ns |
7542 ns |
0.98 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
7250 ns |
7791 ns |
0.93 |
groupnorm(2, act=identity, affine=false)(4 x 32)/zygote/GPU/CUDA |
550067.5 ns |
||
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
2208 ns |
2041 ns |
1.08 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
2083 ns |
1958 ns |
1.06 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
2500 ns |
2209 ns |
1.13 |
bias_activation(2, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
2042 ns |
2167 ns |
0.94 |
bias_activation(2, act=gelu)(2 x 128)/forward/GPU/CUDA |
23210 ns |
||
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6750 ns |
6333 ns |
1.07 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
6583 ns |
6542 ns |
1.01 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6875 ns |
6416 ns |
1.07 |
bias_activation(2, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
6375 ns |
6666 ns |
0.96 |
bias_activation(2, act=gelu)(2 x 128)/zygote/GPU/CUDA |
255514.5 ns |
||
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/2 thread(s) |
750895.5 ns |
749417 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/4 thread(s) |
747375.5 ns |
746625 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/8 thread(s) |
750500 ns |
749166.5 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/forward/CPU/1 thread(s) |
748833 ns |
772625 ns |
0.97 |
bias_activation(512, act=tanh)(512 x 128)/forward/GPU/CUDA |
29391 ns |
||
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/2 thread(s) |
794041 ns |
792667 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/4 thread(s) |
791520.5 ns |
792625 ns |
1.00 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/8 thread(s) |
811791 ns |
775750 ns |
1.05 |
bias_activation(512, act=tanh)(512 x 128)/zygote/CPU/1 thread(s) |
797770.5 ns |
808562.5 ns |
0.99 |
bias_activation(512, act=tanh)(512 x 128)/zygote/GPU/CUDA |
228730 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
7333 ns |
7334 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
5958 ns |
5959 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
6000 ns |
5333 ns |
1.13 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
10209 ns |
10125 ns |
1.01 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
33871 ns |
||
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
220562.5 ns |
220166 ns |
1.00 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
266103.5 ns |
239292 ns |
1.11 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
239604 ns |
229167 ns |
1.05 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
261875 ns |
254959 ns |
1.03 |
batchnorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
267782 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
9833 ns |
9792 ns |
1.00 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
9750 ns |
10000 ns |
0.97 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
10667 ns |
11166 ns |
0.96 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
10000 ns |
9750 ns |
1.03 |
layernorm(2, act=gelu, affine=false)(32 x 32)/forward/GPU/CUDA |
145818 ns |
||
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
25000 ns |
24541 ns |
1.02 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
24084 ns |
24291 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
24604.5 ns |
24917 ns |
0.99 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
24979 ns |
24625 ns |
1.01 |
layernorm(2, act=gelu, affine=false)(32 x 32)/zygote/GPU/CUDA |
908475 ns |
||
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) |
106369479 ns |
105924583 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) |
116893083 ns |
116546459 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) |
121051729 ns |
124211854 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) |
117688104.5 ns |
117471395.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA |
3526289 ns |
||
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) |
394374250 ns |
393647209 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) |
358666458 ns |
356631062.5 ns |
1.01 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) |
405386333 ns |
357758708 ns |
1.13 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) |
603318791.5 ns |
619205000 ns |
0.97 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA |
20857877 ns |
||
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) |
789086042 ns |
612150166 ns |
1.29 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) |
769903666.5 ns |
766180166.5 ns |
1.00 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) |
567505292 ns |
749713459 ns |
0.76 |
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) |
610508375.5 ns |
785793916 ns |
0.78 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
6791 ns |
7000 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
7042 ns |
6875 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
8917 ns |
8625 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
7000 ns |
6542 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(32 x 32)/forward/GPU/CUDA |
115139 ns |
||
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
14583.5 ns |
13500 ns |
1.08 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
13875 ns |
13625 ns |
1.02 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
14542 ns |
14375 ns |
1.01 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
14208 ns |
14584 ns |
0.97 |
groupnorm(2, act=identity, affine=true)(32 x 32)/zygote/GPU/CUDA |
649364 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/2 thread(s) |
6125 ns |
5917 ns |
1.04 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/4 thread(s) |
6042 ns |
5770.5 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/8 thread(s) |
8250 ns |
7875 ns |
1.05 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/CPU/1 thread(s) |
6041 ns |
5583 ns |
1.08 |
groupnorm(2, act=identity, affine=false)(32 x 32)/forward/GPU/CUDA |
114980.5 ns |
||
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/2 thread(s) |
13187.5 ns |
13000 ns |
1.01 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/4 thread(s) |
12250 ns |
12625 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/8 thread(s) |
12687.5 ns |
12834 ns |
0.99 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/CPU/1 thread(s) |
12542 ns |
12895.5 ns |
0.97 |
groupnorm(2, act=identity, affine=false)(32 x 32)/zygote/GPU/CUDA |
542878.5 ns |
||
batchedmm(2, Bsize=128)/forward/CPU/2 thread(s) |
5395.5 ns |
5895.5 ns |
0.92 |
batchedmm(2, Bsize=128)/forward/CPU/4 thread(s) |
5792 ns |
5292 ns |
1.09 |
batchedmm(2, Bsize=128)/forward/CPU/8 thread(s) |
6333 ns |
5916 ns |
1.07 |
batchedmm(2, Bsize=128)/forward/CPU/1 thread(s) |
5520.5 ns |
5417 ns |
1.02 |
batchedmm(2, Bsize=128)/forward/GPU/CUDA |
16943 ns |
||
batchedmm(2, Bsize=128)/zygote/CPU/2 thread(s) |
16145.5 ns |
15667 ns |
1.03 |
batchedmm(2, Bsize=128)/zygote/CPU/4 thread(s) |
15916 ns |
15895.5 ns |
1.00 |
batchedmm(2, Bsize=128)/zygote/CPU/8 thread(s) |
16041 ns |
15916 ns |
1.01 |
batchedmm(2, Bsize=128)/zygote/CPU/1 thread(s) |
16667 ns |
16041 ns |
1.04 |
batchedmm(2, Bsize=128)/zygote/GPU/CUDA |
166399.5 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
417 ns |
292 ns |
1.43 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
416 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
334 ns |
417 ns |
0.80 |
batchnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
29134 ns |
||
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6583.5 ns |
6292 ns |
1.05 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6667 ns |
6667 ns |
1 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7167 ns |
6667 ns |
1.07 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6708 ns |
6666 ns |
1.01 |
batchnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
211491.5 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/2 thread(s) |
5959 ns |
5916 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/4 thread(s) |
5917 ns |
5875 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/8 thread(s) |
6000 ns |
5917 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/CPU/1 thread(s) |
5958 ns |
6041 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/forward/GPU/CUDA |
29820 ns |
||
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/2 thread(s) |
21375 ns |
21667 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/4 thread(s) |
21292 ns |
21208 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/8 thread(s) |
21458 ns |
21750 ns |
0.99 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/CPU/1 thread(s) |
21333 ns |
21875 ns |
0.98 |
batchnorm(2, act=gelu, affine=true)(32 x 32)/zygote/GPU/CUDA |
229238.5 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
143333 ns |
144583 ns |
0.99 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
152583 ns |
162416 ns |
0.94 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
149459 ns |
146625 ns |
1.02 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
188917 ns |
187542 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
178315.5 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
1315000 ns |
1319875 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
1319375 ns |
1320770.5 ns |
1.00 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
1328000 ns |
957604 ns |
1.39 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
1336250 ns |
1324833 ns |
1.01 |
layernorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1036259 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/2 thread(s) |
25084 ns |
23125 ns |
1.08 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/4 thread(s) |
25458 ns |
22437.5 ns |
1.13 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/8 thread(s) |
25000 ns |
23854.5 ns |
1.05 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/CPU/1 thread(s) |
25792 ns |
24396 ns |
1.06 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/forward/GPU/CUDA |
203055 ns |
||
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/2 thread(s) |
118708 ns |
129875 ns |
0.91 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/4 thread(s) |
131395.5 ns |
138125 ns |
0.95 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/8 thread(s) |
180791 ns |
118937.5 ns |
1.52 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/CPU/1 thread(s) |
171334 ns |
176083 ns |
0.97 |
layernorm(4, act=relu, affine=false)(16 x 16 x 4 x 32)/zygote/GPU/CUDA |
1025032 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
375 ns |
375 ns |
1 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
375 ns |
1.11 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
292 ns |
375 ns |
0.78 |
batchnorm(2, act=relu, affine=true)(4 x 32)/forward/GPU/CUDA |
24086.5 ns |
||
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
6625 ns |
6833.5 ns |
0.97 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
6875 ns |
6708 ns |
1.02 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
7145.5 ns |
6667 ns |
1.07 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
6584 ns |
6917 ns |
0.95 |
batchnorm(2, act=relu, affine=true)(4 x 32)/zygote/GPU/CUDA |
214485.5 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
4125 ns |
4333.5 ns |
0.95 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
4292 ns |
4292 ns |
1 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
4916 ns |
5292 ns |
0.93 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
4208 ns |
4042 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
157123 ns |
||
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
12000 ns |
11542 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
11792 ns |
11958 ns |
0.99 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
12167 ns |
11708 ns |
1.04 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
11917 ns |
12625 ns |
0.94 |
layernorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
1026248.5 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/2 thread(s) |
1625 ns |
1584 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/4 thread(s) |
1583 ns |
1583 ns |
1 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/8 thread(s) |
1625 ns |
1583 ns |
1.03 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/CPU/1 thread(s) |
1625 ns |
1667 ns |
0.97 |
dense(2, bias=false, act=gelu)(2 x 128)/forward/GPU/CUDA |
23064 ns |
||
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/2 thread(s) |
6000 ns |
5667 ns |
1.06 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/4 thread(s) |
5708 ns |
5625 ns |
1.01 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/8 thread(s) |
6000 ns |
5791 ns |
1.04 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/CPU/1 thread(s) |
5709 ns |
5791 ns |
0.99 |
dense(2, bias=false, act=gelu)(2 x 128)/zygote/GPU/CUDA |
244348.5 ns |
||
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) |
6812708 ns |
6893499.5 ns |
0.99 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) |
6425333 ns |
6374750 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) |
6540958 ns |
6500541.5 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) |
7677937.5 ns |
7628458 ns |
1.01 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA |
270587.5 ns |
||
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) |
23996666.5 ns |
24057854 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) |
21302750 ns |
21255853.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) |
21086833 ns |
21045937.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) |
29891875 ns |
29752958 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA |
2226319.5 ns |
||
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) |
48712521 ns |
37194104 ns |
1.31 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) |
45691542 ns |
45565937.5 ns |
1.00 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) |
34490729.5 ns |
45856833 ns |
0.75 |
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) |
38181542 ns |
49410209 ns |
0.77 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
5916.5 ns |
5729.5 ns |
1.03 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
5916 ns |
6041 ns |
0.98 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
7166 ns |
7542 ns |
0.95 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
6000 ns |
5583 ns |
1.07 |
groupnorm(2, act=identity, affine=true)(4 x 32)/forward/GPU/CUDA |
116557 ns |
||
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8750 ns |
7812.5 ns |
1.12 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8333 ns |
8333 ns |
1 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
9042 ns |
8667 ns |
1.04 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8709 ns |
8750 ns |
1.00 |
groupnorm(2, act=identity, affine=true)(4 x 32)/zygote/GPU/CUDA |
641435.5 ns |
||
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) |
1535000 ns |
1558521 ns |
0.98 |
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) |
1261458 ns |
1261333 ns |
1.00 |
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) |
1633937 ns |
1624791.5 ns |
1.01 |
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) |
2192625 ns |
2151979 ns |
1.02 |
lenet(28, 28, 1, 128)/forward/GPU/CUDA |
282489 ns |
||
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) |
7916812.5 ns |
7911312.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) |
6605083 ns |
6595562.5 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) |
7204562.5 ns |
7113500.5 ns |
1.01 |
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) |
10483416 ns |
10486458 ns |
1.00 |
lenet(28, 28, 1, 128)/zygote/GPU/CUDA |
1517949 ns |
||
batchedmm(128, Bsize=4)/forward/CPU/2 thread(s) |
366750 ns |
370375.5 ns |
0.99 |
batchedmm(128, Bsize=4)/forward/CPU/4 thread(s) |
372875 ns |
370334 ns |
1.01 |
batchedmm(128, Bsize=4)/forward/CPU/8 thread(s) |
456062.5 ns |
457042 ns |
1.00 |
batchedmm(128, Bsize=4)/forward/CPU/1 thread(s) |
22292 ns |
24083.5 ns |
0.93 |
batchedmm(128, Bsize=4)/forward/GPU/CUDA |
47069 ns |
||
batchedmm(128, Bsize=4)/zygote/CPU/2 thread(s) |
730541 ns |
740416 ns |
0.99 |
batchedmm(128, Bsize=4)/zygote/CPU/4 thread(s) |
807625 ns |
810542 ns |
1.00 |
batchedmm(128, Bsize=4)/zygote/CPU/8 thread(s) |
1066291 ns |
1091458.5 ns |
0.98 |
batchedmm(128, Bsize=4)/zygote/CPU/1 thread(s) |
122375 ns |
119250 ns |
1.03 |
batchedmm(128, Bsize=4)/zygote/GPU/CUDA |
200846.5 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/2 thread(s) |
396834 ns |
397375 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/4 thread(s) |
287958 ns |
288000 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/8 thread(s) |
287959 ns |
211583 ns |
1.36 |
dense(512, bias=true, act=identity)(512 x 128)/forward/CPU/1 thread(s) |
752542 ns |
750270.5 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/forward/GPU/CUDA |
44068 ns |
||
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/2 thread(s) |
669291 ns |
673041 ns |
0.99 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/4 thread(s) |
531792 ns |
532334 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/8 thread(s) |
532750 ns |
474084 ns |
1.12 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/CPU/1 thread(s) |
974833 ns |
973792 ns |
1.00 |
dense(512, bias=true, act=identity)(512 x 128)/zygote/GPU/CUDA |
212758 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
642833 ns |
662833.5 ns |
0.97 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
673334 ns |
641958 ns |
1.05 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
651312.5 ns |
544334 ns |
1.20 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
670375 ns |
670813 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
182401 ns |
||
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2454458 ns |
2467229 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2451041.5 ns |
2462313 ns |
1.00 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2461083 ns |
2482583.5 ns |
0.99 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2475584 ns |
2448459 ns |
1.01 |
layernorm(4, act=relu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1098035.5 ns |
||
batchedmm(2, Bsize=32)/forward/CPU/2 thread(s) |
2917 ns |
3583.5 ns |
0.81 |
batchedmm(2, Bsize=32)/forward/CPU/4 thread(s) |
2979.5 ns |
2687.5 ns |
1.11 |
batchedmm(2, Bsize=32)/forward/CPU/8 thread(s) |
3583 ns |
2959 ns |
1.21 |
batchedmm(2, Bsize=32)/forward/CPU/1 thread(s) |
2250 ns |
3833 ns |
0.59 |
batchedmm(2, Bsize=32)/forward/GPU/CUDA |
16277 ns |
||
batchedmm(2, Bsize=32)/zygote/CPU/2 thread(s) |
5750 ns |
5542 ns |
1.04 |
batchedmm(2, Bsize=32)/zygote/CPU/4 thread(s) |
5833 ns |
5792 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/CPU/8 thread(s) |
5916 ns |
5833 ns |
1.01 |
batchedmm(2, Bsize=32)/zygote/CPU/1 thread(s) |
5791.5 ns |
5833.5 ns |
0.99 |
batchedmm(2, Bsize=32)/zygote/GPU/CUDA |
165419 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1464291 ns |
1460979.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1500667 ns |
1498958 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1501583 ns |
1492334 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1443958 ns |
1436709 ns |
1.01 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
63337.5 ns |
||
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4766999.5 ns |
5110375 ns |
0.93 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5286875 ns |
5286896 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5294979 ns |
4965208 ns |
1.07 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
5006687.5 ns |
4987187.5 ns |
1.00 |
batchnorm(4, act=gelu, affine=true)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
277046.5 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/2 thread(s) |
3750 ns |
3709 ns |
1.01 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/4 thread(s) |
3709 ns |
3750 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/8 thread(s) |
3709 ns |
3709 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/CPU/1 thread(s) |
3709 ns |
3709 ns |
1 |
dense(32, bias=false, act=identity)(32 x 128)/forward/GPU/CUDA |
31531.5 ns |
||
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/2 thread(s) |
15541 ns |
15250 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/4 thread(s) |
15625 ns |
15375 ns |
1.02 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/8 thread(s) |
15667 ns |
15208 ns |
1.03 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/CPU/1 thread(s) |
15375 ns |
15542 ns |
0.99 |
dense(32, bias=false, act=identity)(32 x 128)/zygote/GPU/CUDA |
270855 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
71250 ns |
71167 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
71166 ns |
71208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
71375 ns |
71125 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
71333 ns |
70145.5 ns |
1.02 |
dense(512, bias=false, act=relu)(512 x 128)/forward/GPU/CUDA |
113271 ns |
||
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
319792 ns |
318209 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
318750 ns |
321166 ns |
0.99 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
318250 ns |
331000 ns |
0.96 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
319333 ns |
318208 ns |
1.00 |
dense(512, bias=false, act=relu)(512 x 128)/zygote/GPU/CUDA |
219649.5 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/2 thread(s) |
1083 ns |
1000 ns |
1.08 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/4 thread(s) |
1083 ns |
1084 ns |
1.00 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/8 thread(s) |
1125 ns |
1083 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/CPU/1 thread(s) |
1042 ns |
1125 ns |
0.93 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/forward/GPU/CUDA |
28583 ns |
||
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/2 thread(s) |
8271 ns |
8208 ns |
1.01 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/4 thread(s) |
8625 ns |
8333 ns |
1.04 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/8 thread(s) |
8709 ns |
8542 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/CPU/1 thread(s) |
8604.5 ns |
8458 ns |
1.02 |
batchnorm(2, act=gelu, affine=true)(4 x 32)/zygote/GPU/CUDA |
226098 ns |
||
batchedmm(128, Bsize=32)/forward/CPU/2 thread(s) |
502500 ns |
513416.5 ns |
0.98 |
batchedmm(128, Bsize=32)/forward/CPU/4 thread(s) |
494791 ns |
491000 ns |
1.01 |
batchedmm(128, Bsize=32)/forward/CPU/8 thread(s) |
562895.5 ns |
564167 ns |
1.00 |
batchedmm(128, Bsize=32)/forward/CPU/1 thread(s) |
226833 ns |
219125 ns |
1.04 |
batchedmm(128, Bsize=32)/forward/GPU/CUDA |
129226 ns |
||
batchedmm(128, Bsize=32)/zygote/CPU/2 thread(s) |
1394062.5 ns |
1389604.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/4 thread(s) |
1466041 ns |
1470916.5 ns |
1.00 |
batchedmm(128, Bsize=32)/zygote/CPU/8 thread(s) |
1760708.5 ns |
1739750 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/CPU/1 thread(s) |
879666 ns |
867042 ns |
1.01 |
batchedmm(128, Bsize=32)/zygote/GPU/CUDA |
316892.5 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/2 thread(s) |
375 ns |
333 ns |
1.13 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/4 thread(s) |
334 ns |
375 ns |
0.89 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/8 thread(s) |
417 ns |
292 ns |
1.43 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/CPU/1 thread(s) |
333 ns |
417 ns |
0.80 |
batchnorm(2, act=relu, affine=false)(4 x 32)/forward/GPU/CUDA |
32353 ns |
||
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/2 thread(s) |
6667 ns |
6792 ns |
0.98 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/4 thread(s) |
6625 ns |
6667 ns |
0.99 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/8 thread(s) |
6875 ns |
6667 ns |
1.03 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/CPU/1 thread(s) |
6729.5 ns |
6583 ns |
1.02 |
batchnorm(2, act=relu, affine=false)(4 x 32)/zygote/GPU/CUDA |
235464.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1721563 ns |
1744875 ns |
0.99 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1767583 ns |
1720437.5 ns |
1.03 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1729750 ns |
1725229 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1726750 ns |
1774833.5 ns |
0.97 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
181593.5 ns |
||
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
4366645.5 ns |
4362875 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
4370104.5 ns |
4366833.5 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
4360895.5 ns |
4017625 ns |
1.09 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4377083 ns |
4360042 ns |
1.00 |
layernorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
1035758 ns |
||
bias_activation(512, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
6791 ns |
6709 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
6708 ns |
6541 ns |
1.03 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
7125 ns |
7125 ns |
1 |
bias_activation(512, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
7000 ns |
6896 ns |
1.02 |
bias_activation(512, act=relu)(512 x 128)/forward/GPU/CUDA |
27514 ns |
||
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
32917 ns |
32667 ns |
1.01 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
51083 ns |
51125 ns |
1.00 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
53291.5 ns |
33125 ns |
1.61 |
bias_activation(512, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
68542 ns |
52271 ns |
1.31 |
bias_activation(512, act=relu)(512 x 128)/zygote/GPU/CUDA |
200817.5 ns |
||
batchedmm(2, Bsize=512)/forward/CPU/2 thread(s) |
17583 ns |
18166.5 ns |
0.97 |
batchedmm(2, Bsize=512)/forward/CPU/4 thread(s) |
18145.5 ns |
17500 ns |
1.04 |
batchedmm(2, Bsize=512)/forward/CPU/8 thread(s) |
18583 ns |
18875 ns |
0.98 |
batchedmm(2, Bsize=512)/forward/CPU/1 thread(s) |
17417 ns |
17666.5 ns |
0.99 |
batchedmm(2, Bsize=512)/forward/GPU/CUDA |
18647.5 ns |
||
batchedmm(2, Bsize=512)/zygote/CPU/2 thread(s) |
53708.5 ns |
53667 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/4 thread(s) |
53708 ns |
53584 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/8 thread(s) |
53625 ns |
53417 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/CPU/1 thread(s) |
53834 ns |
54000 ns |
1.00 |
batchedmm(2, Bsize=512)/zygote/GPU/CUDA |
252803.5 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/2 thread(s) |
75084 ns |
75334 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/4 thread(s) |
75292 ns |
75375 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/8 thread(s) |
75375 ns |
75209 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/CPU/1 thread(s) |
75166 ns |
74916 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/forward/GPU/CUDA |
46809 ns |
||
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/2 thread(s) |
326375 ns |
324959 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/4 thread(s) |
331417 ns |
340167 ns |
0.97 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/8 thread(s) |
325042 ns |
336875 ns |
0.96 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/CPU/1 thread(s) |
325458 ns |
324833 ns |
1.00 |
dense(512, bias=true, act=relu)(512 x 128)/zygote/GPU/CUDA |
236766 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
1486292 ns |
1486958 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
1530000 ns |
1526792 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
1525292 ns |
1521459 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
1469208 ns |
1463834 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
74016 ns |
||
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
5126625 ns |
5117062 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
5277354.5 ns |
5294604 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
5292437 ns |
4960833 ns |
1.07 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
4988541.5 ns |
4987709 ns |
1.00 |
batchnorm(4, act=gelu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
290808 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/2 thread(s) |
28167 ns |
28167 ns |
1 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/4 thread(s) |
28250 ns |
28167 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/8 thread(s) |
28250 ns |
28292 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/CPU/1 thread(s) |
28208 ns |
28292 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/forward/GPU/CUDA |
24124 ns |
||
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/2 thread(s) |
66750 ns |
66333 ns |
1.01 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/4 thread(s) |
66750 ns |
66833 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/8 thread(s) |
66708 ns |
66500 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/CPU/1 thread(s) |
66583 ns |
66459 ns |
1.00 |
dense(32, bias=true, act=gelu)(32 x 128)/zygote/GPU/CUDA |
355733 ns |
||
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) |
1497500 ns |
1395354 ns |
1.07 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) |
1152646 ns |
1059146 ns |
1.09 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) |
1147917 ns |
814208 ns |
1.41 |
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) |
2245000 ns |
2269396 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA |
552030 ns |
||
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) |
2907209 ns |
3090979 ns |
0.94 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) |
2737417 ns |
2740854.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) |
2767834 ns |
2544104.5 ns |
1.09 |
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) |
3849250 ns |
3812666 ns |
1.01 |
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA |
1602529 ns |
||
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) |
7898104 ns |
7882104 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) |
7917292 ns |
7902666.5 ns |
1.00 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) |
7917979.5 ns |
8008791.5 ns |
0.99 |
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) |
4851208.5 ns |
4806271 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/2 thread(s) |
82791 ns |
81167 ns |
1.02 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/4 thread(s) |
132062.5 ns |
83208.5 ns |
1.59 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/8 thread(s) |
84229 ns |
81979.5 ns |
1.03 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/CPU/1 thread(s) |
83791 ns |
80417 ns |
1.04 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/forward/GPU/CUDA |
189865.5 ns |
||
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/2 thread(s) |
2017917 ns |
2017166.5 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/4 thread(s) |
2013750 ns |
2013729 ns |
1.00 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/8 thread(s) |
2029000 ns |
1774125 ns |
1.14 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/CPU/1 thread(s) |
2024812.5 ns |
2014354.5 ns |
1.01 |
groupnorm(4, act=relu, affine=false)(16 x 16 x 32 x 32)/zygote/GPU/CUDA |
609581 ns |
This comment was automatically generated by workflow using github-action-benchmark.
No description provided.