Skip to content

Commit

Permalink
ci(buildkite): debugging CUDA segfaults on CI (#937)
Browse files Browse the repository at this point in the history
* ci(buildkite): add coreupload plugin

* ci(buildkite): try using the latest cuda_driver_jll

* chore: try running tests with compat=false

* chore: cleanup the PR
  • Loading branch information
avik-pal authored Sep 18, 2024
1 parent 919d27f commit 1b7c9a9
Show file tree
Hide file tree
Showing 13 changed files with 83 additions and 10 deletions.
7 changes: 7 additions & 0 deletions docs/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -29,6 +29,7 @@ Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
[compat]
ADTypes = "1.3"
Adapt = "4"
CUDA_Driver_jll = "0.9, 0.10"
ChainRulesCore = "1.24"
ComponentArrays = "0.15"
Documenter = "1.4"
Expand All @@ -54,3 +55,9 @@ StaticArrays = "1"
WeightInitializers = "1"
Zygote = "0.6.70"
julia = "1.10"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
7 changes: 7 additions & 0 deletions examples/Basics/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -11,10 +11,17 @@ Random = "9a3f8284-a2c9-5f02-9a11-845980a1fd5c"
Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"

[compat]
CUDA_Driver_jll = "0.9, 0.10"
ComponentArrays = "0.15"
ForwardDiff = "0.10"
Literate = "2"
Lux = "1"
LuxCUDA = "0.3"
Optimisers = "0.3"
Zygote = "0.6"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
4 changes: 4 additions & 0 deletions examples/ConvMixer/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -41,4 +41,8 @@ Statistics = "1.10"
Zygote = "0.6.70"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"
CUDA_Runtime_jll = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"

[preferences.CUDA_Driver_jll]
compat = false
4 changes: 4 additions & 0 deletions examples/DDIM/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -48,3 +48,7 @@ Zygote = "0.6"

[extras]
CUDA_Runtime_jll = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
6 changes: 6 additions & 0 deletions examples/HyperNet/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,9 @@ Optimisers = "0.3"
Setfield = "1"
Statistics = "1"
Zygote = "0.6"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
4 changes: 4 additions & 0 deletions examples/ImageNet/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -46,3 +46,7 @@ Zygote = "0.6.70"

[extras]
CUDA_Runtime_jll = "76a88914-d11a-5bdc-97e0-2f5a05c973a2"
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
6 changes: 6 additions & 0 deletions examples/NeuralODE/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -28,3 +28,9 @@ OrdinaryDiffEq = "6"
SciMLSensitivity = "7.63"
Statistics = "1"
Zygote = "0.6"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
6 changes: 6 additions & 0 deletions examples/OptimizationIntegration/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,9 @@ OrdinaryDiffEqTsit5 = "1.1.0"
Printf = "1.10"
Random = "1.10"
SciMLSensitivity = "7.67.0"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
6 changes: 6 additions & 0 deletions examples/PINN2DPDE/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -27,3 +27,9 @@ Printf = "1.10"
Random = "1.10"
Statistics = "1.10"
Zygote = "0.6.70"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
6 changes: 6 additions & 0 deletions examples/PolynomialFitting/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -20,3 +20,9 @@ LuxCUDA = "0.3"
Optimisers = "0.3"
Statistics = "1"
Zygote = "0.6"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
6 changes: 6 additions & 0 deletions examples/SimpleRNN/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -22,3 +22,9 @@ MLUtils = "0.4"
Optimisers = "0.3"
Statistics = "1"
Zygote = "0.6"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
7 changes: 7 additions & 0 deletions test/Project.toml
Original file line number Diff line number Diff line change
Expand Up @@ -42,6 +42,7 @@ Zygote = "e88e6eb3-aa80-5325-afca-941959d7151f"
ADTypes = "1.5"
Adapt = "4"
Aqua = "0.8.4"
CUDA_Driver_jll = "0.9, 0.10"
ChainRulesCore = "1.24"
ComponentArrays = "0.15.16"
DispatchDoctor = "0.4.12"
Expand Down Expand Up @@ -77,3 +78,9 @@ Statistics = "1.11.1"
Test = "1.10"
Tracker = "0.2.34"
Zygote = "0.6.70"

[extras]
CUDA_Driver_jll = "4ee394cb-3365-5eb0-8335-949819d2adfc"

[preferences.CUDA_Driver_jll]
compat = false
24 changes: 14 additions & 10 deletions test/runtests.jl
Original file line number Diff line number Diff line change
Expand Up @@ -8,24 +8,28 @@ const ALL_LUX_TEST_GROUPS = [
"core_layers", "contrib", "helpers", "distributed", "normalize_layers",
"others", "autodiff", "recurrent_layers", "fluxcompat"]

__INPUT_TEST_GROUP = lowercase(get(ENV, "LUX_TEST_GROUP", "all"))
const LUX_TEST_GROUP = if startswith("!", __INPUT_TEST_GROUP[1])
exclude_group = lowercase.(split(__INPUT_TEST_GROUP[2:end], ","))
INPUT_TEST_GROUP = lowercase(get(ENV, "LUX_TEST_GROUP", "all"))
const LUX_TEST_GROUP = if startswith("!", INPUT_TEST_GROUP[1])
exclude_group = lowercase.(split(INPUT_TEST_GROUP[2:end], ","))
filter(x -> x exclude_group, ALL_LUX_TEST_GROUPS)
else
[__INPUT_TEST_GROUP]
[INPUT_TEST_GROUP]
end
@info "Running tests for group: $LUX_TEST_GROUP"

const EXTRA_PKGS = String[]
const EXTRA_PKGS = Pkg.PackageSpec[]

if ("all" in LUX_TEST_GROUP || "distributed" in LUX_TEST_GROUP)
push!(EXTRA_PKGS, "MPI")
(BACKEND_GROUP == "all" || BACKEND_GROUP == "cuda") && push!(EXTRA_PKGS, "NCCL")
push!(EXTRA_PKGS, Pkg.PackageSpec("MPI"))
(BACKEND_GROUP == "all" || BACKEND_GROUP == "cuda") &&
push!(EXTRA_PKGS, Pkg.PackageSpec("NCCL"))
end
("all" in LUX_TEST_GROUP || "fluxcompat" in LUX_TEST_GROUP) && push!(EXTRA_PKGS, "Flux")
(BACKEND_GROUP == "all" || BACKEND_GROUP == "cuda") && push!(EXTRA_PKGS, "LuxCUDA")
(BACKEND_GROUP == "all" || BACKEND_GROUP == "amdgpu") && push!(EXTRA_PKGS, "AMDGPU")
("all" in LUX_TEST_GROUP || "fluxcompat" in LUX_TEST_GROUP) &&
push!(EXTRA_PKGS, Pkg.PackageSpec("Flux"))
(BACKEND_GROUP == "all" || BACKEND_GROUP == "cuda") &&
push!(EXTRA_PKGS, Pkg.PackageSpec("LuxCUDA"))
(BACKEND_GROUP == "all" || BACKEND_GROUP == "amdgpu") &&
push!(EXTRA_PKGS, Pkg.PackageSpec("AMDGPU"))

if !isempty(EXTRA_PKGS)
@info "Installing Extra Packages for testing" EXTRA_PKGS=EXTRA_PKGS
Expand Down

3 comments on commit 1b7c9a9

@github-actions
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lux Benchmarks

Benchmark suite Current: 1b7c9a9 Previous: 03eaa56 Ratio
Dense(512 => 512, identity)(512 x 128)/forward/CPU/2 thread(s) 414167 ns 414333 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/4 thread(s) 243812.5 ns 243166 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/8 thread(s) 243375 ns 244458 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/CPU/1 thread(s) 739750 ns 740167 ns 1.00
Dense(512 => 512, identity)(512 x 128)/forward/GPU/CUDA 43608.5 ns 43790 ns 1.00
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/2 thread(s) 1274750 ns 1313500 ns 0.97
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/4 thread(s) 1257604 ns 1240208 ns 1.01
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/8 thread(s) 16232709 ns 16477375 ns 0.99
Dense(512 => 512, identity)(512 x 128)/zygote/CPU/1 thread(s) 2193229 ns 2255000 ns 0.97
Dense(512 => 512, identity)(512 x 128)/zygote/GPU/CUDA 205508.5 ns 208556.5 ns 0.99
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/2 thread(s) 1311791 ns 1362770.5 ns 0.96
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/4 thread(s) 1296000 ns 1287854.5 ns 1.01
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/8 thread(s) 16564750 ns 16632958 ns 1.00
Dense(512 => 512, identity)(512 x 128)/enzyme/CPU/1 thread(s) 2236917 ns 2226042 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1656771 ns 1717603.5 ns 0.96
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1101167 ns 1031500 ns 1.07
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1519083 ns 1531250 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 2996500 ns 3017292 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/forward/GPU/CUDA 206771 ns 208614 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12074917 ns 12148333 ns 0.99
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 8846125 ns 8837084 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9185812.5 ns 9175417 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18620646 ns 18614083.5 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1506641 ns 1495290 ns 1.01
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17279459 ns 17311000 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14009229.5 ns 13985416 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14468291.5 ns 14512667 ns 1.00
Conv((3, 3), 2 => 2, identity)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 21873146 ns 21852792 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 252162083.5 ns 252076020.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148884583 ns 148360833 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116232875 ns 116492833.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447534666 ns 447245667 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5465296 ns 5467078.5 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1230946875 ns 1233021917 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 931953750 ns 930623166 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 826867750.5 ns 833831875 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1631748667 ns 1634325958 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 31362804 ns 31610363 ns 0.99
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1146184875 ns 1149699000 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 997853916.5 ns 996946771 ns 1.00
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1329065916.5 ns 1315883646 ns 1.01
Conv((3, 3), 64 => 64, relu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1736617187.5 ns 1740085708 ns 1.00
lenet(28, 28, 1, 32)/forward/CPU/2 thread(s) 1111541.5 ns 1118375 ns 0.99
lenet(28, 28, 1, 32)/forward/CPU/4 thread(s) 1663917 ns 1638125 ns 1.02
lenet(28, 28, 1, 32)/forward/CPU/8 thread(s) 3634917 ns 3616250 ns 1.01
lenet(28, 28, 1, 32)/forward/CPU/1 thread(s) 788500 ns 782250 ns 1.01
lenet(28, 28, 1, 32)/forward/GPU/CUDA 262430.5 ns 270251 ns 0.97
lenet(28, 28, 1, 32)/zygote/CPU/2 thread(s) 2981646 ns 2996292 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/4 thread(s) 4151854.5 ns 4132667 ns 1.00
lenet(28, 28, 1, 32)/zygote/CPU/8 thread(s) 10487312.5 ns 10221167 ns 1.03
lenet(28, 28, 1, 32)/zygote/CPU/1 thread(s) 3265083 ns 3156062 ns 1.03
lenet(28, 28, 1, 32)/zygote/GPU/CUDA 1131749 ns 1122893 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 2342791 ns 2273271 ns 1.03
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1260000 ns 1316749.5 ns 0.96
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1539542 ns 1552667 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 4176916 ns 4209687.5 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 208157.5 ns 209258 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 19392625 ns 19422291.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 16105895.5 ns 16107334 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 17329250 ns 17351937.5 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 25905125 ns 25921166 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1607984 ns 1598707 ns 1.01
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 34168604 ns 34193958 ns 1.00
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 30734292 ns 30938000 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 30891041.5 ns 31197500 ns 0.99
Conv((3, 3), 2 => 2, gelu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 36714750 ns 36682042 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 4532000 ns 4538583 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2546584 ns 2543291 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2675583.5 ns 2674083 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 8386333 ns 8379958 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 419971 ns 424210 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 38621250 ns 38932625 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 32144146 ns 32157666 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 32234313 ns 32273542 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 51925709 ns 51985167 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2628667 ns 2626449 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 89245375 ns 89065979.5 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 115663979 ns 115218833 ns 1.00
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 223717000 ns 226131166 ns 0.99
Conv((3, 3), 4 => 4, gelu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 74519062.5 ns 74254146 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 270237667 ns 269870292 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 156197542 ns 156750750 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 123423271 ns 123574479.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 485408250 ns 485653292 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/forward/GPU/CUDA 7027939 ns 6941099.5 ns 1.01
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1473080062.5 ns 1470327334 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 1168760792 ns 1179882625 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 1063953145.5 ns 1073715271 ns 0.99
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 2006090104 ns 2001077396 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 34772934.5 ns 34677615.5 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1719270959 ns 1720676625 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 1530344979 ns 1537383438 ns 1.00
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1879104875 ns 1839399083 ns 1.02
Conv((3, 3), 64 => 64, gelu)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 2217620458 ns 2211950584 ns 1.00
lenet(28, 28, 1, 128)/forward/CPU/2 thread(s) 2066124.5 ns 2097166 ns 0.99
lenet(28, 28, 1, 128)/forward/CPU/4 thread(s) 3080917 ns 3024209 ns 1.02
lenet(28, 28, 1, 128)/forward/CPU/8 thread(s) 7964834 ns 7203500 ns 1.11
lenet(28, 28, 1, 128)/forward/CPU/1 thread(s) 2511771 ns 2463583 ns 1.02
lenet(28, 28, 1, 128)/forward/GPU/CUDA 272286 ns 264273 ns 1.03
lenet(28, 28, 1, 128)/zygote/CPU/2 thread(s) 9629792 ns 9398666 ns 1.02
lenet(28, 28, 1, 128)/zygote/CPU/4 thread(s) 12051208 ns 11990396 ns 1.01
lenet(28, 28, 1, 128)/zygote/CPU/8 thread(s) 23782666.5 ns 25173666.5 ns 0.94
lenet(28, 28, 1, 128)/zygote/CPU/1 thread(s) 11321791 ns 11771708 ns 0.96
lenet(28, 28, 1, 128)/zygote/GPU/CUDA 1192316.5 ns 1166267.5 ns 1.02
vgg16(32, 32, 3, 32)/forward/CPU/2 thread(s) 379182875 ns 380060375 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/4 thread(s) 311332270.5 ns 311100041.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/CPU/8 thread(s) 260260313 ns 267361708.5 ns 0.97
vgg16(32, 32, 3, 32)/forward/CPU/1 thread(s) 450681833 ns 451932312.5 ns 1.00
vgg16(32, 32, 3, 32)/forward/GPU/CUDA 4857816 ns 4972294 ns 0.98
vgg16(32, 32, 3, 32)/zygote/CPU/2 thread(s) 1151703750 ns 1154776917 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/4 thread(s) 938427709 ns 938936416 ns 1.00
vgg16(32, 32, 3, 32)/zygote/CPU/8 thread(s) 943142791 ns 971050584 ns 0.97
vgg16(32, 32, 3, 32)/zygote/CPU/1 thread(s) 1396853084 ns 1397053958 ns 1.00
vgg16(32, 32, 3, 32)/zygote/GPU/CUDA 17794579 ns 20192694 ns 0.88
lenet(28, 28, 1, 64)/forward/CPU/2 thread(s) 1048833 ns 1061792 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/4 thread(s) 1655208.5 ns 1666562 ns 0.99
lenet(28, 28, 1, 64)/forward/CPU/8 thread(s) 4851812 ns 4995250 ns 0.97
lenet(28, 28, 1, 64)/forward/CPU/1 thread(s) 1291167 ns 1386958.5 ns 0.93
lenet(28, 28, 1, 64)/forward/GPU/CUDA 278270.5 ns 265363 ns 1.05
lenet(28, 28, 1, 64)/zygote/CPU/2 thread(s) 6497104 ns 6518459 ns 1.00
lenet(28, 28, 1, 64)/zygote/CPU/4 thread(s) 13086396 ns 13167979.5 ns 0.99
lenet(28, 28, 1, 64)/zygote/CPU/8 thread(s) 18753875 ns 20031250 ns 0.94
lenet(28, 28, 1, 64)/zygote/CPU/1 thread(s) 5891208.5 ns 6075042 ns 0.97
lenet(28, 28, 1, 64)/zygote/GPU/CUDA 1253158.5 ns 1210268 ns 1.04
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70556458 ns 70474416.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 44452167 ns 44309437.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 39837500 ns 39939667 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132581125 ns 132597500 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1865473 ns 1928662.5 ns 0.97
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 356767520.5 ns 356622333.5 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 272336833 ns 272976834 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 255661771 ns 255218042 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 534829208.5 ns 534735459 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 12304649 ns 12363809 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 395040042 ns 396348584 ns 1.00
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 370401500 ns 384172750 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 693812291 ns 721077333.5 ns 0.96
Conv((3, 3), 32 => 32, identity)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 711246750 ns 711103834 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/2 thread(s) 1188023709 ns 1188662708 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/4 thread(s) 835256562.5 ns 832992083.5 ns 1.00
vgg16(32, 32, 3, 128)/forward/CPU/8 thread(s) 638885750 ns 642434458.5 ns 0.99
vgg16(32, 32, 3, 128)/forward/CPU/1 thread(s) 1768729250 ns 1776768042 ns 1.00
vgg16(32, 32, 3, 128)/forward/GPU/CUDA 12316863.5 ns 12309666 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/2 thread(s) 3627838020.5 ns 3641771083.5 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/4 thread(s) 2824735750 ns 2830188042 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/8 thread(s) 2694929167 ns 2706151041 ns 1.00
vgg16(32, 32, 3, 128)/zygote/CPU/1 thread(s) 5002434750 ns 5031071750 ns 0.99
vgg16(32, 32, 3, 128)/zygote/GPU/CUDA 49730192 ns 49668395 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3432375.5 ns 3440708 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2078583 ns 2055416 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2530500 ns 2500125 ns 1.01
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6020833 ns 6019375 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/forward/GPU/CUDA 339043.5 ns 315406 ns 1.07
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 25844354 ns 25875291 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 18918770.5 ns 19146229.5 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 19719959 ns 19365708.5 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 39362209 ns 38324833.5 ns 1.03
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2460010 ns 2475375 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 54493625 ns 54514062.5 ns 1.00
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 84184417 ns 82633291.5 ns 1.02
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 173059688 ns 174671625 ns 0.99
Conv((3, 3), 4 => 4, relu)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 45573959 ns 45302208 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/2 thread(s) 1783437.5 ns 1792062.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/4 thread(s) 1098584 ns 1099604.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/8 thread(s) 1563624.5 ns 1542042 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/CPU/1 thread(s) 3028979 ns 3033959 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/forward/GPU/CUDA 212147.5 ns 211799 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/2 thread(s) 12574667 ns 12562750 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/4 thread(s) 9223854 ns 9226791 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/8 thread(s) 9681958 ns 9602354 ns 1.01
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/CPU/1 thread(s) 18996416 ns 18997104.5 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/zygote/GPU/CUDA 1525057 ns 1540683 ns 0.99
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/2 thread(s) 17650833 ns 17671792 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/4 thread(s) 14332292 ns 14309083 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/8 thread(s) 14552750 ns 14547458 ns 1.00
Conv((3, 3), 2 => 2, relu)(64 x 64 x 2 x 128)/enzyme/CPU/1 thread(s) 22194208 ns 22161375 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 70637271 ns 70504375 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 44500249.5 ns 44105145.5 ns 1.01
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 40038333 ns 39912979 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 132595500 ns 132559791.5 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 1878861 ns 1936174 ns 0.97
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 361106062 ns 359409291 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 349644938 ns 348618104 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 304116708.5 ns 305716167 ns 0.99
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 723634000 ns 724420791 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 13382866.5 ns 13389186 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 419845083.5 ns 420157625 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 427670459 ns 429218959 ns 1.00
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 765524104 ns 700058604.5 ns 1.09
Conv((3, 3), 32 => 32, relu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 715822875 ns 716102291 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/2 thread(s) 1591792 ns 1595000 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/4 thread(s) 1165292 ns 1157583.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/8 thread(s) 1150479.5 ns 1159667 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/CPU/1 thread(s) 2435375 ns 2459125 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/forward/GPU/CUDA 580934.5 ns 547379.5 ns 1.06
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/2 thread(s) 8855583 ns 8848209 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/4 thread(s) 13566583 ns 13600625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/8 thread(s) 33371313 ns 33819291.5 ns 0.99
mlp7layer_bn(gelu)(32 x 256)/zygote/CPU/1 thread(s) 9856250 ns 9846625 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/zygote/GPU/CUDA 1447660.5 ns 1473567 ns 0.98
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/2 thread(s) 16614333.5 ns 16653917 ns 1.00
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/4 thread(s) 22957687.5 ns 22770333.5 ns 1.01
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/8 thread(s) 45530875 ns 47753750 ns 0.95
mlp7layer_bn(gelu)(32 x 256)/enzyme/CPU/1 thread(s) 13137979 ns 13143791.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/2 thread(s) 830833 ns 827917 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/CPU/4 thread(s) 515458 ns 621542 ns 0.83
Dense(512 => 512, relu)(512 x 128)/forward/CPU/8 thread(s) 1061583 ns 1073833 ns 0.99
Dense(512 => 512, relu)(512 x 128)/forward/CPU/1 thread(s) 723895.5 ns 725042 ns 1.00
Dense(512 => 512, relu)(512 x 128)/forward/GPU/CUDA 48058.5 ns 47938 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/2 thread(s) 1549792 ns 1553521 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/4 thread(s) 1043458 ns 1054000 ns 0.99
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/8 thread(s) 1717459 ns 1432167 ns 1.20
Dense(512 => 512, relu)(512 x 128)/zygote/CPU/1 thread(s) 2249729 ns 2258729 ns 1.00
Dense(512 => 512, relu)(512 x 128)/zygote/GPU/CUDA 235968.5 ns 240587.5 ns 0.98
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/2 thread(s) 1556416 ns 1558291.5 ns 1.00
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/4 thread(s) 1068292 ns 1087375 ns 0.98
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/8 thread(s) 1707875 ns 1840000 ns 0.93
Dense(512 => 512, relu)(512 x 128)/enzyme/CPU/1 thread(s) 2224354 ns 2188584 ns 1.02
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/2 thread(s) 3404875 ns 3428687.5 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/4 thread(s) 2061708 ns 2064583 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/8 thread(s) 2526583 ns 2512875 ns 1.01
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/CPU/1 thread(s) 6005458 ns 6002146 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/forward/GPU/CUDA 284654 ns 287607 ns 0.99
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/2 thread(s) 24057375 ns 24070583 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/4 thread(s) 17188917 ns 17222146 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/8 thread(s) 17108854 ns 17095666.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/CPU/1 thread(s) 37589750 ns 37568312.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/zygote/GPU/CUDA 2418683.5 ns 2419447 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/2 thread(s) 52962291.5 ns 52951458.5 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/4 thread(s) 85344416 ns 82721062.5 ns 1.03
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/8 thread(s) 171244354 ns 171722291 ns 1.00
Conv((3, 3), 4 => 4, identity)(64 x 64 x 4 x 128)/enzyme/CPU/1 thread(s) 44652208.5 ns 44599666.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/2 thread(s) 251293750 ns 251795458 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/4 thread(s) 148493709 ns 148535375 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/8 thread(s) 116314333.5 ns 116156833 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/CPU/1 thread(s) 447949229.5 ns 447970041.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/forward/GPU/CUDA 5446386 ns 5450674 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/2 thread(s) 1103974709 ns 1105735084 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/4 thread(s) 855630395.5 ns 859662041.5 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/8 thread(s) 831750854.5 ns 829884208 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/CPU/1 thread(s) 1754110584 ns 1759233333 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/zygote/GPU/CUDA 28887646 ns 29448635.5 ns 0.98
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/2 thread(s) 1030795771 ns 1031472604 ns 1.00
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/4 thread(s) 973527459 ns 979007542 ns 0.99
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/8 thread(s) 1276835833 ns 1377035458 ns 0.93
Conv((3, 3), 64 => 64, identity)(64 x 64 x 64 x 128)/enzyme/CPU/1 thread(s) 1741435895.5 ns 1724417500 ns 1.01
mlp7layer_bn(relu)(32 x 256)/forward/CPU/2 thread(s) 1102104.5 ns 1210666.5 ns 0.91
mlp7layer_bn(relu)(32 x 256)/forward/CPU/4 thread(s) 764333 ns 663000 ns 1.15
mlp7layer_bn(relu)(32 x 256)/forward/CPU/8 thread(s) 784979 ns 688083 ns 1.14
mlp7layer_bn(relu)(32 x 256)/forward/CPU/1 thread(s) 1957854 ns 2060083 ns 0.95
mlp7layer_bn(relu)(32 x 256)/forward/GPU/CUDA 563252 ns 581074 ns 0.97
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/2 thread(s) 5885125 ns 5887979 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/4 thread(s) 9085895.5 ns 8608041 ns 1.06
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/8 thread(s) 26897042 ns 26857833 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/CPU/1 thread(s) 7099083 ns 7102729 ns 1.00
mlp7layer_bn(relu)(32 x 256)/zygote/GPU/CUDA 1415829 ns 1395877.5 ns 1.01
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/2 thread(s) 9699771 ns 9702166.5 ns 1.00
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/4 thread(s) 15967729 ns 16070292 ns 0.99
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/8 thread(s) 32771687.5 ns 34207145.5 ns 0.96
mlp7layer_bn(relu)(32 x 256)/enzyme/CPU/1 thread(s) 7633666 ns 7622708 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/2 thread(s) 514458 ns 521416 ns 0.99
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/4 thread(s) 384604.5 ns 467583.5 ns 0.82
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/8 thread(s) 3059459 ns 2678000 ns 1.14
Dense(128 => 128, gelu)(128 x 128)/forward/CPU/1 thread(s) 87833 ns 89458 ns 0.98
Dense(128 => 128, gelu)(128 x 128)/forward/GPU/CUDA 28219 ns 28156 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/2 thread(s) 381812.5 ns 382000 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/4 thread(s) 447750 ns 441542 ns 1.01
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/8 thread(s) 4678459 ns 4583458 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/zygote/CPU/1 thread(s) 258375 ns 258542 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/zygote/GPU/CUDA 228924.5 ns 225153.5 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/2 thread(s) 410916.5 ns 411500 ns 1.00
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/4 thread(s) 479208 ns 471916 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/8 thread(s) 4649000 ns 4557813 ns 1.02
Dense(128 => 128, gelu)(128 x 128)/enzyme/CPU/1 thread(s) 270833 ns 271709 ns 1.00
Dense(128 => 128, relu)(128 x 128)/forward/CPU/2 thread(s) 461250.5 ns 464666 ns 0.99
Dense(128 => 128, relu)(128 x 128)/forward/CPU/4 thread(s) 322625 ns 415562.5 ns 0.78
Dense(128 => 128, relu)(128 x 128)/forward/CPU/8 thread(s) 768834 ns 787354 ns 0.98
Dense(128 => 128, relu)(128 x 128)/forward/CPU/1 thread(s) 52875 ns 54458 ns 0.97
Dense(128 => 128, relu)(128 x 128)/forward/GPU/CUDA 28278 ns 28105 ns 1.01
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/2 thread(s) 342333 ns 339125 ns 1.01
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/4 thread(s) 347625 ns 337209 ns 1.03
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/8 thread(s) 396687 ns 425750 ns 0.93
Dense(128 => 128, relu)(128 x 128)/zygote/CPU/1 thread(s) 151250 ns 151834 ns 1.00
Dense(128 => 128, relu)(128 x 128)/zygote/GPU/CUDA 212495 ns 209442 ns 1.01
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/2 thread(s) 356000 ns 354958 ns 1.00
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/4 thread(s) 362937.5 ns 351542 ns 1.03
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/8 thread(s) 740771 ns 447104.5 ns 1.66
Dense(128 => 128, relu)(128 x 128)/enzyme/CPU/1 thread(s) 150875 ns 151375 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/2 thread(s) 601061209 ns 601684667 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/4 thread(s) 430671250 ns 429562875 ns 1.00
vgg16(32, 32, 3, 64)/forward/CPU/8 thread(s) 383040583 ns 376810833 ns 1.02
vgg16(32, 32, 3, 64)/forward/CPU/1 thread(s) 870727020.5 ns 869971062 ns 1.00
vgg16(32, 32, 3, 64)/forward/GPU/CUDA 7032100 ns 7028132 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/2 thread(s) 2000504228.5 ns 1996714667 ns 1.00
vgg16(32, 32, 3, 64)/zygote/CPU/4 thread(s) 1604685125 ns 1615544895.5 ns 0.99
vgg16(32, 32, 3, 64)/zygote/CPU/8 thread(s) 1652458646 ns 1563076479 ns 1.06
vgg16(32, 32, 3, 64)/zygote/CPU/1 thread(s) 2626165250 ns 2624412875 ns 1.00
vgg16(32, 32, 3, 64)/zygote/GPU/CUDA 25934443 ns 26150198.5 ns 0.99
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/2 thread(s) 526333 ns 520854 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/4 thread(s) 400458.5 ns 393375 ns 1.02
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/8 thread(s) 3022187.5 ns 2582708 ns 1.17
Dense(512 => 512, gelu)(512 x 128)/forward/CPU/1 thread(s) 868667 ns 866187.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/forward/GPU/CUDA 47967.5 ns 47544 ns 1.01
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/2 thread(s) 1757062.5 ns 1879500 ns 0.93
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/4 thread(s) 1694333 ns 1747271 ns 0.97
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/8 thread(s) 16312334 ns 16566729 ns 0.98
Dense(512 => 512, gelu)(512 x 128)/zygote/CPU/1 thread(s) 2651375 ns 2650937.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/zygote/GPU/CUDA 257253 ns 248835.5 ns 1.03
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/2 thread(s) 1894750.5 ns 1949917 ns 0.97
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/4 thread(s) 1834625 ns 1830604.5 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/8 thread(s) 16537333 ns 16534688 ns 1.00
Dense(512 => 512, gelu)(512 x 128)/enzyme/CPU/1 thread(s) 2736604.5 ns 2714312.5 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/2 thread(s) 1496021 ns 1368166.5 ns 1.09
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/4 thread(s) 931750 ns 967041 ns 0.96
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/8 thread(s) 1059667 ns 933875 ns 1.13
mlp7layer_bn(tanh)(32 x 256)/forward/CPU/1 thread(s) 2319292 ns 2334542 ns 0.99
mlp7layer_bn(tanh)(32 x 256)/forward/GPU/CUDA 585808.5 ns 587807 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/2 thread(s) 5882458 ns 5905375 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/4 thread(s) 8563167 ns 8596208 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/8 thread(s) 26031937 ns 25859917 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/CPU/1 thread(s) 7331479 ns 7262750 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/zygote/GPU/CUDA 1393892 ns 1348515 ns 1.03
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/2 thread(s) 11701667 ns 11679812.5 ns 1.00
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/4 thread(s) 18292896 ns 18127854 ns 1.01
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/8 thread(s) 39864875 ns 37908583 ns 1.05
mlp7layer_bn(tanh)(32 x 256)/enzyme/CPU/1 thread(s) 9527500 ns 9569791 ns 1.00
Dense(16 => 16, relu)(16 x 128)/forward/CPU/2 thread(s) 2750 ns 2583 ns 1.06
Dense(16 => 16, relu)(16 x 128)/forward/CPU/4 thread(s) 2334 ns 4583 ns 0.51
Dense(16 => 16, relu)(16 x 128)/forward/CPU/8 thread(s) 3292 ns 3459 ns 0.95
Dense(16 => 16, relu)(16 x 128)/forward/CPU/1 thread(s) 2583 ns 2458.5 ns 1.05
Dense(16 => 16, relu)(16 x 128)/forward/GPU/CUDA 24864 ns 24305 ns 1.02
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/2 thread(s) 7041 ns 7000 ns 1.01
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/4 thread(s) 7166 ns 6958 ns 1.03
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/8 thread(s) 7250 ns 7250 ns 1
Dense(16 => 16, relu)(16 x 128)/zygote/CPU/1 thread(s) 7083 ns 6959 ns 1.02
Dense(16 => 16, relu)(16 x 128)/zygote/GPU/CUDA 216254.5 ns 209367.5 ns 1.03
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/2 thread(s) 8250 ns 8250 ns 1
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/4 thread(s) 8459 ns 8208 ns 1.03
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/8 thread(s) 8542 ns 8542 ns 1
Dense(16 => 16, relu)(16 x 128)/enzyme/CPU/1 thread(s) 5834 ns 6000 ns 0.97
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/2 thread(s) 10479.5 ns 10521 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/4 thread(s) 13062.5 ns 13833 ns 0.94
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/8 thread(s) 10500 ns 10437.5 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/forward/CPU/1 thread(s) 7500 ns 7520.5 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/forward/GPU/CUDA 25125 ns 24374 ns 1.03
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/2 thread(s) 19916 ns 20000 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/4 thread(s) 19917 ns 19542 ns 1.02
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/8 thread(s) 20270.5 ns 20291 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/zygote/CPU/1 thread(s) 20000 ns 19750 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/zygote/GPU/CUDA 238014.5 ns 229284 ns 1.04
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/2 thread(s) 23541 ns 23459 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/4 thread(s) 23584 ns 23417 ns 1.01
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/8 thread(s) 23917 ns 23916 ns 1.00
Dense(16 => 16, gelu)(16 x 128)/enzyme/CPU/1 thread(s) 21333 ns 21292 ns 1.00
Dense(128 => 128, identity)(128 x 128)/forward/CPU/2 thread(s) 28687.5 ns 28917 ns 0.99
Dense(128 => 128, identity)(128 x 128)/forward/CPU/4 thread(s) 28458 ns 29000 ns 0.98
Dense(128 => 128, identity)(128 x 128)/forward/CPU/8 thread(s) 28750 ns 28417 ns 1.01
Dense(128 => 128, identity)(128 x 128)/forward/CPU/1 thread(s) 46041 ns 46479.5 ns 0.99
Dense(128 => 128, identity)(128 x 128)/forward/GPU/CUDA 26166 ns 25572 ns 1.02
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/2 thread(s) 224416 ns 223416 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/4 thread(s) 277458 ns 272291 ns 1.02
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/8 thread(s) 3940416 ns 4265917 ns 0.92
Dense(128 => 128, identity)(128 x 128)/zygote/CPU/1 thread(s) 145375 ns 145583 ns 1.00
Dense(128 => 128, identity)(128 x 128)/zygote/GPU/CUDA 215900.5 ns 205830.5 ns 1.05
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/2 thread(s) 241916.5 ns 241125 ns 1.00
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/4 thread(s) 294834 ns 290042 ns 1.02
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/8 thread(s) 4072750 ns 4002209 ns 1.02
Dense(128 => 128, identity)(128 x 128)/enzyme/CPU/1 thread(s) 145500 ns 145667 ns 1.00
Dense(16 => 16, identity)(16 x 128)/forward/CPU/2 thread(s) 1750 ns 2000 ns 0.88
Dense(16 => 16, identity)(16 x 128)/forward/CPU/4 thread(s) 1709 ns 1959 ns 0.87
Dense(16 => 16, identity)(16 x 128)/forward/CPU/8 thread(s) 2833 ns 2417 ns 1.17
Dense(16 => 16, identity)(16 x 128)/forward/CPU/1 thread(s) 1792 ns 2625 ns 0.68
Dense(16 => 16, identity)(16 x 128)/forward/GPU/CUDA 23320 ns 22856 ns 1.02
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/2 thread(s) 5250 ns 5209 ns 1.01
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/4 thread(s) 5084 ns 5000 ns 1.02
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/8 thread(s) 5375 ns 5292 ns 1.02
Dense(16 => 16, identity)(16 x 128)/zygote/CPU/1 thread(s) 5250 ns 5000 ns 1.05
Dense(16 => 16, identity)(16 x 128)/zygote/GPU/CUDA 273997 ns 247823 ns 1.11
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/2 thread(s) 7500 ns 7416 ns 1.01
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/4 thread(s) 7458 ns 7458 ns 1
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/8 thread(s) 7625 ns 7708 ns 0.99
Dense(16 => 16, identity)(16 x 128)/enzyme/CPU/1 thread(s) 5125 ns 5084 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/2 thread(s) 79922000 ns 79982958 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/4 thread(s) 48869292 ns 48448125 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/8 thread(s) 43653750 ns 43546250 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/CPU/1 thread(s) 151454541 ns 151447125 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/forward/GPU/CUDA 2718779 ns 2714961 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/2 thread(s) 663985416 ns 664508792 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/4 thread(s) 413249125 ns 414562750 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/8 thread(s) 397260000 ns 399573833 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/CPU/1 thread(s) 684524000 ns 682810167 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/zygote/GPU/CUDA 14579213 ns 14687395.5 ns 0.99
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/2 thread(s) 713434583.5 ns 714844958 ns 1.00
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/4 thread(s) 675522709 ns 686047458 ns 0.98
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/8 thread(s) 997663125 ns 991292583 ns 1.01
Conv((3, 3), 32 => 32, gelu)(64 x 64 x 32 x 128)/enzyme/CPU/1 thread(s) 999548041 ns 999363417 ns 1.00

This comment was automatically generated by workflow using github-action-benchmark.

@avik-pal
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@JuliaRegistrator
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Registration pull request created: JuliaRegistries/General/115463

Tip: Release Notes

Did you know you can add release notes too? Just add markdown formatted text underneath the comment after the text
"Release notes:" and it will be added to the registry PR, and if TagBot is installed it will also be added to the
release that TagBot creates. i.e.

@JuliaRegistrator register

Release notes:

## Breaking changes

- blah

To add them here just re-invoke and the PR will be updated.

Tagging

After the above pull request is merged, it is recommended that a tag is created on this repository for the registered package version.

This will be done automatically if the Julia TagBot GitHub Action is installed, or can be done manually through the github interface, or via:

git tag -a v1.0.5 -m "<description of version>" 1b7c9a9f34400579d142a5e1d2b93af53dccc473
git push origin v1.0.5

Please sign in to comment.