-
Notifications
You must be signed in to change notification settings - Fork 62
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reactant grabbing wrong CUDA version #1225
Comments
cc @giordano |
@wsmoses we build CUDA 12.1 and 12.6, the only way to solve this I see is to build 12.3 too. |
I think the right thing to do is to drop 12.1 and build 12.2 since that is the last version before the linker dependency |
And I think we should, weirdly enough have the versions as 12.3+ -> 12.6 build (since then we always have the linker issue resolved) |
this is on my lab's on prem workstation. should i consider upgrading to cuda 12.6? |
Thanks @giordano @wsmoses that error is fixed. The timings I am getting are below. Looks like Reactant is much slower on both CPU and GPU than Zygote/ Enzyme. And @avik-pal, AutoEnzyme() on GPU is taking over ~5 mins to run. The script is here. ####
# CPU
####
@time train(cpu_device(), AutoZygote())
@time train(cpu_device(), AutoEnzyme())
Reactant.set_default_backend("cpu")
@time train(reactant_device(), AutoEnzyme())
####
# GPU
####
@time train(gpu_device(), AutoZygote())
# @time train(gpu_device(), AutoEnzyme()) # takes over 5 mins so I don't run it
Reactant.set_default_backend("gpu")
@time train(reactant_device(), AutoEnzyme()) 0.153180 seconds (213.75 k allocations: 32.291 MiB, 16.31% gc time, 37.01% compilation time)
0.118047 seconds (217.79 k allocations: 37.426 MiB, 13.91% gc time, 45.17% compilation time)
2025-01-22 11:31:51.966997: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 13985633240296206436
0.810326 seconds (368.79 k allocations: 17.013 MiB, 64.08% compilation time)
0.076810 seconds (327.37 k allocations: 9.804 MiB)
2025-01-22 11:31:53.233898: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 16348477711662704876
1.097237 seconds (368.81 k allocations: 17.037 MiB, 46.19% compilation time)
|
Just for the sake of removing precompilation time, what happens if you time the second train? e.g.
Also I'd be very curious to see results of the op profiler https://enzymead.github.io/Reactant.jl/dev/tutorials/profiling (which incidentally should work with both Reactant code [and give tons more helpful info] and generic code) |
can you start julia with |
However @avik-pal I'm confused why we see more julia allocations on the julia side (at a high level I would've expected the opposite) |
That might be from Lux.jl/ext/LuxReactantExt/training.jl Lines 1 to 24 in 1053879
|
Also it might be worthwhile passing |
Thanks both, here is the updated script and the MLIR [ Info: HLO dumped to /tmp/jl_7OxadMZgvQ.mlir
2025-01-22 13:02:04.847488: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 2656968583157646233
0.075871 seconds (17.07 k allocations: 22.296 MiB, 19.48% gc time)
0.063287 seconds (21.30 k allocations: 27.448 MiB, 26.12% gc time)
[ Info: HLO dumped to /tmp/jl_BSfmN2ZmlM.mlir
1.112902 seconds (443.00 k allocations: 19.483 MiB, 61.43% compilation time)
[ Info: HLO dumped to /tmp/jl_Fu5UxgvEB0.mlir
2025-01-22 13:02:07.763693: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 1376608044839264878
0.066194 seconds (327.13 k allocations: 9.773 MiB)
[ Info: HLO dumped to /tmp/jl_z2jKr6H4bO.mlir
1.340633 seconds (442.96 k allocations: 19.424 MiB, 50.82% compilation time) |
Notice that Reactant has ~50% compilation time even for 2nd runs. |
Hm that's weird and bad, @avik-pal is there a way for Lux to use the precompiled version (and also separately it really shouldn't be forcing that much compile the second go anyways) |
Yes, keep the
@mofeing (I think?) mentioned in today's meeting that recompile times have regressed |
Looks like a big chunk of time is taken up by compiling the XLA models. I 100x'd the number of epochs and got the following timings Zygote + CPU : 4.439158 seconds (1.67 M allocations: 2.162 GiB, 5.76% gc time)
Enzyme + CPU : 4.687645 seconds (2.09 M allocations: 2.658 GiB, 6.91% gc time)
Reactant + CPU: 9.184567 seconds (1.86 M allocations: 71.208 MiB, 0.43% gc time, 5.46% compilation time)
# Base.@time
Zygote + GPU : 7.112343 seconds (32.36 M allocations: 957.447 MiB, 3.21% gc time)
Reactant + GPU: 3.046458 seconds (1.86 M allocations: 71.210 MiB, 17.63% compilation time) Maybe it would be useful to separate out the calls to @avik-pal @wsmoses is there a good way to track GPU memory usage by Reactant, similar to |
https://enzymead.github.io/Reactant.jl/dev/api/xla#Reactant.XLA.allocatorstats maybe be useful for the latter question [on GPU] |
If you call that function once outside the loop, it will cache it inside the returned train_state object |
Any chance loop vectorization is loaded in your env (even via indirect dep)? |
Here's what I got: ##################
CPUDevice() + AutoZygote()
##################
Warm up:
0.002192 seconds (183 allocations: 227.219 KiB)
Training:
4.521769 seconds (1.70 M allocations: 2.165 GiB, 5.66% gc time)
##################
CPUDevice() + AutoEnzyme()
##################
Warm up:
0.000784 seconds (251 allocations: 347.500 KiB)
Training:
4.615168 seconds (2.11 M allocations: 2.661 GiB, 5.76% gc time)
##################
ReactantDevice{Missing, Missing}(missing, missing) + AutoEnzyme()
##################
Warm up:
0.725995 seconds (318.00 k allocations: 14.596 MiB, 67.76% compilation time)
Training:
8.284950 seconds (1.56 M allocations: 57.048 MiB, 0.52% gc time, 0.20% compilation time)
##################
CUDADevice{Nothing}(nothing) + AutoZygote()
##################
Warm up:
0.001040 seconds (3.28 k allocations: 99.211 KiB)
Training:
6.724378 seconds (32.68 M allocations: 967.014 MiB, 3.36% gc time)
##################
ReactantDevice{Missing, Missing}(missing, missing) + AutoEnzyme()
##################
Warm up:
1.090517 seconds (318.02 k allocations: 14.602 MiB, 47.71% compilation time)
Training:
2.203423 seconds (1.56 M allocations: 57.048 MiB, 0.82% compilation time) |
lemme double check after lecture. I do have MKL.jl installed. |
@avik-pal, yes. I have LV installed. julia> "LoopVectorization" in [x.name for x in values(Pkg.dependencies())]
true |
So Lux is taking the LV path which would be faster for a small network like this. (Same reason why SimpleChains works much faster than Jax for that size) |
@avik-pal I ran the same test case in a new environment without LV and got the same results julia> include("cpu.jl")
"LoopVectorization" in [x.name for x = values(Pkg.dependencies())] = false
##################
CPUDevice() + AutoZygote()
##################
Warm up:
0.001829 seconds (183 allocations: 227.219 KiB)
Training:
4.385837 seconds (1.70 M allocations: 2.165 GiB, 5.39% gc time)
##################
CPUDevice() + AutoEnzyme()
##################
Warm up:
0.000712 seconds (251 allocations: 347.500 KiB)
Training:
4.633129 seconds (2.11 M allocations: 2.661 GiB, 6.19% gc time)
##################
ReactantDevice{Missing, Missing}(missing, missing) + AutoEnzyme()
##################
Warm up:
0.728178 seconds (316.32 k allocations: 14.549 MiB, 66.67% compilation time)
Training:
8.254477 seconds (1.56 M allocations: 57.043 MiB, 0.46% gc time, 0.20% compilation time)
|
Some other notes:
|
You can use the options exposed in EnzymeAD/Reactant.jl#589. Not yet released.
No. But you can use disable preallocation and GC.gc (https://enzymead.github.io/Reactant.jl/stable/introduction/#Empty-Cache) |
Looks like Reactant is not grabbing the right CUDA version.
The script is linked below.
The text was updated successfully, but these errors were encountered: