Reactant grabbing wrong CUDA version #1225

vpuri3 · 2025-01-21T23:28:05Z

Looks like Reactant is not grabbing the right CUDA version.

julia> include("misc/react_demo.jl")                                                               
2025-01-21 18:16:53.960554: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 6019370082488903858
E0121 18:16:54.305053   24773 pjrt_stream_executor_client.cc:3045] Execution of replica 0 failed: UNIMPLEMENTED: StreamBeginCaptureToGraph is not implemented for CUDA below version 12.3. Therefore t
racing is not supported.                                                                                                                                                                              
ERROR: LoadError: UNIMPLEMENTED: StreamBeginCaptureToGraph is not implemented for CUDA below version 12.3. Therefore tracing is not supported.

[vedantpu@eagle NeuralROMs.jl]:nvidia-smi
Tue Jan 21 18:26:24 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+

The script is linked below.

using Lux, MLDataDevices
using CUDA, LuxCUDA, KernelAbstractions
using Random, Printf, Optimisers, MLUtils

using Zygote
using Enzyme
using Reactant

function train(device, adtype)

	N = 10000
	W = 64
	E = 500

	# model
	NN = Chain(Dense(1 => W, gelu), Dense(W => W, gelu), Dense(W => 1))
	ps, st = Lux.setup(Random.default_rng(), NN)

	# data
	x = LinRange(0f0, 1f0, N) |> Array
	x = reshape(x, 1, N)
	y = @. sinpi(2x)

	# data loader
	DL = DataLoader((x, y); batchsize = div(N, 100))

	# device transfer
	ps = ps |> device
	st = st |> device
	DL = DeviceIterator(device, DL)

	# training
    train_state = Training.TrainState(NN, ps, st, Adam(0.001f0))

    for epoch in 1:E
        for (i, (xᵢ, yᵢ)) in enumerate(DL)
            _, loss, _, train_state = Training.single_train_step!(
                adtype, MSELoss(), (xᵢ, yᵢ), train_state)
            if (epoch % E == 0 || epoch == 1) && i == 1
				println("Epoch $(epoch)/$(E)\tLoss: $(loss)")
            end
        end
    end

    return train_state
end

# @time train(cpu_device(), AutoZygote())
# @time train(gpu_device(), AutoZygote())

# @time train(cpu_device(), AutoEnzyme())
# @time train(gpu_device(), AutoEnzyme())

# Reactant.set_default_backend("cpu")
# @time train(reactant_device(), AutoEnzyme())
Reactant.set_default_backend("gpu")
@time train(reactant_device(), AutoEnzyme())

#====================================================#
nothing

The text was updated successfully, but these errors were encountered:

vpuri3 · 2025-01-21T23:28:27Z

@wsmoses @avik-pal

wsmoses · 2025-01-21T23:36:15Z

cc @giordano

giordano · 2025-01-22T00:23:16Z

@wsmoses we build CUDA 12.1 and 12.6, the only way to solve this I see is to build 12.3 too.

wsmoses · 2025-01-22T00:24:32Z

I think the right thing to do is to drop 12.1 and build 12.2 since that is the last version before the linker dependency

wsmoses · 2025-01-22T00:26:01Z

And I think we should, weirdly enough have the versions as

12.3+ -> 12.6 build (since then we always have the linker issue resolved)
<12.3, a 12.2 build (or maybe 12.1)

vpuri3 · 2025-01-22T00:52:56Z

this is on my lab's on prem workstation. should i consider upgrading to cuda 12.6?

vpuri3 · 2025-01-22T16:34:52Z

Thanks @giordano @wsmoses that error is fixed.

The timings I am getting are below. Looks like Reactant is much slower on both CPU and GPU than Zygote/ Enzyme. And @avik-pal, AutoEnzyme() on GPU is taking over ~5 mins to run. The script is here.

####
# CPU
####

@time train(cpu_device(), AutoZygote())
@time train(cpu_device(), AutoEnzyme())
Reactant.set_default_backend("cpu")
@time train(reactant_device(), AutoEnzyme())

####
# GPU
####

@time train(gpu_device(), AutoZygote())
# @time train(gpu_device(), AutoEnzyme()) # takes over 5 mins so I don't run it
Reactant.set_default_backend("gpu")
@time train(reactant_device(), AutoEnzyme())

  0.153180 seconds (213.75 k allocations: 32.291 MiB, 16.31% gc time, 37.01% compilation time)
  0.118047 seconds (217.79 k allocations: 37.426 MiB, 13.91% gc time, 45.17% compilation time)
2025-01-22 11:31:51.966997: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 13985633240296206436
  0.810326 seconds (368.79 k allocations: 17.013 MiB, 64.08% compilation time)

  0.076810 seconds (327.37 k allocations: 9.804 MiB)
2025-01-22 11:31:53.233898: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 16348477711662704876
  1.097237 seconds (368.81 k allocations: 17.037 MiB, 46.19% compilation time)

wsmoses · 2025-01-22T16:38:31Z

Just for the sake of removing precompilation time, what happens if you time the second train?

e.g.

train(cpu_device(), AutoZygote())
@time train(cpu_device(), AutoZygote())
train(cpu_device(), AutoEnzyme())
@time train(cpu_device(), AutoEnzyme())
Reactant.set_default_backend("cpu")
train(reactant_device(), AutoEnzyme())
@time train(reactant_device(), AutoEnzyme())

####
# GPU
####
train(gpu_device(), AutoZygote())
@time train(gpu_device(), AutoZygote())
# @time train(gpu_device(), AutoEnzyme()) # takes over 5 mins so I don't run it
Reactant.set_default_backend("gpu")
train(reactant_device(), AutoEnzyme())
@time train(reactant_device(), AutoEnzyme())

Also I'd be very curious to see results of the op profiler https://enzymead.github.io/Reactant.jl/dev/tutorials/profiling (which incidentally should work with both Reactant code [and give tons more helpful info] and generic code)

avik-pal · 2025-01-22T16:39:14Z

can you start julia with LUX_DUMP_REACTANT_HLO_OPTIMIZE=true or if you already have Lux loaded Lux.DUMP_REACTANT_HLO_OPT_MODE[] = true. It will dump the generated HLO in a mlir file.

wsmoses · 2025-01-22T16:39:53Z

However @avik-pal I'm confused why we see more julia allocations on the julia side (at a high level I would've expected the opposite)

avik-pal · 2025-01-22T16:41:55Z

That might be from

Lux.jl/ext/LuxReactantExt/training.jl

Lines 1 to 24 in 1053879

    
           mutable struct StatsAndNewStateWrapper 
        
               stats::Any 
        
               st::Any 
        
           end 
        
           function wrapped_objective_function( 
        
                   fn::F, model, ps, data, cache::StatsAndNewStateWrapper 
        
           ) where {F} 
        
               loss, stₙ, stats = fn(model, ps, cache.st, data) 
        
               cache.stats = stats 
        
               cache.st = stₙ 
        
               return loss 
        
           end 
        
           function compute_gradients_internal(objective_function::F, model, data, ps, st) where {F} 
        
               st_stats_wrapper = StatsAndNewStateWrapper(nothing, st) 
        
               res = Enzyme.gradient( 
        
                   Enzyme.set_abi(Enzyme.ReverseWithPrimal, Reactant.ReactantABI), 
        
                   Const(wrapped_objective_function), Const(objective_function), 
        
                   Const(model), ps, Const(data), Const(st_stats_wrapper) 
        
               ) 
        
               loss, dps = res.val, res.derivs[3] 
        
               return dps, loss, st_stats_wrapper.stats, st_stats_wrapper.st 
        
           end

, because I need multiple returns from Enzyme.autodiff

avik-pal · 2025-01-22T16:44:43Z

Also it might be worthwhile passing return_gradients = Val(true) to single_train_step! that should eliminate some intermediates

vpuri3 · 2025-01-22T18:11:00Z

Thanks both, here is the updated script and the MLIR

[ Info: HLO dumped to /tmp/jl_7OxadMZgvQ.mlir                                                                                                                                                                                                                
2025-01-22 13:02:04.847488: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 2656968583157646233
  0.075871 seconds (17.07 k allocations: 22.296 MiB, 19.48% gc time)
  0.063287 seconds (21.30 k allocations: 27.448 MiB, 26.12% gc time)
[ Info: HLO dumped to /tmp/jl_BSfmN2ZmlM.mlir
  1.112902 seconds (443.00 k allocations: 19.483 MiB, 61.43% compilation time)

[ Info: HLO dumped to /tmp/jl_Fu5UxgvEB0.mlir
2025-01-22 13:02:07.763693: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 1376608044839264878
  0.066194 seconds (327.13 k allocations: 9.773 MiB)
[ Info: HLO dumped to /tmp/jl_z2jKr6H4bO.mlir
  1.340633 seconds (442.96 k allocations: 19.424 MiB, 50.82% compilation time)

vpuri3 · 2025-01-22T18:15:30Z

Notice that Reactant has ~50% compilation time even for 2nd runs.

wsmoses · 2025-01-22T18:18:39Z

Hm that's weird and bad, @avik-pal is there a way for Lux to use the precompiled version (and also separately it really shouldn't be forcing that much compile the second go anyways)

avik-pal · 2025-01-22T19:03:06Z

Hm that's weird and bad, @avik-pal is there a way for Lux to use the precompiled version

Yes, keep the train_state object around. It contains the compiled funcitons

and also separately it really shouldn't be forcing that much compile the second go anyways

@mofeing (I think?) mentioned in today's meeting that recompile times have regressed

vpuri3 · 2025-01-22T19:33:44Z

Looks like a big chunk of time is taken up by compiling the XLA models. I 100x'd the number of epochs and got the following timings

Zygote + CPU  :  4.439158 seconds (1.67 M allocations: 2.162 GiB, 5.76% gc time)
Enzyme + CPU  :  4.687645 seconds (2.09 M allocations: 2.658 GiB, 6.91% gc time)
Reactant + CPU:  9.184567 seconds (1.86 M allocations: 71.208 MiB, 0.43% gc time, 5.46% compilation time)

# Base.@time
Zygote + GPU  :  7.112343 seconds (32.36 M allocations: 957.447 MiB, 3.21% gc time)
Reactant + GPU:  3.046458 seconds (1.86 M allocations: 71.210 MiB, 17.63% compilation time)

Maybe it would be useful to separate out the calls to @compile as a step before applying single_train_step?

@avik-pal @wsmoses is there a good way to track GPU memory usage by Reactant, similar to CUDA.@time / CUDA.@allocated? Base.@time / CUDA.@time says that only 71 MiB is used but nvidia-smi says 8700MiB is occupied.

wsmoses · 2025-01-22T19:36:14Z

https://enzymead.github.io/Reactant.jl/dev/api/xla#Reactant.XLA.allocatorstats maybe be useful for the latter question [on GPU]

avik-pal · 2025-01-22T19:57:54Z

Maybe it would be useful to separate out the calls to @compile as a step before applying single_train_step?

If you call that function once outside the loop, it will cache it inside the returned train_state object

avik-pal · 2025-01-22T19:59:06Z

Zygote + CPU : 4.439158 seconds (1.67 M allocations: 2.162 GiB, 5.76% gc time)

Any chance loop vectorization is loaded in your env (even via indirect dep)?

vpuri3 · 2025-01-22T20:02:55Z

If you call that function once outside the loop, it will cache it inside the returned train_state object

Here's what I got:

##################                                                                                                                                                                                    
CPUDevice() + AutoZygote()                                                                                                                                                                            
##################                                                                                                                                                                                    
Warm up:                                                                                                                                                                                              
  0.002192 seconds (183 allocations: 227.219 KiB)                                                                                                                                                     
Training:                                                                                                                                                                                             
  4.521769 seconds (1.70 M allocations: 2.165 GiB, 5.66% gc time)
                                                 
##################                                                                                 
CPUDevice() + AutoEnzyme()
##################                                                                                 
Warm up:
  0.000784 seconds (251 allocations: 347.500 KiB) 
Training:                                                                                          
  4.615168 seconds (2.11 M allocations: 2.661 GiB, 5.76% gc time)
                                                 
##################                                                                                 
ReactantDevice{Missing, Missing}(missing, missing) + AutoEnzyme()
##################                                                                                 
Warm up:
  0.725995 seconds (318.00 k allocations: 14.596 MiB, 67.76% compilation time)
Training:                                                                                                                                                                                             
  8.284950 seconds (1.56 M allocations: 57.048 MiB, 0.52% gc time, 0.20% compilation time)  
                                                                                                                                                                                                      
                                                                                                   
##################                                                                                 
CUDADevice{Nothing}(nothing) + AutoZygote()
##################                                                                                 
Warm up:
  0.001040 seconds (3.28 k allocations: 99.211 KiB)
Training:                                                                                          
  6.724378 seconds (32.68 M allocations: 967.014 MiB, 3.36% gc time)
                                                 
##################                                                                                 
ReactantDevice{Missing, Missing}(missing, missing) + AutoEnzyme()
##################                                                                                 
Warm up:
  1.090517 seconds (318.02 k allocations: 14.602 MiB, 47.71% compilation time)
Training:
  2.203423 seconds (1.56 M allocations: 57.048 MiB, 0.82% compilation time)

vpuri3 · 2025-01-22T20:11:54Z

Zygote + CPU : 4.439158 seconds (1.67 M allocations: 2.162 GiB, 5.76% gc time)

Any chance loop vectorization is loaded in your env (even via indirect dep)?

lemme double check after lecture. I do have MKL.jl installed.

vpuri3 · 2025-01-22T21:12:27Z

@avik-pal, yes. I have LV installed.

julia> "LoopVectorization" in [x.name for x in values(Pkg.dependencies())]
true

avik-pal · 2025-01-22T21:14:10Z

So Lux is taking the LV path which would be faster for a small network like this. (Same reason why SimpleChains works much faster than Jax for that size)

vpuri3 · 2025-01-22T22:20:22Z

@avik-pal I ran the same test case in a new environment without LV and got the same results

julia> include("cpu.jl")
"LoopVectorization" in [x.name for x = values(Pkg.dependencies())] = false

##################
CPUDevice() + AutoZygote()
##################
Warm up:
  0.001829 seconds (183 allocations: 227.219 KiB)
Training:
  4.385837 seconds (1.70 M allocations: 2.165 GiB, 5.39% gc time)

##################
CPUDevice() + AutoEnzyme()
##################
Warm up:
  0.000712 seconds (251 allocations: 347.500 KiB)
Training:
  4.633129 seconds (2.11 M allocations: 2.661 GiB, 6.19% gc time)

##################
ReactantDevice{Missing, Missing}(missing, missing) + AutoEnzyme()
##################
Warm up:
  0.728178 seconds (316.32 k allocations: 14.549 MiB, 66.67% compilation time)
Training:
  8.254477 seconds (1.56 M allocations: 57.043 MiB, 0.46% gc time, 0.20% compilation time)

vpuri3 · 2025-01-22T23:17:01Z

Some other notes:

For large models, the first call to train() takes >2mins to complete: When I bump up the layer with to 128 (16k params), Reactant takes 131s to compile. When I added more layers (80k params), the compile time is 132s.
The XLA pool size is always 75% of the GPU capacity. It would be good to be able to modify that.
Is there a way to reclaim memory from the pool? Do compiled models get GC'd, or can I unsafe_free! them?

avik-pal · 2025-01-22T23:46:29Z

You can use the options exposed in EnzymeAD/Reactant.jl#589. Not yet released.

Is there a way to reclaim memory from the pool?

No. But you can use disable preallocation and GC.gc (https://enzymead.github.io/Reactant.jl/stable/introduction/#Empty-Cache)

vpuri3 changed the title ~~Reactant GPU failing~~ Reactant grabbing wrong CUDA version Jan 21, 2025

giordano mentioned this issue Jan 22, 2025

[Reactant] Build for CUDA 12.3 JuliaPackaging/Yggdrasil#10315

Merged

avik-pal closed this as completed Jan 24, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reactant grabbing wrong CUDA version #1225

Reactant grabbing wrong CUDA version #1225

vpuri3 commented Jan 21, 2025

vpuri3 commented Jan 21, 2025

wsmoses commented Jan 21, 2025

giordano commented Jan 22, 2025

wsmoses commented Jan 22, 2025 •

edited

Loading

wsmoses commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

wsmoses commented Jan 22, 2025

avik-pal commented Jan 22, 2025

wsmoses commented Jan 22, 2025

avik-pal commented Jan 22, 2025

avik-pal commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

wsmoses commented Jan 22, 2025

avik-pal commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

wsmoses commented Jan 22, 2025

avik-pal commented Jan 22, 2025

avik-pal commented Jan 22, 2025 •

edited

Loading

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

avik-pal commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025 •

edited

Loading

avik-pal commented Jan 22, 2025

Reactant grabbing wrong CUDA version #1225

Reactant grabbing wrong CUDA version #1225

Comments

vpuri3 commented Jan 21, 2025

vpuri3 commented Jan 21, 2025

wsmoses commented Jan 21, 2025

giordano commented Jan 22, 2025

wsmoses commented Jan 22, 2025 • edited Loading

wsmoses commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

wsmoses commented Jan 22, 2025

avik-pal commented Jan 22, 2025

wsmoses commented Jan 22, 2025

avik-pal commented Jan 22, 2025

avik-pal commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

wsmoses commented Jan 22, 2025

avik-pal commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

wsmoses commented Jan 22, 2025

avik-pal commented Jan 22, 2025

avik-pal commented Jan 22, 2025 • edited Loading

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

avik-pal commented Jan 22, 2025

vpuri3 commented Jan 22, 2025

vpuri3 commented Jan 22, 2025 • edited Loading

avik-pal commented Jan 22, 2025

wsmoses commented Jan 22, 2025 •

edited

Loading

avik-pal commented Jan 22, 2025 •

edited

Loading

vpuri3 commented Jan 22, 2025 •

edited

Loading