Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reactant grabbing wrong CUDA version #1225

Closed
vpuri3 opened this issue Jan 21, 2025 · 27 comments
Closed

Reactant grabbing wrong CUDA version #1225

vpuri3 opened this issue Jan 21, 2025 · 27 comments

Comments

@vpuri3
Copy link
Contributor

vpuri3 commented Jan 21, 2025

Looks like Reactant is not grabbing the right CUDA version.

julia> include("misc/react_demo.jl")                                                               
2025-01-21 18:16:53.960554: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 6019370082488903858
E0121 18:16:54.305053   24773 pjrt_stream_executor_client.cc:3045] Execution of replica 0 failed: UNIMPLEMENTED: StreamBeginCaptureToGraph is not implemented for CUDA below version 12.3. Therefore t
racing is not supported.                                                                                                                                                                              
ERROR: LoadError: UNIMPLEMENTED: StreamBeginCaptureToGraph is not implemented for CUDA below version 12.3. Therefore tracing is not supported.
[vedantpu@eagle NeuralROMs.jl]:nvidia-smi
Tue Jan 21 18:26:24 2025       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 550.127.08             Driver Version: 550.127.08     CUDA Version: 12.4     |
|-----------------------------------------+------------------------+----------------------+

The script is linked below.

using Lux, MLDataDevices
using CUDA, LuxCUDA, KernelAbstractions
using Random, Printf, Optimisers, MLUtils

using Zygote
using Enzyme
using Reactant

function train(device, adtype)

	N = 10000
	W = 64
	E = 500

	# model
	NN = Chain(Dense(1 => W, gelu), Dense(W => W, gelu), Dense(W => 1))
	ps, st = Lux.setup(Random.default_rng(), NN)

	# data
	x = LinRange(0f0, 1f0, N) |> Array
	x = reshape(x, 1, N)
	y = @. sinpi(2x)

	# data loader
	DL = DataLoader((x, y); batchsize = div(N, 100))

	# device transfer
	ps = ps |> device
	st = st |> device
	DL = DeviceIterator(device, DL)

	# training
    train_state = Training.TrainState(NN, ps, st, Adam(0.001f0))

    for epoch in 1:E
        for (i, (xᵢ, yᵢ)) in enumerate(DL)
            _, loss, _, train_state = Training.single_train_step!(
                adtype, MSELoss(), (xᵢ, yᵢ), train_state)
            if (epoch % E == 0 || epoch == 1) && i == 1
				println("Epoch $(epoch)/$(E)\tLoss: $(loss)")
            end
        end
    end

    return train_state
end

# @time train(cpu_device(), AutoZygote())
# @time train(gpu_device(), AutoZygote())

# @time train(cpu_device(), AutoEnzyme())
# @time train(gpu_device(), AutoEnzyme())

# Reactant.set_default_backend("cpu")
# @time train(reactant_device(), AutoEnzyme())
Reactant.set_default_backend("gpu")
@time train(reactant_device(), AutoEnzyme())

#====================================================#
nothing
@vpuri3 vpuri3 changed the title Reactant GPU failing Reactant grabbing wrong CUDA version Jan 21, 2025
@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 21, 2025

@wsmoses @avik-pal

@wsmoses
Copy link
Contributor

wsmoses commented Jan 21, 2025

cc @giordano

@giordano
Copy link

@wsmoses we build CUDA 12.1 and 12.6, the only way to solve this I see is to build 12.3 too.

@wsmoses
Copy link
Contributor

wsmoses commented Jan 22, 2025

I think the right thing to do is to drop 12.1 and build 12.2 since that is the last version before the linker dependency

@wsmoses
Copy link
Contributor

wsmoses commented Jan 22, 2025

And I think we should, weirdly enough have the versions as

12.3+ -> 12.6 build (since then we always have the linker issue resolved)
<12.3, a 12.2 build (or maybe 12.1)

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

this is on my lab's on prem workstation. should i consider upgrading to cuda 12.6?

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

Thanks @giordano @wsmoses that error is fixed.

The timings I am getting are below. Looks like Reactant is much slower on both CPU and GPU than Zygote/ Enzyme. And @avik-pal, AutoEnzyme() on GPU is taking over ~5 mins to run. The script is here.

####
# CPU
####

@time train(cpu_device(), AutoZygote())
@time train(cpu_device(), AutoEnzyme())
Reactant.set_default_backend("cpu")
@time train(reactant_device(), AutoEnzyme())

####
# GPU
####

@time train(gpu_device(), AutoZygote())
# @time train(gpu_device(), AutoEnzyme()) # takes over 5 mins so I don't run it
Reactant.set_default_backend("gpu")
@time train(reactant_device(), AutoEnzyme())
  0.153180 seconds (213.75 k allocations: 32.291 MiB, 16.31% gc time, 37.01% compilation time)
  0.118047 seconds (217.79 k allocations: 37.426 MiB, 13.91% gc time, 45.17% compilation time)
2025-01-22 11:31:51.966997: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 13985633240296206436
  0.810326 seconds (368.79 k allocations: 17.013 MiB, 64.08% compilation time)

  0.076810 seconds (327.37 k allocations: 9.804 MiB)
2025-01-22 11:31:53.233898: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 16348477711662704876
  1.097237 seconds (368.81 k allocations: 17.037 MiB, 46.19% compilation time)

@wsmoses
Copy link
Contributor

wsmoses commented Jan 22, 2025

Just for the sake of removing precompilation time, what happens if you time the second train?

e.g.

train(cpu_device(), AutoZygote())
@time train(cpu_device(), AutoZygote())
train(cpu_device(), AutoEnzyme())
@time train(cpu_device(), AutoEnzyme())
Reactant.set_default_backend("cpu")
train(reactant_device(), AutoEnzyme())
@time train(reactant_device(), AutoEnzyme())

####
# GPU
####
train(gpu_device(), AutoZygote())
@time train(gpu_device(), AutoZygote())
# @time train(gpu_device(), AutoEnzyme()) # takes over 5 mins so I don't run it
Reactant.set_default_backend("gpu")
train(reactant_device(), AutoEnzyme())
@time train(reactant_device(), AutoEnzyme())

Also I'd be very curious to see results of the op profiler https://enzymead.github.io/Reactant.jl/dev/tutorials/profiling (which incidentally should work with both Reactant code [and give tons more helpful info] and generic code)

@avik-pal
Copy link
Member

can you start julia with LUX_DUMP_REACTANT_HLO_OPTIMIZE=true or if you already have Lux loaded Lux.DUMP_REACTANT_HLO_OPT_MODE[] = true. It will dump the generated HLO in a mlir file.

@wsmoses
Copy link
Contributor

wsmoses commented Jan 22, 2025

However @avik-pal I'm confused why we see more julia allocations on the julia side (at a high level I would've expected the opposite)

@avik-pal
Copy link
Member

That might be from

mutable struct StatsAndNewStateWrapper
stats::Any
st::Any
end
function wrapped_objective_function(
fn::F, model, ps, data, cache::StatsAndNewStateWrapper
) where {F}
loss, stₙ, stats = fn(model, ps, cache.st, data)
cache.stats = stats
cache.st = stₙ
return loss
end
function compute_gradients_internal(objective_function::F, model, data, ps, st) where {F}
st_stats_wrapper = StatsAndNewStateWrapper(nothing, st)
res = Enzyme.gradient(
Enzyme.set_abi(Enzyme.ReverseWithPrimal, Reactant.ReactantABI),
Const(wrapped_objective_function), Const(objective_function),
Const(model), ps, Const(data), Const(st_stats_wrapper)
)
loss, dps = res.val, res.derivs[3]
return dps, loss, st_stats_wrapper.stats, st_stats_wrapper.st
end
, because I need multiple returns from Enzyme.autodiff

@avik-pal
Copy link
Member

Also it might be worthwhile passing return_gradients = Val(true) to single_train_step! that should eliminate some intermediates

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

Thanks both, here is the updated script and the MLIR

[ Info: HLO dumped to /tmp/jl_7OxadMZgvQ.mlir                                                                                                                                                                                                                
2025-01-22 13:02:04.847488: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 2656968583157646233
  0.075871 seconds (17.07 k allocations: 22.296 MiB, 19.48% gc time)
  0.063287 seconds (21.30 k allocations: 27.448 MiB, 26.12% gc time)
[ Info: HLO dumped to /tmp/jl_BSfmN2ZmlM.mlir
  1.112902 seconds (443.00 k allocations: 19.483 MiB, 61.43% compilation time)

[ Info: HLO dumped to /tmp/jl_Fu5UxgvEB0.mlir
2025-01-22 13:02:07.763693: I external/xla/xla/service/llvm_ir/llvm_command_line_options.cc:50] XLA (re)initializing LLVM with options fingerprint: 1376608044839264878
  0.066194 seconds (327.13 k allocations: 9.773 MiB)
[ Info: HLO dumped to /tmp/jl_z2jKr6H4bO.mlir
  1.340633 seconds (442.96 k allocations: 19.424 MiB, 50.82% compilation time)

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

Notice that Reactant has ~50% compilation time even for 2nd runs.

@wsmoses
Copy link
Contributor

wsmoses commented Jan 22, 2025

Hm that's weird and bad, @avik-pal is there a way for Lux to use the precompiled version (and also separately it really shouldn't be forcing that much compile the second go anyways)

@avik-pal
Copy link
Member

Hm that's weird and bad, @avik-pal is there a way for Lux to use the precompiled version

Yes, keep the train_state object around. It contains the compiled funcitons

and also separately it really shouldn't be forcing that much compile the second go anyways

@mofeing (I think?) mentioned in today's meeting that recompile times have regressed

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

Looks like a big chunk of time is taken up by compiling the XLA models. I 100x'd the number of epochs and got the following timings

Zygote + CPU  :  4.439158 seconds (1.67 M allocations: 2.162 GiB, 5.76% gc time)
Enzyme + CPU  :  4.687645 seconds (2.09 M allocations: 2.658 GiB, 6.91% gc time)
Reactant + CPU:  9.184567 seconds (1.86 M allocations: 71.208 MiB, 0.43% gc time, 5.46% compilation time)

# Base.@time
Zygote + GPU  :  7.112343 seconds (32.36 M allocations: 957.447 MiB, 3.21% gc time)
Reactant + GPU:  3.046458 seconds (1.86 M allocations: 71.210 MiB, 17.63% compilation time)

Maybe it would be useful to separate out the calls to @compile as a step before applying single_train_step?

@avik-pal @wsmoses is there a good way to track GPU memory usage by Reactant, similar to CUDA.@time / CUDA.@allocated? Base.@time / CUDA.@time says that only 71 MiB is used but nvidia-smi says 8700MiB is occupied.

@wsmoses
Copy link
Contributor

wsmoses commented Jan 22, 2025

https://enzymead.github.io/Reactant.jl/dev/api/xla#Reactant.XLA.allocatorstats maybe be useful for the latter question [on GPU]

@avik-pal
Copy link
Member

Maybe it would be useful to separate out the calls to @compile as a step before applying single_train_step?

If you call that function once outside the loop, it will cache it inside the returned train_state object

@avik-pal
Copy link
Member

avik-pal commented Jan 22, 2025

Zygote + CPU : 4.439158 seconds (1.67 M allocations: 2.162 GiB, 5.76% gc time)

Any chance loop vectorization is loaded in your env (even via indirect dep)?

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

If you call that function once outside the loop, it will cache it inside the returned train_state object

Here's what I got:

##################                                                                                                                                                                                    
CPUDevice() + AutoZygote()                                                                                                                                                                            
##################                                                                                                                                                                                    
Warm up:                                                                                                                                                                                              
  0.002192 seconds (183 allocations: 227.219 KiB)                                                                                                                                                     
Training:                                                                                                                                                                                             
  4.521769 seconds (1.70 M allocations: 2.165 GiB, 5.66% gc time)
                                                 
##################                                                                                 
CPUDevice() + AutoEnzyme()
##################                                                                                 
Warm up:
  0.000784 seconds (251 allocations: 347.500 KiB) 
Training:                                                                                          
  4.615168 seconds (2.11 M allocations: 2.661 GiB, 5.76% gc time)
                                                 
##################                                                                                 
ReactantDevice{Missing, Missing}(missing, missing) + AutoEnzyme()
##################                                                                                 
Warm up:
  0.725995 seconds (318.00 k allocations: 14.596 MiB, 67.76% compilation time)
Training:                                                                                                                                                                                             
  8.284950 seconds (1.56 M allocations: 57.048 MiB, 0.52% gc time, 0.20% compilation time)  
                                                                                                                                                                                                      
                                                                                                   
##################                                                                                 
CUDADevice{Nothing}(nothing) + AutoZygote()
##################                                                                                 
Warm up:
  0.001040 seconds (3.28 k allocations: 99.211 KiB)
Training:                                                                                          
  6.724378 seconds (32.68 M allocations: 967.014 MiB, 3.36% gc time)
                                                 
##################                                                                                 
ReactantDevice{Missing, Missing}(missing, missing) + AutoEnzyme()
##################                                                                                 
Warm up:
  1.090517 seconds (318.02 k allocations: 14.602 MiB, 47.71% compilation time)
Training:
  2.203423 seconds (1.56 M allocations: 57.048 MiB, 0.82% compilation time)

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

Zygote + CPU : 4.439158 seconds (1.67 M allocations: 2.162 GiB, 5.76% gc time)

Any chance loop vectorization is loaded in your env (even via indirect dep)?

lemme double check after lecture. I do have MKL.jl installed.

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

@avik-pal, yes. I have LV installed.

julia> "LoopVectorization" in [x.name for x in values(Pkg.dependencies())]
true

@avik-pal
Copy link
Member

So Lux is taking the LV path which would be faster for a small network like this. (Same reason why SimpleChains works much faster than Jax for that size)

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

@avik-pal I ran the same test case in a new environment without LV and got the same results

julia> include("cpu.jl")
"LoopVectorization" in [x.name for x = values(Pkg.dependencies())] = false

##################
CPUDevice() + AutoZygote()
##################
Warm up:
  0.001829 seconds (183 allocations: 227.219 KiB)
Training:
  4.385837 seconds (1.70 M allocations: 2.165 GiB, 5.39% gc time)

##################
CPUDevice() + AutoEnzyme()
##################
Warm up:
  0.000712 seconds (251 allocations: 347.500 KiB)
Training:
  4.633129 seconds (2.11 M allocations: 2.661 GiB, 6.19% gc time)

##################
ReactantDevice{Missing, Missing}(missing, missing) + AutoEnzyme()
##################
Warm up:
  0.728178 seconds (316.32 k allocations: 14.549 MiB, 66.67% compilation time)
Training:
  8.254477 seconds (1.56 M allocations: 57.043 MiB, 0.46% gc time, 0.20% compilation time)

@vpuri3
Copy link
Contributor Author

vpuri3 commented Jan 22, 2025

Some other notes:

  1. For large models, the first call to train() takes >2mins to complete: When I bump up the layer with to 128 (16k params), Reactant takes 131s to compile. When I added more layers (80k params), the compile time is 132s.
  2. The XLA pool size is always 75% of the GPU capacity. It would be good to be able to modify that.
  3. Is there a way to reclaim memory from the pool? Do compiled models get GC'd, or can I unsafe_free! them?

@avik-pal
Copy link
Member

You can use the options exposed in EnzymeAD/Reactant.jl#589. Not yet released.

Is there a way to reclaim memory from the pool?

No. But you can use disable preallocation and GC.gc (https://enzymead.github.io/Reactant.jl/stable/introduction/#Empty-Cache)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants