GPU memory size #261

waywardpidgeon · 2024-11-18T23:02:03Z

While running the near_global_ocean_simulation.jl a few days ago I had an overflow of my GPU memory error quite soon after starting. Now I have tried again (with a fresh version of "main") and its been running for about one hour with error completing the initial time step in 476.6 ms. The current state of the GPU is in the attached screen-shot.

I have an Nvidia RTX A2000 with 6Gbytes of GPU memory. The MIT simulations appear to be done on an H100 with over 70Gbytes.

Query: if needed to complete this example should I reduce the depth from Nz=40 to say Nz=10, making the example similar to the grid size of the documentation for ClimaOcean example, which did complete, or has the ClimaOcean or Oceananigans code been adapted to this rather small 6Gbyte memory size for the GPU?

waywardpidgeon · 2024-11-18T23:22:25Z

The simulation referred to above was incomplete after 200 iterations finding a NaN in u_ocean. The screen messages are given in

near_global_stopped.txt

glwagner · 2024-11-19T00:57:02Z

@waywardpidgeon changing Nz will reduce the number of grid points, but not the depth. However, you could consider a shallower simulation. Another possibility is to try a one-degree simulation instead:

https://github.com/CliMA/ClimaOcean.jl/blob/main/experiments/one_degree_simulation/one_degree_simulation.jl

I'm not 100% sure about the status of this simulation, but PR #260 is open to continue working on it and making it better. You might follow along and could report issues there which will help push that effort forward.

You can also simply reduce the resolution of the simulation you are working with by changing Nx, Ny here:

ClimaOcean.jl/examples/near_global_ocean_simulation.jl

Lines 35 to 36 in 5a32b35

To diagnose causes of model blow up, I suggest to first start by printing output more frequently. For example, you could change this line:

ClimaOcean.jl/examples/near_global_ocean_simulation.jl

Line 153 in 5a32b35

simulation.callbacks[:progress] = Callback(progress, TimeInterval(5days))

to

 simulation.callbacks[:progress] = Callback(progress, IterationInterval(1))

to print output every iteration, for example. This will show you more precisely at which iteration the model blows up. I would also try reducing the time-step systematically to see if you can stabilize the simulation. The time-step is set here:

ClimaOcean.jl/examples/near_global_ocean_simulation.jl

Line 124 in 5a32b35

simulation = Simulation(coupled_model; Δt=90, stop_time=10days)

waywardpidgeon · 2024-11-27T23:25:08Z

Thanks for the advice. I have started with a shallower simulation 6000m -> 4000m. The initial sim is taking near 6s per step at iteration 100 wall time per step. I will try some of the other suggestions later, as well as the documentation example again.

glwagner · 2024-11-29T22:09:13Z

That's great! Should we convert this issue to a discussion? Generally issues are good when we need to change the source code, whereas discussions are better for simply discussing issues / things related to the code, which may not require changing the source code.

The reason I ask is because issues form a kind of "TO DO list" for development.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU memory size #261

GPU memory size #261

waywardpidgeon commented Nov 18, 2024

waywardpidgeon commented Nov 18, 2024

glwagner commented Nov 19, 2024

waywardpidgeon commented Nov 27, 2024

glwagner commented Nov 29, 2024

GPU memory size #261

GPU memory size #261

Comments

waywardpidgeon commented Nov 18, 2024

waywardpidgeon commented Nov 18, 2024

glwagner commented Nov 19, 2024

waywardpidgeon commented Nov 27, 2024

glwagner commented Nov 29, 2024