Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU memory size #261

Open
waywardpidgeon opened this issue Nov 18, 2024 · 4 comments
Open

GPU memory size #261

waywardpidgeon opened this issue Nov 18, 2024 · 4 comments

Comments

@waywardpidgeon
Copy link

While running the near_global_ocean_simulation.jl a few days ago I had an overflow of my GPU memory error quite soon after starting. Now I have tried again (with a fresh version of "main") and its been running for about one hour with error completing the initial time step in 476.6 ms. The current state of the GPU is in the attached screen-shot.

GPUrunningNearGlobal

I have an Nvidia RTX A2000 with 6Gbytes of GPU memory. The MIT simulations appear to be done on an H100 with over 70Gbytes.

Query: if needed to complete this example should I reduce the depth from Nz=40 to say Nz=10, making the example similar to the grid size of the documentation for ClimaOcean example, which did complete, or has the ClimaOcean or Oceananigans code been adapted to this rather small 6Gbyte memory size for the GPU?

@waywardpidgeon
Copy link
Author

The simulation referred to above was incomplete after 200 iterations finding a NaN in u_ocean. The screen messages are given in

near_global_stopped.txt

@glwagner
Copy link
Member

@waywardpidgeon changing Nz will reduce the number of grid points, but not the depth. However, you could consider a shallower simulation. Another possibility is to try a one-degree simulation instead:

https://github.com/CliMA/ClimaOcean.jl/blob/main/experiments/one_degree_simulation/one_degree_simulation.jl

I'm not 100% sure about the status of this simulation, but PR #260 is open to continue working on it and making it better. You might follow along and could report issues there which will help push that effort forward.

You can also simply reduce the resolution of the simulation you are working with by changing Nx, Ny here:

To diagnose causes of model blow up, I suggest to first start by printing output more frequently. For example, you could change this line:

simulation.callbacks[:progress] = Callback(progress, TimeInterval(5days))

to

 simulation.callbacks[:progress] = Callback(progress, IterationInterval(1))

to print output every iteration, for example. This will show you more precisely at which iteration the model blows up. I would also try reducing the time-step systematically to see if you can stabilize the simulation. The time-step is set here:

simulation = Simulation(coupled_model; Δt=90, stop_time=10days)

@waywardpidgeon
Copy link
Author

Thanks for the advice. I have started with a shallower simulation 6000m -> 4000m. The initial sim is taking near 6s per step at iteration 100 wall time per step. I will try some of the other suggestions later, as well as the documentation example again.

@glwagner
Copy link
Member

That's great! Should we convert this issue to a discussion? Generally issues are good when we need to change the source code, whereas discussions are better for simply discussing issues / things related to the code, which may not require changing the source code.

The reason I ask is because issues form a kind of "TO DO list" for development.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants