-
Notifications
You must be signed in to change notification settings - Fork 10
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU memory size #261
Comments
The simulation referred to above was incomplete after 200 iterations finding a NaN in u_ocean. The screen messages are given in |
@waywardpidgeon changing I'm not 100% sure about the status of this simulation, but PR #260 is open to continue working on it and making it better. You might follow along and could report issues there which will help push that effort forward. You can also simply reduce the resolution of the simulation you are working with by changing ClimaOcean.jl/examples/near_global_ocean_simulation.jl Lines 35 to 36 in 5a32b35
To diagnose causes of model blow up, I suggest to first start by printing output more frequently. For example, you could change this line:
to simulation.callbacks[:progress] = Callback(progress, IterationInterval(1)) to print output every iteration, for example. This will show you more precisely at which iteration the model blows up. I would also try reducing the time-step systematically to see if you can stabilize the simulation. The time-step is set here:
|
Thanks for the advice. I have started with a shallower simulation 6000m -> 4000m. The initial sim is taking near 6s per step at iteration 100 wall time per step. I will try some of the other suggestions later, as well as the documentation example again. |
That's great! Should we convert this issue to a discussion? Generally issues are good when we need to change the source code, whereas discussions are better for simply discussing issues / things related to the code, which may not require changing the source code. The reason I ask is because issues form a kind of "TO DO list" for development. |
While running the near_global_ocean_simulation.jl a few days ago I had an overflow of my GPU memory error quite soon after starting. Now I have tried again (with a fresh version of "main") and its been running for about one hour with error completing the initial time step in 476.6 ms. The current state of the GPU is in the attached screen-shot.
I have an Nvidia RTX A2000 with 6Gbytes of GPU memory. The MIT simulations appear to be done on an H100 with over 70Gbytes.
Query: if needed to complete this example should I reduce the depth from Nz=40 to say Nz=10, making the example similar to the grid size of the documentation for ClimaOcean example, which did complete, or has the ClimaOcean or Oceananigans code been adapted to this rather small 6Gbyte memory size for the GPU?
The text was updated successfully, but these errors were encountered: