Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Issue reading c180 GEOS-IT mass flux in c180 GCHP simulation #2862

Open
yuanjianz opened this issue Jun 11, 2024 · 17 comments
Open

Issue reading c180 GEOS-IT mass flux in c180 GCHP simulation #2862

yuanjianz opened this issue Jun 11, 2024 · 17 comments
Assignees

Comments

@yuanjianz
Copy link

yuanjianz commented Jun 11, 2024

I am using MAPL 2.26.0 to run GCHP at C180 with GEOS-IT meteorology. Since what I am reading and running are the same resolution, I supposed there should not be a regridding issue. Surprisingly, I got the error message below:

pe=00000 FAIL at line=00147    NewRegridderManager.F90                  <no such property>
pe=00000 FAIL at line=00092    NewRegridderManager.F90                  <status=1>
pe=00000 FAIL at line=01006    GriddedIO.F90                            <status=1>
pe=00000 FAIL at line=04705    ExtDataGridCompMod.F90                   <status=1>
pe=00000 FAIL at line=01490    ExtDataGridCompMod.F90                   <status=1>

I looked through the issues and found #2118 and #1124. From my understanding, the HFLUX essentially requires the resolution divisible by the Nx and Ny after decomposition.

In my case, however:

  1. Simulation resolution is the same as the raw file, so regridding itself is unnecessary.
  2. Even if we imagine a case where C180 is regridded to C180, I tested with both 49*24 cores (Nx=14, Ny=14*6, 180/14 is not integer) and 36*24 cores (Nx=12, Ny=12*6, 180/12=15). They both failed with the same error mentioned above.
  3. Interestingly, if I follow the rules and actually do the regridding: read C180 but run at C90 with 25*24 cores (Nx=10, Ny=10, 90/10=9), it works!

Related entries in my ExtData.rc are:

MFXC;MFYC Pa_m+2_s-1 N H F0;003000 none  0.6666666 MFXC;MFYC ./MetDir/%y4/%m2/%d2/GEOS.it.asm.ctm_tavg_1hr_glo_C180x180x6_v72.GEOS5294.%y4-%m2-%d2T%h2%n2.V01.nc4 1995-01-01T00:30:00P01:00
CXC;CYC   1          N H F0;003000 none  none      CX;CY     ./MetDir/%y4/%m2/%d2/GEOS.it.asm.ctm_tavg_1hr_glo_C180x180x6_v72.GEOS5294.%y4-%m2-%d2T%h2%n2.V01.nc4 1995-01-01T00:30:00P01:00

For a workaround, I replace H with N and manually bypass the regridding. It works as well.

The compiler is intel with MPT mpi if that matters.

@mathomp4
Copy link
Member

Well, since this is ExtData related I'm adding @bena-nasa.

But I'll also mention @tclune and @atrayano as, well, they know this stuff in a way too.

Note that MAPL 2.26 is a bit ago so perhaps this has been fixed? Though if so, might have been in ExtData2, not 1.

@bena-nasa
Copy link
Collaborator

@yuanjianz You tried doing what I was going to suggest, changing the "H" in the fourth column to "N" which gets you around the issue. As you say, if they are on the same grid no need to regrid. My first guess is that there is something about @tclune implementation of the flux regridder that is getting confused since it should not be regridding other than a trivial identity regridder.

@yuanjianz
Copy link
Author

It suddenly came to me that @lizziel did several Transport-Tracer simulations with raw c180 GEOS-IT at C180 resolution on Discover. Did you encounter this problem?

@tclune
Copy link
Collaborator

tclune commented Jun 12, 2024

Given that things seemed to work for @lizziel my primary suspicion is that the code is failing to detect that the two grids are in fact identical. But even then, I would have expected the code to work, just wasting resources computing what amounts to an identity transform.

@tclune
Copy link
Collaborator

tclune commented Jun 12, 2024

@yuanjianz Probably not important, but I am confused by the fact that the linked version of MAPL above does not actually produce the error message given in the traceback. ("no such property" does not occur in NewRegridderManager.F90)

It will be important for us to confirm the precise version of MAPL to track this down.

@lizziel
Copy link
Contributor

lizziel commented Jun 12, 2024

I am looking into this now, first trying to figure out why a subroutine in GriddedIO.F90 is returning false when checking if file grid and run grid are the same. That should be true for the GEOS-IT input file if running at C180.

Regarding NX and NY, I did C180 runs with NX=6, NY=30. I did not play with other combinations. Could it be that file resolution divided by NX must be even?

@tclune
Copy link
Collaborator

tclune commented Jun 12, 2024

@lizziel the "even" requirement should only be on the model grid. But lots of subtle things about how MAPL invents decompositions may come into play here.

@yuanjianz
Copy link
Author

yuanjianz commented Jun 12, 2024

@tclune I double checked, and I confirmed the MAPL version and the link was correct.

I did some extra tests, and it seems to be related to the total cores I apply. My previous failed tests were 1176 and 876 core. However, my recent test with 600 cores at C180 succeed.

@lizziel 's suggestion might be correct. In both 1176 (14x14x6) and 876 (12x12x6) tests, 180 divided by Nx is non-even number, while the 600(10x10x6, 180/10=18) test turns out successful.

@lizziel
Copy link
Contributor

lizziel commented Jun 12, 2024

Regarding the unequal grids, here is the line that results in false:

if (filegrid/=output_grid) then

I printed out the two grids and get this:
ewl: filegrid not equal to output_grid: ./MetDir/2019/01/01/GEOS.it.asm.asm_tavg_3hr_glo
_C180x180x6_v72.GEOS5294.2019-01-01T0130.V01.nc4

filegrid:

 Object has been CREATED
 Base name    = UNKNOWN
 Status: Base = Ready,  object = Ready
 Proxy        = no
 Base ID = 2530, vmID:
--- ESMCI::VMId::print() start ---
  vmKeyWidth = 12
  vmKey=0xFFFFFFFFFFFFFFFFFFFFFFC0  localID = 1
--- ESMCI::VMId::print() end ---
 Root Info (Attributes):
{
  "ESMF": {
    "General": {
      "GRID_LM": 72,
      "GridType": "Cubed-Sphere",
      "MAPL_grid_factory_id": 2,
      "NEW_CUBE": 1
    }
  }

output_grid:

} Object has been CREATED
 Base name    = PE180x1080-CF
 Status: Base = Ready,  object = Ready
 Proxy        = no
 Base ID = 20, vmID:
--- ESMCI::VMId::print() start ---
  vmKeyWidth = 12
  vmKey=0xFFFFFFFFFFFFFFFFFFFFFFC0  localID = 1
--- ESMCI::VMId::print() end ---
 Root Info (Attributes):
{
  "ESMF": {
    "General": {
      "GRID_LM": 72,
      "GridType": "Cubed-Sphere",
      "MAPL_grid_factory_id": 1,
      "NEW_CUBE": 1
    }
  }

This is MAPL 2.26. "no such property" is in MAPL_ErrorHandling.F90.

@lizziel
Copy link
Contributor

lizziel commented Jun 12, 2024

I was able to run GCHP at C180 using C180 mass fluxes with the following combos:
96 cores, NX=4 (180/4=45 which disproves my even number theory)
150 cores, NX=5
216 cores, NX=6

Like @yuanjianz, 864 cores with NX=12 failed for me.

@tclune
Copy link
Collaborator

tclune commented Jun 12, 2024

Interesting - will have to investigate further. Hopefully we can reproduce on our end.

@bena-nasa
Copy link
Collaborator

@lizziel so sounds like you ran this on Discover? Maybe? If so, where are these GEOS-IT files on Discover or your input file with the path?

@lizziel
Copy link
Contributor

lizziel commented Jun 12, 2024

My runs with 180 cores are on discover at ~ewlundgr/ccmdev/ewlundgr/GEOS-Chem/1yr_runs/gchp_TransportTracers_geosit_raw_cs_using_mass_flux_with_FRSNO_fix. The GEOS-IT data path is /gpfsm/dnb06/projects/p171/dao_ops/archive/d5294_geosit_jan18/diag.

My test runs with other numbers of cores are at Harvard.

@yuanjianz
Copy link
Author

yuanjianz commented Jun 14, 2024

@bena-nasa since GriddedIO.F90 failed to detect the two grids are identical, do you see any potential errors that could be introduced by setting regridding method to N to bypass the error? (N seems to be bilinear regridding)

@bena-nasa
Copy link
Collaborator

bena-nasa commented Aug 13, 2024

@tclune @lizziel

I think I see what the problem is. You have your application, it creates a grid, you said at 864 cores you used nx=12, ny=72, so each face is decomposed on a 12x12 layout.
Now when the file is read, it tries to made a grid from that file to be used later in possible regridding and to distribute the data when reading. Well, it has to just choose an nx and ny and it turns out it chooses a 9x16 layout on each face. We may have a different heuristic that tries to make a more square decomposition (which would be a perfect 12x12) we could use, but it would still rely on coincidence since the user can set whatever decomposition they want.

So the two grids are not the same grid and it tries to find a regridder which in this case is the flux regridder. And there's no flux regridding that can do what is essentially a redistribution of the data so it fails.

If you set the regridding to any other regrid method goes through ESMF it will just work since ESMF can always regrid between these two grids which are really just the same grid, different layouts, so yes, you would get past this error and the regridder it spawned would be the identity for all practical purposes.

The only solution to this is that we essentially do not allow the user to choose the decomposition. It ALWAYS chooses a grid decomposition based on the core count and we make sure we are always using the same algorithm and you would do this for the application grid rather than look at a file.

If the user has freedom, there's no solution since they can always choose something other than what an algorithm would choose and when we make the grid from the file, it has to just pick a layout.

@lizziel
Copy link
Contributor

lizziel commented Aug 14, 2024

Thanks @bena-nasa. For nearly all GCHP users the domain decomposition is chosen automatically anyway as a pre-run step I added to the run script. So making this built-in would not be a problem. However, it would be nice to have an override option for testing (e.g. to run at two different domain decomps and check bit-for-bit reproducibility) as well as for domain decomposition research.

@lizziel
Copy link
Contributor

lizziel commented Aug 14, 2024

Tagging @sdeastham

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants