Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latest MAPL develop breaks GEOS in certain layouts #3307

Closed
mathomp4 opened this issue Jan 13, 2025 · 2 comments · Fixed by #3310
Closed

Latest MAPL develop breaks GEOS in certain layouts #3307

mathomp4 opened this issue Jan 13, 2025 · 2 comments · Fixed by #3310
Assignees
Labels
🪲 Bug Something isn't working ❗ High Priority This is a high priority PR

Comments

@mathomp4
Copy link
Member

mathomp4 commented Jan 13, 2025

It looks like #3303 has broken the model...sort of.

To reproduce, the SCM seems to trigger it reliably.


In my nightly tests, all the regression tests failed. It presented as a layout-test failure, but that was only sort of true. The real issue is that something in #3303 broke 1x6 runs...kind of.

As seen in:

/gpfsm/dnb34/mathomp4/SystemTests/runs/AGCM_MAPLDEV/c48_O1_GOCART/CURRENT/run/1day/regress/slurm-42759818.out
  • 1-day 4x24 run: Success
  • 6-hour 4x24 run: Success
  • 18-hour 4x24 run (restarted from 6-hour): Success
  • 6-hour 1x6 run: Fail

The error seems to be at restart read time:

 Character Resource Parameter: ROMIO_CB_WRITE:enable, (default value)
 Character Resource Parameter: CB_BUFFER_SIZE:16777216, (default value)
 Using parallel NetCDF to read file: fvcore_internal_rst
Abort(134825477) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7fffec5c6ebc) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(805914117) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffcc02d9fbc) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(403260933) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffe64d8fc3c) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(470369797) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffccbb3713c) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(604587525) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7fff35d1c2bc) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(269043205) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffd45792c3c) failed
PMPI_Comm_free(98).: Invalid communicator
>> Error << /usr/local/intel/oneapi/2024/mpi/2021.13/bin/mpirun  -np 6 ./GEOSgcm.x --logging_config logging.yaml: status = 5; at /gpfsm/dnb34/mathomp4/SystemTests/builds/AGCM_MAPLDEV/CURRENT/GEOSgcm/install-Release/bin/esma_mpirun line 377.

I did some tests, though, and I can't seem to duplicate this easily. That is a 4x24 run of C24 or C48 works, as does a 1x6 run of C24 or C48 (both 0-day and 6-hour).

But if you run gcm_regress.j, I can reproduce the crash at both c24 and c48.

But but! If you run the SCM, the failure seems quite reproducible. It always dies:

 Character Resource Parameter: CB_BUFFER_SIZE:16777216, (default value)
 Using parallel NetCDF to read file: gwd_internal_rst
   Bootstrapping Variable: EFFRDG in gwd_internal_rst
   Bootstrapping Variable: KWVRDG in gwd_internal_rst
Abort(403260933) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffeda3800bc) failed
PMPI_Comm_free(98).: Invalid communicator
>> Error << /usr/local/intel/oneapi/2024/mpi/2021.13/bin/mpirun  -np 1 ./GEOSgcm.x --logging_config logging.yaml: status = 5; at /gpfsm/dnb34/mathomp4/SystemTests/builds/AGCM_MAPLDEV/CURRENT/GEOSgcm/inst
@mathomp4 mathomp4 added 🪲 Bug Something isn't working ❗ High Priority This is a high priority PR labels Jan 13, 2025
@atrayano
Copy link
Contributor

@mathomp4 I'll take a look once my eyes recover for the dilataion (and the injection). This seems to be ralated to my PR (#3303) that got merged recently.

@mathomp4
Copy link
Member Author

@atrayano Yes, that is my thought. So weird though that it's both reproducible and irreproducible!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🪲 Bug Something isn't working ❗ High Priority This is a high priority PR
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants