You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
It looks like #3303 has broken the model...sort of.
To reproduce, the SCM seems to trigger it reliably.
In my nightly tests, all the regression tests failed. It presented as a layout-test failure, but that was only sort of true. The real issue is that something in #3303 broke 1x6 runs...kind of.
Character Resource Parameter: ROMIO_CB_WRITE:enable, (default value)
Character Resource Parameter: CB_BUFFER_SIZE:16777216, (default value)
Using parallel NetCDF to read file: fvcore_internal_rst
Abort(134825477) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7fffec5c6ebc) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(805914117) on node 1 (rank 1 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffcc02d9fbc) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(403260933) on node 5 (rank 5 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffe64d8fc3c) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(470369797) on node 2 (rank 2 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffccbb3713c) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(604587525) on node 3 (rank 3 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7fff35d1c2bc) failed
PMPI_Comm_free(98).: Invalid communicator
Abort(269043205) on node 4 (rank 4 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffd45792c3c) failed
PMPI_Comm_free(98).: Invalid communicator
>> Error << /usr/local/intel/oneapi/2024/mpi/2021.13/bin/mpirun -np 6 ./GEOSgcm.x --logging_config logging.yaml: status = 5; at /gpfsm/dnb34/mathomp4/SystemTests/builds/AGCM_MAPLDEV/CURRENT/GEOSgcm/install-Release/bin/esma_mpirun line 377.
I did some tests, though, and I can't seem to duplicate this easily. That is a 4x24 run of C24 or C48 works, as does a 1x6 run of C24 or C48 (both 0-day and 6-hour).
But if you run gcm_regress.j, I can reproduce the crash at both c24 and c48.
But but! If you run the SCM, the failure seems quite reproducible. It always dies:
Character Resource Parameter: CB_BUFFER_SIZE:16777216, (default value)
Using parallel NetCDF to read file: gwd_internal_rst
Bootstrapping Variable: EFFRDG in gwd_internal_rst
Bootstrapping Variable: KWVRDG in gwd_internal_rst
Abort(403260933) on node 0 (rank 0 in comm 0): Fatal error in PMPI_Comm_free: Unknown error class, error stack:
PMPI_Comm_free(135): MPI_Comm_free(comm=0x7ffeda3800bc) failed
PMPI_Comm_free(98).: Invalid communicator
>> Error << /usr/local/intel/oneapi/2024/mpi/2021.13/bin/mpirun -np 1 ./GEOSgcm.x --logging_config logging.yaml: status = 5; at /gpfsm/dnb34/mathomp4/SystemTests/builds/AGCM_MAPLDEV/CURRENT/GEOSgcm/inst
The text was updated successfully, but these errors were encountered:
@mathomp4 I'll take a look once my eyes recover for the dilataion (and the injection). This seems to be ralated to my PR (#3303) that got merged recently.
It looks like #3303 has broken the model...sort of.
To reproduce, the SCM seems to trigger it reliably.
In my nightly tests, all the regression tests failed. It presented as a layout-test failure, but that was only sort of true. The real issue is that something in #3303 broke 1x6 runs...kind of.
As seen in:
The error seems to be at restart read time:
I did some tests, though, and I can't seem to duplicate this easily. That is a 4x24 run of C24 or C48 works, as does a 1x6 run of C24 or C48 (both 0-day and 6-hour).
But if you run
gcm_regress.j
, I can reproduce the crash at both c24 and c48.But but! If you run the SCM, the failure seems quite reproducible. It always dies:
The text was updated successfully, but these errors were encountered: