You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Describe the bug
When writing large files collectively from more then 3 Processes, the program ends with a segmentation fault while writing. For gcc builds on various systems. Large means about >=8 GB per process. We call the HDF5 Fortran API.
I include a minimal failing example. Each process allocates an int32 one-rank array of size 2147483647 (that are ~8.59GB) initialized with the process number and writes that to a one-rank dataset of size n_processes * 2147483647 of type H5T_NATIVE_INTEGER. The program ends in the segmentation fault when called from more then three processes/
Expected behavior
The program writes the expected file collectively and ends without an error.
Platform
The example code fails as described on two systems:
Laptop:
- Dell Latitude 5430
- 12th Gen Intel(R) Core(TM) i7-1265U
- 32 GB RAM
HPC:
- Dell R7525 Server, 2x 7713 AMD EPYC with 1TB RAM with 2x 100 GBit/s Ethernet
- Rocky 9.3 or 9.4 as hypervisor
- Runs a KVM in a docker container
- Connected to a Lustre with 100GB/s Ethernet, 2 meta and 4 storage servers
HDF5 version
- 1.12.3 and 1.14.0 (on both systems)
OS and version
- Ubuntu 24.04.1 LTS (Laptop)
- OpenSUSE 15.5 (HPC)
Compiler and version
- gcc 13.3.0 (Laptop)
- gcc 14.2.0 (HPC)
Build system (e.g. CMake, Autotools) and version
Any configure options you specified
- export CC=mpicc F9X=mpif90 ./configure --enable-build-mode=production --enable-parallel --enable-fortran (Laptop and HPC for HDF5-1.12.3)
- export CC=mpicc CXX=mpicxx FC=mpifort F77=mpif77 ; ./configure --prefix=/software/hdf5/1.14.5-libaec-1.1.3-openmpi-5.0.6-gcc14.2.0 --with-pic --with-pthread --enable-shared --enable-threadsafe --enable-unsupported --enable-parallel --enable-fortran --with-fortran2003 --enable-build-mode=production (HPC for HDF5-1.14.0)`
MPI library and version (parallel HDF5)
OpenMPI 4.1.6 (Laptop)
OpenMPI 5.0.6 (HPC)
Additional context
The program ends without an error if the h5dwrite_f routines is commented out or if, in the h5pset_dxpl_mpio_f call, H5FD_MPIO_COLLECTIVE_F is replaced with H5FD_MPIO_INDEPENDENT_F.
The program runs as expected on the LUMI HPC in Finland with cray-HDF5-12.2.11 built with gcc 12.2.0 and on intel nodes (Xeon(R) Gold 6126 CPU @ 2.60GHz) built with intel-oneapi 2021.4.0. On the intel machine, HDF5-1.12.0 is self built with export CC=mpiicc F9X=mpiifort .configure --enable-build-mode=production --enable-parallel --enable-fortran --enable-fortran2003
Additionally I tested on the Laptop HDF5 installed with apt-get install libhdf5-openmpi-dev which is version 1.10.10. In this case the program fails as well as described above.
All software on the HPC is self built. Here are the specs:
GCC 14.2.0 ./contrib/download_prerequisites gefolgt von ./configure --prefix=/software/gcc/14.2.0 --disable-multilib
libaec 1.1.3 cmake -S ../ -DCMAKE_INSTALL_PREFIX=/software/libaec/1.1.3 ; make ; make install
OpenMPI 5.0.6 gcc/14.2.0 ./configure --prefix=/software/openmpi/5.0.6-gcc14.2.0 ; make -j ; make install
I could also add that make check passes in the HDF5 compilation step as described and the same problems can be reproduced bare-metal on a Linux laptop (also AMD-based) running Tumbleweed.
Using two ranks succeeds, but the moment 3 or more are used it fails...
This looks like an OpenMPI issue. For me, the program works fine with MPICH but fails with OpenMPI 4.0.2. Can you try with MPICH and see if you have the same problem?
Thanks @brtnfld - and my apologies for taking so long to respond.
I just build mpich locally and hdf5 and then I could run the test locally successfully.
In another context, mpich was a nightmare for a different tool and I'm not sure we ever resolved that to a satisfactory level while openmpi just works - so I'd much prefer to stick to openmpi if possible. But if the only way to enable users to run their software is to build everything with mpich, so be it.
Do you think you would work out the source of the problem with the openmpi people? - It seems to be very much the defacto standard solution for mpi-solutions, especially in a research context.
Describe the bug
When writing large files collectively from more then 3 Processes, the program ends with a segmentation fault while writing. For gcc builds on various systems. Large means about >=8 GB per process. We call the HDF5 Fortran API.
I include a minimal failing example. Each process allocates an int32 one-rank array of size 2147483647 (that are ~8.59GB) initialized with the process number and writes that to a one-rank dataset of size
n_processes * 2147483647
of typeH5T_NATIVE_INTEGER
. The program ends in the segmentation fault when called from more then three processes/Expected behavior
The program writes the expected file collectively and ends without an error.
Platform
The example code fails as described on two systems:
- Dell Latitude 5430
- 12th Gen Intel(R) Core(TM) i7-1265U
- 32 GB RAM
- Dell R7525 Server, 2x 7713 AMD EPYC with 1TB RAM with 2x 100 GBit/s Ethernet
- Rocky 9.3 or 9.4 as hypervisor
- Runs a KVM in a docker container
- Connected to a Lustre with 100GB/s Ethernet, 2 meta and 4 storage servers
- 1.12.3 and 1.14.0 (on both systems)
- Ubuntu 24.04.1 LTS (Laptop)
- OpenSUSE 15.5 (HPC)
- gcc 13.3.0 (Laptop)
- gcc 14.2.0 (HPC)
-
export CC=mpicc F9X=mpif90 ./configure --enable-build-mode=production --enable-parallel --enable-fortran
(Laptop and HPC for HDF5-1.12.3)-
export CC=mpicc CXX=mpicxx FC=mpifort F77=mpif77 ; ./configure --prefix=/software/hdf5/1.14.5-libaec-1.1.3-openmpi-5.0.6-gcc14.2.0 --with-pic --with-pthread --enable-shared --enable-threadsafe --enable-unsupported --enable-parallel --enable-fortran --with-fortran2003 --enable-build-mode=production
(HPC for HDF5-1.14.0)`Additional context
The program ends without an error if the
h5dwrite_f
routines is commented out or if, in theh5pset_dxpl_mpio_f
call,H5FD_MPIO_COLLECTIVE_F
is replaced withH5FD_MPIO_INDEPENDENT_F
.The program runs as expected on the LUMI HPC in Finland with cray-HDF5-12.2.11 built with gcc 12.2.0 and on intel nodes (Xeon(R) Gold 6126 CPU @ 2.60GHz) built with intel-oneapi 2021.4.0. On the intel machine, HDF5-1.12.0 is self built with
export CC=mpiicc F9X=mpiifort .configure --enable-build-mode=production --enable-parallel --enable-fortran --enable-fortran2003
Additionally I tested on the Laptop HDF5 installed with
apt-get install libhdf5-openmpi-dev
which is version 1.10.10. In this case the program fails as well as described above.All software on the HPC is self built. Here are the specs:
GCC 14.2.0
./contrib/download_prerequisites
gefolgt von./configure --prefix=/software/gcc/14.2.0 --disable-multilib
libaec 1.1.3
cmake -S ../ -DCMAKE_INSTALL_PREFIX=/software/libaec/1.1.3 ; make ; make install
OpenMPI 5.0.6 gcc/14.2.0
./configure --prefix=/software/openmpi/5.0.6-gcc14.2.0 ; make -j ; make install
HDF5 1.14.5 libaec/1.1.3 openmpi/5.0.6 gcc/14.2.0
export CC=mpicc CXX=mpicxx FC=mpifort F77=mpif77 ; ./configure --prefix=/software/hdf5/1.14.5-libaec-1.1.3-openmpi-5.0.6-gcc14.2.0 --with-pic --with-pthread --enable-shared --enable-threadsafe --enable-unsupported --enable-parallel --enable-fortran --with-fortran2003 --enable-build-mode=production
Failing Example
parallel_write_int32.f90
Built with
The text was updated successfully, but these errors were encountered: