You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi, I am having trouble running some applications with Grid (develop) on Marconi100 at CINECA (2xIBM power AC922 with 4 NVIDIA Volta V100 GPUs, NVLink 2.0)
I am using
nvcc: NVIDIA (R) Cuda compiler driver
Copyright (c) 2005-2019 NVIDIA Corporation
Built on Thu_Oct_24_17:58:26_PDT_2019
Cuda compilation tools, release 10.2, V10.2.89
gcc (GCC) 8.4.0
mpirun (IBM Spectrum MPI) 10.3.1.02rtm0
The error appears in different tests involving Wilson fermions. The one we ultimately need to run is Test_hmc_WCMixedRepFG_Production with Nc=4 but the error can be reproduced by simply running Test_wilson_force with Nc=3, see below. The error message is
Let me list a few cases in which the error appears. In all the following examples I am using 4^4 local lattices.
Test_wilson_force Nc=3, 1 node and 2 GPUs works (e.g. mpirun -np 2 Test_wilson_force —grid 4.4.4.8 —mpi 1.1.1.2)
Test_wilson_force Nc=3, 1 node and 4 GPUs fails (e.g. mpirun -np 4 Test_wilson_force —grid 4.4.8.8 —mpi 1.1.2.2)
Test_wilson_force Nc=4, 1 node and 2 GPU fails
Test_wilson_force Nc=4, 1 node and 4 GPU fails
Test_hmc_WCMixedRepFG_Production fails always when running on GPUs.
Other informations:
-Benchmark dwf, ITT, and comms_host_device work fine.
-The error does not appear on Jureca.
-I ran Test_wilson_force with the —men-debug option and the profile I see for the allocated memory is the same for jureca and marconi100 (until the latter dies).
-Reducing --enable-gen-simd-width allows Test_wilson_force to work but Test_hmc_WCMixedRepFG_Production will ultimately fail, especially with Nc=4.
The configure line I am using is inspired by the instructions for Summit on the grid wiki.
The cxxflag “-Xcompiler -mno-float128” seems necessary when using Cuda 10 with gcc. The most recent Cuda version on Marconi100 is 11.0, which I also tried by disabling the macro error in CompilerCompatible.h, but the error persists.
I have tried to add and remove several configure options with no luck, e.g. the ones suggested in #346.
I am attaching config.log, grid.configure.summary and the output of make v=1
Hi, I am having trouble running some applications with Grid (develop) on Marconi100 at CINECA (2xIBM power AC922 with 4 NVIDIA Volta V100 GPUs, NVLink 2.0)
I am using
gcc (GCC) 8.4.0
mpirun (IBM Spectrum MPI) 10.3.1.02rtm0
The error appears in different tests involving Wilson fermions. The one we ultimately need to run is Test_hmc_WCMixedRepFG_Production with Nc=4 but the error can be reproduced by simply running Test_wilson_force with Nc=3, see below. The error message is
Let me list a few cases in which the error appears. In all the following examples I am using 4^4 local lattices.
Test_wilson_force Nc=3, 1 node and 2 GPUs works (e.g. mpirun -np 2 Test_wilson_force —grid 4.4.4.8 —mpi 1.1.1.2)
Test_wilson_force Nc=3, 1 node and 4 GPUs fails (e.g. mpirun -np 4 Test_wilson_force —grid 4.4.8.8 —mpi 1.1.2.2)
Test_wilson_force Nc=4, 1 node and 2 GPU fails
Test_wilson_force Nc=4, 1 node and 4 GPU fails
Test_hmc_WCMixedRepFG_Production fails always when running on GPUs.
Other informations:
-Benchmark dwf, ITT, and comms_host_device work fine.
-The error does not appear on Jureca.
-I ran Test_wilson_force with the —men-debug option and the profile I see for the allocated memory is the same for jureca and marconi100 (until the latter dies).
-Reducing --enable-gen-simd-width allows Test_wilson_force to work but Test_hmc_WCMixedRepFG_Production will ultimately fail, especially with Nc=4.
The configure line I am using is inspired by the instructions for Summit on the grid wiki.
The cxxflag “-Xcompiler -mno-float128” seems necessary when using Cuda 10 with gcc. The most recent Cuda version on Marconi100 is 11.0, which I also tried by disabling the macro error in CompilerCompatible.h, but the error persists.
I have tried to add and remove several configure options with no luck, e.g. the ones suggested in #346.
I am attaching config.log, grid.configure.summary and the output of make v=1
thanks for the help,
Alessandro
config.log
grid.configure.summary.log
makeV1.log
The text was updated successfully, but these errors were encountered: