Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add arm support #35

Merged
merged 10 commits into from
May 22, 2024
Merged

add arm support #35

merged 10 commits into from
May 22, 2024

Conversation

wangyinz
Copy link
Collaborator

@wangyinz wangyinz commented May 6, 2023

This branch is a work in progress of adding a arm64 native build of container. This version will simply use a mpich library, so it will not run on HPC systems like #33. We should eventually build two parallel versions to support different architectures.

@wangyinz
Copy link
Collaborator Author

wangyinz commented May 6, 2023

The last build failed after 4 hours... The error is from SeisSol with:

#54 3337.4 [ 34%] Building CXX object CMakeFiles/SeisSol-lib.dir/src/generated_code/subroutine.cpp.o
#54 3340.8 g++: error: unrecognized command-line option ‘-mno-red-zone’
#54 3340.8 make[2]: *** [CMakeFiles/SeisSol-lib.dir/build.make:1382: CMakeFiles/SeisSol-lib.dir/src/generated_code/subroutine.cpp.o] Error 1

More details can be found in the log: https://github.com/SeisSol/Training/actions/runs/4899728175/jobs/8749902315.

The problem seems to be from this line in SeisSol:
https://github.com/SeisSol/SeisSol/blob/9b1b0ec970af4ad79a155c63035234b660838476/generated_code/SConscript#LL82C66-L82C77

The -mno-red-zone option is added deliberately, but this option is not recognized by the gcc under the arm architecture, because red zone is a x86 thing.

Any thoughts on how to get over this? @sebwolf-de @Thomas-Ulrich

@wangyinz
Copy link
Collaborator Author

wangyinz commented May 6, 2023

The build took more than 6 hours, so it was cancelled by GitHub...

@wangyinz
Copy link
Collaborator Author

wangyinz commented May 6, 2023

Took me a few hours to build on my laptop, and I had the container pushed to docker hub here

Note that because this build is compiled with noarch, the binary name is different. You might want to double check all the content of the notebook and change the binary name to SeisSol_Release_dnoarch_4_elastic or SeisSol_Release_dnoarch_4_viscoelastic2.

I tested in the emulator on my laptop using the tpv13 notebook. The gmsh, pumgen and vtk steps went through, but it failed at running seissol, with the following error:

!OMP_NUM_THREADS=4 mpirun -n 1 SeisSol_Release_dnoarch_4_elastic parameters.par

Sat May 06 22:24:23, Info:  Welcome to SeisSol 
Sat May 06 22:24:23, Info:  Copyright (c) 2012-2021, SeisSol Group 
Sat May 06 22:24:23, Info:  Built on: May  6 2023 18:10:48 
Sat May 06 22:24:23, Info:  Version: 9b1b0ec (modified) 
Sat May 06 22:24:23, Info:  Running on: "bd90741a4f80" 
Sat May 06 22:24:23, Info:  Using MPI with #ranks: 1 
Sat May 06 22:24:23, Info:  Using OMP with #threads/rank: 4 
Sat May 06 22:24:23, Info:  OpenMP worker affinity (this process): "01--45--89|--23--" 
Sat May 06 22:24:23, Info:  OpenMP worker affinity (this node)   : "01--45--89|--23--" 
Sat May 06 22:24:23, Info:  The stack size ulimit is  8192 [kb]. 
Sat May 06 22:24:23, Warn:  Stack size of 8192 [kb] is lower than recommended minimum of 2097152 [kb]. You can increase the stack size by running the command: ulimit -Ss unlimited. 
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <                SeisSol MPI initialization               >
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    |  Double precision used for real.
Rank:        0 | Info    | <--------------------------------------------------------->
 INFORMATION: The assumed unit number is           6 for stdout and           0 for stderr.
              If no information follows, please change the value.
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <     Start ini_SeisSol ...                               >
Rank:        0 | Info    | <--------------------------------------------------------->
Rank:        0 | Info    | <  Parameters read from file: parameters.par              >
Rank:        0 | Info    | <                                                         >
Rank:        0 | Info    | (Drucker-Prager) plasticity assumed .
Rank:        0 | Info    | Plastic relaxation Tv is set to:    2.9999999999999999E-002
Rank:        0 | Info    | Use averaging to sample material values, when implemented.
Rank:        0 | Info    | No attenuation assumed. 
Rank:        0 | Info    | No adjoint wavefield generated. 
Rank:        0 | Info    | Isotropic material is assumed. 
Rank:        0 | Info    | Read a PUML mesh file
Rank:        0 | Warning | Ignoring space order from parameter file, using           4
Rank:        0 | Info    | Volume output is in XDMF format (new implementation)
Rank:        0 | Info    | Output data are generated at delta T=    5.0000000000000000     
Rank:        0 | Info    | Use HDF5 XdmfWriter backend
Rank:        0 | Info    | Refinement strategy for volume output is Face Extraction :  4 subcells per cell
Sat May 06 22:24:23, Info:  Reading PUML mesh tpv13_training.puml.h5 
Sat May 06 22:24:23, Info:  Found 37074 cells 
Sat May 06 22:24:23, Info:  Found 6977 vertices 
Sat May 06 22:24:24, Info:  Computing LTS weights. 
Sat May 06 22:24:25, Info:  Limiting number of clusters to 2147483646 
Sat May 06 22:24:25, Info:  Computing LTS weights. Done.  (688 reductions.)
Sat May 06 22:24:26, Info:  Reading mesh. Done. 
Sat May 06 22:24:26, Info:  Extracting fault information 
Sat May 06 22:24:26, Info:  Mesh initialized in: 2.82334 (min: 2.82334, max: 2.82334)
Sat May 06 22:24:26, Warn:  Material Averaging is not implemented for plastic materials. Falling back to material properties sampled from the element barycenters instead. 
qemu: uncaught target signal 11 (Segmentation fault) - core dumped

It seems to me that this could be an issue related to the qemu emulator and might go away when running on arm64 natively. Could you please confirm? Thank you!

@wangyinz
Copy link
Collaborator Author

wangyinz commented May 7, 2023

Grabbed an arm instance on AWS to test the container. However, seissol fails at the same step with seg fault. Maybe it has something to do with the compile flags? Note that I simply removed the -mno-red-zone in this build. Not sure how that might lead to the seg fault though.

@wangyinz
Copy link
Collaborator Author

wangyinz commented May 7, 2023

Tried another build natively on the arm64 node on AWS, but the run still fails with the same error. At this point, I believe there is something wrong with seissol itself. Not sure how to proceed... Below is the error message:

Sun May 07 02:49:24, Info:  Reading mesh. Done. 
Sun May 07 02:49:24, Info:  Extracting fault information 
Sun May 07 02:49:24, Info:  Mesh initialized in: 1.33762 (min: 1.33762, max: 1.33762)
Sun May 07 02:49:24, Warn:  Material Averaging is not implemented for plastic materials. Falling back to material properties sampled from the element barycenters instead. 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 90 RUNNING AT c040cefdbefd
=   EXIT CODE: 139
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

Note that the proxy code runs all fine:

root@c040cefdbefd:/home/tools/bin# SeisSol_proxy_Release_dnoarch_4_elastic 10000 100 all
Allocating fake data...
...done

=================================================
===            PERFORMANCE SUMMARY            ===
=================================================
seissol proxy mode                  : all
time for seissol proxy              : 10.687574
cycles                              : 0.000000

GFLOP (libxsmm)                     : 0.000000
GFLOP (pspamm)                      : 0.000000
GFLOP (libxsmm + pspamm)            : 0.000000
GFLOP (non-zero) for seissol proxy  : 50.672430
GFLOP (hardware) for seissol proxy  : 112.950000
GiB (estimate) for seissol proxy    : 21.860003

FLOPS/cycle (non-zero)              : inf
FLOPS/cycle (hardware)              : inf
Bytes/cycle (estimate)              : inf

GFLOPS (non-zero) for seissol proxy : 4.741247
GFLOPS (hardware) for seissol proxy : 10.568348
GiB/s (estimate) for seissol proxy  : 2.045366
=================================================

root@c040cefdbefd:/home/tools/bin# SeisSol_proxy_Release_dnoarch_4_viscoelastic2 10000 100 all
Allocating fake data...
...done

=================================================
===            PERFORMANCE SUMMARY            ===
=================================================
seissol proxy mode                  : all
time for seissol proxy              : 18.992046
cycles                              : 0.000000

GFLOP (libxsmm)                     : 0.000000
GFLOP (pspamm)                      : 0.000000
GFLOP (libxsmm + pspamm)            : 0.000000
GFLOP (non-zero) for seissol proxy  : 115.187430
GFLOP (hardware) for seissol proxy  : 222.960000
GiB (estimate) for seissol proxy    : 46.111643

FLOPS/cycle (non-zero)              : inf
FLOPS/cycle (hardware)              : inf
Bytes/cycle (estimate)              : inf

GFLOPS (non-zero) for seissol proxy : 6.065035
GFLOPS (hardware) for seissol proxy : 11.739651
GiB/s (estimate) for seissol proxy  : 2.427945
=================================================

btw, the build from #33 fails even with the proxy code because of the invalid avx2 instructions:

root@36278a11a601:/home/training/tpv13# SeisSol_proxy_Release_dhsw_4_elastic 1000 100 all
qemu: uncaught target signal 4 (Illegal instruction) - core dumped
Illegal instruction (core dumped)

So, the image built here does run properly on arm64. The error is likely due to SeisSol itself.

@wangyinz
Copy link
Collaborator Author

wangyinz commented May 7, 2023

I probably should take above back. I further ran the three other cases in the container and found that they failed at different step (all at the very beginning though). This reminds me that the issue might be memory related - the arm64 instance I got only have 4GB of memory, which may not be enough for the run. Is that true? Do you have an estimate of memory requirement for these runs? A close monitor with the top command does reveal a memory spike right before the failure, so it is probably the case. Can someone with access to an arm-based Mac to please verify the container?

@krenzland
Copy link
Contributor

-DHOST_ARCH=noarch is really going to break performance of the code. Maybe with the backend LIBXSMM_JIT, this could be mitigated a bit. (Not a priority)

Nite that the arch "thunderx2t99" may work also on M1/M2 chips. Definitely not optimal, but it atleast should activate vectorization. We can also add similar settings for M1/M2 but this is definitely not a priority for us.

I'm also not surprised that building a container with QEMU is taking a long time...

@wangyinz
Copy link
Collaborator Author

wangyinz commented May 8, 2023

I did not use libxsmm in this build as I thought libxsmm does not support arm. Then, I found that this is not accurate: there is no support in any of the released versions, but they do seem to have the development version that has support. Still, I wanted to play safe so used eigen instead.

I am not sure noarch will actually impact too much of the performance on Apple Silicon because the chip does not have SVE anyway. I think the compiler should enabled the SIMD optimization by default. I don't have the time to test it out, but since this version of container is to enabled the training material to the majority, I don't think performance is the priority anyway.

I am back to my office so was able to test it out on a M1 arm macbook. It turns out that the tpv13 run still seg faults at the same place, but I was able to get the Kaikoura case running. Not sure what the expected performance should be, but below is running with 4 omp threads:

Mon May 08 14:13:51, Info:  Writing energy output at time 0.6 
Mon May 08 14:13:52, Info:  Writing energy output at time 0.6 Done. 
Mon May 08 14:13:52, Info:  Performance since the start: 0.00769377 TFLOP/s (rank 0: 7.69377 GFLOP/s, average over ranks: 7.69377 GFLOP/s) 
Mon May 08 14:13:52, Info:  Performance since last sync point: 0.00801944 TFLOP/s (rank 0: 8.01944 GFLOP/s, average over ranks: 8.01944 GFLOP/s) 

I also tested the three other cases and found that the sulawesi case failed at:

Mon May 08 14:17:26, Info:  Reading mesh. Done. 
Mon May 08 14:17:26, Info:  Extracting fault information 
Mon May 08 14:17:26, Info:  Mesh initialized in: 3.62567 (min: 3.62567, max: 3.62567)
Mon May 08 14:17:26, Warn:  Material Averaging is not implemented for plastic materials. Falling back to material properties sampled from the element barycenters instead. 
Mon May 08 14:17:26, Warn:  ASAGI: NUMA communication could not be enabled because the ASAGI is not compiled with NUMA support. 
Mon May 08 14:17:26, Warn:  ASAGI: NUMA communication could not be enabled because the ASAGI is not compiled with NUMA support. 
Mon May 08 14:17:26, Warn:  ASAGI: NUMA communication could not be enabled because the ASAGI is not compiled with NUMA support. 
Mon May 08 14:17:26, Warn:  ASAGI: NUMA communication could not be enabled because the ASAGI is not compiled with NUMA support. 

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 213 RUNNING AT 2bfa7cb22d3b
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

The Northridge case started the calculation, but fialed with the Inf/NaN error:

Mon May 08 14:21:50, Info:  Writing free surface at time 0.
Mon May 08 14:21:50, Info:  Writing free surface at time 0. Done.
Mon May 08 14:21:50, Info:  Writing energy output at time 0 
Mon May 08 14:21:50, Info:  Writing energy output at time 0 Done. 
Mon May 08 14:22:07, Info:  Writing energy output at time 0.5 
Mon May 08 14:22:08, Info:  Elastic energy (total, % kinematic, % potential):  nan  , nan  , nan 
Mon May 08 14:22:08, Error: Detected Inf/NaN in energies. Aborting. 
Backtrace:
SeisSol_Release_dnoarch_4_elastic(+0x6acc4) [0xaaaaca43acc4]
SeisSol_Release_dnoarch_4_elastic(+0x118008) [0xaaaaca4e8008]
SeisSol_Release_dnoarch_4_elastic(+0x1b17c4) [0xaaaaca5817c4]
SeisSol_Release_dnoarch_4_elastic(+0x608bc) [0xaaaaca4308bc]
SeisSol_Release_dnoarch_4_elastic(+0x5e658) [0xaaaaca42e658]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0x4000177f73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0x4000177f74cc]
SeisSol_Release_dnoarch_4_elastic(+0x65e30) [0xaaaaca435e30]
Abort(134) on node 0 (rank 0 in comm 0): application called MPI_Abort(MPI_COMM_WORLD, 134) - process 0
Assertion failed in file src/binding/c/coll/barrier.c at line 36: 0
/lib/aarch64-linux-gnu/libmpich.so.12(+0x2202e4) [0x4000152d02e4]
/lib/aarch64-linux-gnu/libmpich.so.12(MPI_Barrier+0x24c) [0x4000150f289c]
SeisSol_Release_dnoarch_4_elastic(+0x89cad4) [0xaaaacac6cad4]
SeisSol_Release_dnoarch_4_elastic(+0x89dccc) [0xaaaacac6dccc]
SeisSol_Release_dnoarch_4_elastic(+0x5dcae0) [0xaaaaca9acae0]
SeisSol_Release_dnoarch_4_elastic(+0x6623a0) [0xaaaacaa323a0]
SeisSol_Release_dnoarch_4_elastic(+0x665794) [0xaaaacaa35794]
SeisSol_Release_dnoarch_4_elastic(+0x665c40) [0xaaaacaa35c40]
SeisSol_Release_dnoarch_4_elastic(+0x872b8c) [0xaaaacac42b8c]
SeisSol_Release_dnoarch_4_elastic(+0x858fc8) [0xaaaacac28fc8]
SeisSol_Release_dnoarch_4_elastic(+0x661c38) [0xaaaacaa31c38]
SeisSol_Release_dnoarch_4_elastic(+0x6ce600) [0xaaaacaa9e600]
SeisSol_Release_dnoarch_4_elastic(+0x662f50) [0xaaaacaa32f50]
SeisSol_Release_dnoarch_4_elastic(+0x5cb898) [0xaaaaca99b898]
/lib/aarch64-linux-gnu/libc.so.6(+0x3cde8) [0x40001780cde8]
/lib/aarch64-linux-gnu/libc.so.6(+0x3cf0c) [0x40001780cf0c]
/lib/aarch64-linux-gnu/libmpich.so.12(+0x21fe60) [0x4000152cfe60]
/lib/aarch64-linux-gnu/libmpich.so.12(+0x2053b0) [0x4000152b53b0]
/lib/aarch64-linux-gnu/libmpich.so.12(MPI_Abort+0x1c8) [0x400015182878]
SeisSol_Release_dnoarch_4_elastic(+0x6ac48) [0xaaaaca43ac48]
SeisSol_Release_dnoarch_4_elastic(+0x118008) [0xaaaaca4e8008]
SeisSol_Release_dnoarch_4_elastic(+0x1b17c4) [0xaaaaca5817c4]
SeisSol_Release_dnoarch_4_elastic(+0x608bc) [0xaaaaca4308bc]
SeisSol_Release_dnoarch_4_elastic(+0x5e658) [0xaaaaca42e658]
/lib/aarch64-linux-gnu/libc.so.6(+0x273fc) [0x4000177f73fc]
/lib/aarch64-linux-gnu/libc.so.6(__libc_start_main+0x98) [0x4000177f74cc]
SeisSol_Release_dnoarch_4_elastic(+0x65e30) [0xaaaaca435e30]
Abort(1) on node 0: Internal error

So, there are still issues with the SeisSol build, but the container is built properly for the arm architecture.

@wangyinz
Copy link
Collaborator Author

wangyinz commented May 8, 2023

Just another update: I ran the Kaikoura case again, and the performance is significantly improved. It seems making more sense now.

Mon May 08 19:33:09, Info:  Writing energy output at time 0.4 
Mon May 08 19:33:09, Info:  Writing energy output at time 0.4 Done. 
Mon May 08 19:33:09, Info:  Performance since the start: 0.042381 TFLOP/s (rank 0: 42.381 GFLOP/s, average over ranks: 42.381 GFLOP/s) 
Mon May 08 19:33:09, Info:  Performance since last sync point: 0.0430451 TFLOP/s (rank 0: 43.0451 GFLOP/s, average over ranks: 43.0451 GFLOP/s) 
Mon May 08 19:33:42, Info:  Writing energy output at time 0.6 
Mon May 08 19:33:43, Info:  Writing energy output at time 0.6 Done. 
Mon May 08 19:33:43, Info:  Performance since the start: 0.0426642 TFLOP/s (rank 0: 42.6642 GFLOP/s, average over ranks: 42.6642 GFLOP/s) 
Mon May 08 19:33:43, Info:  Performance since last sync point: 0.0432421 TFLOP/s (rank 0: 43.2421 GFLOP/s, average over ranks: 43.2421 GFLOP/s) 

@krenzland
Copy link
Contributor

I did not use libxsmm in this build as I thought libxsmm does not support arm. Then, I found that this is not accurate: there is no support in any of the released versions, but they do seem to have the development version that has support. Still, I wanted to play safe so used eigen instead.

The latest release has (undocumented) support for Arm but only for selected CPUs. It may not work for Apple silicon.

I am not sure noarch will actually impact too much of the performance on Apple Silicon because the chip does not have SVE anyway. I think the compiler should enabled the SIMD optimization by default. I don't have the time to test it out, but since this version of container is to enabled the training material to the majority, I don't think performance is the priority anyway.

They should have NEON support at least. I don't know what code the compiler is going to emit for Arm architectures without a specified tuning target. It likely is going to be suboptimal.

@sebwolf-de
Copy link
Collaborator

@wangyinz could you please rebase this onto the current main branch? IMHO this makes the review easier :D

@wangyinz wangyinz force-pushed the wangyinz/add_arm_support branch from 6f5cf51 to 69c4df8 Compare May 9, 2023 22:46
@krenzland
Copy link
Contributor

Try setting:
https://github.com/SeisSol/SeisSol/blob/70232f83f1e57d79da2b2cdea1afff7713c7568d/cmake/cpu_arch_flags.cmake#LL41C8-L41C42

set(HAS_REDZONE OFF PARENT_SCOPE)

(Might break some configurations on Intel hardware but might help on Arm)

@wangyinz
Copy link
Collaborator Author

I had that set already to get to the success build:

&& if [ "$TARGETARCH" == "arm64" ]; \
then sed -i 's/ + \[-mno-red-zone\]//g' generated_code/SConscript; fi\
&& if [ "$TARGETARCH" == "arm64" ]; \
then sed -i '/if (HAS_REDZONE)/i set(HAS_REDZONE OFF)' CMakeLists.txt; fi\

Also, the one in SConscript also needs to be removed.

@krenzland
Copy link
Contributor

The only other thing that arch does is to specify the alignment:
https://github.com/SeisSol/SeisSol/blob/1a7fcd18c4eb30fd3b2f7d026fa3d001030c33db/cmake/process_users_input.cmake#L41
(We should actually align to 64bytes on most systems anyway to align to the cacheline.)

The value in the SConscript doesn't matter, it isn't used anymore.

@wangyinz
Copy link
Collaborator Author

So, thunderx2t99 and noarch are both aligned to 16? Do you think I should add a line with sed to manually set that to 64?

Somehow I thought the one in SConscript gave me an error, but maybe I remembered wrong.

@AliceGabriel
Copy link
Collaborator

Here is an overview, of my private M1 testing of the current PR :

  • the Inf/NaN error occurs in Northridge without attenuation (should we turn off energy output?)
  • Northridge with attenuation runs fine at 14.4 GFLOP/s (we can ask our M1/M2s tomorrow to run this notebook with attenuation)
  • Kaikoura --> Inf/NaN , after a few minutes, before 49.25 GFLOP/s
  • TPV13 --> Segfaults
  • Palu, Sulawesi -> Asagi/NUMA error

@sebwolf-de
Copy link
Collaborator

My few remarks:

  • We should try to get noarch running in the container first. I actually believe that we do something wrong in SeisSol and it's not a generic M1 problem. With a more advanced code generator, debugging is harder though.
  • It is interesting that the viscoelastic version of Northridge runs. Do the other scenarios also work with attenuation or don't they work at all?

@krenzland
Copy link
Contributor

krenzland commented May 16, 2023

Isn't Northridge the only scenario that doesn't use dynamic rupture?
The small alignment of noarch might destroy the DR code. Maybe the alignment hides slightly incorrect memory accesses?

@sebwolf-de
Copy link
Collaborator

Indeed, it's the only scenario without DR, but Kaikoura works for a few minutes, so DR is not completely broken.

@krenzland
Copy link
Contributor

I'm also not sure why Northridge runs only with attenuation. In the current implementation, viscoelasticity uses the same wave propagation kernels as the elastic code.

@krenzland
Copy link
Contributor

I can reproduce the segfaults, even when using a specific Apple M2 arch setting. I have no idea why, it seems to run well without Docker. I'll investigate.

@davschneller
Copy link
Collaborator

davschneller commented Sep 28, 2023

A small side comment, the "no redzone" fixes should not be necessary anymore; noarch now doesn't add that parameter anymore by default. (at least when using the latest master, v1.1.0 doesn't have that change in yet; EDIT: v1.1.1 contains that patch)

@davschneller
Copy link
Collaborator

davschneller commented Oct 17, 2023

The segfaults with tpv13 could be due to ASAGI, or the SeisSol ASAGI reader—even though ASAGI is not even used there. But: it's compiled into the binary. Thus, ASAGI is called here https://github.com/SeisSol/SeisSol/blob/master/src/Reader/AsagiReader.h which in turn is called by https://github.com/SeisSol/SeisSol/blob/313c4e4c459b1ea67302b8887650f51d1ebbf9e7/src/Initializer/ParameterDB.cpp#L626 when initializing an easi model. And the last message you'd see before ending up there is exactly a warning like "falling back to materials sampled from cell barycenters".

@krenzland
Copy link
Contributor

I can reproduce the crashes but somehow have a very hard time debugging them due to an unrelated issue :(
I'm working on it!

@wangyinz
Copy link
Collaborator Author

I tried a few different combinations and we learned that:

  1. PSpaMM + neon: fails with Inf/NaN
  2. PSpaMM + noarch: works
  3. LIBXSMM_JIT + neon or noarch: seg fault
Tue May 21 18:57:21, Info:  Initialize Memory layout. 
Tue May 21 18:57:21, Info:  Initialize cell-local matrices. 

LIBXSMM_VERSION: feature_mxfp4_bf16_avx2_gemms-1.17-3727 (25693839)
AARCH64/DP    TRY    JIT    STA    COL
   0..13     45      0      0      0 
  14..23      5      0      0      0 
  24..64      2      0      0      0 
Registry and code: 13 MB
Command: SeisSol_Release_dhsw_4_elastic parameters.par
Uptime: 2.325351 s

===================================================================================
=   BAD TERMINATION OF ONE OF YOUR APPLICATION PROCESSES
=   PID 137 RUNNING AT 0494e8febeb4
=   EXIT CODE: 11
=   CLEANING UP REMAINING PROCESSES
=   YOU CAN IGNORE THE BELOW CLEANUP MESSAGES
===================================================================================
YOUR APPLICATION TERMINATED WITH THE EXIT STRING: Segmentation fault (signal 11)
This typically refers to a problem with your application.
Please see the FAQ page for debugging suggestions

The latest PSpaMM also have issues and we have to use @davschneller/compile-fixes branch for now. The LIBXSMM issue might relates to their latest update, but we don't know for sure. There is no official stable release of libxsmm that supports arm.

In any case, arm users can use the latest build pushed to Docker Hub: https://hub.docker.com/r/wangyinz/seissoltraining/tags with the command

docker pull wangyinz/seissoltraining:test_arm

The tpv13 case ran successfully at almost 40 GFLOP/s on an M1 Macbook :

Tue May 21 17:35:25, Info:  Total time spent in compute kernels: 146.643 s ( = 2 min 26.6432 s ) 
Tue May 21 17:35:25, Info:  Total calculated HW-FLOP:  5.8350 TFLOP 
Tue May 21 17:35:25, Info:  Total calculated NZ-FLOP:  2.9176 TFLOP 
Tue May 21 17:35:25, Info:  Total calculated HW-FLOP/s:  39.1719 GFLOP/s 
Tue May 21 17:35:25, Info:  Total calculated NZ-FLOP/s:  19.5867 GFLOP/s 

@davschneller
Copy link
Collaborator

It should be noted that using PSpaMM together with noarch as architecture will cause Yateto to generate pure C++ loops for the matrix multiplications and avoid the explicit code generation (i.e. inline assembly) entirely.

@wangyinz
Copy link
Collaborator Author

Quite surprisingly, the multi-arch docker build finished!

Note that this branch has the setup to build both amd64 and arm64 architectures in the same image (which is here already). Previously the arm64 build runs too slow that it cannot finish within the 6 hour limit of the running. I guess GitHub has upgraded the runnings and now the workflow finishes in less than 4 hours. Still slow, but we now can ask the attendees to pull the same image regardless of the arch they need.

@davschneller
Copy link
Collaborator

davschneller commented May 21, 2024

Great to see that! ... That reminds me... There was a doubling of the cores for the runners recently: https://github.blog/2024-01-17-github-hosted-runners-double-the-power-for-open-source/

@AliceGabriel
Copy link
Collaborator

AliceGabriel commented May 21, 2024 via email

@wangyinz
Copy link
Collaborator Author

I have tested on my m1 macbook and can confirm this build does not have the NaN error (at least not in the tpv13 case).

&& cd SeisSol \
&& mkdir build_hsw && cd build_hsw \
&& export PATH=$PATH:/home/tools/bin \
&& CC=mpicc CXX=mpicxx cmake .. -DCMAKE_PREFIX_PATH=/home/tools -DGEMM_TOOLS_LIST=PSpaMM -DHOST_ARCH=noarch -DASAGI=on -DNETCDF=on -DORDER=4 \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-DDR_QUAD_RULE_OPTIONS=dunavant
if we want to harmonize the dockerfiles and decrease run-times.

@Thomas-Ulrich Thomas-Ulrich merged commit 6dcb7c6 into main May 22, 2024
1 check passed
@Thomas-Ulrich Thomas-Ulrich deleted the wangyinz/add_arm_support branch May 22, 2024 09:07
@davschneller
Copy link
Collaborator

As a note, I can reproduce PSpaMM+neon failing with an inf/nan, while emulating the system with QEMU on an X86-64 machine.

Maybe it is indeed possible to debug the ARM container (albeit slow, with crashing Python) for us non-mac users.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants