Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added build and instructions for Pixi environment manager #414

Merged
merged 5 commits into from
Nov 19, 2024
Merged

Conversation

aaronkho
Copy link
Contributor

Pixi environment / package manager allows seamless combination of conda and pypi packages, while requiring no system admin privileges. All recommended regression tests have passed on test build.

@aaronkho aaronkho requested a review from smithsp October 25, 2024 18:02
# git clone [email protected]:gafusion/gacode.git
# pixi add "python~=3.12,<3.13"
# pixi add git
# pixi add mpich
Copy link
Member

@smithsp smithsp Oct 28, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had built a conda based environment on the PPPL portal cluster at one point, and we had a problem with using more than 32 cores. See #96 . Are you in a position to try building this on a system that can handle more than 32 cores and testing it?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I have access to a partition of the PSFC cluster with 64 cores per node. Let me build the code there and see if I cannot get it to use more than 32 cores on this cluster.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have compiled it on the PSFC cluster using the PIXI instructions. It seems that the CGYRO regression tests return strange results with varying the number of cores. There is something wrong with the OpenBLAS package I used so OpenMP was not working for this test, I will debug that.

In the meantime, here are the regression results with 64 cores:

(py312_gacode) [aaronkho@eofe10 py312_gacode]$ cgyro -r -n 64 -nomp 1
REGRESSION TESTING: cgyro
reg01: ERROR - Regression data was not generated by simulation.
reg02: ERROR - Regression data was not generated by simulation.
reg03: ERROR - Regression data was not generated by simulation.
reg04: ERROR - Regression data was not generated by simulation.
reg05: PASS
reg06: ERROR - Regression data was not generated by simulation.
reg07: PASS
reg08: ERROR - Regression data was not generated by simulation.
reg09: ERROR - Regression data was not generated by simulation.
reg10: ERROR - Regression data was not generated by simulation.
reg11: ERROR - Regression data was not generated by simulation.
reg12: ERROR - Regression data was not generated by simulation.
reg13: ERROR - Regression data was not generated by simulation.
reg14: ERROR - Regression data was not generated by simulation.
reg15: ERROR - Regression data was not generated by simulation.
reg16: ERROR - Regression data was not generated by simulation.
reg17: PASS
reg18: ERROR - Regression data was not generated by simulation.
reg19: ERROR - Regression data was not generated by simulation.
reg20: ERROR - Regression data was not generated by simulation.
reg21: ERROR - Regression data was not generated by simulation.
reg22: ERROR - Regression data was not generated by simulation.

At 32 cores: reg04, reg12, reg13, reg14, reg18 fail.
At 16 cores: reg12 fails
At 8 cores: All regression tests pass

Please let me know if this is intended behaviour.

Copy link
Contributor Author

@aaronkho aaronkho Oct 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed the OpenBLAS issue by adding a missing library (8d4cc83). However, CGYRO regression tests still fail in the same ways.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's important to ensure OpenBLAS is single-threaded:

BINARY=64
USE_THREAD=0
USE_LOCKING=1
NO_SHARED=1
NO_CBLAS=1
NO_LAPACKE=1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcandy The regression tests do not seem to change, regardless of using OpenBLAS library with single-thread or multi-thread. The only impact is that the computation time is longer with multi-thread, which makes sense with the quoted overhead problems with multi-thread OpenBLAS.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would pick one regression case that isn't working (say, reg12) and then run cgyro directly in the reg12 directory. We should get an idea where the code is failing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jcandy I ran everything with -n 32 and looked into reg12 and saw that it crashed with this error:
ERROR: (CGYRO) nc ( 24) not a multiple of coll atoa procs ( 32)

Inside the file out.cgyro.mpi, there is this output:

 Parallelization and distribution diagnostics

         nv:   256
         nc:    24
 GCD(nv,nc):     8
 n_toroidal:     1
     nt_loc:     1

           [coll]     [str]      [NL]      [NL]      [NL]    [coll]     [str]
  n_MPI    nc_loc    nv_loc   n_split  atoa[MB] atoa proc atoa proc ared proc
 ------    ------    ------   -------  -------- --------- --------- ---------
      1        24       256      6144      0.10         1         1         1
      2        12       128      3072      0.05         1         2         2
      4         6        64      1536      0.02         1         4         4
      8         3        32       768      0.01         1         8         8

My guess is that the regression test itself is not suitable to be parallelized across more than 24 cores?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Similarly,
reg04 has ERROR: (CGYRO) nc ( 144) not a multiple of coll atoa procs ( 32)
reg13 has ERROR: (CGYRO) nc ( 144) not a multiple of coll atoa procs ( 32)
reg14 has ERROR: (CGYRO) nc ( 144) not a multiple of coll atoa procs ( 32)
reg18 has ERROR: (CGYRO) nc ( 144) not a multiple of coll atoa procs ( 32)

Copy link
Contributor Author

@aaronkho aaronkho Nov 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Switching over to -n 24 on reg12 yields other problems, giving the following error:
ERROR: (CGYRO) nv ( 256) not a multiple of coll atoa procs ( 24)

So it seems that -n 8 passes all regression checks purely by being the largest common denominator of all the CGYRO grids within these tests (which naturally extends to -n 4 and -n 2).

@aaronkho
Copy link
Contributor Author

@smithsp @jcandy Given the nature of the regression errors I found, plus the fact that all the tests pass still works with the recommended -n 4, can we go ahead with approving this request and merge?

@aaronkho aaronkho requested a review from smithsp November 19, 2024 16:47
@jcandy jcandy merged commit d38a87b into master Nov 19, 2024
@jcandy
Copy link
Member

jcandy commented Nov 19, 2024

Done

@aaronkho aaronkho deleted the pixi_build branch November 19, 2024 22:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants