-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Docker Image is usable with Sarus on the CSCS cluster #512
Comments
Currently, I'm working on the branch At the moment, I'm able to create and run Sarus or Singularity containers on CSCS with a more or less functional Karabo environment. A sarus image can easily be created by just running: module load daint-gpu
module load sarus
sarus pull ghcr.io/i4ds/karabo-pipeline:latest For testing purpose, I start (as recommended from CSCS) an interactive SLURM job for testing: srun -A sk05 -C gpu --pty bash MPI sarus run --tty --mpi --mount=type=bind,source=/users/lgehrig/Karabo-Pipeline,destination=/workspace/Karabo-Pipeline ghcr.io/i4ds/karabo-pipeline:0.19.6 bash This however fails because no MPI is found inside the container. Except it is, but in pytest A Sarus container can start successfully by just leaving out the mpi-hook in the command above. I made sure to load pytest /opt/conda/lib/python3.9/site-packages/karabo/test 4 tests are failing currently. The reasons are:
|
GPU-related issues don't occur the second time the tests are calles using |
There are some updates on this matter: I had a discussion with the CSCS support about the mpi-hook. Unfortunately it didn't result in a state where I was able to solve all issues. However, the support stated that we need mpich instead of openmpi and it needs to be in a standard location for the native mpi-hook for Sarus containers. This task seems to be very difficult because our dependencies install openmpi. As far as I've seen, this is because we set openmpi in ourself as dependency instead of mpich in the following feedstock builds: By a quick walk-through of the packages I didn't see anything which speaks for openmpi or against mpich. However, @fschramka claimed in build_base.yml that pinocchio needs the exact openmpi build which makes me unsure whether I've looked into the above mentioned repositories (links are in the feedstock meta.yaml files) properly. Thus, as long as we have openmpi in our builds, conda will always install openmpi instead of mpich. As soon as we would have mpich as our dependency instead of openmpi, the task still remains that it has to reside in a standard-location and not in a conda environment. The only way I see this working is to install mpich in a standard location, force remove mpich from the conda environment, and reinstall mpich-dummies in the environment (according to conda-forge doc). Maybe a pre-installation of the dummies is possible to not have to force-remove the existing mpich (to test that I must have a test-script which makes use of mpi and see if it works). An example on how to install mpich in a standard location in a docker-container can be seen here. |
Something which is also worth mentioning: To me it seems like the karabo-wheel doesn't care about installing mpi-compatible dependencies. An example can be seen here: casacore, h5py and hdf5 are all nompi wheels, despite e.g. h5py anaconda.org having openmpi and mpich wheels. Currently I'm not exactly sure what this exactly means and how to solve this issue. To me it seems like they have mpi-processes which can't be used at all because we have nompi wheels. |
@Lukas113 during the time of compiling, it was the only option, more should be awailable now - take whatever MPI package you like and recompile the whole pinocchio branch with it :) Just check that you've got MPI binaries bundled - everything marked with "extern" does not hold them |
Small update on my comment above. Integrating Sadly casacore (oskar dependency) doesn't have any mpich builds, just no-mpi or open-mpi. Therefore we can't integrate an mpi enabled casacore into karabo. |
So, a lot of this issue is done with PR #526 . However, I can'r really check any of the checkpoints mentioned at the beginning of the issue for several reasons. The reasons in order of the checkpoints are as follows:
|
mpi-hook is now enabled with PR #526 . @kenfus Therefore, I suggest we close this issue, and reopen a new one for the second point. Do you agree? If yes, I suggest that you write the issue for the second check-point, because you're the person which has already done some work with dask and parallelization on CSCS, and therefore can write a proper issue. |
https://user.cscs.ch/tools/containers/sarus/
https://sarus.readthedocs.io/en/latest/quickstart/quickstart.html
https://sarus.readthedocs.io/en/latest/user/custom-cuda-images.html
https://sarus.readthedocs.io/en/latest/user/abi_compatibility.html
The text was updated successfully, but these errors were encountered: