Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Docker Image is usable with Sarus on the CSCS cluster #512

Closed
4 tasks done
kenfus opened this issue Aug 17, 2023 · 8 comments
Closed
4 tasks done

Docker Image is usable with Sarus on the CSCS cluster #512

kenfus opened this issue Aug 17, 2023 · 8 comments
Assignees
Labels
cscs Things concerning CSCS Swiss National Supercomputing Centre enhancement New feature or request prio-high High priority

Comments

@kenfus
Copy link
Member

kenfus commented Aug 17, 2023

  • Karabo Docker Image works with Sarus on CSCS (sbatch script exists with example script running with karabo).
  • Karabo Docker Image is parallelized when running the script on multiple nodes (checked via the dask-dashboard).
  • Workflow exists to deploy your karabo-script with the docker image on cscs.
  • OpenMPI is still installed as a dependency of pinocchio / fftw3 but with the help of sarus can speak to "cray-mpich" installed on CSCS.

https://user.cscs.ch/tools/containers/sarus/
https://sarus.readthedocs.io/en/latest/quickstart/quickstart.html
https://sarus.readthedocs.io/en/latest/user/custom-cuda-images.html
https://sarus.readthedocs.io/en/latest/user/abi_compatibility.html

@kenfus kenfus added enhancement New feature or request prio-high High priority labels Aug 17, 2023
@Lukas113 Lukas113 added this to the 0.21 (September 2023) milestone Aug 18, 2023
@lmachadopolettivalle lmachadopolettivalle added the cscs Things concerning CSCS Swiss National Supercomputing Centre label Aug 18, 2023
@Lukas113
Copy link
Collaborator

Lukas113 commented Sep 26, 2023

Currently, I'm working on the branch 512_sarus to address this issue.

At the moment, I'm able to create and run Sarus or Singularity containers on CSCS with a more or less functional Karabo environment.

A sarus image can easily be created by just running:

module load daint-gpu
module load sarus
sarus pull ghcr.io/i4ds/karabo-pipeline:latest

For testing purpose, I start (as recommended from CSCS) an interactive SLURM job for testing:

srun -A sk05 -C gpu --pty bash

MPI
In the interactive SLURM job I try to make Karabo run in a Sarus container. To replace the MPI of the current environment, I tried to do a Native MPI-Hook as follows:

sarus run --tty --mpi --mount=type=bind,source=/users/lgehrig/Karabo-Pipeline,destination=/workspace/Karabo-Pipeline ghcr.io/i4ds/karabo-pipeline:0.19.6 bash

This however fails because no MPI is found inside the container. Except it is, but in /opt/conda. However, I'm currently not exactly sure how to solve this issue and therefore I asked Victor Holanda for support (still pending).

image

pytest

A Sarus container can start successfully by just leaving out the mpi-hook in the command above. I made sure to load daint-gpu prior. Then I'm inside a running Sarus container where the environment seems almost fine. However, when I run:

pytest /opt/conda/lib/python3.9/site-packages/karabo/test

image

4 tests are failing currently. The reasons are:

  • RuntimeError: oskar_interferometer_check_init() failed with code 46 (CUDA-capable device(s) is/are busy or unavailable) This seems odd to me, because a gpu is definitely available (nvidia-smi works inside the sarus container).
  • FileNotFoundError: [Errno 2] No such file or directory: '/users/lgehrig/miniconda3/etc/pinocchio_params.conf' For some reason, $CONDA_PREFIX is the same as it was outside the container and doesn't point to /opt/conda

@Lukas113
Copy link
Collaborator

GPU-related issues don't occur the second time the tests are calles using pytest --lf. It seems that maybe the gpu isn't released fast enough between tests? Not sure about that.

@Lukas113
Copy link
Collaborator

Lukas113 commented Oct 11, 2023

There are some updates on this matter:

I had a discussion with the CSCS support about the mpi-hook. Unfortunately it didn't result in a state where I was able to solve all issues. However, the support stated that we need mpich instead of openmpi and it needs to be in a standard location for the native mpi-hook for Sarus containers.

This task seems to be very difficult because our dependencies install openmpi. As far as I've seen, this is because we set openmpi in ourself as dependency instead of mpich in the following feedstock builds:

By a quick walk-through of the packages I didn't see anything which speaks for openmpi or against mpich. However, @fschramka claimed in build_base.yml that pinocchio needs the exact openmpi build which makes me unsure whether I've looked into the above mentioned repositories (links are in the feedstock meta.yaml files) properly. Thus, as long as we have openmpi in our builds, conda will always install openmpi instead of mpich.

As soon as we would have mpich as our dependency instead of openmpi, the task still remains that it has to reside in a standard-location and not in a conda environment. The only way I see this working is to install mpich in a standard location, force remove mpich from the conda environment, and reinstall mpich-dummies in the environment (according to conda-forge doc). Maybe a pre-installation of the dummies is possible to not have to force-remove the existing mpich (to test that I must have a test-script which makes use of mpi and see if it works). An example on how to install mpich in a standard location in a docker-container can be seen here.

@Lukas113
Copy link
Collaborator

Lukas113 commented Oct 11, 2023

Something which is also worth mentioning:

To me it seems like the karabo-wheel doesn't care about installing mpi-compatible dependencies. An example can be seen here:

image

casacore, h5py and hdf5 are all nompi wheels, despite e.g. h5py anaconda.org having openmpi and mpich wheels. Currently I'm not exactly sure what this exactly means and how to solve this issue. To me it seems like they have mpi-processes which can't be used at all because we have nompi wheels.

@fschramka
Copy link
Collaborator

@Lukas113 during the time of compiling, it was the only option, more should be awailable now - take whatever MPI package you like and recompile the whole pinocchio branch with it :) Just check that you've got MPI binaries bundled - everything marked with "extern" does not hold them

@Lukas113
Copy link
Collaborator

Small update on my comment above.

Integrating h5py and hdf5 enabled mpich-wheels seems to be easy by just replacing h5py with h5py=*=mpi*. And when we have mpich dependencies, the according mpich-wheel will be chosen.

Sadly casacore (oskar dependency) doesn't have any mpich builds, just no-mpi or open-mpi. Therefore we can't integrate an mpi enabled casacore into karabo.

@Lukas113
Copy link
Collaborator

So, a lot of this issue is done with PR #526 . However, I can'r really check any of the checkpoints mentioned at the beginning of the issue for several reasons. The reasons in order of the checkpoints are as follows:

  • No sbatch script example exists .I also think it shouldn't. At max an example in the doc-section on how to use Sarus, and not with sbatch. And that is already there in the containers-doc.
  • Don't know how to do the second point, which is probably out of scope for this issue.
  • The third checkpoint sounds the same as the first checkpoint to me.
  • Fourth checkpoint requires mpi-hook which is not enabled with PR 512 sarus #526 at this point in time.

@Lukas113
Copy link
Collaborator

mpi-hook is now enabled with PR #526 .

@kenfus Therefore, I suggest we close this issue, and reopen a new one for the second point. Do you agree?

If yes, I suggest that you write the issue for the second check-point, because you're the person which has already done some work with dask and parallelization on CSCS, and therefore can write a proper issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cscs Things concerning CSCS Swiss National Supercomputing Centre enhancement New feature or request prio-high High priority
Projects
None yet
Development

No branches or pull requests

5 participants