Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: Check for CUDA-aware MPI might fail #1787

Open
mrfh92 opened this issue Feb 10, 2025 · 1 comment · May be fixed by #1793
Open

[Bug]: Check for CUDA-aware MPI might fail #1787

mrfh92 opened this issue Feb 10, 2025 · 1 comment · May be fixed by #1793
Assignees
Labels
bug Something isn't working communication MPI Anything related to MPI communication
Milestone

Comments

@mrfh92
Copy link
Collaborator

mrfh92 commented Feb 10, 2025

What happened?

We check availability of CUDA-aware MPI as follows:

CUDA_AWARE_MPI = False
# check whether OpenMPI support CUDA-aware MPI
if "openmpi" in os.environ.get("MPI_SUFFIX", "").lower():
    buffer = subprocess.check_output(["ompi_info", "--parsable", "--all"])
    CUDA_AWARE_MPI = b"mpi_built_with_cuda_support:value:true" in buffer
# MVAPICH
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("MV2_USE_CUDA") == "1"
# MPICH
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("MPIR_CVAR_ENABLE_HCOLL") == "1"
# ParaStationMPI
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("PSP_CUDA") == "1"

On some systems I am using, MPI_SUFFIX is empty, although OpenMPI is installed (and used by Heat). Nevertheless, in that cases one has to set heat.CUDA_AWARE_MPI = True manually as the automatic check does not work.

Questions

  • is that a bug in our code or a bug in the systems that have empty MPI_SUFFIX?
  • if the first applies, how to find a catch-all version of our check?

Code snippet triggering the error

Error message or erroneous outcome

Version

main (development branch)

Python version

None

PyTorch version

None

MPI version

@mrfh92 mrfh92 added bug Something isn't working MPI Anything related to MPI communication communication labels Feb 10, 2025
@mrfh92 mrfh92 added this to the 1.5.2 milestone Feb 12, 2025
@mrfh92 mrfh92 self-assigned this Feb 14, 2025
Copy link
Contributor

@mrfh92 mrfh92 linked a pull request Feb 14, 2025 that will close this issue
4 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working communication MPI Anything related to MPI communication
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant