Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

More generic check for CUDA-aware MPI #1793

Open
wants to merge 3 commits into
base: main
Choose a base branch
from

Conversation

mrfh92
Copy link
Collaborator

@mrfh92 mrfh92 commented Feb 14, 2025

fixes #1787

  • CUDA-awareness of OpenMPI is now checked within a try-except-construction to ensure that it is really checked in any case
  • a warning is issued if PyTorch supports CUDA but there is no CUDA-aware MPI detected

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • [ ] benchmarks: created for new functionality
    • [ ] benchmarks: performance improved or maintained
    • documentation updated where needed

Does this change modify the behaviour of other functions? If so, which?

no

…; warning is issued if PyTorch supports GPUs but no cuda-aware MPI is found.
@mrfh92 mrfh92 self-assigned this Feb 14, 2025
@github-actions github-actions bot added backport stable bug Something isn't working core labels Feb 14, 2025
@mrfh92 mrfh92 added MPI Anything related to MPI communication communication labels Feb 14, 2025
Copy link
Contributor

Thank you for the PR!

Copy link

codecov bot commented Feb 14, 2025

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.25%. Comparing base (9c8eaf5) to head (c96a4e4).

Files with missing lines Patch % Lines
heat/core/communication.py 66.66% 2 Missing ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1793   +/-   ##
=======================================
  Coverage   92.24%   92.25%           
=======================================
  Files          84       84           
  Lines       12460    12465    +5     
=======================================
+ Hits        11494    11499    +5     
  Misses        966      966           
Flag Coverage Δ
unit 92.25% <66.66%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@ClaudiaComito ClaudiaComito added this to the 1.5.1 milestone Feb 17, 2025
@ClaudiaComito ClaudiaComito changed the title Fix check for CUDA-aware MPI More generic check for CUDA-aware MPI Feb 17, 2025
ClaudiaComito
ClaudiaComito previously approved these changes Feb 17, 2025
Copy link
Contributor

@ClaudiaComito ClaudiaComito left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I'm concerned this can be merged, thanks for looking into this!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito modified the milestones: 1.5.1, 1.6 Feb 17, 2025
@JuanPedroGHM
Copy link
Member

Benchmarks results - Sponsored by perun

function mpi_ranks device metric value ref_value std % change type alert lower_quantile upper_quantile
heat_benchmarks 4 CPU RUNTIME 273.389 52.39 11.6914 421.834 jump-detection True nan nan
matmul_split_0 4 CPU RUNTIME 0.333236 0.0963688 0.0216452 245.792 jump-detection True nan nan
matmul_split_1 4 CPU RUNTIME 0.376656 0.0990343 0.0433766 280.329 jump-detection True nan nan
qr_split_0 4 CPU RUNTIME 0.499975 0.218702 0.0113442 128.61 jump-detection True nan nan
qr_split_1 4 CPU RUNTIME 0.209519 0.163714 0.00352157 27.9784 jump-detection True nan nan
hierachical_svd_rank 4 CPU RUNTIME 0.0905598 0.062739 0.00491588 44.3438 jump-detection True nan nan
hierachical_svd_tol 4 CPU RUNTIME 0.0813055 0.0734396 0.00857699 10.7107 jump-detection True nan nan
reshape 4 CPU RUNTIME 0.832802 0.0762245 0.0134849 992.564 jump-detection True nan nan
concatenate 4 CPU RUNTIME 1.75758 0.129364 0.0695605 1258.64 jump-detection True nan nan
resplit 4 CPU RUNTIME 15.7163 1.16468 0.0552245 1249.4 jump-detection True nan nan
apply_inplace_standard_scaler_and_inverse 4 CPU RUNTIME 0.209927 0.0106696 0.001661 1867.52 jump-detection True nan nan
apply_inplace_min_max_scaler_and_inverse 4 CPU RUNTIME 0.00149274 0.0013495 9.32465e-05 10.6145 jump-detection True nan nan
apply_inplace_max_abs_scaler_and_inverse 4 CPU RUNTIME 0.00115566 0.000851977 8.32206e-05 35.6448 jump-detection True nan nan
apply_inplace_normalizer 4 CPU RUNTIME 0.00397964 0.00153844 0.0060146 158.679 jump-detection True nan nan
incremental_pca_split0 4 CPU RUNTIME 242.97 42.6731 11.7832 469.373 jump-detection True nan nan
heat_benchmarks 4 GPU RUNTIME 75.6497 21.333 0.203889 254.614 jump-detection True nan nan
matmul_split_0 4 GPU RUNTIME 0.169956 0.0164776 0.00366775 931.437 jump-detection True nan nan
matmul_split_1 4 GPU RUNTIME 0.194378 0.0162983 0.00569976 1092.63 jump-detection True nan nan
qr_split_0 4 GPU RUNTIME 0.223258 0.0410244 0.00626876 444.207 jump-detection True nan nan
qr_split_1 4 GPU RUNTIME 0.0833278 0.0304706 0.0166195 173.469 jump-detection True nan nan
lanczos 4 GPU RUNTIME 0.87182 0.724528 0.00682621 20.3293 jump-detection True nan nan
hierachical_svd_rank 4 GPU RUNTIME 0.108615 0.095702 0.00048534 13.4932 jump-detection True nan nan
reshape 4 GPU RUNTIME 1.45305 0.154857 0.0426811 838.323 jump-detection True nan nan
concatenate 4 GPU RUNTIME 0.557138 0.0870384 0.0483283 540.106 jump-detection True nan nan
resplit 4 GPU RUNTIME 28.3703 2.19025 0.0827825 1195.3 jump-detection True nan nan
apply_inplace_standard_scaler_and_inverse 4 GPU RUNTIME 1.0243 0.0181149 0.0214778 5554.43 jump-detection True nan nan
apply_inplace_min_max_scaler_and_inverse 4 GPU RUNTIME 0.00197 0.00175316 3.03594e-05 12.3686 jump-detection True nan nan
apply_inplace_max_abs_scaler_and_inverse 4 GPU RUNTIME 0.00118265 0.000979006 7.49305e-05 20.8012 jump-detection True nan nan
apply_inplace_normalizer 4 GPU RUNTIME 0.00342388 0.00239644 0.000280244 42.8736 jump-detection True nan nan
incremental_pca_split0 4 GPU RUNTIME 29.0433 4.9077 0.125791 491.791 jump-detection True nan nan
heat_benchmarks 4 CPU RUNTIME 273.389 49.1195 11.6914 456.58 trend-deviation True 44.8761 54.2423
matmul_split_0 4 CPU RUNTIME 0.333236 0.159913 0.0216452 108.385 trend-deviation True 0.121471 0.207108
matmul_split_1 4 CPU RUNTIME 0.376656 0.138353 0.0433766 172.243 trend-deviation True 0.108312 0.179826
qr_split_0 4 CPU RUNTIME 0.499975 0.277881 0.0113442 79.924 trend-deviation True 0.227327 0.346153
qr_split_1 4 CPU RUNTIME 0.209519 0.169796 0.00352157 23.3946 trend-deviation True 0.163472 0.174831
hierachical_svd_rank 4 CPU RUNTIME 0.0905598 0.056287 0.00491588 60.8895 trend-deviation True 0.0467363 0.0682467
hierachical_svd_tol 4 CPU RUNTIME 0.0813055 0.0627339 0.00857699 29.6039 trend-deviation True 0.0515528 0.0755125
reshape 4 CPU RUNTIME 0.832802 0.19281 0.0134849 331.93 trend-deviation True 0.149999 0.213561
concatenate 4 CPU RUNTIME 1.75758 0.194566 0.0695605 803.336 trend-deviation True 0.147756 0.250637
resplit 4 CPU RUNTIME 15.7163 1.11733 0.0552245 1306.58 trend-deviation True 1.0712 1.17281
apply_inplace_standard_scaler_and_inverse 4 CPU RUNTIME 0.209927 0.00886971 0.001661 2266.79 trend-deviation True 0.00692102 0.01092
apply_inplace_min_max_scaler_and_inverse 4 CPU RUNTIME 0.00149274 0.00120594 9.32465e-05 23.7821 trend-deviation True 0.00102689 0.00144081
apply_inplace_max_abs_scaler_and_inverse 4 CPU RUNTIME 0.00115566 0.000763574 8.32206e-05 51.3491 trend-deviation True 0.000512126 0.00108052
incremental_pca_split0 4 CPU RUNTIME 242.97 39.9462 11.7832 508.242 trend-deviation True 37.1267 43.2585
heat_benchmarks 4 GPU RUNTIME 75.6497 21.3689 0.203889 254.018 trend-deviation True 20.8217 22.577
matmul_split_0 4 GPU RUNTIME 0.169956 0.0443321 0.00366775 283.371 trend-deviation True 0.0230956 0.0587207
matmul_split_1 4 GPU RUNTIME 0.194378 0.0272088 0.00569976 614.393 trend-deviation True 0.0195909 0.0319684
qr_split_0 4 GPU RUNTIME 0.223258 0.0537206 0.00626876 315.591 trend-deviation True 0.0509402 0.058744
qr_split_1 4 GPU RUNTIME 0.0833278 0.044845 0.0166195 85.8128 trend-deviation True 0.0347649 0.0532777
lanczos 4 GPU RUNTIME 0.87182 0.713858 0.00682621 22.1279 trend-deviation True 0.606565 0.866203
hierachical_svd_rank 4 GPU RUNTIME 0.108615 0.0974697 0.00048534 11.4349 trend-deviation True 0.0953985 0.099392
hierachical_svd_tol 4 GPU RUNTIME 0.12905 0.124553 0.00332751 3.61059 trend-deviation True 0.120811 0.128828
reshape 4 GPU RUNTIME 1.45305 0.235989 0.0426811 515.729 trend-deviation True 0.174847 0.280703
concatenate 4 GPU RUNTIME 0.557138 0.0895676 0.0483283 522.031 trend-deviation True 0.0755032 0.102385
resplit 4 GPU RUNTIME 28.3703 2.12104 0.0827825 1237.57 trend-deviation True 2.04964 2.19401
apply_inplace_standard_scaler_and_inverse 4 GPU RUNTIME 1.0243 0.0143389 0.0214778 7043.49 trend-deviation True 0.010719 0.0199456
apply_inplace_min_max_scaler_and_inverse 4 GPU RUNTIME 0.00197 0.00161336 3.03594e-05 22.1057 trend-deviation True 0.0014344 0.00185432
apply_inplace_max_abs_scaler_and_inverse 4 GPU RUNTIME 0.00118265 0.000800488 7.49305e-05 47.7413 trend-deviation True 0.000601578 0.00107888
incremental_pca_split0 4 GPU RUNTIME 29.0433 6.41271 0.125791 352.903 trend-deviation True 4.84684 7.58545

Grafana Dashboard
Last updated: 2025-02-17T10:28:03Z

@ClaudiaComito ClaudiaComito modified the milestones: 1.6, 1.5.2 Feb 17, 2025
CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("PSP_CUDA") == "1"

# warn the user if CUDA-aware MPI is not available, but PyTorch can use GPUs
if torch.cuda.is_available() and not CUDA_AWARE_MPI:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something occurred to us earlier just before we merged this: if torch.cuda.is_available() will return True with ROCm as well. We need to constrain the check a bit more, that's why we decided to to merge yet.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thats true, so actually we need to check for ROCm/HIP and CUDA in PyTorch and need to compare with corresponding MPI-capabilities to avoid the (hopefully) unlikely case of having ROCm/HIP-PyTorch and CUDA-MPI or CUDA-PyTorch and ROCm/HIP-MPI etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport stable benchmark PR bug Something isn't working communication core MPI Anything related to MPI communication PR talk
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

[Bug]: Check for CUDA-aware MPI might fail
3 participants