More generic check for CUDA-aware MPI #1793

mrfh92 · 2025-02-14T10:21:10Z

fixes #1787

CUDA-awareness of OpenMPI is now checked within a try-except-construction to ensure that it is really checked in any case
a warning is issued if PyTorch supports CUDA but there is no CUDA-aware MPI detected

Due Diligence

General:
- title of the PR is suitable to appear in the Release Notes
Implementation:
- unit tests: all split configurations tested
- unit tests: multiple dtypes tested
- ~~[ ] benchmarks: created for new functionality~~
- ~~[ ] benchmarks: performance improved or maintained~~
- documentation updated where needed

Does this change modify the behaviour of other functions? If so, which?

no

…; warning is issued if PyTorch supports GPUs but no cuda-aware MPI is found.

github-actions · 2025-02-14T10:26:26Z

Thank you for the PR!

codecov · 2025-02-14T11:12:01Z

Codecov Report

Attention: Patch coverage is 66.66667% with 2 lines in your changes missing coverage. Please review.

Project coverage is 92.25%. Comparing base (9c8eaf5) to head (c96a4e4).

Files with missing lines	Patch %	Lines
heat/core/communication.py	66.66%	2 Missing ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1793   +/-   ##
=======================================
  Coverage   92.24%   92.25%           
=======================================
  Files          84       84           
  Lines       12460    12465    +5     
=======================================
+ Hits        11494    11499    +5     
  Misses        966      966

Flag	Coverage Δ
unit	`92.25% <66.66%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

…t_fail

ClaudiaComito

As far as I'm concerned this can be merged, thanks for looking into this!

github-actions · 2025-02-17T05:09:34Z

Thank you for the PR!

github-actions · 2025-02-17T08:50:05Z

Thank you for the PR!

JuanPedroGHM · 2025-02-17T10:28:04Z

Benchmarks results - Sponsored by perun

function	mpi_ranks	device	metric	value	ref_value	std	% change	type	alert	lower_quantile	upper_quantile
heat_benchmarks	4	CPU	RUNTIME	273.389	52.39	11.6914	421.834	jump-detection	True	nan	nan
matmul_split_0	4	CPU	RUNTIME	0.333236	0.0963688	0.0216452	245.792	jump-detection	True	nan	nan
matmul_split_1	4	CPU	RUNTIME	0.376656	0.0990343	0.0433766	280.329	jump-detection	True	nan	nan
qr_split_0	4	CPU	RUNTIME	0.499975	0.218702	0.0113442	128.61	jump-detection	True	nan	nan
qr_split_1	4	CPU	RUNTIME	0.209519	0.163714	0.00352157	27.9784	jump-detection	True	nan	nan
hierachical_svd_rank	4	CPU	RUNTIME	0.0905598	0.062739	0.00491588	44.3438	jump-detection	True	nan	nan
hierachical_svd_tol	4	CPU	RUNTIME	0.0813055	0.0734396	0.00857699	10.7107	jump-detection	True	nan	nan
reshape	4	CPU	RUNTIME	0.832802	0.0762245	0.0134849	992.564	jump-detection	True	nan	nan
concatenate	4	CPU	RUNTIME	1.75758	0.129364	0.0695605	1258.64	jump-detection	True	nan	nan
resplit	4	CPU	RUNTIME	15.7163	1.16468	0.0552245	1249.4	jump-detection	True	nan	nan
apply_inplace_standard_scaler_and_inverse	4	CPU	RUNTIME	0.209927	0.0106696	0.001661	1867.52	jump-detection	True	nan	nan
apply_inplace_min_max_scaler_and_inverse	4	CPU	RUNTIME	0.00149274	0.0013495	9.32465e-05	10.6145	jump-detection	True	nan	nan
apply_inplace_max_abs_scaler_and_inverse	4	CPU	RUNTIME	0.00115566	0.000851977	8.32206e-05	35.6448	jump-detection	True	nan	nan
apply_inplace_normalizer	4	CPU	RUNTIME	0.00397964	0.00153844	0.0060146	158.679	jump-detection	True	nan	nan
incremental_pca_split0	4	CPU	RUNTIME	242.97	42.6731	11.7832	469.373	jump-detection	True	nan	nan
heat_benchmarks	4	GPU	RUNTIME	75.6497	21.333	0.203889	254.614	jump-detection	True	nan	nan
matmul_split_0	4	GPU	RUNTIME	0.169956	0.0164776	0.00366775	931.437	jump-detection	True	nan	nan
matmul_split_1	4	GPU	RUNTIME	0.194378	0.0162983	0.00569976	1092.63	jump-detection	True	nan	nan
qr_split_0	4	GPU	RUNTIME	0.223258	0.0410244	0.00626876	444.207	jump-detection	True	nan	nan
qr_split_1	4	GPU	RUNTIME	0.0833278	0.0304706	0.0166195	173.469	jump-detection	True	nan	nan
lanczos	4	GPU	RUNTIME	0.87182	0.724528	0.00682621	20.3293	jump-detection	True	nan	nan
hierachical_svd_rank	4	GPU	RUNTIME	0.108615	0.095702	0.00048534	13.4932	jump-detection	True	nan	nan
reshape	4	GPU	RUNTIME	1.45305	0.154857	0.0426811	838.323	jump-detection	True	nan	nan
concatenate	4	GPU	RUNTIME	0.557138	0.0870384	0.0483283	540.106	jump-detection	True	nan	nan
resplit	4	GPU	RUNTIME	28.3703	2.19025	0.0827825	1195.3	jump-detection	True	nan	nan
apply_inplace_standard_scaler_and_inverse	4	GPU	RUNTIME	1.0243	0.0181149	0.0214778	5554.43	jump-detection	True	nan	nan
apply_inplace_min_max_scaler_and_inverse	4	GPU	RUNTIME	0.00197	0.00175316	3.03594e-05	12.3686	jump-detection	True	nan	nan
apply_inplace_max_abs_scaler_and_inverse	4	GPU	RUNTIME	0.00118265	0.000979006	7.49305e-05	20.8012	jump-detection	True	nan	nan
apply_inplace_normalizer	4	GPU	RUNTIME	0.00342388	0.00239644	0.000280244	42.8736	jump-detection	True	nan	nan
incremental_pca_split0	4	GPU	RUNTIME	29.0433	4.9077	0.125791	491.791	jump-detection	True	nan	nan
heat_benchmarks	4	CPU	RUNTIME	273.389	49.1195	11.6914	456.58	trend-deviation	True	44.8761	54.2423
matmul_split_0	4	CPU	RUNTIME	0.333236	0.159913	0.0216452	108.385	trend-deviation	True	0.121471	0.207108
matmul_split_1	4	CPU	RUNTIME	0.376656	0.138353	0.0433766	172.243	trend-deviation	True	0.108312	0.179826
qr_split_0	4	CPU	RUNTIME	0.499975	0.277881	0.0113442	79.924	trend-deviation	True	0.227327	0.346153
qr_split_1	4	CPU	RUNTIME	0.209519	0.169796	0.00352157	23.3946	trend-deviation	True	0.163472	0.174831
hierachical_svd_rank	4	CPU	RUNTIME	0.0905598	0.056287	0.00491588	60.8895	trend-deviation	True	0.0467363	0.0682467
hierachical_svd_tol	4	CPU	RUNTIME	0.0813055	0.0627339	0.00857699	29.6039	trend-deviation	True	0.0515528	0.0755125
reshape	4	CPU	RUNTIME	0.832802	0.19281	0.0134849	331.93	trend-deviation	True	0.149999	0.213561
concatenate	4	CPU	RUNTIME	1.75758	0.194566	0.0695605	803.336	trend-deviation	True	0.147756	0.250637
resplit	4	CPU	RUNTIME	15.7163	1.11733	0.0552245	1306.58	trend-deviation	True	1.0712	1.17281
apply_inplace_standard_scaler_and_inverse	4	CPU	RUNTIME	0.209927	0.00886971	0.001661	2266.79	trend-deviation	True	0.00692102	0.01092
apply_inplace_min_max_scaler_and_inverse	4	CPU	RUNTIME	0.00149274	0.00120594	9.32465e-05	23.7821	trend-deviation	True	0.00102689	0.00144081
apply_inplace_max_abs_scaler_and_inverse	4	CPU	RUNTIME	0.00115566	0.000763574	8.32206e-05	51.3491	trend-deviation	True	0.000512126	0.00108052
incremental_pca_split0	4	CPU	RUNTIME	242.97	39.9462	11.7832	508.242	trend-deviation	True	37.1267	43.2585
heat_benchmarks	4	GPU	RUNTIME	75.6497	21.3689	0.203889	254.018	trend-deviation	True	20.8217	22.577
matmul_split_0	4	GPU	RUNTIME	0.169956	0.0443321	0.00366775	283.371	trend-deviation	True	0.0230956	0.0587207
matmul_split_1	4	GPU	RUNTIME	0.194378	0.0272088	0.00569976	614.393	trend-deviation	True	0.0195909	0.0319684
qr_split_0	4	GPU	RUNTIME	0.223258	0.0537206	0.00626876	315.591	trend-deviation	True	0.0509402	0.058744
qr_split_1	4	GPU	RUNTIME	0.0833278	0.044845	0.0166195	85.8128	trend-deviation	True	0.0347649	0.0532777
lanczos	4	GPU	RUNTIME	0.87182	0.713858	0.00682621	22.1279	trend-deviation	True	0.606565	0.866203
hierachical_svd_rank	4	GPU	RUNTIME	0.108615	0.0974697	0.00048534	11.4349	trend-deviation	True	0.0953985	0.099392
hierachical_svd_tol	4	GPU	RUNTIME	0.12905	0.124553	0.00332751	3.61059	trend-deviation	True	0.120811	0.128828
reshape	4	GPU	RUNTIME	1.45305	0.235989	0.0426811	515.729	trend-deviation	True	0.174847	0.280703
concatenate	4	GPU	RUNTIME	0.557138	0.0895676	0.0483283	522.031	trend-deviation	True	0.0755032	0.102385
resplit	4	GPU	RUNTIME	28.3703	2.12104	0.0827825	1237.57	trend-deviation	True	2.04964	2.19401
apply_inplace_standard_scaler_and_inverse	4	GPU	RUNTIME	1.0243	0.0143389	0.0214778	7043.49	trend-deviation	True	0.010719	0.0199456
apply_inplace_min_max_scaler_and_inverse	4	GPU	RUNTIME	0.00197	0.00161336	3.03594e-05	22.1057	trend-deviation	True	0.0014344	0.00185432
apply_inplace_max_abs_scaler_and_inverse	4	GPU	RUNTIME	0.00118265	0.000800488	7.49305e-05	47.7413	trend-deviation	True	0.000601578	0.00107888
incremental_pca_split0	4	GPU	RUNTIME	29.0433	6.41271	0.125791	352.903	trend-deviation	True	4.84684	7.58545

Grafana Dashboard
Last updated: 2025-02-17T10:28:03Z

ClaudiaComito · 2025-02-17T17:26:14Z

heat/core/communication.py

 CUDA_AWARE_MPI = CUDA_AWARE_MPI or os.environ.get("PSP_CUDA") == "1"

+# warn the user if CUDA-aware MPI is not available, but PyTorch can use GPUs
+if torch.cuda.is_available() and not CUDA_AWARE_MPI:


Something occurred to us earlier just before we merged this: if torch.cuda.is_available() will return True with ROCm as well. We need to constrain the check a bit more, that's why we decided to to merge yet.

Thats true, so actually we need to check for ROCm/HIP and CUDA in PyTorch and need to compare with corresponding MPI-capabilities to avoid the (hopefully) unlikely case of having ROCm/HIP-PyTorch and CUDA-MPI or CUDA-PyTorch and ROCm/HIP-MPI etc.

cuda-awareness of openmpi is now checked in a try-except-construction…

8a3ae51

…; warning is issued if PyTorch supports GPUs but no cuda-aware MPI is found.

mrfh92 self-assigned this Feb 14, 2025

github-actions bot added backport stable bug Something isn't working core labels Feb 14, 2025

mrfh92 added MPI Anything related to MPI communication communication labels Feb 14, 2025

mrfh92 requested review from JuanPedroGHM and ClaudiaComito February 14, 2025 12:34

ClaudiaComito added this to the 1.5.1 milestone Feb 17, 2025

Merge branch 'main' into bugs/1787-_Bug_Check_for_CUDA-aware_MPI_migh…

5bb8ce6

…t_fail

ClaudiaComito changed the title ~~Fix check for CUDA-aware MPI~~ More generic check for CUDA-aware MPI Feb 17, 2025

ClaudiaComito previously approved these changes Feb 17, 2025

View reviewed changes

ClaudiaComito added the PR talk label Feb 17, 2025

JuanPedroGHM added the benchmark PR label Feb 17, 2025

Update communication.py

c96a4e4

mrfh92 dismissed ClaudiaComito’s stale review via c96a4e4 February 17, 2025 08:44

ClaudiaComito approved these changes Feb 17, 2025

View reviewed changes

ClaudiaComito modified the milestones: 1.5.1, 1.6 Feb 17, 2025

ClaudiaComito modified the milestones: 1.6, 1.5.2 Feb 17, 2025

ClaudiaComito requested changes Feb 17, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

More generic check for CUDA-aware MPI #1793

More generic check for CUDA-aware MPI #1793

mrfh92 commented Feb 14, 2025

github-actions bot commented Feb 14, 2025

codecov bot commented Feb 14, 2025 •

edited

Loading

ClaudiaComito left a comment

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

JuanPedroGHM commented Feb 17, 2025

ClaudiaComito Feb 17, 2025

mrfh92 Feb 18, 2025

More generic check for CUDA-aware MPI #1793

Are you sure you want to change the base?

More generic check for CUDA-aware MPI #1793

Conversation

mrfh92 commented Feb 14, 2025

Due Diligence

Does this change modify the behaviour of other functions? If so, which?

github-actions bot commented Feb 14, 2025

codecov bot commented Feb 14, 2025 • edited Loading

Codecov Report

ClaudiaComito left a comment

Choose a reason for hiding this comment

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 17, 2025

JuanPedroGHM commented Feb 17, 2025

Benchmarks results - Sponsored by perun

ClaudiaComito Feb 17, 2025

Choose a reason for hiding this comment

mrfh92 Feb 18, 2025

Choose a reason for hiding this comment

codecov bot commented Feb 14, 2025 •

edited

Loading