Refactor `test_random` to minimize collective calls #1677

ClaudiaComito · 2024-10-15T09:35:03Z

Due Diligence

General:
- title of the PR is suitable to appear in the Release Notes
Implementation:
- unit tests: all split configurations tested
- unit tests: multiple dtypes tested
- benchmarks: created for new functionality
- benchmarks: performance improved or maintained
- documentation updated where needed

Description

test_random has been giving us problems in connection to .numpy() calls (aka Allgather/Allgatherv and copying to CPU) before.

As far as I can tell, it isn't any particular instance of "allgathering" that doesn't work. On the AMD runner (2-process GPU tests), since this Monday, test_random has been failing consistently around the 10th numpy() call in the module.

I have refactored test_random to gather and copy only when absolutely necessary. It now gathers/copies to CPU only 8 times, as opposed to 47 in the legacy implementation.

Issue/s resolved: #1682

Changes proposed:

remove unnecessary numpy() calls

Type of change

Bug fix (non-breaking change which fixes an issue)

Memory requirements

NA

Performance

NA

Does this change modify the behaviour of other functions? If so, which?

no

github-actions · 2024-10-15T09:59:35Z

Thank you for the PR!

github-actions · 2024-10-15T10:24:14Z

Thank you for the PR!

github-actions · 2024-10-15T11:32:23Z

Thank you for the PR!

github-actions · 2024-10-15T12:36:53Z

Thank you for the PR!

github-actions · 2024-10-15T12:58:17Z

Thank you for the PR!

github-actions · 2024-10-15T13:00:12Z

Thank you for the PR!

github-actions · 2024-10-15T13:26:39Z

Thank you for the PR!

github-actions · 2024-10-15T13:32:14Z

Thank you for the PR!

github-actions · 2024-10-15T13:55:27Z

Thank you for the PR!

github-actions · 2024-10-16T07:28:54Z

Thank you for the PR!

github-actions · 2024-10-16T08:02:07Z

Thank you for the PR!

github-actions · 2024-10-16T09:12:56Z

Thank you for the PR!

github-actions · 2024-10-16T10:11:36Z

Thank you for the PR!

github-actions · 2024-10-16T12:03:35Z

Thank you for the PR!

codecov · 2024-10-16T13:06:09Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.13%. Comparing base (b40646f) to head (b3e2b31).
Report is 1 commits behind head on main.

Additional details and impacted files

@@           Coverage Diff           @@
##             main    #1677   +/-   ##
=======================================
  Coverage   92.13%   92.13%           
=======================================
  Files          83       83           
  Lines       12165    12173    +8     
=======================================
+ Hits        11208    11216    +8     
  Misses        957      957

Flag	Coverage Δ
unit	`92.13% <100.00%> (+<0.01%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

mtar

Thank you. You skipped the median tests. Is this intentional?

github-actions · 2024-10-16T17:16:16Z

Thank you for the PR!

github-actions · 2024-10-17T07:55:40Z

Thank you for the PR!

github-actions · 2024-10-17T08:49:45Z

Thank you for the PR!

ClaudiaComito · 2024-10-17T09:00:14Z

Thank you. You skipped the median tests. Is this intentional?

Yes, I skipped the ht.median tests because they are very communication-intensive. The np.median tests are all still in.

This reverts commit 1241454.

This reverts commit 4da8c93.

This reverts commit bf50914.

github-actions · 2024-10-17T10:18:56Z

Thank you for the PR!

* debugging * fix misinterpretation of dtype * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * replace numpy() calls with alternative checks * debugging * debugging * debugging randint * debugging * cast ints to float in statistical ops * bypass numpy call l. 197 * bypass more numpy calls, skip median checks * bypass more numpy calls, skip median checks * bypass numpy calls wherever possible * reinstate median checks * skip ht.median if split>0 * skip all ht.median * Revert "skip all ht.median" This reverts commit 1241454. * Revert "skip ht.median if split>0" This reverts commit 4da8c93. * Revert "reinstate median checks" This reverts commit bf50914. (cherry picked from commit 4b3e570)

github-actions · 2024-10-17T15:17:43Z

Successfully created backport PR for release/1.5.x:

[Backport release/1.5.x] Refactor test_random to minimize collective calls #1683

* debugging * fix misinterpretation of dtype * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * replace numpy() calls with alternative checks * debugging * debugging * debugging randint * debugging * cast ints to float in statistical ops * bypass numpy call l. 197 * bypass more numpy calls, skip median checks * bypass more numpy calls, skip median checks * bypass numpy calls wherever possible * reinstate median checks * skip ht.median if split>0 * skip all ht.median * Revert "skip all ht.median" This reverts commit 1241454. * Revert "skip ht.median if split>0" This reverts commit 4da8c93. * Revert "reinstate median checks" This reverts commit bf50914. (cherry picked from commit 4b3e570) Co-authored-by: Claudia Comito <[email protected]>

* Maintenance/version change (#1644) * Change dev -> rc1 * Update CHANGELOG.md * Update CITATION.cff --------- Co-authored-by: Claudia Comito <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> * fix: missing backported pr * release-drafter autolabeling config * tmp change to pull_request from pull_request_target * correct commitish * not filter by-commitish * test autolable * autolabeler, second try * check that changes are being reflected on the draft release * testing if main is the answer * trying to make changes visible * still trying to see some changes * last state, need to test on a fork * complete autolabeler configuration * removed the second flame * Refactor `test_random` to minimize collective calls (#1677) (#1683) * debugging * fix misinterpretation of dtype * debugging * debugging * debugging * debugging * debugging * debugging * debugging * debugging * replace numpy() calls with alternative checks * debugging * debugging * debugging randint * debugging * cast ints to float in statistical ops * bypass numpy call l. 197 * bypass more numpy calls, skip median checks * bypass more numpy calls, skip median checks * bypass numpy calls wherever possible * reinstate median checks * skip ht.median if split>0 * skip all ht.median * Revert "skip all ht.median" This reverts commit 1241454. * Revert "skip ht.median if split>0" This reverts commit 4da8c93. * Revert "reinstate median checks" This reverts commit bf50914. (cherry picked from commit 4b3e570) Co-authored-by: Claudia Comito <[email protected]> * initialised ipcluster with mpi (#1679) (#1684) Co-authored-by: jindra1 <[email protected]> Co-authored-by: Claudia Comito <[email protected]> (cherry picked from commit 68319be) Co-authored-by: Marc-Jindra <[email protected]> * authors list and version update for 1.5 * Update CHANGELOG.md * Support PyTorch 2.4.1 (#1655) (#1687) * Support latest PyTorch release * Update bug_report.yml * Update ci.yaml * Update setup.py * Update basic_test.py * skip failing test hip/rocm --------- Co-authored-by: ClaudiaComito <[email protected]> Co-authored-by: Michael Tarnawa <[email protected]> Co-authored-by: Fabian Hoppe <[email protected]> (cherry picked from commit 78d480a) * add Dalcin et al reference (#1695) (cherry picked from commit 99f6f4b) * Support PyTorch 2.5.1 (#1701) (#1706) * Support latest PyTorch release * Update dependencies * Update bug_report.yml * Update ci.yaml * Update setup.py --------- Co-authored-by: mtar <[email protected]> Co-authored-by: Fabian Hoppe <[email protected]> (cherry picked from commit b912846) Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> * Documentation updates after new release (#1704) (#1708) * loose ends after releasing * remove version update --------- Co-authored-by: Fabian Hoppe <[email protected]> Co-authored-by: Michael Tarnawa <[email protected]> (cherry picked from commit e68db45) Co-authored-by: Claudia Comito <[email protected]> * ci: added updated version of claudias release-prep workflow * post-review commit * Modernise setup.py configuration (#1731) (#1743) * set build-system * fix deprecation warning * make it easier to get to GitHub from the docs (cherry picked from commit b4b5540) * no black formatting on tutorials (#1748) (cherry picked from commit 8e8c37d) * Bug fix: printing non-distributed data (#1756) (#1764) * make 1-proc print great again * fix tabs size * skip formatter on non-distr data * remove time import (cherry picked from commit 3082dd9) Co-authored-by: Claudia Comito <[email protected]> * Fixed precision loss in several functions when dtype is float64 (#993) (#1790) * Fix `array` * Fix `arange` * Fix `linspace` * Fix `abs` `fabs` and `matrix_norm` were also modified to explicitly cast to float, in accordance with pre-established behaviour. * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Comment out dtype for testing * Changed 2 tests that were asking for float where it now returns int --------- Co-authored-by: Claudia Comito <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Michael Tarnawa <[email protected]> Co-authored-by: Marc-Jindra <[email protected]> (cherry picked from commit ab677a6) Co-authored-by: neosunhan <[email protected]> * `heat.eq`, `heat.ne` now allow non-array operands (#1773) (#1791) * changed eq and ne so that the input of wrong Types does not cause an error * Changed eq and ne to include try except for wrong Types * Changed tests to assert True/False instead of Errors * fixed spelling of erroneous_type --------- Co-authored-by: Claudia Comito <[email protected]> (cherry picked from commit c282cb1) Co-authored-by: Marc-Jindra <[email protected]> * Bump version to 1.5.1 * add backport label * updated changelog * Update CHANGELOG.md * updated pre-commit * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update CITATION.cff * Support pytorch 2.6 * Match torchvision to pytorch 2.6 * updated release note * Updated RELEASE.md * Merge pull request #1775 from helmholtz-analytics/support/new-pytorch-main Support PyTorch 2.6.0 / Add zarr as optional dependency * Removing some cherrypicked stuff * Update RELEASE.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Heat 1.5.1 - Release (#1796) * Bump version to 1.5.1 * add backport label * updated changelog * Update CHANGELOG.md * updated pre-commit * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Update CITATION.cff * Support pytorch 2.6 * Match torchvision to pytorch 2.6 * updated release note * Updated RELEASE.md * Merge pull request #1775 from helmholtz-analytics/support/new-pytorch-main Support PyTorch 2.6.0 / Add zarr as optional dependency * Removing some cherrypicked stuff * Update RELEASE.md * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci --------- Co-authored-by: Heat Release Bot <> Co-authored-by: Claudia Comito <[email protected]> Co-authored-by: Gutiérrez Hermosillo Muriedas, Juan Pedro <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: Fabian Hoppe <[email protected]> * Fix to release-prep action * correct version for main --------- Co-authored-by: Berkant <[email protected]> Co-authored-by: Claudia Comito <[email protected]> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com> Co-authored-by: Marc-Jindra <[email protected]> Co-authored-by: Michael Tarnawa <[email protected]> Co-authored-by: Fabian Hoppe <[email protected]> Co-authored-by: Michael Tarnawa <[email protected]> Co-authored-by: Jörn Hees <[email protected]> Co-authored-by: neosunhan <[email protected]> Co-authored-by: Heat Release Bot <>

ClaudiaComito added 3 commits October 15, 2024 11:34

debugging

62942d9

fix misinterpretation of dtype

024b9e9

debugging

6640c7a

debugging

62dca2b

debugging

0e5ec77

debugging

8114f8d

ClaudiaComito added 2 commits October 15, 2024 14:52

debugging

d4c433c

debugging

e58a3ec

ClaudiaComito added 3 commits October 15, 2024 15:13

debugging

4230d08

debugging

725dc02

replace numpy() calls with alternative checks

621eb48

debugging

6c01e17

debugging

315b3c4

debugging randint

d2b3240

debugging

a4d439b

ClaudiaComito added 2 commits October 16, 2024 12:04

cast ints to float in statistical ops

45dcbe1

bypass numpy call l. 197

3cf651d

ClaudiaComito changed the title ~~Debugging test_random on AMD runner~~ Refactor test_random to minimize collective calls Oct 16, 2024

ClaudiaComito added this to the 1.5.0 milestone Oct 16, 2024

ClaudiaComito added bug Something isn't working MPI Anything related to MPI communication testing Implementation of tests, or test-related issues HW:ROCm backport release/1.5.x labels Oct 16, 2024

ClaudiaComito requested a review from mtar October 16, 2024 12:42

mtar reviewed Oct 16, 2024

View reviewed changes

reinstate median checks

bf50914

skip ht.median if split>0

4da8c93

skip all ht.median

1241454

ClaudiaComito added 3 commits October 17, 2024 12:12

Revert "skip all ht.median"

835a555

This reverts commit 1241454.

Revert "skip ht.median if split>0"

726d784

This reverts commit 4da8c93.

Revert "reinstate median checks"

b3e2b31

This reverts commit bf50914.

mtar approved these changes Oct 17, 2024

View reviewed changes

ClaudiaComito merged commit 4b3e570 into main Oct 17, 2024
43 checks passed

ClaudiaComito deleted the bug/amd-runner-test-random branch October 17, 2024 15:17

github-actions bot mentioned this pull request Oct 17, 2024

[Backport release/1.5.x] Refactor test_random to minimize collective calls #1683

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor `test_random` to minimize collective calls #1677

Refactor `test_random` to minimize collective calls #1677

ClaudiaComito commented Oct 15, 2024 •

edited

Loading

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

codecov bot commented Oct 16, 2024 •

edited

Loading

mtar left a comment

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

ClaudiaComito commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

Refactor test_random to minimize collective calls #1677

Refactor test_random to minimize collective calls #1677

Conversation

ClaudiaComito commented Oct 15, 2024 • edited Loading

Due Diligence

Description

Changes proposed:

Type of change

Memory requirements

Performance

Does this change modify the behaviour of other functions? If so, which?

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 15, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 16, 2024

codecov bot commented Oct 16, 2024 • edited Loading

Codecov Report

mtar left a comment

Choose a reason for hiding this comment

github-actions bot commented Oct 16, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

ClaudiaComito commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

github-actions bot commented Oct 17, 2024

Refactor `test_random` to minimize collective calls #1677

Refactor `test_random` to minimize collective calls #1677

ClaudiaComito commented Oct 15, 2024 •

edited

Loading

codecov bot commented Oct 16, 2024 •

edited

Loading