Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor test_random to minimize collective calls #1677

Merged
merged 26 commits into from
Oct 17, 2024

Conversation

ClaudiaComito
Copy link
Contributor

@ClaudiaComito ClaudiaComito commented Oct 15, 2024

Due Diligence

  • General:
  • Implementation:
    • unit tests: all split configurations tested
    • unit tests: multiple dtypes tested
    • benchmarks: created for new functionality
    • benchmarks: performance improved or maintained
    • documentation updated where needed

Description

test_random has been giving us problems in connection to .numpy() calls (aka Allgather/Allgatherv and copying to CPU) before.

As far as I can tell, it isn't any particular instance of "allgathering" that doesn't work. On the AMD runner (2-process GPU tests), since this Monday, test_random has been failing consistently around the 10th numpy() call in the module.

I have refactored test_random to gather and copy only when absolutely necessary. It now gathers/copies to CPU only 8 times, as opposed to 47 in the legacy implementation.

Issue/s resolved: #1682

Changes proposed:

  • remove unnecessary numpy() calls

Type of change

Bug fix (non-breaking change which fixes an issue)

Memory requirements

NA

Performance

NA

Does this change modify the behaviour of other functions? If so, which?

no

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

1 similar comment
Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito changed the title Debugging test_random on AMD runner Refactor test_random to minimize collective calls Oct 16, 2024
@ClaudiaComito ClaudiaComito added this to the 1.5.0 milestone Oct 16, 2024
@ClaudiaComito ClaudiaComito added bug Something isn't working MPI Anything related to MPI communication testing Implementation of tests, or test-related issues HW:ROCm backport release/1.5.x labels Oct 16, 2024
@ClaudiaComito ClaudiaComito requested a review from mtar October 16, 2024 12:42
Copy link

codecov bot commented Oct 16, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 92.13%. Comparing base (b40646f) to head (b3e2b31).
Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main    #1677   +/-   ##
=======================================
  Coverage   92.13%   92.13%           
=======================================
  Files          83       83           
  Lines       12165    12173    +8     
=======================================
+ Hits        11208    11216    +8     
  Misses        957      957           
Flag Coverage Δ
unit 92.13% <100.00%> (+<0.01%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Collaborator

@mtar mtar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you. You skipped the median tests. Is this intentional?

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito
Copy link
Contributor Author

Thank you. You skipped the median tests. Is this intentional?

Yes, I skipped the ht.median tests because they are very communication-intensive. The np.median tests are all still in.

Copy link
Contributor

Thank you for the PR!

@ClaudiaComito ClaudiaComito merged commit 4b3e570 into main Oct 17, 2024
43 checks passed
@ClaudiaComito ClaudiaComito deleted the bug/amd-runner-test-random branch October 17, 2024 15:17
github-actions bot pushed a commit that referenced this pull request Oct 17, 2024
* debugging

* fix misinterpretation of dtype

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* replace numpy() calls with alternative checks

* debugging

* debugging

* debugging randint

* debugging

* cast ints to float in statistical ops

* bypass numpy call l. 197

* bypass more numpy calls, skip median checks

* bypass more numpy calls, skip median checks

* bypass numpy calls wherever possible

* reinstate median checks

* skip ht.median if split>0

* skip all ht.median

* Revert "skip all ht.median"

This reverts commit 1241454.

* Revert "skip ht.median if split>0"

This reverts commit 4da8c93.

* Revert "reinstate median checks"

This reverts commit bf50914.

(cherry picked from commit 4b3e570)
Copy link
Contributor

Successfully created backport PR for release/1.5.x:

ClaudiaComito added a commit that referenced this pull request Oct 18, 2024
* debugging

* fix misinterpretation of dtype

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* replace numpy() calls with alternative checks

* debugging

* debugging

* debugging randint

* debugging

* cast ints to float in statistical ops

* bypass numpy call l. 197

* bypass more numpy calls, skip median checks

* bypass more numpy calls, skip median checks

* bypass numpy calls wherever possible

* reinstate median checks

* skip ht.median if split>0

* skip all ht.median

* Revert "skip all ht.median"

This reverts commit 1241454.

* Revert "skip ht.median if split>0"

This reverts commit 4da8c93.

* Revert "reinstate median checks"

This reverts commit bf50914.

(cherry picked from commit 4b3e570)

Co-authored-by: Claudia Comito <[email protected]>
ClaudiaComito added a commit that referenced this pull request Feb 19, 2025
* Maintenance/version change (#1644)

* Change  dev -> rc1

* Update CHANGELOG.md

* Update CITATION.cff

---------

Co-authored-by: Claudia Comito <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>

* fix: missing backported pr

* release-drafter autolabeling config

* tmp change to pull_request from pull_request_target

* correct commitish

* not filter by-commitish

* test autolable

* autolabeler, second try

* check that changes are being reflected on the draft release

* testing if main is the answer

* trying to make changes visible

* still trying to see some changes

* last state, need to test on a fork

* complete autolabeler configuration

* removed the second flame

* Refactor `test_random` to minimize collective calls  (#1677) (#1683)

* debugging

* fix misinterpretation of dtype

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* debugging

* replace numpy() calls with alternative checks

* debugging

* debugging

* debugging randint

* debugging

* cast ints to float in statistical ops

* bypass numpy call l. 197

* bypass more numpy calls, skip median checks

* bypass more numpy calls, skip median checks

* bypass numpy calls wherever possible

* reinstate median checks

* skip ht.median if split>0

* skip all ht.median

* Revert "skip all ht.median"

This reverts commit 1241454.

* Revert "skip ht.median if split>0"

This reverts commit 4da8c93.

* Revert "reinstate median checks"

This reverts commit bf50914.

(cherry picked from commit 4b3e570)

Co-authored-by: Claudia Comito <[email protected]>

* initialised ipcluster with mpi (#1679) (#1684)

Co-authored-by: jindra1 <[email protected]>
Co-authored-by: Claudia Comito <[email protected]>
(cherry picked from commit 68319be)

Co-authored-by: Marc-Jindra <[email protected]>

* authors list and version update for 1.5

* Update CHANGELOG.md

* Support PyTorch 2.4.1 (#1655) (#1687)

* Support latest PyTorch release

* Update bug_report.yml

* Update ci.yaml

* Update setup.py

* Update basic_test.py

* skip failing test hip/rocm

---------

Co-authored-by: ClaudiaComito <[email protected]>
Co-authored-by: Michael Tarnawa <[email protected]>
Co-authored-by: Fabian Hoppe <[email protected]>
(cherry picked from commit 78d480a)

* add Dalcin et al reference (#1695)

(cherry picked from commit 99f6f4b)

* Support PyTorch 2.5.1 (#1701) (#1706)

* Support latest PyTorch release

* Update dependencies

* Update bug_report.yml

* Update ci.yaml

* Update setup.py

---------

Co-authored-by: mtar <[email protected]>
Co-authored-by: Fabian Hoppe <[email protected]>
(cherry picked from commit b912846)

Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>

* Documentation updates after new release (#1704) (#1708)

* loose ends after releasing

* remove version update

---------

Co-authored-by: Fabian Hoppe <[email protected]>
Co-authored-by: Michael Tarnawa <[email protected]>
(cherry picked from commit e68db45)

Co-authored-by: Claudia Comito <[email protected]>

* ci: added updated version of claudias release-prep workflow

* post-review commit

* Modernise setup.py configuration (#1731) (#1743)

* set build-system

* fix deprecation warning

* make it easier to get to GitHub from the docs

(cherry picked from commit b4b5540)

* no black formatting on tutorials (#1748)

(cherry picked from commit 8e8c37d)

* Bug fix: printing non-distributed data  (#1756) (#1764)

* make 1-proc print great again

* fix tabs size

* skip formatter on non-distr data

* remove time import

(cherry picked from commit 3082dd9)

Co-authored-by: Claudia Comito <[email protected]>

* Fixed precision loss in several functions when dtype is float64 (#993) (#1790)

* Fix `array`

* Fix `arange`

* Fix `linspace`

* Fix `abs`

`fabs` and `matrix_norm` were also modified to explicitly cast to
float, in accordance with pre-established behaviour.

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Comment out dtype for testing

* Changed 2 tests that were asking for float where it now returns int

---------

Co-authored-by: Claudia Comito <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Michael Tarnawa <[email protected]>
Co-authored-by: Marc-Jindra <[email protected]>
(cherry picked from commit ab677a6)

Co-authored-by: neosunhan <[email protected]>

* `heat.eq`, `heat.ne` now allow non-array operands (#1773) (#1791)

* changed eq and ne so that the input of wrong Types does not cause an error

* Changed eq and ne to include try except for wrong Types

* Changed tests to assert True/False instead of Errors

* fixed spelling of erroneous_type

---------

Co-authored-by: Claudia Comito <[email protected]>
(cherry picked from commit c282cb1)

Co-authored-by: Marc-Jindra <[email protected]>

* Bump version to 1.5.1

* add backport label

* updated changelog

* Update CHANGELOG.md

* updated pre-commit

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update CITATION.cff

* Support pytorch 2.6

* Match torchvision to pytorch 2.6

* updated release note

* Updated RELEASE.md

* Merge pull request #1775 from helmholtz-analytics/support/new-pytorch-main

Support PyTorch 2.6.0 / Add zarr as optional dependency

* Removing some cherrypicked stuff

* Update RELEASE.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Heat 1.5.1 - Release (#1796)

* Bump version to 1.5.1

* add backport label

* updated changelog

* Update CHANGELOG.md

* updated pre-commit

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

* Update CITATION.cff

* Support pytorch 2.6

* Match torchvision to pytorch 2.6

* updated release note

* Updated RELEASE.md

* Merge pull request #1775 from helmholtz-analytics/support/new-pytorch-main

Support PyTorch 2.6.0 / Add zarr as optional dependency

* Removing some cherrypicked stuff

* Update RELEASE.md

* [pre-commit.ci] auto fixes from pre-commit.com hooks

for more information, see https://pre-commit.ci

---------

Co-authored-by: Heat Release Bot <>
Co-authored-by: Claudia Comito <[email protected]>
Co-authored-by: Gutiérrez Hermosillo Muriedas, Juan Pedro <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: Fabian Hoppe <[email protected]>

* Fix to release-prep action

* correct version for main

---------

Co-authored-by: Berkant <[email protected]>
Co-authored-by: Claudia Comito <[email protected]>
Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <41898282+github-actions[bot]@users.noreply.github.com>
Co-authored-by: Marc-Jindra <[email protected]>
Co-authored-by: Michael Tarnawa <[email protected]>
Co-authored-by: Fabian Hoppe <[email protected]>
Co-authored-by: Michael Tarnawa <[email protected]>
Co-authored-by: Jörn Hees <[email protected]>
Co-authored-by: neosunhan <[email protected]>
Co-authored-by: Heat Release Bot <>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working HW:ROCm MPI Anything related to MPI communication testing Implementation of tests, or test-related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Bug]: test_random fails on AMD GPU
2 participants