Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix out of bounds memory accesses in RAJAPerf suite #89

Closed
johnbowen42 opened this issue Dec 23, 2024 · 16 comments
Closed

Fix out of bounds memory accesses in RAJAPerf suite #89

johnbowen42 opened this issue Dec 23, 2024 · 16 comments
Assignees
Labels
bug Something isn't working

Comments

@johnbowen42
Copy link
Collaborator

johnbowen42 commented Dec 23, 2024

Multiple benchmarks including (DEL_DOT_VEC_2D, EDGE3D, VOL3D, PRESSURE) are segfaulting with a variation of

Callback: Queue 0x15553ca00000 aborting with error : HSA_STATUS_ERROR_MEMORY_APERTURE_VIOLATION: The agent attempted to access memory beyond the largest legal address. code: 0x29

I'm still identifying all the failing benchmarks, triaging, and working on a fix

@johnbowen42 johnbowen42 self-assigned this Dec 23, 2024
@ggeorgakoudis
Copy link
Collaborator

Could be related to BlockDim, GridDim specialization in the runtime and caching. Try with the latest main which merges #87

@johnbowen42
Copy link
Collaborator Author

This issue is not fixed by #87

@koparasy
Copy link
Collaborator

koparasy commented Dec 26, 2024

I am suggesting the following steps:

  1. Ensure the store cache is deleted
  2. Run then bencmark with setting environment variables:
ENV_PROTEUS_USE_STORED_CACHE=0 ENV_PROTEUS_SET_LAUNCH_BOUNDS=0 ENV_PROTEUS_SPECIALIZE_ARGS=0 ENV_PROTEUS_SPECIALIZE_DIMS=0 <cli>

if it doesn't fail start adding optimizations. 1 by 1. We somehow corrupt either the dynamic information or the module itself.

@johnbowen42
Copy link
Collaborator Author

johnbowen42 commented Jan 2, 2025

I was able to verify that enabling ENV_PROTEUS_SPECIALIZE_DIMS is causing this

@johnbowen42
Copy link
Collaborator Author

johnbowen42 commented Jan 2, 2025

I think when this is fixed it could be nice to add an integration test CI pipeline that does the following

  1. Clone our fork of RAJAPerf and init submodules
  2. Build in release
  3. Run with --features forall, which runs all benchmarks of RAJA::forall

@koparasy
Copy link
Collaborator

koparasy commented Jan 2, 2025

Can you please verify once more that you are using/linking with the correct proteus version. Can you double check that the cache hash includes block and grid dimensions: https://github.com/Olympus-HPC/proteus/blob/main/lib/JitEngineDevice.hpp#L450 ?

@johnbowen42
Copy link
Collaborator Author

I'm using the latest commit (54dbb1f) as a submodule. I will check the cache hash, but this bug exists with and without cache enabled

@johnbowen42
Copy link
Collaborator Author

Disabling https://github.com/Olympus-HPC/proteus/blob/54dbb1fbc55d99a7bf9c6f6ce96e059b1dac5ed6/lib/JitEngineDevice.hpp#L202 fixes the issue for the ENERGY benchmark, but not all the benchmarks

@koparasy
Copy link
Collaborator

koparasy commented Jan 2, 2025

I can replicate the issue(s). I think I have a fix for in bugfix/specialize-dims.

I need though to have a clear head to check how we name things. I will do that tomorrow. I will add some tests as well.

@ggeorgakoudis ggeorgakoudis added the bug Something isn't working label Jan 7, 2025
@ggeorgakoudis
Copy link
Collaborator

I went through bugfix/specialize-dims. It looks like it's a simple misnaming bug. @johnbowen42 Does this branch fix your issues?

@johnbowen42
Copy link
Collaborator Author

johnbowen42 commented Jan 7, 2025

This fixes a subset of these issues but I still am seeing many benchmarks fail

@ggeorgakoudis
Copy link
Collaborator

This fixes a subset of these issues but I still am seeing many benchmarks fail

Hmm, that means there's another underlying issue. My advice is to run with ENV_PROTEUS_SPECIALIZE_DIMS=0 for now until we find out what's wrong. @johnbowen42 and @koparasy, I'm tagging you both.

@koparasy
Copy link
Collaborator

koparasy commented Jan 7, 2025

ENV_PROTEUS_SPECIALIZE_DIMS=0 this does not fix all the errors (We had this discussion with @johnbowen42 privately).

@ggeorgakoudis
Copy link
Collaborator

Can you fill me in? Here or in slack?

@johnbowen42
Copy link
Collaborator Author

johnbowen42 commented Jan 7, 2025

Update:

  • All reduction benchmarks are failing
  • All forall benchmarks except INT_PREDICT run without error
  • Some atomic benchmarks are hanging

@johnbowen42
Copy link
Collaborator Author

see #96, #97, #98

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants