Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Potential problem with ExeSpaceUtils view_reduction and parallel_reduce #254

Closed
jgfouca opened this issue Sep 27, 2022 · 6 comments · Fixed by #256
Closed

Potential problem with ExeSpaceUtils view_reduction and parallel_reduce #254

jgfouca opened this issue Sep 27, 2022 · 6 comments · Fixed by #256
Assignees
Labels
bug Something isn't working

Comments

@jgfouca
Copy link
Member

jgfouca commented Sep 27, 2022

Describe the bug
This was discovered when porting shoc_energy_integrals to small kernels. I was getting large differences in the outputs of the view_reductions when num_threads>1. I suspect the problem is in the handling of the garbage of the last pack because the problem went away when I used nlev % pack_size = 0.

To Reproduce
Steps to reproduce the behavior:

  1. Switch shoc_energy_integrals to the implementation it had before the small kernel PR. The one that uses view_reductions.
  2. Build SCREAM with -DSCREAM_SMALL_KERNELS=On -DCMAKE_BUILD_TYPE=Debug
  3. run OMP_NUM_THREADS=16 ./shoc_tests shoc_main_bfb
  4. This should fail due to being non_bfb with fortran. You can add print statements to confirm that the se_int, ke_int, wv_int, and wl_int values do not match fortran, which causes different results later in shoc for the output views.

Expected behavior
view_reduction should have produced bfb results with fortran.

@jgfouca jgfouca added the bug Something isn't working label Sep 27, 2022
@jgfouca
Copy link
Member Author

jgfouca commented Oct 13, 2022

Upon a second look, nlev%pack_size !=0 is not necessary to demonstrate the error but it does make the errors more frequent. When I switched to ExeSpaceUtils::parallel_reduce on scalarized views, I had similar problems until I made sure each thread had a local variable passed to the reducer. The local variable approach also fixed view_reduction.

@jgfouca jgfouca changed the title Potential problem with view_reduction Potential problem with ExeSpaceUtils view_reduction and parallel_reduce Oct 13, 2022
@jgfouca
Copy link
Member Author

jgfouca commented Oct 13, 2022

I should also note that the error only occurs when team_size > 1, which is what you get when MIMIC_GPU is On (team size 7) which is On by default for Debug builds.

@jgfouca
Copy link
Member Author

jgfouca commented Oct 17, 2022

I believe the problems with ExeSpaceUtils::parallel_reduce were not fixed. @bartgol , correct me if I'm wrong.

@jgfouca jgfouca reopened this Oct 17, 2022
@bartgol
Copy link
Contributor

bartgol commented Oct 17, 2022

Yes, you're right. I was working on completing last Friday, but did not finish by week's end. I should be done today.

@bartgol
Copy link
Contributor

bartgol commented Nov 15, 2023

I think this was completed in #258. Closing.

@bartgol
Copy link
Contributor

bartgol commented Jan 14, 2025

I guess I never really closed this? Closing (for real).

@bartgol bartgol closed this as completed Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
3 participants