Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integration PR followups: make_workdiv, uniform_elements, concrete kernel dimensions #141

Merged
merged 6 commits into from
Jan 29, 2025

Conversation

ariostas
Copy link
Member

@ariostas ariostas commented Dec 17, 2024

I started to address some follow-up tasks in #75. In particular:

  • Switch to cms::alpakatools::makeworkdir instead of our custom createWorkDiv.
  • Switch to cms::alpakatools::uniform_elements for kernel loops.
  • Don't set a custom work division for CPU.
  • Started moving towards the removal of kVerticalModuleSlope.
  • Use concrete kernel dimensions instead of templated ones.

@ariostas
Copy link
Member Author

ariostas commented Dec 17, 2024

I think the plots might look a bit different now that the work division is different, but hopefully they are still run-to-run reproducible (I'll check).

For now, I'm just making sure I didn't break something.

/run standalone
/run checks

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     46.8    400.0    187.9    153.9    166.6    515.0    123.0    232.9    146.5      3.9    1976.4    1414.6+/- 393.7     525.2   explicit[s=4] (target branch)
   avg     46.5    386.1    190.4    157.7    165.2    694.5      8.8     11.8    160.6      3.4    1824.9    1084.0+/- 241.8     482.7   explicit[s=4] (this PR)

@ariostas
Copy link
Member Author

Welp, I did break something. I'll look into it

@ariostas ariostas force-pushed the ariostas/integration_pr_followups branch from 17d4223 to aa82696 Compare December 17, 2024 20:39
@ariostas
Copy link
Member Author

Hopefully it's fine now.

/run standalone

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     45.8    394.5    194.2    150.0    164.7    514.4    121.9    236.0    145.2      3.7    1970.5    1410.2+/- 392.1     519.0   explicit[s=4] (target branch)
   avg     44.4    385.7    188.4    155.9    149.7    702.5    127.3    256.8    177.9      3.7    2192.4    1445.4+/- 400.5     577.8   explicit[s=4] (this PR)

@ariostas
Copy link
Member Author

The plots match perfectly, which is nice. There was a significant increase in timing, especially for pLS, but it seems like this is only for CPU. This is how the timing compares in cgpu-1.

This PR (aa82696)
Total Timing Summary
Average time for map loading = 571.866 ms
Average time for input loading = 7550.76 ms
Average time for lst::Event creation = 0.00314486 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     11.7      0.5      0.3      1.4      1.2      0.4      0.8      0.6      1.1      0.0      18.1       6.0+/-  1.2      20.1   explicit[s=1]
   avg      2.5      0.6      0.5      1.8      1.4      0.4      1.2      0.7      1.6      0.0      10.8       7.8+/-  1.6       6.5   explicit[s=2]
   avg      4.6      1.0      0.8      2.8      2.1      0.6      2.4      1.5      2.7      0.0      18.5      13.3+/-  2.8       5.3   explicit[s=4]
   avg      6.5      1.6      1.1      3.8      3.2      0.7      3.2      1.8      3.9      0.0      25.9      18.6+/-  4.1       4.8   explicit[s=6]
   avg      8.1      2.2      1.5      4.9      4.2      0.9      3.8      2.1      5.2      0.0      32.9      23.9+/-  4.6       4.6   explicit[s=8]

master (9628e8f)
Total Timing Summary
Average time for map loading = 566.258 ms
Average time for input loading = 7584.02 ms
Average time for lst::Event creation = 0.0034968 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     13.1      0.5      0.3      1.5      1.2      0.4      1.3      1.0      1.3      0.0      20.7       7.2+/-  1.3      22.8   explicit[s=1]
   avg      2.5      0.6      0.5      1.8      1.4      0.5      1.1      0.7      1.8      0.0      10.8       7.9+/-  1.6       6.5   explicit[s=2]
   avg      3.4      1.0      0.8      2.8      2.2      0.6      2.0      1.1      3.0      0.0      16.9      13.0+/-  2.8       4.9   explicit[s=4]
   avg      5.7      1.6      1.2      4.0      3.3      0.7      3.0      1.6      4.4      0.0      25.4      18.9+/-  3.4       4.7   explicit[s=6]
   avg     10.0      1.9      1.3      4.8      4.7      0.9      4.3      2.4      5.5      0.0      35.8      24.9+/-  5.1       5.0   explicit[s=8]

@ariostas ariostas force-pushed the ariostas/integration_pr_followups branch 3 times, most recently from 7278857 to 9444e18 Compare December 18, 2024 21:47
@ariostas
Copy link
Member Author

Sorry for all the force-pushing.

I made the code compatible with both kVerticalModuleSlope and infinity, so that we can change the data files without breaking anything. Once that is done we can fully remove it.

I think the last low hanging fruit that I'll include here is to set concrete dimensions instead of templated types for kernels and alpaka functions.

@ariostas ariostas force-pushed the ariostas/integration_pr_followups branch from 2a034e8 to 0da8474 Compare December 20, 2024 16:23
@ariostas ariostas changed the title Integration PR followups: make_workdiv, uniform_elements, ... Integration PR followups: make_workdiv, uniform_elements, concrete kernel dimensions Dec 20, 2024
@ariostas ariostas marked this pull request as ready for review December 20, 2024 16:27
@slava77
Copy link

slava77 commented Dec 20, 2024

/run all

Copy link

There was a problem while building and running in standalone mode. The logs can be found here.

Copy link

There was a problem while building and running with CMSSW. The logs can be found here.

@slava77
Copy link

slava77 commented Dec 20, 2024

There was a problem while building and running in standalone mode. The logs can be found here.

I couldn't parse the error to the point of understanding where a fix is needed

RecoTracker/LSTCore/src/alpaka/Hit.h:87:63:   required from here
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-aba90e6b0efd37975ff68417e953fa78/include/alpaka/workdiv/Traits.hpp:36:81: 
error: incomplete type 
'alpaka::trait::GetWorkDiv<alpaka::WorkDivUniformCudaHipBuiltIn<std::integral_constant<long unsigned int, 1>, unsigned int>, alpaka::origin::Thread, alpaka::unit::Elems, void>'
 used in nested name specifier
   36 |         return trait::GetWorkDiv<ImplementationBase, TOrigin, TUnit>::getWorkDiv(workDiv);
      |                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~

@ariostas
Copy link
Member Author

ariostas commented Dec 23, 2024

We had some issues with our #includes since GPU code was showing up in CPU-only parts (without all the Alpaka flags being set appropriately). I didn't notice it because it compiles fine for the CPU backend, which is what I was using for testing. Should be all good now.

/run all

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     46.5    393.0    188.5    157.4    164.4    508.9    122.1    230.1    146.0      3.7    1960.7    1405.3+/- 388.9     519.0   explicit[s=4] (target branch)
   avg     45.1    388.7    189.8    156.2    153.5    701.3    126.9    248.2    176.3      3.4    2189.2    1442.9+/- 393.2     575.6   explicit[s=4] (this PR)

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@slava77
Copy link

slava77 commented Dec 23, 2024

I updated the master branch after the merge of #140 .
There is a conflict now in RecoTracker/LSTCore/interface/alpaka/Common.h

@ariostas ariostas force-pushed the ariostas/integration_pr_followups branch from f6a499c to 4f9d7f9 Compare December 30, 2024 14:44
@ariostas
Copy link
Member Author

There is a conflict now in RecoTracker/LSTCore/interface/alpaka/Common.h

Thank you, Slava. It's fixed now.

/run all

Copy link

There was a problem while building and running in standalone mode. The logs can be found here.

@ariostas ariostas force-pushed the ariostas/integration_pr_followups branch from 4f9d7f9 to c8b78fd Compare December 30, 2024 15:07
Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@ariostas ariostas force-pushed the ariostas/integration_pr_followups branch from c8b78fd to 1a27b2a Compare January 10, 2025 14:54
@ariostas
Copy link
Member Author

/run all

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     46.9    398.7    190.5    154.1    148.7    547.5    124.9    235.9    151.0      3.9    2002.2    1407.8+/- 387.0     530.9   explicit[s=4] (target branch)
   avg     49.5    389.6    194.2    162.0    155.0    703.5    131.0    259.7    178.9      3.8    2227.3    1474.3+/- 403.6     588.1   explicit[s=4] (this PR)

@slava77
Copy link

slava77 commented Jan 10, 2025

before the break we talked about batching, assuming other contributions overlapping this PR can accumulate.
It looks like there is nothing ready yet. So, perhaps it's more practical now to just submit this upstream to cms-sw/cmssw already.
Commits are linear and few, I don't see a need to squash or rebase.

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@slava77
Copy link

slava77 commented Jan 10, 2025

this crashed in the bot tests with GPU: did it run locally on GPU or is it some more recent feature in the IBs?

@ariostas
Copy link
Member Author

Seems like the bug is outside of the changes I made for this PR. The issue was triggered by the switch from int8_t to int16_t. My guess is that something was left unset and for 0xff it was fine, but 0xffff is large enough to cause a segfault. I'm still looking into where this is coming from.

@ariostas
Copy link
Member Author

Nevermind, seems like it is a problem in this PR. Maybe I'm misunderstanding how some other functionality works on GPUs.

@ariostas
Copy link
Member Author

Turns out that cms-sw#46967 broke running on GPUs. The binary search is not working for some reason.

@VourMa
Copy link
Collaborator

VourMa commented Jan 13, 2025

Turns out that cms-sw#46967 broke running on GPUs. The binary search is not working for some reason.

The GPU tests passed back then though:
cms-sw#46967 (comment)
What changed?

@slava77
Copy link

slava77 commented Jan 13, 2025

The GPU tests passed back then though:
cms-sw#46967 (comment)
What changed?

the baseline in the PR tests ran OK as well. So, the crash appears from just the incremental addition of this PR.
Well, unless there is a non-reproducible component and it may or may not crash unpredictably.

@ariostas
Copy link
Member Author

the baseline in the PR tests ran OK as well.

The baseline doesn't crash, but it produces garbage results. With the switch from int8_t to int16_t in this PR the garbage is bad enough to cause a segfault.

@ariostas ariostas force-pushed the ariostas/integration_pr_followups branch from 1a27b2a to baa91b3 Compare January 17, 2025 16:27
@ariostas
Copy link
Member Author

/run all

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     47.8    397.1    189.4    153.1    147.5    555.7    123.7    235.3    152.8      3.6    2006.0    1402.4+/- 388.4     532.6   explicit[s=4] (target branch)
   avg     50.3    391.1    193.3    158.6    171.2    707.9    132.9    254.8    179.2      3.1    2242.5    1484.3+/- 416.2     593.2   explicit[s=4] (this PR)

@@ -694,7 +658,8 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::lst {
float y2 = mds.anchorY()[secondMDIndex];
float y3 = mds.anchorY()[thirdMDIndex];

circleRadius = computeRadiusFromThreeAnchorHits(acc, x1, y1, x2, y2, x3, y3, circleCenterX, circleCenterY);
std::tie(circleRadius, circleCenterX, circleCenterY) =
computeRadiusFromThreeAnchorHits(acc, x1, y1, x2, y2, x3, y3);
Copy link

@slava77 slava77 Jan 17, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it shorter to write circleRadius = computeRadiusFromThreeAnchorHits(...).get<0>()

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's true. I'll wait to see if there are any other comments and I'll fix it

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@ariostas
Copy link
Member Author

Here's a timing comparison on cgpu-1:

[baa91b3]
Total Timing Summary
Average time for map loading = 599.562 ms
Average time for input loading = 7649.27 ms
Average time for lst::Event creation = 0.00361034 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     13.6      0.5      0.3      1.5      1.2      0.4      0.9      0.6      1.1      0.0      20.0       6.1+/-  1.3      22.2   explicit[s=1]
   avg      2.5      0.6      0.5      1.8      1.4      0.5      1.2      0.8      1.6      0.0      10.8       7.8+/-  1.5       6.6   explicit[s=2]
   avg      4.2      1.0      0.8      2.8      2.1      0.5      2.3      1.3      2.7      0.0      17.7      13.0+/-  2.4       5.1   explicit[s=4]
   avg      6.5      1.5      1.1      3.9      3.1      0.7      3.2      1.8      3.9      0.0      25.8      18.6+/-  3.6       4.8   explicit[s=6]
   avg      9.0      2.2      1.5      4.7      4.2      0.8      4.0      2.1      5.1      0.0      33.7      23.8+/-  4.4       4.7   explicit[s=8]

master before it was broken [0ae08d2]
Total Timing Summary
Average time for map loading = 583.027 ms
Average time for input loading = 7625.62 ms
Average time for lst::Event creation = 0.00333786 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     13.0      0.5      0.3      1.5      1.2      0.4      1.3      1.0      1.4      0.0      20.6       7.2+/-  1.3      22.7   explicit[s=1]
   avg      3.9      0.7      0.5      1.8      1.4      0.5      1.7      1.2      1.8      0.0      13.4       9.0+/-  1.6       7.9   explicit[s=2]
   avg      3.6      1.0      0.9      2.8      2.2      0.6      2.0      1.1      3.0      0.0      17.1      12.9+/-  2.5       5.0   explicit[s=4]
   avg      6.6      1.5      1.1      3.9      3.2      0.7      3.2      1.7      4.3      0.0      26.3      19.0+/-  4.2       4.9   explicit[s=6]
   avg      8.5      2.1      1.5      5.0      4.7      0.9      4.0      2.2      5.6      0.0      34.4      25.0+/-  5.1       4.7   explicit[s=8]

@ariostas ariostas merged commit 878e0b4 into master Jan 29, 2025
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants