Integration PR followups: make_workdiv, uniform_elements, concrete kernel dimensions #141

ariostas · 2024-12-17T19:46:51Z

I started to address some follow-up tasks in #75. In particular:

Switch to cms::alpakatools::makeworkdir instead of our custom createWorkDiv.
Switch to cms::alpakatools::uniform_elements for kernel loops.
Don't set a custom work division for CPU.
Started moving towards the removal of kVerticalModuleSlope.
Use concrete kernel dimensions instead of templated ones.

ariostas · 2024-12-17T19:49:14Z

I think the plots might look a bit different now that the work division is different, but hopefully they are still run-to-run reproducible (I'll check).

For now, I'm just making sure I didn't break something.

/run standalone
/run checks

github-actions · 2024-12-17T20:04:03Z

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     46.8    400.0    187.9    153.9    166.6    515.0    123.0    232.9    146.5      3.9    1976.4    1414.6+/- 393.7     525.2   explicit[s=4] (target branch)
   avg     46.5    386.1    190.4    157.7    165.2    694.5      8.8     11.8    160.6      3.4    1824.9    1084.0+/- 241.8     482.7   explicit[s=4] (this PR)

ariostas · 2024-12-17T20:05:30Z

Welp, I did break something. I'll look into it

ariostas · 2024-12-17T20:40:29Z

Hopefully it's fine now.

/run standalone

github-actions · 2024-12-17T20:54:59Z

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     45.8    394.5    194.2    150.0    164.7    514.4    121.9    236.0    145.2      3.7    1970.5    1410.2+/- 392.1     519.0   explicit[s=4] (target branch)
   avg     44.4    385.7    188.4    155.9    149.7    702.5    127.3    256.8    177.9      3.7    2192.4    1445.4+/- 400.5     577.8   explicit[s=4] (this PR)

ariostas · 2024-12-17T21:04:38Z

The plots match perfectly, which is nice. There was a significant increase in timing, especially for pLS, but it seems like this is only for CPU. This is how the timing compares in cgpu-1.

This PR (aa82696)
Total Timing Summary
Average time for map loading = 571.866 ms
Average time for input loading = 7550.76 ms
Average time for lst::Event creation = 0.00314486 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     11.7      0.5      0.3      1.4      1.2      0.4      0.8      0.6      1.1      0.0      18.1       6.0+/-  1.2      20.1   explicit[s=1]
   avg      2.5      0.6      0.5      1.8      1.4      0.4      1.2      0.7      1.6      0.0      10.8       7.8+/-  1.6       6.5   explicit[s=2]
   avg      4.6      1.0      0.8      2.8      2.1      0.6      2.4      1.5      2.7      0.0      18.5      13.3+/-  2.8       5.3   explicit[s=4]
   avg      6.5      1.6      1.1      3.8      3.2      0.7      3.2      1.8      3.9      0.0      25.9      18.6+/-  4.1       4.8   explicit[s=6]
   avg      8.1      2.2      1.5      4.9      4.2      0.9      3.8      2.1      5.2      0.0      32.9      23.9+/-  4.6       4.6   explicit[s=8]

master (9628e8f)
Total Timing Summary
Average time for map loading = 566.258 ms
Average time for input loading = 7584.02 ms
Average time for lst::Event creation = 0.0034968 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     13.1      0.5      0.3      1.5      1.2      0.4      1.3      1.0      1.3      0.0      20.7       7.2+/-  1.3      22.8   explicit[s=1]
   avg      2.5      0.6      0.5      1.8      1.4      0.5      1.1      0.7      1.8      0.0      10.8       7.9+/-  1.6       6.5   explicit[s=2]
   avg      3.4      1.0      0.8      2.8      2.2      0.6      2.0      1.1      3.0      0.0      16.9      13.0+/-  2.8       4.9   explicit[s=4]
   avg      5.7      1.6      1.2      4.0      3.3      0.7      3.0      1.6      4.4      0.0      25.4      18.9+/-  3.4       4.7   explicit[s=6]
   avg     10.0      1.9      1.3      4.8      4.7      0.9      4.3      2.4      5.5      0.0      35.8      24.9+/-  5.1       5.0   explicit[s=8]

RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc

ariostas · 2024-12-18T21:51:41Z

Sorry for all the force-pushing.

I made the code compatible with both kVerticalModuleSlope and infinity, so that we can change the data files without breaking anything. Once that is done we can fully remove it.

I think the last low hanging fruit that I'll include here is to set concrete dimensions instead of templated types for kernels and alpaka functions.

RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc

RecoTracker/LSTCore/src/alpaka/Segment.h

slava77 · 2024-12-20T23:06:36Z

/run all

github-actions · 2024-12-20T23:13:50Z

There was a problem while building and running in standalone mode. The logs can be found here.

github-actions · 2024-12-20T23:16:37Z

There was a problem while building and running with CMSSW. The logs can be found here.

slava77 · 2024-12-20T23:32:57Z

There was a problem while building and running in standalone mode. The logs can be found here.

I couldn't parse the error to the point of understanding where a fix is needed

RecoTracker/LSTCore/src/alpaka/Hit.h:87:63:   required from here
/cvmfs/cms.cern.ch/el8_amd64_gcc12/external/alpaka/1.1.0-aba90e6b0efd37975ff68417e953fa78/include/alpaka/workdiv/Traits.hpp:36:81: 
error: incomplete type 
'alpaka::trait::GetWorkDiv<alpaka::WorkDivUniformCudaHipBuiltIn<std::integral_constant<long unsigned int, 1>, unsigned int>, alpaka::origin::Thread, alpaka::unit::Elems, void>'
 used in nested name specifier
   36 |         return trait::GetWorkDiv<ImplementationBase, TOrigin, TUnit>::getWorkDiv(workDiv);
      |                ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~^~~~~~~~~

ariostas · 2024-12-23T14:58:52Z

We had some issues with our #includes since GPU code was showing up in CPU-only parts (without all the Alpaka flags being set appropriately). I didn't notice it because it compiles fine for the CPU backend, which is what I was using for testing. Should be all good now.

/run all

github-actions · 2024-12-23T15:14:39Z

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     46.5    393.0    188.5    157.4    164.4    508.9    122.1    230.1    146.0      3.7    1960.7    1405.3+/- 388.9     519.0   explicit[s=4] (target branch)
   avg     45.1    388.7    189.8    156.2    153.5    701.3    126.9    248.2    176.3      3.4    2189.2    1442.9+/- 393.2     575.6   explicit[s=4] (this PR)

github-actions · 2024-12-23T16:29:58Z

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

slava77 · 2024-12-23T18:38:12Z

I updated the master branch after the merge of #140 .
There is a conflict now in RecoTracker/LSTCore/interface/alpaka/Common.h

ariostas · 2024-12-30T14:53:08Z

There is a conflict now in RecoTracker/LSTCore/interface/alpaka/Common.h

Thank you, Slava. It's fixed now.

/run all

github-actions · 2024-12-30T14:59:34Z

There was a problem while building and running in standalone mode. The logs can be found here.

github-actions · 2024-12-30T17:12:54Z

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

RecoTracker/LSTCore/standalone/code/core/write_lst_ntuple.cc

ariostas · 2025-01-10T14:54:56Z

/run all

github-actions · 2025-01-10T15:16:04Z

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     46.9    398.7    190.5    154.1    148.7    547.5    124.9    235.9    151.0      3.9    2002.2    1407.8+/- 387.0     530.9   explicit[s=4] (target branch)
   avg     49.5    389.6    194.2    162.0    155.0    703.5    131.0    259.7    178.9      3.8    2227.3    1474.3+/- 403.6     588.1   explicit[s=4] (this PR)

slava77 · 2025-01-10T15:28:56Z

before the break we talked about batching, assuming other contributions overlapping this PR can accumulate.
It looks like there is nothing ready yet. So, perhaps it's more practical now to just submit this upstream to cms-sw/cmssw already.
Commits are linear and few, I don't see a need to squash or rebase.

github-actions · 2025-01-10T16:21:31Z

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

slava77 · 2025-01-10T22:28:19Z

this crashed in the bot tests with GPU: did it run locally on GPU or is it some more recent feature in the IBs?

ariostas · 2025-01-13T16:47:32Z

Seems like the bug is outside of the changes I made for this PR. The issue was triggered by the switch from int8_t to int16_t. My guess is that something was left unset and for 0xff it was fine, but 0xffff is large enough to cause a segfault. I'm still looking into where this is coming from.

ariostas · 2025-01-13T17:21:59Z

Nevermind, seems like it is a problem in this PR. Maybe I'm misunderstanding how some other functionality works on GPUs.

ariostas · 2025-01-13T19:43:52Z

Turns out that cms-sw#46967 broke running on GPUs. The binary search is not working for some reason.

VourMa · 2025-01-13T19:47:56Z

Turns out that cms-sw#46967 broke running on GPUs. The binary search is not working for some reason.

The GPU tests passed back then though:
cms-sw#46967 (comment)
What changed?

slava77 · 2025-01-13T23:12:26Z

The GPU tests passed back then though:
cms-sw#46967 (comment)
What changed?

the baseline in the PR tests ran OK as well. So, the crash appears from just the incremental addition of this PR.
Well, unless there is a non-reproducible component and it may or may not crash unpredictably.

ariostas · 2025-01-14T14:13:47Z

the baseline in the PR tests ran OK as well.

The baseline doesn't crash, but it produces garbage results. With the switch from int8_t to int16_t in this PR the garbage is bad enough to cause a segfault.

RecoTracker/LSTCore/interface/Circle.h

Co-authored-by: Andrea Bocci <[email protected]>

ariostas · 2025-01-17T16:30:58Z

/run all

github-actions · 2025-01-17T16:47:20Z

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     47.8    397.1    189.4    153.1    147.5    555.7    123.7    235.3    152.8      3.6    2006.0    1402.4+/- 388.4     532.6   explicit[s=4] (target branch)
   avg     50.3    391.1    193.3    158.6    171.2    707.9    132.9    254.8    179.2      3.1    2242.5    1484.3+/- 416.2     593.2   explicit[s=4] (this PR)

slava77 · 2025-01-17T17:02:18Z

RecoTracker/LSTCore/src/alpaka/Triplet.h

@@ -694,7 +658,8 @@ namespace ALPAKA_ACCELERATOR_NAMESPACE::lst {
    float y2 = mds.anchorY()[secondMDIndex];
    float y3 = mds.anchorY()[thirdMDIndex];

-    circleRadius = computeRadiusFromThreeAnchorHits(acc, x1, y1, x2, y2, x3, y3, circleCenterX, circleCenterY);
+    std::tie(circleRadius, circleCenterX, circleCenterY) =
+        computeRadiusFromThreeAnchorHits(acc, x1, y1, x2, y2, x3, y3);


is it shorter to write circleRadius = computeRadiusFromThreeAnchorHits(...).get<0>()

That's true. I'll wait to see if there are any other comments and I'll fix it

github-actions · 2025-01-17T18:02:03Z

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks

The full set of validation and comparison plots can be found here.

ariostas · 2025-01-23T21:05:21Z

Here's a timing comparison on cgpu-1:

[baa91b3]
Total Timing Summary
Average time for map loading = 599.562 ms
Average time for input loading = 7649.27 ms
Average time for lst::Event creation = 0.00361034 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     13.6      0.5      0.3      1.5      1.2      0.4      0.9      0.6      1.1      0.0      20.0       6.1+/-  1.3      22.2   explicit[s=1]
   avg      2.5      0.6      0.5      1.8      1.4      0.5      1.2      0.8      1.6      0.0      10.8       7.8+/-  1.5       6.6   explicit[s=2]
   avg      4.2      1.0      0.8      2.8      2.1      0.5      2.3      1.3      2.7      0.0      17.7      13.0+/-  2.4       5.1   explicit[s=4]
   avg      6.5      1.5      1.1      3.9      3.1      0.7      3.2      1.8      3.9      0.0      25.8      18.6+/-  3.6       4.8   explicit[s=6]
   avg      9.0      2.2      1.5      4.7      4.2      0.8      4.0      2.1      5.1      0.0      33.7      23.8+/-  4.4       4.7   explicit[s=8]

master before it was broken [0ae08d2]
Total Timing Summary
Average time for map loading = 583.027 ms
Average time for input loading = 7625.62 ms
Average time for lst::Event creation = 0.00333786 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     13.0      0.5      0.3      1.5      1.2      0.4      1.3      1.0      1.4      0.0      20.6       7.2+/-  1.3      22.7   explicit[s=1]
   avg      3.9      0.7      0.5      1.8      1.4      0.5      1.7      1.2      1.8      0.0      13.4       9.0+/-  1.6       7.9   explicit[s=2]
   avg      3.6      1.0      0.9      2.8      2.2      0.6      2.0      1.1      3.0      0.0      17.1      12.9+/-  2.5       5.0   explicit[s=4]
   avg      6.6      1.5      1.1      3.9      3.2      0.7      3.2      1.7      4.3      0.0      26.3      19.0+/-  4.2       4.9   explicit[s=6]
   avg      8.5      2.1      1.5      5.0      4.7      0.9      4.0      2.2      5.6      0.0      34.4      25.0+/-  5.1       4.7   explicit[s=8]

ariostas force-pushed the ariostas/integration_pr_followups branch from 17d4223 to aa82696 Compare December 17, 2024 20:39

slava77 reviewed Dec 17, 2024

View reviewed changes

RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc Outdated Show resolved Hide resolved

slava77 reviewed Dec 17, 2024

View reviewed changes

RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc Outdated Show resolved Hide resolved

ariostas force-pushed the ariostas/integration_pr_followups branch 3 times, most recently from 7278857 to 9444e18 Compare December 18, 2024 21:47

GNiendorf reviewed Dec 19, 2024

View reviewed changes

RecoTracker/LSTCore/src/alpaka/LSTEvent.dev.cc Show resolved Hide resolved

slava77 reviewed Dec 20, 2024

View reviewed changes

RecoTracker/LSTCore/src/alpaka/Segment.h Show resolved Hide resolved

ariostas force-pushed the ariostas/integration_pr_followups branch from 2a034e8 to 0da8474 Compare December 20, 2024 16:23

ariostas changed the title ~~Integration PR followups: make_workdiv, uniform_elements, ...~~ Integration PR followups: make_workdiv, uniform_elements, concrete kernel dimensions Dec 20, 2024

ariostas marked this pull request as ready for review December 20, 2024 16:27

ariostas force-pushed the ariostas/integration_pr_followups branch from f6a499c to 4f9d7f9 Compare December 30, 2024 14:44

ariostas force-pushed the ariostas/integration_pr_followups branch from 4f9d7f9 to c8b78fd Compare December 30, 2024 15:07

slava77 reviewed Jan 3, 2025

View reviewed changes

RecoTracker/LSTCore/standalone/code/core/write_lst_ntuple.cc Outdated Show resolved Hide resolved

ariostas force-pushed the ariostas/integration_pr_followups branch from c8b78fd to 1a27b2a Compare January 10, 2025 14:54

slava77 approved these changes Jan 10, 2025

View reviewed changes

VourMa reviewed Jan 15, 2025

View reviewed changes

RecoTracker/LSTCore/interface/Circle.h Outdated Show resolved Hide resolved

ariostas and others added 6 commits January 17, 2025 06:28

Use make_workdiv and uniform_elements

7e5703d

Use int16_t for hitRanges counters

f1a4cc6

Started removal of kVerticalModuleSlope

8e46424

Added concrete dimensions to kernels

7929700

Fixed include issues

1416747

Added lower_bound function that works in device code

baa91b3

Co-authored-by: Andrea Bocci <[email protected]>

ariostas force-pushed the ariostas/integration_pr_followups branch from 1a27b2a to baa91b3 Compare January 17, 2025 16:27

slava77 reviewed Jan 17, 2025

View reviewed changes

VourMa mentioned this pull request Jan 20, 2025

Use central phi functions instead LST ones #146

Merged

ariostas merged commit 878e0b4 into master Jan 29, 2025
3 checks passed

Integration PR followups: make_workdiv, uniform_elements, concrete kernel dimensions #141

Integration PR followups: make_workdiv, uniform_elements, concrete kernel dimensions #141

Conversation

ariostas commented Dec 17, 2024 • edited Loading

ariostas commented Dec 17, 2024 • edited Loading

github-actions bot commented Dec 17, 2024

ariostas commented Dec 17, 2024

ariostas commented Dec 17, 2024

github-actions bot commented Dec 17, 2024

ariostas commented Dec 17, 2024

ariostas commented Dec 18, 2024

slava77 commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

github-actions bot commented Dec 20, 2024

slava77 commented Dec 20, 2024

ariostas commented Dec 23, 2024 • edited Loading

github-actions bot commented Dec 23, 2024

github-actions bot commented Dec 23, 2024

slava77 commented Dec 23, 2024

ariostas commented Dec 30, 2024

github-actions bot commented Dec 30, 2024

github-actions bot commented Dec 30, 2024

ariostas commented Jan 10, 2025

github-actions bot commented Jan 10, 2025

slava77 commented Jan 10, 2025

github-actions bot commented Jan 10, 2025

slava77 commented Jan 10, 2025

ariostas commented Jan 13, 2025

ariostas commented Jan 13, 2025

ariostas commented Jan 13, 2025

VourMa commented Jan 13, 2025

slava77 commented Jan 13, 2025

ariostas commented Jan 14, 2025

ariostas commented Jan 17, 2025

github-actions bot commented Jan 17, 2025

slava77 Jan 17, 2025 • edited Loading

Choose a reason for hiding this comment

ariostas Jan 17, 2025

Choose a reason for hiding this comment

github-actions bot commented Jan 17, 2025

ariostas commented Jan 23, 2025

ariostas commented Dec 17, 2024 •

edited

Loading

ariostas commented Dec 17, 2024 •

edited

Loading

ariostas commented Dec 23, 2024 •

edited

Loading

slava77 Jan 17, 2025 •

edited

Loading