Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Dynamic Memory Limits for LST Objects #148

Open
wants to merge 1 commit into
base: master
Choose a base branch
from
Open

Conversation

GNiendorf
Copy link
Member

@GNiendorf GNiendorf commented Jan 27, 2025

Dynamic Memory Limits

I compute a dynamic max for each module that is the max possible number of objects that can be created given the number of lower level objects (so for example the number of possible T5's for a given module given the total number of T3's in modules where two T3's can be connected). This upper bound provides a large reduction in the occupancies. This is also accomplished with no new truncations, since this dynamic max represents the maximum number of possible objects that can be created.

For the segments, I had to rearrange the addPixelSegmentToEvent() function so that the segments container is instantiated after the MDs container is filled. Right now the segments object is instantiated during the pLS loading, which is done before the MDs container is filled.

Object Size Reduction

Reduction in object sizes (total occupancy basically) at current max occupancies (averaged over 10 events, 0.8 GeV):

  • 682,404 -> 303,891 (55% decrease) for MD's.
  • 1,930,994->857,880 (56% decrease) for Segments.
  • 1,980,235 -> 737,502 (63% decrease) for T3's.
  • 502,053 -> 212,953 (58% decrease) for T5's.

Total Memory Decrease

For single stream, 100 events at current occupancies:

1695MiB -> 1275MiB (25% decrease, 0.8 GeV)

@GNiendorf GNiendorf mentioned this pull request Jan 27, 2025
@GNiendorf GNiendorf marked this pull request as ready for review January 28, 2025 19:47
@GNiendorf GNiendorf requested a review from slava77 January 28, 2025 19:47
@GNiendorf
Copy link
Member Author

GNiendorf commented Jan 28, 2025

Performance plots look unchanged, and negligible changes in timing.

@slava77
Copy link

slava77 commented Jan 28, 2025

/run all

Copy link

There was a problem while building and running with CMSSW. The logs can be found here.

Copy link

There was a problem while building and running in standalone mode. The logs can be found here.

@GNiendorf
Copy link
Member Author

Should I rebase to master?

@slava77
Copy link

slava77 commented Jan 28, 2025

Should I rebase to master?

I don't think it will help by itself; the CI is not pulling in the right version of alpaka.
... I guess if the target branch is not master but instead something older than Jan 21, the tests may work

@GNiendorf GNiendorf changed the base branch from master to some_branch January 29, 2025 01:51
@GNiendorf
Copy link
Member Author

/run all

Copy link

There was a problem while building and running with CMSSW. The logs can be found here.

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     46.5    403.0    189.1    150.9    147.1    552.7    123.4    236.1    152.7      3.8    2005.4    1406.1+/- 384.0     528.4   explicit[s=4] (target branch)
   avg     31.3    390.5    183.2    142.2    138.4    551.8    121.3    232.4    151.2      2.2    1944.5    1361.4+/- 387.2     518.3   explicit[s=4] (this PR)

@slava77
Copy link

slava77 commented Jan 29, 2025

/run all

this almost worked with switching to some_branch; the CMSSW part failed due to a missing HeterogeneousCore/AlpakaMath I thought that addpkg was not throwing.
A better guess in this context is to use CMSSW_14_2_0_pre4_workflowsAndGeneralFunctions_squashed to run the tests ... although that's a Dec 4 master and dynamic_occ is from Dec 16 and can pull in some extra stuff.
Still, it's a hack and it's better to get the CI fixed.

Copy link

There was a problem while building and running with CMSSW. The logs can be found here.

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     46.3    396.6    188.5    154.4    148.1    547.7    124.5    238.0    152.0      3.4    1999.6    1405.5+/- 392.0     531.5   explicit[s=4] (target branch)
   avg     32.0    389.3    184.2    143.7    137.9    555.4    123.3    230.8    150.7      2.0    1949.4    1361.9+/- 387.5     520.0   explicit[s=4] (this PR)

@ariostas ariostas changed the base branch from some_branch to master January 29, 2025 14:09
@ariostas
Copy link
Member

I changed the target branch because I thought that the CI should work now that it uses the latest IB release, but I just noticed that it failed for another PR. I'll look into it.

@ariostas
Copy link
Member

/run cmssw

Copy link

There was a problem while building and running with CMSSW. The logs can be found here.

@ariostas
Copy link
Member

I'll change the CI so that it does the merge before checking out the packages

@ariostas
Copy link
Member

There's going to have to be some (hopefully minor) rebasing now that my PR was merged into CMSSW. I'll update our master to see if there are conflicts.

@slava77
Copy link

slava77 commented Jan 29, 2025

There's going to have to be some (hopefully minor) rebasing now that my PR was merged into CMSSW. I'll update our master to see if there are conflicts.

5 files with conflicts now

Copy link

There was a problem while building and running with CMSSW. The logs can be found here.

@GNiendorf
Copy link
Member Author

GNiendorf commented Jan 29, 2025

Should I just rebase to master and fix the conflicts? Seems like the CMSSW run failed for reasons unrelated to this PR.

@GNiendorf
Copy link
Member Author

Just squashed the commits, will rebase now to fix the conflicts if that is fine.

@slava77
Copy link

slava77 commented Jan 29, 2025

Should I just rebase to master and fix the conflicts? Seems like the CMSSW run failed for reasons unrelated to this PR.

I'd rebase (to master or to SegmentLinking:ariostas/integration_pr_followups should work)

Copy link

There was a problem while building and running with CMSSW. The logs can be found here.

@GNiendorf
Copy link
Member Author

/run cmssw

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@GNiendorf
Copy link
Member Author

@slava77 Do you have any comments for this PR, or should I submit it to CMSSW?

@slava77
Copy link

slava77 commented Jan 30, 2025

@slava77 Do you have any comments for this PR, or should I submit it to CMSSW?

  • I missed the part when PixelSegmentSoA was added and will need some time to check the code.
  • I'm not sure I follow the comment in the PR description It looks like the process of moving the segments over to this method resulted in a net increase in memory usage, while above that the segments are stated to decrease.
  • Please review the description to cover the above (and be self-consistent)

what's the plan with #147 ? I thought that you wanted that more or less together with this.
Does this PR interfere with changes in #146 ?

@ariostas
Copy link
Member

I went through the changes and they look good to me

@GNiendorf
Copy link
Member Author

  • I missed the part when PixelSegmentSoA was added and will need some time to check the code.
  • I'm not sure I follow the comment in the PR description It looks like the process of moving the segments over to this method resulted in a net increase in memory usage, while above that the segments are stated to decrease.

I moved some arrays (the pLS hit indices and their delta phi) onto the GPU in the pixelSegments SoA but corrected this with a later update. I removed the edit from the PR description and just list the current total memory savings for 0.8 GeV at 100 events to be self-consistent.

what's the plan with #147 ? I thought that you wanted that more or less together with this. Does this PR interfere with changes in #146 ?

I think it makes sense to merge this PR separately from #147. I think it's still unclear what the occupancies should be set to, and that will involve its own efficiency and memory changes. Therefore I think it's better to merge this in first before addressing the occupancy tuning in that PR.

@GNiendorf GNiendorf changed the title Dynamic Memory Allocation Dynamic Memory Limits for LST Objects Jan 30, 2025
Copy link

@slava77 slava77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will also likely need a rebase very soon, once the upstream variant of #146 is merged

RecoTracker/LSTCore/src/alpaka/Quintuplet.h Outdated Show resolved Hide resolved
RecoTracker/LSTCore/src/alpaka/Segment.h Outdated Show resolved Hide resolved
RecoTracker/LSTCore/src/alpaka/Segment.h Outdated Show resolved Hide resolved
RecoTracker/LSTCore/src/alpaka/Triplet.h Outdated Show resolved Hide resolved
RecoTracker/LSTCore/src/alpaka/Triplet.h Show resolved Hide resolved
@slava77
Copy link

slava77 commented Jan 30, 2025

For the segments, I had to rearrange the createPixelSegmentToEvent() function so that the segments object is created after the mini-doublets are created. Right now in the code the segments are created during the pLS loading, which is done before the MDs are created.

mostly for the purpose of the CMSSW PR, to have more clarity perhaps instead of "MDs are created" say "the MD container is instantiated". Our create* methods in most cases also run the algorithms and populate relevant containers.

@GNiendorf GNiendorf force-pushed the dynamic_occ branch 3 times, most recently from d9ef1f9 to c7b126c Compare January 31, 2025 15:14
@GNiendorf
Copy link
Member Author

/run all

@SegmentLinking SegmentLinking deleted a comment from github-actions bot Jan 31, 2025
Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     49.1    382.4    193.8    160.3    158.9    699.0    132.8    253.4    177.7      3.8    2211.0    1463.0+/- 402.1     584.0   explicit[s=4] (target branch)
   avg     33.4    382.9    189.2    154.4    165.2    703.8    130.9    245.8    175.7      1.9    2183.2    1446.0+/- 403.3     576.1   explicit[s=4] (this PR)

@slava77
Copy link

slava77 commented Jan 31, 2025

@GNiendorf
please run timing tests on GPU. The host-side per-object loops may show some significant slowdown

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@ariostas
Copy link
Member

Working with IB releases definitely makes building the standalone part a bit more confusing. If we keep having to use IB builds we might need to tweak the setup.sh script.

Here's a timing comparison on cgpu-1.

This PR (2967c14)
Total Timing Summary
Average time for map loading = 761.337 ms
Average time for input loading = 8295.82 ms
Average time for lst::Event creation = 0.00356492 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg      1.4      0.5      0.4      1.6      1.3      0.5      0.9      0.6      1.1      0.0       8.2       6.2+/-  1.4       9.3   explicit[s=1]
   avg      1.3      0.7      0.6      1.9      1.5      0.6      1.2      0.7      1.4      0.0       9.9       8.0+/-  1.9       6.1   explicit[s=2]
   avg      3.1      1.1      1.0      2.4      2.0      1.1      2.2      1.3      2.2      0.0      16.5      12.3+/-  3.8       5.8   explicit[s=4]
   avg      3.6      1.9      1.7      3.6      3.3      1.5      3.3      1.5      3.7      0.0      24.2      19.1+/-  5.8       4.6   explicit[s=6]
   avg      5.5      2.7      2.5      4.8      4.3      2.0      4.2      2.0      4.8      0.0      32.8      25.3+/-  6.7       4.5   explicit[s=8]

master (7905921)
Total Timing Summary
Average time for map loading = 574.156 ms
Average time for input loading = 7868.15 ms
Average time for lst::Event creation = 0.00286102 ms
   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg      2.7      0.4      0.3      1.5      1.2      0.4      1.3      1.0      1.1      0.0      10.1       7.1+/-  1.4      11.1   explicit[s=1]
   avg      1.6      0.6      0.6      1.8      1.4      0.4      1.2      0.7      1.5      0.0       9.9       7.8+/-  1.6       6.1   explicit[s=2]
   avg      3.1      1.0      0.8      2.7      2.2      0.6      2.2      1.1      2.7      0.0      16.4      12.7+/-  2.4       4.7   explicit[s=4]
   avg      6.0      1.5      1.1      3.9      3.2      0.7      3.4      1.8      3.9      0.0      25.5      18.8+/-  4.7       4.7   explicit[s=6]
   avg      9.1      2.1      1.4      4.8      4.2      0.8      4.2      2.2      4.9      0.0      33.7      23.8+/-  4.9       4.6   explicit[s=8]

@GNiendorf
Copy link
Member Author

Thank you @ariostas!

@slava77
Copy link

slava77 commented Jan 31, 2025

Here's a timing comparison on cgpu-1.

thank you; for the single-stream perhaps I should ignore the hits column

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants