Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace Second Delta Beta Cut with DNN #126

Conversation

GNiendorf
Copy link
Member

@GNiendorf GNiendorf commented Nov 19, 2024

On the 1000 event RelVal plots (not shown here yet) this PR leads to a higher efficiency at a lower FR. Considering that issue #123 was fixed though, I think it makes sense to only merge this PR if I can replace both delta beta cuts at a performance improvement. Leaving this PR as a draft for now.

Continuation of PR #122, although I heavily downsample 80% matched tracks during training in this PR to get better overall performance and a DNN cut more similar to the existing delta beta cuts.

Edit: It looks like this method of downsampling 80% tracks works for the second delta beta cut, but not for the first delta beta cut. It seems like the more underlying issue is network size when training on the larger sample. If I increase network size substantially it fixes the issue without having to downsample tracks, although I think it would take a substantial amount of effort to make a larger model that has competitive timing to the current hybrid approach. Closing this PR for now. I'm going to shift my focus to work on loading the weights properly and hopefully getting a timing improvement with lower precision weights.

@GNiendorf
Copy link
Member Author

/run all

Copy link

The PR was built and ran successfully in standalone mode. Here are some of the comparison plots.

Efficiency vs pT comparison Efficiency vs eta comparison
Fake rate vs pT comparison Fake rate vs eta comparison
Duplicate rate vs pT comparison Duplicate rate vs eta comparison

The full set of validation and comparison plots can be found here.

Here is a timing comparison:

   Evt    Hits       MD       LS      T3       T5       pLS       pT5      pT3      TC       Reset    Event     Short             Rate
   avg     43.4    319.4    115.6     73.3    118.6    508.7    126.4    138.2    140.2      2.4    1586.1    1034.1+/- 275.2     434.9   explicit_cache[s=4] (target branch)
   avg     46.7    324.9    116.8     73.9    114.1    498.2    125.3    137.9    144.9      3.1    1585.7    1040.8+/- 272.3     434.5   explicit_cache[s=4] (this PR)

Copy link

The PR was built and ran successfully with CMSSW. Here are some plots.

OOTB All Tracks
Efficiency and fake rate vs pT, eta, and phi

The full set of validation and comparison plots can be found here.

@GNiendorf
Copy link
Member Author

@slava77 It looks like this method of downsampling 80% tracks works for the second delta beta cut, but not for the first delta beta cut. It seems like the more underlying issue is network size when training on the larger sample. If I increase network size substantially it fixes the issue without having to downsample tracks, although I think it would take a substantial amount of effort to make a larger model that has competitive timing to the current hybrid approach. Closing this PR for now. I'm going to shift my focus to work on loading the weights properly and hopefully getting a timing improvement with lower precision weights.

@GNiendorf GNiendorf closed this Nov 21, 2024
@GNiendorf
Copy link
Member Author

404795f

Here is the big dnn code with associated training notebook for future reference.

@slava77
Copy link

slava77 commented Nov 21, 2024

404795f

Here is the big dnn code with associated training notebook for future reference.

two more layers and double the hidden features; right?
How much slower is it?

@GNiendorf
Copy link
Member Author

GNiendorf commented Nov 21, 2024

two more layers and double the hidden features; right? How much slower is it?

The number of parameters increases by a factor of ~7.6x, the t5 timing increases by a factor of 12x (1ms->12ms). I didn't check to see if a smaller network also fixes the issue though.

@slava77
Copy link

slava77 commented Nov 22, 2024

two more layers and double the hidden features; right? How much slower is it?

The number of parameters increases by a factor of ~7.6x, the t5 timing increases by a factor of 12x (1ms->12ms). I didn't check to see if a smaller network also fixes the issue though.

running a profiler may help.
2x more features would 2^3 in matrix computation. x2 layers it's 16x. While I thought that in 1ms the DNN computation was not the leading term.
I suspect memory constraints, that the new weights don't fit anymore in the SM.
Perhaps only 50% more hidden features would fit (at least one per layer matrix)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants