Skip to content
This repository has been archived by the owner on Dec 9, 2024. It is now read-only.

Move to Alpaka Memory + Alpaka 0.9 + CMSSW Alpaka Caching Allocator #292

Merged
merged 44 commits into from
Jul 17, 2023
Merged

Move to Alpaka Memory + Alpaka 0.9 + CMSSW Alpaka Caching Allocator #292

merged 44 commits into from
Jul 17, 2023

Conversation

GNiendorf
Copy link
Member

@GNiendorf GNiendorf commented May 31, 2023

This PR moves the memory management for the codebase over to Alpaka. It also removes unused memory functions in a few files, and gets rid of the unused PrintUtil files in SDL (along with a couple of unused print functions). Lastly, by including Alpaka in the main makefile it moves some of the required Alpaka statements from Event.cuh to Constants.cuh. See timing and performance plots below. Note that the timing is slower because the Alpaka caching allocator is not yet in place.

@GNiendorf
Copy link
Member Author

GNiendorf commented Jun 1, 2023

Timing (Ignore it being slower - this is without the alpaka caching allocator!)
Screenshot 2023-06-01 at 4 22 32 PM

Validation Plots - Here
Comparison to Master (pre-DNN) - Here

@GNiendorf GNiendorf marked this pull request as ready for review June 2, 2023 15:44
@GNiendorf GNiendorf requested a review from VourMa June 2, 2023 15:44
@GNiendorf GNiendorf linked an issue Jun 2, 2023 that may be closed by this pull request
3 tasks
@GNiendorf GNiendorf mentioned this pull request Jun 2, 2023
3 tasks
@GNiendorf
Copy link
Member Author

I fixed a small bug with the last commit where segmentsInGPU was not being deleted when the Event destructor was called at the end of a run. It led to a mismatch in the number of cudaMallocs vs cudaFrees.

@GNiendorf GNiendorf changed the title Segments.cu Alpaka Memory Segments.cu + Hit.cu Alpaka Memory Jun 7, 2023
@GNiendorf GNiendorf marked this pull request as draft June 7, 2023 23:09
@GNiendorf GNiendorf changed the title Segments.cu + Hit.cu Alpaka Memory Move to Alpaka Memory Jun 7, 2023
@GNiendorf
Copy link
Member Author

Putting this back as a Draft PR since moving over all of the files to Alpaka memory shouldn't take long.

Here's the current timing and validation plots:
Screenshot 2023-06-07 at 7 18 15 PM

Validation Plots - Here
Comparison to Master - Here

@GNiendorf GNiendorf linked an issue Jun 8, 2023 that may be closed by this pull request
@GNiendorf GNiendorf changed the title Move to Alpaka Memory Move to Alpaka Memory + Higher Version of Alpaka Jun 27, 2023
@GNiendorf GNiendorf changed the title Move to Alpaka Memory + Higher Version of Alpaka Move to Alpaka Memory + Alpaka 0.9 Jun 27, 2023
@GNiendorf
Copy link
Member Author

Timing (Now on CGPU-1! So not able to compare to previous timing yet)
Screenshot 2023-06-27 at 4 25 20 PM

@GNiendorf
Copy link
Member Author

GNiendorf commented Jul 5, 2023

I got things to work on lnx7188 again. Here are the current timing results (keep in mind the caching allocator is still bugging out so take these timings with a grain of salt for now. Notice how the TC timing goes crazy with the caching allocator because there is a super long cuda stream sync being called for no apparent reason. Also the caching allocator version throws a runtime error during ntuple writing.)

Relevant Master Timing (Only CUDA w/ CUDA Caching Allocator)
Screenshot 2023-07-05 at 0 15 31 PM

This PR, All Alpaka, No Caching Allocator
Screenshot 2023-07-05 at 0 10 18 PM

This PR, All Alpaka, with Alpaka Caching Allocator
Screenshot 2023-07-05 at 0 07 17 PM

Current Alpaka Branch (With Alpaka Kernels but CUDA memory and CUDA caching allocator)
Screenshot 2023-05-17 at 1 25 39 PM

@GNiendorf GNiendorf changed the title Move to Alpaka Memory + Alpaka 0.9 Move to Alpaka Memory + Alpaka 0.9 + CMSSW Alpaka Caching Allocator Jul 5, 2023
@slava77
Copy link
Contributor

slava77 commented Jul 5, 2023

Notice how the TC timing goes crazy with the caching allocator because there is a super long cuda stream sync being called for no apparent reason

do you happen to know when it's called? before or after the TC kernel or related alloc/read/writes

@GNiendorf
Copy link
Member Author

Notice how the TC timing goes crazy with the caching allocator because there is a super long cuda stream sync being called for no apparent reason

do you happen to know when it's called? before or after the TC kernel or related alloc/read/writes

It's hanging the removeDupQuintupletsInGPUBeforeTC kernel. I have to look into it more.

@GNiendorf
Copy link
Member Author

GNiendorf commented Jul 5, 2023

Fixed it. New timing, looks great!:
Screenshot 2023-07-05 at 6 30 42 PM

Another timing run. This is the first time we can see the full effect of the caching allocator.
Screenshot 2023-07-05 at 6 50 21 PM

@GNiendorf
Copy link
Member Author

Ntuple writing runtime error was not fixed by moving to a newer version of the caching allocator unfortunately. Looking into it more, but I may open the PR for review even if I can't find a solution since it only affects ntuple writing with the caching allocator turned on.

SDL/MiniDoublet.cuh Outdated Show resolved Hide resolved
SDL/MiniDoublet.cuh Outdated Show resolved Hide resolved
@GNiendorf GNiendorf marked this pull request as ready for review July 14, 2023 17:02
@GNiendorf
Copy link
Member Author

GNiendorf commented Jul 14, 2023

Good to go now @VourMa.

Timings below are averages over 10 runs and are shown next to their corresponding standard uncertainties.

Timing - lnx7188
without_half

Timing - cgpu1
no_half_cgpu

Validation Plots - Here
Comparison to Relevant Master - Here

@GNiendorf
Copy link
Member Author

Turning off the half precision code allowed the compiler to find two unused variables in one of the smaller kernels.

@GNiendorf
Copy link
Member Author

After speaking with @VourMa over skype we've agreed to merge this PR first into the Alpaka branch, rebase to the current master, and then do a final review of the Alpaka branch before merging it into the master branch.

@GNiendorf GNiendorf merged commit e73ab28 into SegmentLinking:alpaka_move Jul 17, 2023
Copy link
Contributor

@VourMa VourMa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

First set of comments is there - to be continued...

I have a general question: What are the files under code/alpaka_interface supposed to do? Are they helper functions for the Alpaka memory management? If so, could you add a small description of what's their purpose?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please note that this file is obsolete, replaced by setup_ucsd.sh - not sure if this was meant to be fixed in the rebase or it was missed.

Also, has the whole setup been tested in any of the UCSD machines?

Also, I see that the setup_lnx7188.sh has switched to el8. Could the two files be reconciled once again?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since part of the Alpaka migration is the cleanup of the cpu code, is this file (and its corresponding implementation) being used anywhere or can they be removed (instead of moved)?

@VourMa
Copy link
Contributor

VourMa commented Jul 18, 2023

Sorry, wrong PR, copying to the proper one...

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Incorrect and Ambiguous Uses of Cuda Memset Considerations on memory management
3 participants