Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Map disk cache #132

Merged
merged 63 commits into from
Mar 28, 2022
Merged

Map disk cache #132

merged 63 commits into from
Mar 28, 2022

Conversation

jaycedowell
Copy link
Collaborator

Adds a disk-based caching system for bifrost.map() calls to help reduce start up time in pipelines that are run in the same configuration multiple times. This feature can also be turned off by setting the "NOMAPCACHE" variable to one in the user.mk file.

…peed things up between calls of the same code. There needs to be some more work on this to make sure the the cache is invalidated when it needs to be.
…er-kernel storage format. The per-kernel storage foramt consists of a '.inf' which hold information about the file and a '.ptx' file that stores the actual kernel.
… the Python to make it easy to clear the map cache.
@coveralls
Copy link

Coverage Status

Coverage decreased (-0.02%) to 61.899% when pulling 90291f0 on jaycedowell:map-disk-cache into 72e0315 on ledatelescope:master.

@coveralls
Copy link

coveralls commented Oct 2, 2019

Coverage Status

Coverage decreased (-0.02%) to 61.343% when pulling ab4ece0 on jaycedowell:map-disk-cache into 1681fde on ledatelescope:master.

@codecov-io
Copy link

codecov-io commented Oct 2, 2019

Codecov Report

Merging #132 into master will decrease coverage by <.01%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
- Coverage   58.78%   58.77%   -0.01%     
==========================================
  Files          63       63              
  Lines        5095     5097       +2     
==========================================
+ Hits         2995     2996       +1     
- Misses       2100     2101       +1
Impacted Files Coverage Δ
python/bifrost/__init__.py 80% <100%> (ø) ⬆️
python/bifrost/map.py 21.05% <50%> (+1.6%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 72e0315...90291f0. Read the comment docs.

@jaycedowell
Copy link
Collaborator Author

Oh, the "coverage" thing again.

jaycedowell added a commit to jaycedowell/bifrost that referenced this pull request Mar 18, 2020
@telegraphic
Copy link
Collaborator

+1 for this functionality -- as this is C++ level I would defer to others for review though

@codecov-commenter
Copy link

codecov-commenter commented Jul 23, 2020

Codecov Report

Merging #132 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #132   +/-   ##
=======================================
  Coverage   61.62%   61.62%           
=======================================
  Files          64       64           
  Lines        5300     5300           
=======================================
  Hits         3266     3266           
  Misses       2034     2034           

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a53b35b...2874bd7. Read the comment docs.

@codecov-io
Copy link

codecov-io commented Feb 13, 2021

Codecov Report

Merging #132 (f6b075e) into master (e1cea0b) will decrease coverage by 2.87%.
The diff coverage is 66.66%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
- Coverage   61.62%   58.75%   -2.88%     
==========================================
  Files          64       64              
  Lines        5300     5302       +2     
==========================================
- Hits         3266     3115     -151     
- Misses       2034     2187     +153     
Impacted Files Coverage Δ
python/bifrost/map.py 22.00% <50.00%> (-55.09%) ⬇️
python/bifrost/__init__.py 80.76% <100.00%> (ø)
python/bifrost/fir.py 0.00% <0.00%> (-100.00%) ⬇️
python/bifrost/romein.py 0.00% <0.00%> (-100.00%) ⬇️
python/bifrost/blocks/scrunch.py 37.03% <0.00%> (-59.26%) ⬇️
python/bifrost/linalg.py 36.84% <0.00%> (-52.64%) ⬇️
python/bifrost/fft.py 45.00% <0.00%> (-50.00%) ⬇️
python/bifrost/fdmt.py 50.00% <0.00%> (-50.00%) ⬇️
python/bifrost/transpose.py 41.66% <0.00%> (-50.00%) ⬇️
... and 4 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e1cea0b...f6b075e. Read the comment docs.

@codecov-commenter
Copy link

codecov-commenter commented Apr 20, 2021

Codecov Report

Merging #132 (d75cf87) into master (a57a83f) will decrease coverage by 0.13%.
The diff coverage is 23.07%.

@@            Coverage Diff             @@
##           master     #132      +/-   ##
==========================================
- Coverage   58.50%   58.36%   -0.14%     
==========================================
  Files          67       67              
  Lines        5731     5753      +22     
==========================================
+ Hits         3353     3358       +5     
- Misses       2378     2395      +17     
Impacted Files Coverage Δ
python/bifrost/version/__main__.py 0.00% <ø> (ø)
python/bifrost/map.py 24.00% <20.00%> (-2.00%) ⬇️
python/bifrost/__init__.py 80.00% <100.00%> (ø)
python/bifrost/ndarray.py 73.15% <0.00%> (ø)
python/bifrost/telemetry/__main__.py 0.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update a57a83f...d75cf87. Read the comment docs.

@league
Copy link
Collaborator

league commented Mar 25, 2022

I should be able to try this PR again today, on cuda 11. I'll also start to look at one of the others, whatever is most relevant or ready: is that mmap ring space #135 ?

@jaycedowell
Copy link
Collaborator Author

Cool, thanks. I just finished going through all of my tests (again) under 10.1 and things still seem ok.

If you did want to look at another one either #135 or #167 would work. #167 is short so it has that going for it.

@league
Copy link
Collaborator

league commented Mar 26, 2022

Tests worked great for me on cuda11, tried with and without map cache enabled. It makes a huge difference in speed. Just out of curiosity, I messed with overwriting one of the files with random bytes and sure, it core dumped. Not surprising, and not worth detecting… just trying to push the limits, find the "robustness boundary." Then I tried editing one of the numbers in cache.version, and it seemed to hang instead of dump. (But that's with the corrupt file still there. Not sure whether that's significant. If the version doesn't match expectations, we shouldn't be trying to read one of the other files?)

If I delete the whole directory and start over, it works again of course. Then if I change cache.version without also corrupting one of the other files, it seems to replace it cleanly and keep moving. So I think that's okay.

I started to experiment with adding a fileutils.cpp in addition to the hpp but it's not ready yet, and maybe not necessary. I'm also slightly uncomfortable with dumping together a command line and calling system() in these routines, but I get it… it's definitely easier to use mkdir -p and rm -rf and *.inf than to code those things in C++ (or maybe even to find the libraries to do it).

@league
Copy link
Collaborator

league commented Mar 26, 2022

I remember seeing an issue about machines with >1 GPU, and I'm wondering if the map cache makes that any worse… is it embedding GPU-0 architecture info. Or what if $HOME is shared across machines with different GPUs. Maybe not worth worrying about, since if something goes wrong it's a simple matter to wipe the directory and start fresh. Just thinking out loud about potential complications.

@jaycedowell
Copy link
Collaborator Author

Then I tried editing one of the numbers in cache.version, and it seemed to hang instead of dump. (But that's with the corrupt file still there. Not sure whether that's significant. If the version doesn't match expectations, we shouldn't be trying to read one of the other files?)

The hang is surprising. The cache.version should have been verified first and then the entire cache should have be dropped when the version information was wrong.

I started to experiment with adding a fileutils.cpp in addition to the hpp but it's not ready yet, and maybe not necessary. I'm also slightly uncomfortable with dumping together a command line and calling system() in these routines, but I get it… it's definitely easier to use mkdir -p and rm -rf and *.inf than to code those things in C++ (or maybe even to find the libraries to do it).

I had thought about this too. The options are:

  • move to C++17 and use the new filesystem library,
  • stay with C++11 and use boost for their boost::filesystem library (it's what the C++17 one is based on),
  • use C system calls to implement mkdir/rm and globbing, or
  • stick with system().

My g++ 10.3 was throwing this warning, and I'm reading it could mean
that subclass exceptions would not match. There are other places in
the code that use `const&` in a `catch` so I assume fix is compatible
with other compilers we encounter.

```
map.cpp:425:21: warning: catching polymorphic
   type ‘class std::exception’ by value [
   -Wcatch-value=]
  425 |       } catch( std::exception ) {}
      |                     ^~~~~~~~~
```
@jaycedowell
Copy link
Collaborator Author

I remember seeing an issue about machines with >1 GPU, and I'm wondering if the map cache makes that any worse… is it embedding GPU-0 architecture info. Or what if $HOME is shared across machines with different GPUs.

It shouldn't make things worse on multi-GPU systems but it might be problematic if those GPUs are heterogeneous. The cache is stored as PTX which should be forward compatible. The hitch is that when bifrost.map generates the PTX it targets the architecture of the GPU it has been requested on and then that will get saved to the disk cache. On a mixed GPU system you could end up in a situation where you cache a function for a compute capability 75 device, save it as 75, and then try to load it/use it from a 52 device. I don't think that works.

The alternative would be to do something like always generate the PTX for the lowest compute capability that Bifrost was compiled for if the disk cache is enabled. That would solve my above problems, maybe with a little slowdown, but be largely unchanged in the homogeneous GPU case.

@league
Copy link
Collaborator

league commented Mar 26, 2022

  • Regarding cache.version, yeah I'm not sure. It was a hard enough hang that ^C didn't do the trick, had to kill by PID. Could explore/trace more carefully if it's worrisome.
  • Regarding C++ filesystem: there seems to still be a lot of Ubuntu 18 out there. If there's not a C++17-supporting gcc in its regular apt sources, it might be too early to make that switch. Boost may be plausible, and especially if it wouldn't take major changes to reroute it to C++ standard stuff later on. “Hard pass” on implementing mkdir -p, rm -rf and glob on our own in C++.
  • Regarding GPU heterogeneity, it's likely an obscure situation. But would it be relatively easy to – rather than encode the architecture in cache.version, could you use a separate subdir per architecture/CUDA version/whatever, like $HOME/.bifrost/map_cache/ARCH-VERSION-TAG/*.ptx. Then if you end up on a different architecture, the appropriate cache would just appear to be absent, rather than existing but with the wrong arch. Does that make sense?

I don't think any of this necessarily has to hold up this branch. Could leave a TODO here or there as a reminder for things to pursue in the future or revisit if there's a relevant bug report. Or write it as a low-priority github issue with appropriate label (wish list, improvement, for future consideration, etc.)?

@jaycedowell
Copy link
Collaborator Author

Your point on splitting up the cache by arch. does make sense but I'm not sure how necessary that is if we stick with PTX for the cache. As far as I can tell PTX gets interpreted by the CUDA driver and JIT compiled to a binary when it is run. So as long as we aren't using features in the PTX that the card cannot support it should be ok to have a flat cache. That should be true if we always target the lowest arch. requested during the install. Getting fancier is an option but I think how we do that is going to need some more thought about the details of compatibility and what CUDA really does with PTX files/binaries. Maybe a "TODO" is the way to go.

jaycedowell and others added 2 commits March 26, 2022 08:49
I couldn't get access to it by importing `bifrost.map`, because of the
collision between the module `map` and the function `map` within
`bifrost.map`.
@league
Copy link
Collaborator

league commented Mar 26, 2022

Diagnosed the hang that I experienced before. After it core-dumped because I corrupted a ptx file, it left behind the .lock file. So next time when trying to open the cache, it was waiting on the lock. For whatever that's worth. I was messing around and probably got what I deserved. :)

@jaycedowell
Copy link
Collaborator Author

I ran through my tests again this morning and I think I'm happy with this. @league if you are happy with it too then I will merge it.

@league
Copy link
Collaborator

league commented Mar 28, 2022

LGTM, ship it. 🚢

@jaycedowell jaycedowell merged commit 4dd48d6 into ledatelescope:master Mar 28, 2022
@jaycedowell jaycedowell deleted the map-disk-cache branch March 28, 2022 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants