Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

We need to upgrade our testing infrastructure soon-ish. #139

Closed
ali-ramadhan opened this issue Mar 19, 2019 · 14 comments · Fixed by #872
Closed

We need to upgrade our testing infrastructure soon-ish. #139

ali-ramadhan opened this issue Mar 19, 2019 · 14 comments · Fixed by #872
Labels
help wanted 🦮 plz halp (guide dog provided) testing 🧪 Tests get priority in case of emergency evacuation

Comments

@ali-ramadhan
Copy link
Member

ali-ramadhan commented Mar 19, 2019

Right now all our tests are lumped into one (unit, integration, and model verification tests) and we run the tests on the CPU and the GPU (most tests are shared).

This is not a high priority item right now, but it's already annoying that I have to wait several minutes for the GPU tests to run as I'm debugging. So just starting up a discussion around this topic.

I can see us hitting some limitations soon:

  1. A comprehensive test suite will take time to run, long enough that we cannot keep running it during development and debugging.
  2. Comprehensive model verification tests (or system tests?) will take even longer to run and are absolutely crucial (see Model verification tests. #81 Validation and performance benchmarks. #136), so this problem will get worse in the future.
  3. GPU tests take a while to run because of long compile time (How to reduce compile time for GPU code? #66) and they run on top of all the CPU tests. In general, setting up GPU models take more time so it's not ideal that we're setting up tons of tiny models for testing. Testing GPU stuff may also involve some expensive scalar CUDA operations (see Disable slow fallback methods for CUDA. #82).

We will also need to run the test suite on the following architectures in the future:

  1. single-core CPU (Travis CI and Appveyor are fine here)
  2. single GPU (JuliaGPU's GitLab CI pipeline works great here)
  3. multi-core single CPU (MPI) (paid CI plans will probably work here)
  4. multiple distributed CPU nodes (MPI) (no idea where to run this)
  5. multiple GPUs (MPI) (no idea where to run this)

Some ideas for things to do that will help:

  1. Explicitly split the tests up into 2-3 suites
    1.1. Unit tests: should run in a few minutes so we can run them during development and on every commit/PR/etc.
    1.2. Integration tests: can take a while to run so we don't want to run these locally all the time but probably on every PR. Shouldn't take much more than 1 hour to run so we don't have to wait forever to merge PR's.
    1.3. Model verification tests (also called end-to-end tests): will probably take a long time to run. Maybe run this once a day? Or manually if there's a PR that changes core functionality.
  2. Run the tests in parallel. I think the main Julia repo does this. We might have to roll our own parallel solution (see this thread). This would also require expensive paid CI plans (but very much worth it in my opinion).
  3. Thinking long-term, if we had a multi-CPU multi-GPU machine available we could probably roll our own CI solution for these distributed architectures. Ideally we'd want to see if we can get this through a service although it would probably cost $$$$$.

cc @christophernhill @jm-c @glwagner: I know we all care about testing.

cc @charleskawczynski @simonbyrne: Just wondering if this is a problem you guys are anticipating for CliMA.jl? We might be able to share some common solutions?

Resources:

@ali-ramadhan ali-ramadhan added testing 🧪 Tests get priority in case of emergency evacuation help wanted 🦮 plz halp (guide dog provided) labels Mar 19, 2019
@simonbyrne
Copy link
Member

I think that's a good summary of the issues facing the Clima repo as well.

@vchuravy might be able to expand more here, but from what I understand the JuliaGPU org is using their own box at UGent hooked up as a runner via GitLab CI. We are currently considering getting a similar setup here at Caltech: this might be also useable for multi-CPU/multi-GPU jobs as well.

@simonbyrne
Copy link
Member

We are currently considering getting a similar setup here at Caltech: this might be also useable for multi-CPU/multi-GPU jobs as well.

I should add that if/when we get it set up, you of course would be welcome to make use of it!

@ali-ramadhan
Copy link
Member Author

That would be awesome! We'll definitely keep a look out for what you guys end up using.

I wonder if it would be more cost-effective (and a better use of developer time) to just pay a CI service for this but for such an expensive set up it might not be worth it.

@simonbyrne
Copy link
Member

simonbyrne commented Mar 20, 2019 via email

@ali-ramadhan
Copy link
Member Author

Yeah I don't know... I was going to email around for quotes to see if they have any premium/custom set ups with GPUs and multiple CPUs.

If they're just spinning up VMs on the cloud then maybe it's just as simple as requesting a multi-core CPU with 2-4 GPUs (which I know is available on Google Cloud). But after factoring in support cost it might be pretty steep.

@christophernhill
Copy link
Member

christophernhill commented Mar 20, 2019 via email

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Mar 20, 2019

Thanks, forgot about Circle CI. I just emailed around for quotes from Travis CI, Drone, GitLab CI, and Circle CI. Sounds like enterprise tends to roll their own CI using Jenkins or TeamCity but we're just a small team that needs a custom solution so something simple like Travis CI might be fine.

Sounds like we can spin up our own cloud instances (e.g. on Google Cloud with those sweet credits) according to the specs we need then pay the CI service to basically set it up, maintain it, and provide support.

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Apr 19, 2019

I wonder how hard it would be to spin up a Google Cloud instance with a V100 GPU (or something cheaper, doesn't matter too much since we have enough credits) and set up a GitLab CI pipeline with it just like the one JuliaGPU has. We could share it with the JuliaGPU organization as well.

And if we need 2+ GPUs to really test MPI that would be easy to change (just spin up a new instance and load the "GitLab CI" image maybe).

It wouldn't run the tests on Windows or Mac, but we can pay a little bit more for dedicated Travis (Mac?) and Appveyor (Windows) resources if we want those to run fast as well.

cc @vchuravy is this easy-ish to set up? I think you were involved in setting up the current GitLab CI pipeline?
cc @jkozdon since your Slack post reminded me about this issue.

See: https://github.com/JuliaGPU/gitlab-ci

@jkozdon
Copy link

jkozdon commented Apr 19, 2019

I like the idea of running on google. Setting up runners should be easy I think

cc @lcw

@lcw
Copy link

lcw commented Apr 19, 2019

Yea, I like the points @vchuravy brought up in the weekly atmosphere meeting. He suggested that using google cloud would allow us to test the codes at various scales, from one gpu to hundreds.

@ali-ramadhan
Copy link
Member Author

ali-ramadhan commented Dec 18, 2019

This issue is cropping up now that we regularly timeout on Travis (max runtime is 50 minutes) and we almost always time out on GitLab GPU CI (max runtime is 60 minutes, @maleadt might be able to increase that but it's a shared resource and we probably shouldn't be hogging it up). Surprisingly Appveyor is always fast now. I think free CI servers are just generally underpowered.

We definitely want to keep our tests and make them even more comprehensive so here are some ideas we can discuss (probably in January):

  1. See if we can move Travis CI pipelines onto Azure DevOps. They seem to give out more runtime (up to 360 minutes I think) although they might always reduce that in the future if they get more users. CliMA and @simonbyrne seem to be having a good experience with Azure.
  2. Split tests into a fast smaller test set (regression only?) and the full comprehensive test set. But we still need a place to run the comprehensive test set (maybe Azure runs the comprehensive tests?). We'll probably have to do this at some point.
  3. Split up the tests into jobs that run in <50 minutes each. You can have unlimited jobs on Travis. But this feels like a lot of work to set up and the tests would still take long as you can't have that many parallel builds.

We'll have to test Oceananigans + MPI pretty soon but we can worry about that later. Slurm CI or setting something up with our 4xTitan V server might be a good option here.

@johncmarshall54
Copy link

johncmarshall54 commented Dec 18, 2019 via email

@ali-ramadhan
Copy link
Member Author

Why do model tests have to be long runs? Surely a few timesteps is enough
to see if anything is broken.

That is true but we already do this. Most tests run very small models for a single time step. Some run for a bit longer to test e.g. incompressibility or tracer conservation over time but even then it's like 10 time steps, and those tests don't take very long.

The problem is just sheer number of tests as we try to test each feature on CPU and GPU, Float32 and Float64, with every closure, etc. We been adding tests over time so we currently have ~2000 tests in total (counting GPU tests too). Julia's compiler takes a while to compile everything so that doesn't help. The tests run in 15-20 minutes on my laptop but the free CI servers aren't as powerful so I'm not surprised the tests take over 50-60 minutes.

We're only going to be adding more tests in the future.

@ali-ramadhan
Copy link
Member Author

Gonna close this issue with PR #872 as there's not much to do and no actionable items.

Unless we get paid-tier CI we'll probably stick with Travis CI (Linux+Mac CPU + doc builds), GitLab CI (Linux CPU+GPU), Appveyor CI (Windows CPU), and Docker CI. With MPI (#590) we'll probably have to look into https://github.com/CliMA/slurmci.

it's already annoying that I have to wait several minutes for the GPU tests to run

Haha those were good times.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
help wanted 🦮 plz halp (guide dog provided) testing 🧪 Tests get priority in case of emergency evacuation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants