-
Notifications
You must be signed in to change notification settings - Fork 207
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
We need to upgrade our testing infrastructure soon-ish. #139
Comments
I think that's a good summary of the issues facing the Clima repo as well. @vchuravy might be able to expand more here, but from what I understand the JuliaGPU org is using their own box at UGent hooked up as a runner via GitLab CI. We are currently considering getting a similar setup here at Caltech: this might be also useable for multi-CPU/multi-GPU jobs as well. |
I should add that if/when we get it set up, you of course would be welcome to make use of it! |
That would be awesome! We'll definitely keep a look out for what you guys end up using. I wonder if it would be more cost-effective (and a better use of developer time) to just pay a CI service for this but for such an expensive set up it might not be worth it. |
Are there any GPU-capable CI services?
|
Yeah I don't know... I was going to email around for quotes to see if they have any premium/custom set ups with GPUs and multiple CPUs. If they're just spinning up VMs on the cloud then maybe it's just as simple as requesting a multi-core CPU with 2-4 GPUs (which I know is available on Google Cloud). But after factoring in support cost it might be pretty steep. |
https://circleci.com/docs/2.0/gpu/
may be useful?
…On Wed, Mar 20, 2019 at 08:30 Ali Ramadhan ***@***.***> wrote:
Yeah I don't know... I was going to email around for quotes to see if they
have any premium/custom set ups with GPUs and multiple CPUs.
If they're just spinning up VMs on the cloud then maybe it's just as
simple as requesting a multi-core CPU with 2-4 GPUs (which I know is
available on Google Cloud). But after factoring in support cost it might be
pretty steep.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#139 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ADXx4DUR7vVwZ_gPzKB5-UNlThhQbcZeks5vYim_gaJpZM4b8em2>
.
|
Thanks, forgot about Circle CI. I just emailed around for quotes from Travis CI, Drone, GitLab CI, and Circle CI. Sounds like enterprise tends to roll their own CI using Jenkins or TeamCity but we're just a small team that needs a custom solution so something simple like Travis CI might be fine. Sounds like we can spin up our own cloud instances (e.g. on Google Cloud with those sweet credits) according to the specs we need then pay the CI service to basically set it up, maintain it, and provide support. |
I wonder how hard it would be to spin up a Google Cloud instance with a V100 GPU (or something cheaper, doesn't matter too much since we have enough credits) and set up a GitLab CI pipeline with it just like the one JuliaGPU has. We could share it with the JuliaGPU organization as well. And if we need 2+ GPUs to really test MPI that would be easy to change (just spin up a new instance and load the "GitLab CI" image maybe). It wouldn't run the tests on Windows or Mac, but we can pay a little bit more for dedicated Travis (Mac?) and Appveyor (Windows) resources if we want those to run fast as well. cc @vchuravy is this easy-ish to set up? I think you were involved in setting up the current GitLab CI pipeline? |
I like the idea of running on google. Setting up runners should be easy I think cc @lcw |
Yea, I like the points @vchuravy brought up in the weekly atmosphere meeting. He suggested that using google cloud would allow us to test the codes at various scales, from one gpu to hundreds. |
This issue is cropping up now that we regularly timeout on Travis (max runtime is 50 minutes) and we almost always time out on GitLab GPU CI (max runtime is 60 minutes, @maleadt might be able to increase that but it's a shared resource and we probably shouldn't be hogging it up). Surprisingly Appveyor is always fast now. I think free CI servers are just generally underpowered. We definitely want to keep our tests and make them even more comprehensive so here are some ideas we can discuss (probably in January):
We'll have to test Oceananigans + MPI pretty soon but we can worry about that later. Slurm CI or setting something up with our 4xTitan V server might be a good option here. |
Why do model tests have to be long runs? Surely a few timesteps is enough
to see if anything is broken.
…On Tue, Dec 17, 2019, 7:26 PM Ali Ramadhan ***@***.***> wrote:
This issue is cropping up now that we regularly timeout on Travis (max
runtime is 50 minutes) and we almost always time out on GitLab GPU CI (max
runtime is 60 minutes, @maleadt <https://github.com/maleadt> might be
able to increase that but it's a shared resource and we probably shouldn't
be hogging it up). Surprisingly Appveyor is always fast now. I think free
CI servers are just generally underpowered.
Unfortunately it seems like paying for CI will never happen but we
definitely want to keep our tests and make them even more comprehensive so
here are some ideas we can discuss (probably in January):
1. See if we can move Travis CI pipelines onto Azure DevOps. They seem
to give out more runtime (up to 360 minutes I think) although they might
always reduce that in the future if they get more users. CliMA and
@simonbyrne <https://github.com/simonbyrne> seem to be having a good
experience with Azure.
2. Split tests into a fast smaller test set (regression only?) and the
full comprehensive test set. But we still need a place to run the
comprehensive test set (maybe Azure runs the comprehensive tests?). We'll
probably have to do this at some point.
3. Split up the tests into jobs that run in <50 minutes each. You can
have unlimited jobs on Travis. But this feels like a lot of work to set up
and the tests would still take long as you can't have that many parallel
builds.
We'll have to test Oceananigans + MPI pretty soon but we can worry about
that later. Slurm CI or setting something up with our 4xTitan V server
might be a good option here.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#139>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AKXUEQSZAUZJIBELHWDZFHLQZFU4PANCNFSM4G7R5G3A>
.
|
That is true but we already do this. Most tests run very small models for a single time step. Some run for a bit longer to test e.g. incompressibility or tracer conservation over time but even then it's like 10 time steps, and those tests don't take very long. The problem is just sheer number of tests as we try to test each feature on CPU and GPU, Float32 and Float64, with every closure, etc. We been adding tests over time so we currently have ~2000 tests in total (counting GPU tests too). Julia's compiler takes a while to compile everything so that doesn't help. The tests run in 15-20 minutes on my laptop but the free CI servers aren't as powerful so I'm not surprised the tests take over 50-60 minutes. We're only going to be adding more tests in the future. |
Gonna close this issue with PR #872 as there's not much to do and no actionable items. Unless we get paid-tier CI we'll probably stick with Travis CI (Linux+Mac CPU + doc builds), GitLab CI (Linux CPU+GPU), Appveyor CI (Windows CPU), and Docker CI. With MPI (#590) we'll probably have to look into https://github.com/CliMA/slurmci.
Haha those were good times. |
Right now all our tests are lumped into one (unit, integration, and model verification tests) and we run the tests on the CPU and the GPU (most tests are shared).
This is not a high priority item right now, but it's already annoying that I have to wait several minutes for the GPU tests to run as I'm debugging. So just starting up a discussion around this topic.
I can see us hitting some limitations soon:
We will also need to run the test suite on the following architectures in the future:
Some ideas for things to do that will help:
1.1. Unit tests: should run in a few minutes so we can run them during development and on every commit/PR/etc.
1.2. Integration tests: can take a while to run so we don't want to run these locally all the time but probably on every PR. Shouldn't take much more than 1 hour to run so we don't have to wait forever to merge PR's.
1.3. Model verification tests (also called end-to-end tests): will probably take a long time to run. Maybe run this once a day? Or manually if there's a PR that changes core functionality.
cc @christophernhill @jm-c @glwagner: I know we all care about testing.
cc @charleskawczynski @simonbyrne: Just wondering if this is a problem you guys are anticipating for CliMA.jl? We might be able to share some common solutions?
Resources:
The text was updated successfully, but these errors were encountered: