Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test_benchmarking_parallel_simulation fails on CI for one configuration #1111

Closed
Baschdl opened this issue Mar 25, 2024 · 8 comments · Fixed by #1117
Closed

test_benchmarking_parallel_simulation fails on CI for one configuration #1111

Baschdl opened this issue Mar 25, 2024 · 8 comments · Fixed by #1117
Assignees
Labels
bug Something isn't working

Comments

@Baschdl
Copy link
Contributor

Baschdl commented Mar 25, 2024

The test test_benchmarking_parallel_simulation fails for sim_batch_size=1, num_workers=10 but works for all other configurations on CI:
https://github.com/sbi-dev/sbi/actions/runs/8418222476/job/23048287384?pr=1063#step:8:4939

FAILED tests/multiprocessing_test.py::test_benchmarking_parallel_simulation[1-10] - assert 12.29891848564148 <= (5.03387713432312 * 1.1)

The most likely reasons for this failure are either the overhead of creating 10 workers (on a machine with only 2 cores) is not worth it if we only have 1 sample per batch or problems with running this test with -n auto. I think the first is more likely because all other variants work.

This was noticed in PR #1063 when running the slow tests on CI.

@Baschdl Baschdl added the bug Something isn't working label Mar 25, 2024
@Baschdl Baschdl self-assigned this Mar 25, 2024
@Baschdl
Copy link
Contributor Author

Baschdl commented Mar 25, 2024

I can reproduce this issue locally on a machine with 10 cores, so the number of cores in the CI environment is not the issue. Indeed @famura, running this test with -n auto causes the problem (or not, see two comments down: #1111 (comment)).
I will quickly start a CI run on our slow tests to see if it's worth to run them with -n auto or if we should just get rid of it. They take 33min with parallel execution (see failed run).

@Baschdl
Copy link
Contributor Author

Baschdl commented Mar 25, 2024

@Baschdl
Copy link
Contributor Author

Baschdl commented Mar 25, 2024

This run took 43min. So running the slow tests in parallel saves 10min.

But the above mentioned tests also fails when sequentially executed: https://github.com/sbi-dev/sbi/actions/runs/8420009653/job/23053830420#step:7:3555

@janfb
Copy link
Contributor

janfb commented Mar 25, 2024

Maybe a quicker solution: Do we really need this test?

It seems I thought so back then, but maybe we only have to test the API here and not the fact that parallel simulation should be faster than sequential simulation? (this should be obvious, and of course there is a trade off, but we don't have to do this benchmarking in our tests).

what do the others think, e.g., @michaeldeistler @manuelgloeckler?

@Baschdl
Copy link
Contributor Author

Baschdl commented Mar 28, 2024

I talked with @michaeldeistler today and he was in favor of changing the test to make it work instead of completely kicking it out. He also mentioned that the test failed on his old machine.

We currently run it with num_workers=10 and num_workers=-2 (all cpus but one). num_workers=10 doesn't really make sense on a machine with only two cores (Github Actions), so I would propose to change it to num_workers=-1. On a machine with a bunch of cores this would still be similar to num_workers=10. The downside is that the results could differ from machine to machine but that's also currently the case with num_workers=-2.

This change works on Github Actions: https://github.com/sbi-dev/sbi/actions/runs/8471258924/job/23210817060.

We may have to introduce the additional pytest mark sequential discussed elsewhere to make it work on all machines when people use -n auto.

@Baschdl
Copy link
Contributor Author

Baschdl commented Mar 28, 2024

num_workers=-1 together with -n auto works on Github Actions if we allow joblib to be a little bit slower than the 10% we currently have in the test. I would open a PR for that to get #1063 merged.

Tested here: https://github.com/sbi-dev/sbi/actions/runs/8471478141/job/23211524576

@famura
Copy link
Contributor

famura commented Mar 28, 2024

I got a really naive comment that I still want to make. What speaks against using exactly 2 (or 3) workers? I suppose any action runner has at least 2 cores and it would technically still be a parallel simulation. This way we get a more deterministic behavior I suppose.

@Baschdl
Copy link
Contributor Author

Baschdl commented Mar 28, 2024

To be honest I just tried to replicate the current behavior as close as possible. Setting it to 2 should work equally well and I would be in favor of reproducible behavior across machines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants