-
Notifications
You must be signed in to change notification settings - Fork 993
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove timeouts from db connection #17424
Conversation
Hi @vient I think this is quite risky. If some operation on the DB that in normal conditions takes milliseconds, is not able to work in 10 seconds, then waiting for ever will probably not only not solve the problem, but will aggravate the whole situation, getting all concurrent jobs acting on the DB to stall or interlock each other. The fact that one of the concurrent processes fails because of this timeout is a safeguard to let the other processes succeed. In our experience this happens more for excessive concurrency over the DB than because of high CPU usage on the machine. Are you using separate caches for separate jobs, and still seeing this issue? Or are you sharing the same cache with a high concurrency (see https://docs.conan.io/2/knowledge/guidelines.html notes about concurrent caches) |
Our case looks like this: Jenkins starts pods which use local conan folders. To speed up conan start up we pass a shared download cache and set Also I wonder if there may be a thread starvation occurring somehow - we use around 100 conan packages, if one thread is waiting for lock while another thread continuously succeeds in acquiring it and then performs a long db operation because of disk latency, first thread may see a 10s timeout even if all db operations take <100ms.
Yes, but this principle works if you use retries. Conan does not retry db operations though, exiting with error immediately instead. This approach aggravates the situation even more, at least in my case when this conan process is the only one using this particular conan folder anyway. My current understanding is that these timeouts only help point out that there is something wrong with db operation latency. There are cases like mine though where nothing can be done about it, and it is expected. If you think that timeouts are indeed helpful, may we discuss increasing them or making an option for timeout value instead? |
Thanks for the detailed feedback.
I am a bit surprised that you are not seeing other kind of race conditions, because as commented above, the cache is not yet designed for concurrency. It is true that if you are very careful and make sure to install/build different packages in the different pods, and avoid the race conditions by the right orchestration and order of builds, the race conditions might be avoided. If there is system exhaustion at the machine level, it might be possible that the DB is sometimes not the actual blocker or bottleneck, but maybe if the DB is getting timeouts even after 10 seconds, then those timeouts are not the bigger problem anyway? If it helps, I think we might be able to increase a bit more those timeouts, like to 15 or even 20 seconds, but I'd still like to keep some limit, waiting forever without any limit seems to me to have more risks than benefits. |
Hi @vient Any further feedback about my last comment? |
Hi, thanks for reminder
Just in case, I'm speaking about download cache (
In a way, yes. Unfortunately, it may not be easy (or even possible?) to prevent latency spikes when you have a big server with lots of noisy neighbors on it - if some of them coincidentally start to do something disk-intensive simultaneously, everyone is affected. Disk I/O QoS seems to reduce performance, unlike CPU and RAM QoS - in CI we care more about throughput than latency, so decreasing disk performance when it's already strained is not optimal. We are pretty happy with our CI performance overall, latency spikes rarely cause any problems, one of this problems being this timeout. We've never seen this timeout trigger in any situation other than the high server load, in which case we would prefer it to not trigger because nothing bad actually happens.
Other than creating a config option for timeout duration, this is the only other way. Can we increase it to, say, 30 seconds? |
The download cache doesn't use DB or sqlite at all, it is purely filesystem (and file locks) based. Then, if pods are fully local and not shared for concurrent jobs, then there will be no concurrency over the package storage, and it would be safe, no concurrency over the same Conan cache. And it would seem then mostly a machine resource depletion issue.
Yes, just in case, make sure to update to latest Conan versions, we have done several optimizations that would reduce the number of sqlite reads to the DB, in some cases reducing a lot of calls to the DB I think with the latest optimizations, doubling the time should be quite a huge improvements, I'd suggest trying timeout=20 for the next release and see how it goes. I don't think it deserves a special configuration, as the goal is not to have to configure this at all, it seems more like a workaround than something that users really want to configure to different values. |
We use 2.9 at the moment, will try 2.11 then |
Superseded by #17616 |
Changelog: Fix: Remove DB connection timeouts
As observed in #13924 (comment) and also by me,
Conan failed to acquire database lock
may be triggered on machines with lots of cores and high load - in CI, in other words. All cases I observed were because of high load. I don't think it is meaningful to fail because of this, as it's treated as a flap, build gets restarted which does not help with cpu load. I don't know any other possible reasons for this timeout to trigger, so in my opinion it should not exist at all.develop
branch, documenting this one.