-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Liveness probe for tokman fails often, resulting in container restarts and poor user experience #548
Comments
I tried to debug this as much as I can, however I didn't find find out much. Every now and then when doing the liveness check a kubelet sends a HTTP request that is accepted by a tokman container but then the connection is terminated by the client (before the server has a chance to respond) and that results in the error message. I believe that rules out an issue on tokman side, but that's all I can tell. There is also an error appearing from time to time on a short-running worker pod: |
Opened RITM1766219 |
I don't see any of these events in the last 5 days, I will keep an eye on it and close the issue for now OTOH the same issue has popped up with a short-running worker, so we might want to adjust the requests/limits soon™ There are also some outcomes from the ticket above that can be included in our own docs. |
The liveness probe fails with:
Get "http://${IPADDR}:8000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)
This happens on both prod and stage, but on prod it's very frequent, currently the counter says
649 times in the last 12 days
, while on stage it has happened only5 times in the last 3 days
.The text was updated successfully, but these errors were encountered: