Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Liveness probe for tokman fails often, resulting in container restarts and poor user experience #548

Closed
nforro opened this issue Jan 11, 2024 · 3 comments
Assignees
Labels
complexity/single-task Regular task, should be done within days. kind/bug Something isn't working.

Comments

@nforro
Copy link
Member

nforro commented Jan 11, 2024

The liveness probe fails with:
Get "http://${IPADDR}:8000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This happens on both prod and stage, but on prod it's very frequent, currently the counter says 649 times in the last 12 days, while on stage it has happened only 5 times in the last 3 days.

@lbarcziova lbarcziova moved this from new to ready-to-refine in Packit Kanban Board Jan 11, 2024
@lachmanfrantisek lachmanfrantisek added kind/bug Something isn't working. complexity/single-task Regular task, should be done within days. labels Jan 11, 2024
@lachmanfrantisek lachmanfrantisek moved this from ready-to-refine to refined in Packit Kanban Board Jan 11, 2024
@nforro nforro self-assigned this Jan 18, 2024
@nforro nforro moved this from refined to in-progress in Packit Kanban Board Jan 18, 2024
@nforro
Copy link
Member Author

nforro commented Mar 19, 2024

I tried to debug this as much as I can, however I didn't find find out much. Every now and then when doing the liveness check a kubelet sends a HTTP request that is accepted by a tokman container but then the connection is terminated by the client (before the server has a chance to respond) and that results in the error message. I believe that rules out an issue on tokman side, but that's all I can tell.

There is also an error appearing from time to time on a short-running worker pod:
Liveness probe failed: Ignored keyword arguments: {'type': 'pagure'}
I had a look into that as well and the error message is actually just an unrelated warning (coming from ogr) and the actual error is that celery status (the command the liveness probe runs) on a short-running worker sometimes doesn't produce the expected output (the short-running worker hostname is missing from the list, only long-running workers are there) - this is reproducible even when running the command manually in a terminal, from time to time.

@mfocko
Copy link
Member

mfocko commented Apr 22, 2024

Opened RITM1766219

@mfocko mfocko added the blocked We are blocked! label Apr 22, 2024
@mfocko
Copy link
Member

mfocko commented Jun 12, 2024

I don't see any of these events in the last 5 days, I will keep an eye on it and close the issue for now


OTOH the same issue has popped up with a short-running worker, so we might want to adjust the requests/limits soon™

There are also some outcomes from the ticket above that can be included in our own docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
complexity/single-task Regular task, should be done within days. kind/bug Something isn't working.
Projects
Archived in project
Development

No branches or pull requests

3 participants