Liveness probe for tokman fails often, resulting in container restarts and poor user experience #548

nforro · 2024-01-11T10:08:56Z

The liveness probe fails with:
Get "http://${IPADDR}:8000/api/health": context deadline exceeded (Client.Timeout exceeded while awaiting headers)

This happens on both prod and stage, but on prod it's very frequent, currently the counter says 649 times in the last 12 days, while on stage it has happened only 5 times in the last 3 days.

The text was updated successfully, but these errors were encountered:

nforro · 2024-03-19T16:04:29Z

I tried to debug this as much as I can, however I didn't find find out much. Every now and then when doing the liveness check a kubelet sends a HTTP request that is accepted by a tokman container but then the connection is terminated by the client (before the server has a chance to respond) and that results in the error message. I believe that rules out an issue on tokman side, but that's all I can tell.

There is also an error appearing from time to time on a short-running worker pod:
Liveness probe failed: Ignored keyword arguments: {'type': 'pagure'}
I had a look into that as well and the error message is actually just an unrelated warning (coming from ogr) and the actual error is that celery status (the command the liveness probe runs) on a short-running worker sometimes doesn't produce the expected output (the short-running worker hostname is missing from the list, only long-running workers are there) - this is reproducible even when running the command manually in a terminal, from time to time.

mfocko · 2024-04-22T12:42:44Z

Opened RITM1766219

mfocko · 2024-06-12T13:30:30Z

I don't see any of these events in the last 5 days, I will keep an eye on it and close the issue for now

OTOH the same issue has popped up with a short-running worker, so we might want to adjust the requests/limits soon™

There are also some outcomes from the ticket above that can be included in our own docs.

usercont-release-bot added this to Packit Kanban Board Jan 11, 2024

github-project-automation bot moved this to new in Packit Kanban Board Jan 11, 2024

lbarcziova moved this from new to ready-to-refine in Packit Kanban Board Jan 11, 2024

lachmanfrantisek added kind/bug Something isn't working. complexity/single-task Regular task, should be done within days. labels Jan 11, 2024

lachmanfrantisek moved this from ready-to-refine to refined in Packit Kanban Board Jan 11, 2024

nforro self-assigned this Jan 18, 2024

nforro moved this from refined to in-progress in Packit Kanban Board Jan 18, 2024

nforro mentioned this issue Mar 19, 2024

Research SLO1 problems packit/packit-service#1996

Open

lbarcziova assigned mfocko and unassigned nforro Apr 15, 2024

mfocko added the blocked We are blocked! label Apr 22, 2024

mfocko closed this as completed Jun 12, 2024

github-project-automation bot moved this from in-progress to done in Packit Kanban Board Jun 12, 2024

mfocko removed the blocked We are blocked! label Jun 17, 2024

nforro mentioned this issue Oct 25, 2024

Hard time limit ... exceeded (followup) packit/packit-service#2603

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Liveness probe for tokman fails often, resulting in container restarts and poor user experience #548

Liveness probe for tokman fails often, resulting in container restarts and poor user experience #548

nforro commented Jan 11, 2024

nforro commented Mar 19, 2024

mfocko commented Apr 22, 2024

mfocko commented Jun 12, 2024 •

edited

Loading

Liveness probe for tokman fails often, resulting in container restarts and poor user experience #548

Liveness probe for tokman fails often, resulting in container restarts and poor user experience #548

Comments

nforro commented Jan 11, 2024

nforro commented Mar 19, 2024

mfocko commented Apr 22, 2024

mfocko commented Jun 12, 2024 • edited Loading

mfocko commented Jun 12, 2024 •

edited

Loading