You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Sometimes an alloc will be pending for a long time. Since no alloc has been started yet, the scheduler should be able to re-schedule the alloc to a different client.
This is especially worrisome if it's a periodic batch job which prevents overlap. This makes the severity of the pending job worse.
Reproduction steps
Have a client with a lot of GC'able allocs (start a lot of them and set the GC interval to 24h or something)
Wait for a new job to be scheduled to this client
See the alloc status pending - and see it stay pending for quite a while sometimes
Expected Result
If a job is pending for too long, the scheduler should restart that alloc on a different client.
What is too long? Not sure, but when the next periodic batch job should have started, it has definitely been too long.
Actual Result
The alloc staying on the overworked client and being 'stuck' there until the client finally decides to start it.
Job file (if appropriate)
Not applicable.
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered:
Hi @EtienneBruines, thanks for raising the issue. In case of deployments (i.e., service jobs) they will fail if stuck in pending state for too long, and allocations will be marked as unhealthy. This can be controlled by healthy_deadline. Sadly, in case of periodic jobs this doesn't help.
I'll add this to our board and we'll have a think what to do about it.
Nomad version
Nomad v1.9.4
BuildDate 2024-12-18T15:16:22Z
Revision 5e49fcd+CHANGES
Operating system and Environment details
Ubuntu 22.04.5 LTS on amd64
Issue
Sometimes an alloc will be
pending
for a long time. Since no alloc has been started yet, the scheduler should be able to re-schedule the alloc to a different client.This is especially worrisome if it's a periodic batch job which prevents overlap. This makes the severity of the
pending
job worse.Reproduction steps
Expected Result
If a job is pending for too long, the scheduler should restart that alloc on a different client.
What is too long? Not sure, but when the next periodic batch job should have started, it has definitely been too long.
Actual Result
The alloc staying on the overworked client and being 'stuck' there until the client finally decides to start it.
Job file (if appropriate)
Not applicable.
Nomad Server logs (if appropriate)
Nomad Client logs (if appropriate)
The text was updated successfully, but these errors were encountered: