Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reschedule long-pending allocs #24780

Open
EtienneBruines opened this issue Jan 6, 2025 · 1 comment
Open

Reschedule long-pending allocs #24780

EtienneBruines opened this issue Jan 6, 2025 · 1 comment

Comments

@EtienneBruines
Copy link
Contributor

Nomad version

Nomad v1.9.4
BuildDate 2024-12-18T15:16:22Z
Revision 5e49fcd+CHANGES

Operating system and Environment details

Ubuntu 22.04.5 LTS on amd64

Issue

Sometimes an alloc will be pending for a long time. Since no alloc has been started yet, the scheduler should be able to re-schedule the alloc to a different client.

This is especially worrisome if it's a periodic batch job which prevents overlap. This makes the severity of the pending job worse.

Reproduction steps

  • Have a client with a lot of GC'able allocs (start a lot of them and set the GC interval to 24h or something)
  • Wait for a new job to be scheduled to this client
  • See the alloc status pending - and see it stay pending for quite a while sometimes

Expected Result

If a job is pending for too long, the scheduler should restart that alloc on a different client.

What is too long? Not sure, but when the next periodic batch job should have started, it has definitely been too long.

Actual Result

The alloc staying on the overworked client and being 'stuck' there until the client finally decides to start it.

Job file (if appropriate)

Not applicable.

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

@pkazmierczak
Copy link
Contributor

Hi @EtienneBruines, thanks for raising the issue. In case of deployments (i.e., service jobs) they will fail if stuck in pending state for too long, and allocations will be marked as unhealthy. This can be controlled by healthy_deadline. Sadly, in case of periodic jobs this doesn't help.

I'll add this to our board and we'll have a think what to do about it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

2 participants