Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrinking autoscale group kills in-progress builds #1399

Open
DanielHeath opened this issue Nov 13, 2024 · 2 comments
Open

Shrinking autoscale group kills in-progress builds #1399

DanielHeath opened this issue Nov 13, 2024 · 2 comments

Comments

@DanielHeath
Copy link

DanielHeath commented Nov 13, 2024

Describe the bug

Stack was running nicely, had scaled up to four instances.

There's only enough work for three instances, so the ASG gets told to reduce capacity.

As a result, jobs which are in progress get interrupted partway through (in this case, the job was midway through pushing out a production hotfix, which was a great addition to a morning of incident response :))

Expected behavior

An agent which isn't currently performing work gets selected for termination

Actual behaviour

An instance performing useful work is often killed

Stack parameters:

  • AWS Region: us-east-2
  • Version 6.27.0

** Context **

Changing the size of an ASG is a very blunt instrument.

Consider instead the detach-instance call, which removes an instance from the ASG and decrements the DesiredCapacity.

Once the detach-instance completes, you could then terminate the instance from the lambda; this lets you pick which instance gets killed.

@DrJosh9000
Copy link
Contributor

Hi @DanielHeath, thanks for raising this. I don't know when we will get to this, but I'll raise it internally for planning.

@MichaelFoleyFZ
Copy link

MichaelFoleyFZ commented Nov 20, 2024

We've noticed a similar thing occasionally.

Does the lambda scaler control the scale in protection of the instances? I.e. does it enable/disable it based on if a job is active on the instance. If it did that, I think that would remove the need to detach the instance, and autoscaling would keep trying to reduce the instance count until the termination protection was lifted.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants