Zero downtime deployments for stack updates #1010

joeljeske · 2022-04-21T16:23:58Z

Is your feature request related to a problem? Please describe.

I'm always frustrated when I need to make stack updates and I have to incur downtime to the agent pool when doing so. When I deploy updates, all agents in that group are terminated and replaced. This leads to jobs failing with Exited with status -1 (agent lost). I then have to manually restart all those jobs or rely on users to do so.

Describe the solution you'd like

I would like agents to drain their workload before being terminated and replaced during a stack update.

Describe alternatives you've considered

Performing the stack update during non-peak hours.
Manually creating an adjacent stack, migrating to the stack, and then turning off the original stack

Additional context

Perhaps using AWS lifecycle hooks to put instances in a Terminating:Wait state to allow draining would be helpful.

Alternatively, I could detaching all instances from the ASG before stack update, but I then have the problem of determining when those agents are drained and can be terminated. Maybe if the buildkite-agent service could detect the detached state and then drain the workload that would be helpful

The text was updated successfully, but these errors were encountered:

scadu · 2023-06-13T08:17:12Z

I guess this could be controlled by CloudFormation and how it applies changes to the ASG:

elastic-ci-stack-for-aws/templates/aws-stack.yml

Line 1268 in b2fed35

WillReplace: true

It was configured like this 6 years ago, which most probably there was a reason for doing it that way, but I'm not sure if it's still relevant.
I haven't found any confirmation in this, or the agent's repo.

scadu · 2023-06-14T05:07:53Z

I guess this could be controlled by CloudFormation and how it applies changes to the ASG:

elastic-ci-stack-for-aws/templates/aws-stack.yml

Line 1268 in b2fed35

WillReplace: true

It was configured like this 6 years ago, which most probably there was a reason for doing it that way, but I'm not sure if it's still relevant. I haven't found any confirmation in this, or the agent's repo.

OK, the reason is explained here: #764 (comment)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Zero downtime deployments for stack updates #1010

Zero downtime deployments for stack updates #1010

joeljeske commented Apr 21, 2022

scadu commented Jun 13, 2023 •

edited

Loading

scadu commented Jun 14, 2023

Zero downtime deployments for stack updates #1010

Zero downtime deployments for stack updates #1010

Comments

joeljeske commented Apr 21, 2022

scadu commented Jun 13, 2023 • edited Loading

scadu commented Jun 14, 2023

scadu commented Jun 13, 2023 •

edited

Loading