Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 #3000

lkysow · 2023-09-22T22:41:55Z

When leave_on_terminate=false (default), rolling the statefulset is
disruptive because the new servers come up with the same node IDs but
different IP addresses. They can't join the server cluster until the old
server's node ID is marked as failed by serf. During this time, they continually
start leader elections because they don't know there's a leader. When
they eventually join the cluster, their election term is higher, and so
they trigger a leadership swap. The leadership swap happens at the same
time as the next node to be rolled is being stopped, and so the cluster
can end up without a leader.

With leave_on_terminate=true, the stopping server cleanly leaves the
cluster, so the new server can join smoothly, even though it has the
same node ID as the old server. This increases the speed of the rollout
and in my testing eliminates the period without a leader.

The downside of this change is that when a server leaves gracefully, it
also reduces the number of raft peers. The number of peers is used to
calculate the quorum size, so this can unexpectedly change the fault
tolerance of the cluster. When running with an odd number of servers, 1
server leaving the cluster does not affect quorum size. E.g. 5 servers
=> quorum 3, 4 servers => quorum still 3. During a rollout, Kubernetes
only stops 1 server at a time, so the quorum won't change. During a
voluntary disruption event, e.g. a node being drained, Kubernetes uses
the pod disruption budget to determine how many pods in a statefulset
can be made unavailable at a time. That's why this change hardcodes this
number to 1 now.

Also set autopilot min_quorum to min quorum and disable autopilot
upgrade migration since that's for blue/green deploys.

How I've tested:

Acceptance test that restarts the servers and ensures there's no loss of quorum
Manual acceptance testing bumping server versions and ensuring the rollout is smooth (https://gist.github.com/lkysow/d310e61dfd207c438528f393e709a0f8). I didn't add this to the test suite because we need to run a helm install with an older version and then upgrade to the current version which requires more thought/work for how that test and versions will get updated.
manual testing of wan federation, restarting servers on both sides and ensuring the new nodes are found and the connection still works
manual testing of cluster peering, restarting servers on both sides, communication between svcs has no issue, no errors in logs

.changelog/2962.txt

.changelog/3000.txt

wilkermichael · 2024-01-15T18:41:28Z

@lkysow since you're adding a new acceptance test package, in order to get it to run in the pipeline you need to add it to the yaml matrices: https://github.com/hashicorp/consul-k8s/tree/lkysow/server-restart/acceptance/ci-inputs

wilkermichael

Looks good to me, just a couple nits (you can ignore if they don't make sense) and you need to add the server acceptance package to the acceptance test matrics (see above).

Thanks for all the great comments too, it was very easy to see your reasoning behind things.

acceptance/tests/server/server_test.go

charts/consul/values.yaml

acceptance/tests/server/server_test.go

When leave_on_terminate=false (default), rolling the statefulset is disruptive because the new servers come up with the same node IDs but different IP addresses. They can't join the server cluster until the old server's node ID is marked as failed by serf. During this time, they continually start leader elections because they don't know there's a leader. When they eventually join the cluster, their election term is higher, and so they trigger a leadership swap. The leadership swap happens at the same time as the next node to be rolled is being stopped, and so the cluster can end up without a leader. With leave_on_terminate=true, the stopping server cleanly leaves the cluster, so the new server can join smoothly, even though it has the same node ID as the old server. This increases the speed of the rollout and in my testing eliminates the period without a leader. The downside of this change is that when a server leaves gracefully, it also reduces the number of raft peers. The number of peers is used to calculate the quorum size, so this can unexpectedly change the fault tolerance of the cluster. When running with an odd number of servers, 1 server leaving the cluster does not affect quorum size. E.g. 5 servers => quorum 3, 4 servers => quorum still 3. During a rollout, Kubernetes only stops 1 server at a time, so the quorum won't change. During a voluntary disruption event, e.g. a node being drained, Kubernetes uses the pod disruption budget to determine how many pods in a statefulset can be made unavailable at a time. That's why this change hardcodes this number to 1 now. Also set autopilot min_quorum to min quorum and disable autopilot upgrade migration since that's for blue/green deploys.

david-yu · 2024-01-16T02:55:49Z

@lkysow Thanks for working on this, could we close this issue? #1516 I believe that issue is now resolved by the PR.

With hashicorp/consul-k8s#3000 merged, users can upgrade their k8s installs using a regular helm upgrade since the upgrade is now stable.

* docs: update k8s upgrade instructions With hashicorp/consul-k8s#3000 merged, users can upgrade their k8s installs using a regular helm upgrade since the upgrade is now stable. Co-authored-by: trujillo-adam <[email protected]>

… re-apply datadog-integration branch changes

* Datadog Integration (#3407) * datadog-integration: updated consul-server agent telemetry-config.json with dd specific items as well as additional missing VM based options, unit tests, dd unix socket integration, dd agent acl token generation, deployment override failsafes * datadog-integration: updated consul-server agent telemetry-config.json with dd specific items as well as additional missing VM based options, unit tests, dd unix socket integration, dd agent acl token generation | final initial-push * changelog entry update * datadog-integration: updated consul-server agent server.config (enable_debug) and telemetry.config update | enable_debug to server.config * curt pr review changes (minus extraConfig templating verification changes) * global.metrics.AgentMetrics -> global.metrics.enableAgentMetrics * dogstatsd and otlp mutually exclusive verification checks * breaking changes now incorporated into consul.validateExtraConfig helper template function as precheck * extraConfig hash updates post merge conflict update * fix helpers.tpl consul.extraConfig from merge --> /consul/tmp/extra-config/extra-from-values.json | add labels to rolebinding for datadog secrets * update changelog .txt to match new PR number * updated server-statefulset.yaml to correct ad.datadoghq.com/consul.logs annotation to valid single quote string * fix helpers.tpl consul.extraConfig from merge --> /consul/tmp/extra-config/extra-from-values.json | add labels to rolebinding for datadog secrets * fix helpers.tpl consul.extraConfig from merge --> /consul/tmp/extra-config/extra-from-values.json | add labels to rolebinding for datadog secrets * update UDP dogstatsdPort behavior to exclude including a port value if using a kube service address (as determined by user overrides) * update _helpers.tpl consul.ValidateDatadogConfiguration func to account for using 'https' as protocol => should fail * update server-statefulset.yaml to exclude prometheus.io annotations if enabling datadog openmetrics method for consul server metrics scrape. conflict present with http vs https that breaks openemtrics scrape on consul * update server-statefulset.yaml to exclude prometheus.io annotations if enabling datadog openmetrics method for consul server metrics scrape. conflict present with http vs https that breaks openemtrics scrape on consul * correct otlp protocol helpers.tpl check to lower-case the protocol to match the open-telemetry-deployment.yaml behavior * fix server-acl-init command_test.go for datadog token policy - datacenter should have been dc1 * add in server-statefulset bats test for extraConfig validation testing * manual cherry-pick failed checks fix * revert leave_on_terminate and autopilot updates from commit #3000 and re-apply datadog-integration branch changes

lkysow force-pushed the lkysow/server-restart branch from 90bb42b to e8b4e85 Compare November 23, 2023 19:31

lkysow changed the title ~~wip: test server restart with/without leave_on_terminate~~ Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 Nov 23, 2023

lkysow added backport/0.49.x backport/1.1.x Backport to release/1.1.x branch backport/1.2.x This release branch is no longer active. backport/1.3.x This release branch is no longer active. labels Nov 23, 2023

lkysow commented Nov 23, 2023

View reviewed changes

.changelog/2962.txt Show resolved Hide resolved

lkysow force-pushed the lkysow/server-restart branch from 96ef1d1 to 015e526 Compare December 11, 2023 21:28

lkysow force-pushed the lkysow/server-restart branch 2 times, most recently from aab25ee to 01843e6 Compare January 3, 2024 21:57

lkysow marked this pull request as ready for review January 3, 2024 21:59

lkysow removed backport/0.49.x backport/1.1.x Backport to release/1.1.x branch backport/1.2.x This release branch is no longer active. backport/1.3.x This release branch is no longer active. labels Jan 9, 2024

lkysow commented Jan 15, 2024

View reviewed changes

.changelog/3000.txt Show resolved Hide resolved

lkysow requested review from curtbushko and wilkermichael January 15, 2024 17:54

wilkermichael requested changes Jan 15, 2024

View reviewed changes

acceptance/tests/server/server_test.go Outdated Show resolved Hide resolved

acceptance/tests/server/server_test.go Outdated Show resolved Hide resolved

charts/consul/values.yaml Outdated Show resolved Hide resolved

lkysow force-pushed the lkysow/server-restart branch from 01843e6 to 3a09065 Compare January 15, 2024 20:40

lkysow added the pr/no-backport signals that a PR will not contain a backport label label Jan 15, 2024

wilkermichael approved these changes Jan 15, 2024

View reviewed changes

acceptance/tests/server/server_test.go Outdated Show resolved Hide resolved

lkysow force-pushed the lkysow/server-restart branch from 3a09065 to f13d371 Compare January 15, 2024 23:09

lkysow force-pushed the lkysow/server-restart branch from 956680c to 1722e1c Compare January 16, 2024 01:55

lkysow enabled auto-merge (squash) January 16, 2024 01:56

lkysow merged commit 4b6abc7 into main Jan 16, 2024
27 of 48 checks passed

lkysow deleted the lkysow/server-restart branch January 16, 2024 02:33

This was referenced Jan 16, 2024

Servers go through rolling-deploy for config changes on Consul servers #1516

Closed

consul server rollout restart causes downtime #1612

Closed

lkysow added a commit to hashicorp/consul that referenced this pull request Jan 18, 2024

docs: update k8s upgrade instructions

2d42911

With hashicorp/consul-k8s#3000 merged, users can upgrade their k8s installs using a regular helm upgrade since the upgrade is now stable.

lkysow mentioned this pull request Jan 18, 2024

docs: update k8s upgrade instructions hashicorp/consul#20263

Merged

lkysow added a commit to hashicorp/consul that referenced this pull request Jan 18, 2024

docs: update k8s upgrade instructions

659d064

With hashicorp/consul-k8s#3000 merged, users can upgrade their k8s installs using a regular helm upgrade since the upgrade is now stable.

david-yu mentioned this pull request Feb 13, 2024

Expose Server Statefulset updateStrategy configuration #1233

Closed

curtbushko mentioned this pull request Feb 13, 2024

Datadog Integration (#3407) #3619

Merged

2 tasks

natemollica-nm added a commit that referenced this pull request Feb 13, 2024

revert leave_on_terminate and autopilot updates from commit #3000 and…

1da0620

… re-apply datadog-integration branch changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 #3000

Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 #3000

lkysow commented Sep 22, 2023 •

edited

Loading

wilkermichael commented Jan 15, 2024 •

edited

Loading

wilkermichael left a comment

david-yu commented Jan 16, 2024

Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 #3000

Set leave_on_terminate=true for servers and hardcode maxUnavailable=1 #3000

Conversation

lkysow commented Sep 22, 2023 • edited Loading

wilkermichael commented Jan 15, 2024 • edited Loading

wilkermichael left a comment

Choose a reason for hiding this comment

david-yu commented Jan 16, 2024

lkysow commented Sep 22, 2023 •

edited

Loading

wilkermichael commented Jan 15, 2024 •

edited

Loading