diff --git a/recover_affected_workloads.md b/recover_affected_workloads.md index 0ded790..c432bc3 100644 --- a/recover_affected_workloads.md +++ b/recover_affected_workloads.md @@ -8,25 +8,31 @@ nav_order: 8 In kubernetes world there is no concept of rescheduling. When pods are bounded to a node they run till termination. In a case of hardware failure, or some other -resource problem that hang the OS, there is no built-in solution to migrate or +resource problem that freezes the OS, there is no built-in solution to migrate or restart existing pods. If the node misbehaves, and processes can not run anymore the control plane will start counting down. It will either be the kubelet process which doesn't -report status to the control plane [(see node_status_update_frequency)][1] or +report status to the control plane (see [node_status_update_frequency][]) or the containers themselves will fail/crash and be marked as `phase: failed`. Nodes status conditions, set by the node-controller in the control plane, will reflect the current status and how long the node has been in that state. For example, in the case where the node fails to update its status to the masters the node condition will be set to `type:Ready status:Unknown`. -In order to gain the workload back, reschedule it and start it all over, the -control plane needs to know that the running node is gone and the only way to achieve -that is by deleting the node object, which will trigger [scheduling the pods for -deletion][2] +In order to gain the workload back, reschedule it and start it all over, the control plane +needs to know that the the workload's pod is not running anymore. There are two +strategies to accomplish that: -The job of [self-node-remediation](/remediation/self-node-remediation/self-node-remediation/) is to reboot a failing host, delete its node object, -,and recreate it to restore the node to a working state. +- [`OutOfServiceTaint`][] - new core Kubernetes feature named [Non-Graceful Node + Shutdown], it triggers pods on the node to be forcefully deleted if there are no + matching tolerations on the pods. Persistent volumes attached to the shutdown node will + be detached, and new pods will be created on a different running node. This + is the recommended strategy. + +- `ResourceDeletion` - it deletes the pods on the node. This strategy is still supported for + backward compatibility with clusters running Kubernetes versions that do not support Non-Graceful + Node Shutdown feature, but it is discouraged. # Pod recovery flow @@ -48,7 +54,8 @@ After a successful unpublishing the pod and the volume can start again, on a dif node. ## Virtualized worker node -Nodes which are VMs should be configured with a watchdog device ([see libvirt's watchdog support][3]) + +Nodes which are VMs should be configured with a watchdog device (see [libvirt's watchdog support][]) In case it's not configured then self-node-remediation fallbacks to rebooting the machine using 'systemctl reboot'. The benefits of a virtualized watchdog is that its running on the hypervisor and is not vulnerable to resource starvation problem in the VM level @@ -58,6 +65,7 @@ resulting a quicker reboot action. TBD -[1]: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ -[2]: https://kubernetes.io/docs/concepts/workloads/pods/pod-lifecycle/#pod-garbage-collection -[3]: https://libvirt.org/formatdomain.html#watchdog-device +[node_status_update_frequency]: https://kubernetes.io/docs/reference/command-line-tools-reference/kubelet/ +[`OutOfServiceTaint`]: https://kubernetes.io/docs/reference/labels-annotations-taints/#node-kubernetes-io-out-of-service +[Non-Graceful Node Shutdown]: https://kubernetes.io/blog/2023/08/16/kubernetes-1-28-non-graceful-node-shutdown-ga/ +[libvirt's watchdog support]: https://libvirt.org/formatdomain.html#watchdog-device