bug: StatefulSets are not reevaluated after rescheduling #67

miguelbernadi · 2024-01-04T14:07:11Z

Context

Noe caches the Pods it has already processed to avoid reprocessing
them unless there are changes. We do this by storing the
NamespacedName of the Pod, which includes the namespace and Pod
name.

For Deployments, rescheduling the Pod actually changes its name as the
termination is generated. For StatefulSets, the name is kept the same
no matter how many reschedules happen.

Our caching approach prevents us form reevaluating the location of
StatefulSet pods if they were previously scheduled, and we've seen
this allocate StatefulSet pods to nodes whose architectures where not
supported by the Pod.

Bugfix approach

The intention of the cache is to reduce load over Registries and Noe
itself. In our case so far StatefulSets are much less common than
Deployments, so we choose to avoid the cache for StatefulSets.

This is a stopgap measure to prevent misscheduling of current
workloads until a better approach to prevent reprocessing of Pods can
be devised that supports StatefulSets.

What we change

We add a skipCache(*v1.Pod) function that checks if the Pod is owned
by a StatefulSet. If so, the caching mechanism is skipped for this
workload.

miguelbernadi · 2024-01-04T14:07:52Z

First commit is #66 and will be removed once that PR is merged.

pkg/controllers/pod.go

# Context Noe caches the Pods it has already processed to avoid reprocessing them unless there are changes. We do this by storing the `NamespacedName` of the Pod, which includes the namespace and Pod name. For Deployments, rescheduling the Pod actually changes its name as the termination is generated. For StatefulSets, the name is kept the same no matter how many reschedules happen. Our caching approach prevents us form reevaluating the location of StatefulSet pods if they were previously scheduled, and we've seen this allocate StatefulSet pods to nodes whose architectures where not supported by the Pod. # Bugfix approach The intention of the cache is to reduce load over Registries and Noe itself. In our case so far StatefulSets are much less common than Deployments, so we choose to avoid the cache for StatefulSets. This is a stopgap measure to prevent misscheduling of current workloads until a better approach to prevent reprocessing of Pods can be devised that supports StatefulSets. # What we change We add a `skipCache(*v1.Pod)` function that checks if the Pod is owned by a StatefulSet. If so, the caching mechanism is skipped for this workload. Change-Id: Ie7bdf2d863f433e7998b1b92cea129c6bda4dd8f

Co-authored-by: João Alves <[email protected]>

carlosrejano approved these changes Jan 4, 2024

View reviewed changes

joaoqalves requested changes Jan 8, 2024

View reviewed changes

pkg/controllers/pod.go Outdated Show resolved Hide resolved

pkg/controllers/pod.go Outdated Show resolved Hide resolved

miguelbernadi force-pushed the fix-statefulset-reschedule-bug branch from c5e2458 to 33ae035 Compare January 8, 2024 11:40

fixup! bug: StatefulSets are not reevaluated after rescheduling

511302d

Co-authored-by: João Alves <[email protected]>

joaoqalves self-requested a review January 8, 2024 11:45

joaoqalves approved these changes Jan 8, 2024

View reviewed changes

miguelbernadi merged commit 8957c00 into adevinta:main Jan 8, 2024
1 check passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

bug: StatefulSets are not reevaluated after rescheduling #67

bug: StatefulSets are not reevaluated after rescheduling #67

miguelbernadi commented Jan 4, 2024

miguelbernadi commented Jan 4, 2024

bug: StatefulSets are not reevaluated after rescheduling #67

bug: StatefulSets are not reevaluated after rescheduling #67

Conversation

miguelbernadi commented Jan 4, 2024

Context

Bugfix approach

What we change

miguelbernadi commented Jan 4, 2024