Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce RequestEviction feature for evicting pods in background (KEP-1397) #1466

Merged

Conversation

ingvagabund
Copy link
Contributor

@ingvagabund ingvagabund commented Jul 14, 2024

Implementation of #1354.

When the feature is enabled each pod with descheduler.alpha.kubernetes.io/request-evict-only annotation will have the eviction API error examined for a specific error code/reason and message. If matched eviction of such a pod will be interpreted as initiation of an eviction in background.

The first version of the feature does not provide any way to configure a timeout for pending evictions in background. Neither a timeout for how often a pending evictions in background are checked for timeouts. Both set to 10 minutes by default. The default values can be adjusted based on provided feedback.

TODO:

  • evictions in background are respested
  • unit tests
  • feature gating implemented (EvictionsInBackground as a new feature gate)
  • e2e integration with kubevirt
  • feature configuration (currently none)

Refactorings before this PR can be merged:

@k8s-ci-robot k8s-ci-robot added do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. labels Jul 14, 2024
@k8s-ci-robot k8s-ci-robot requested review from a7i and JaneLiuL July 14, 2024 17:56
@ingvagabund ingvagabund force-pushed the eviction-in-background-code branch 17 times, most recently from 1c05dac to d022bf4 Compare July 17, 2024 14:54
@ingvagabund
Copy link
Contributor Author

Compared to virt-handler produced by my local kind these are the logs produced by the upstream CI:

{"component":"virt-handler","level":"info","msg":"re-enqueuing VirtualMachineInstance default/vmi-masquerade-3","pos":"vm.go:1712","reason":"server error. command SyncVMI failed: \"LibvirtError(Code=38, Domain=0, Message='Unable to create tap device tap0: Permission denied')\"","timestamp":"2024-07-17T15:01:56.211181Z"}

Could be kubevirt does not like combination of running kind in pod with dind enabled.

@ingvagabund ingvagabund force-pushed the eviction-in-background-code branch 6 times, most recently from 3210921 to 90a6b55 Compare July 19, 2024 07:18
@ingvagabund
Copy link
Contributor Author

/retest-required

@@ -45,6 +50,10 @@ type DeschedulerServer struct {
SecureServing *apiserveroptions.SecureServingOptionsWithLoopback
DisableMetrics bool
EnableHTTP2 bool
// FeatureGates enabled by the user

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Btw, the approach used by upstream is via ComponentGlobalsRegistry. Might be good to consider if it makes sense to introduce that.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not applicable here as one may create multiple descheduler instances for simulation purposes.

cmd/descheduler/app/server.go Outdated Show resolved Hide resolved
eventRecorder,
podInformer,
rs.DefaultFeatureGates,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make sense to pass the feature gates as an argument everywhere? Compared to the global variable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was keeping in mind the case where the descheduling framework is run as a simulator. E.g. in case two or more framework instances are created in the same run/binary.

pkg/descheduler/descheduler.go Outdated Show resolved Hide resolved
return
}
// Ignore any pod that does not have eviction in progress
if _, exists := newPod.Annotations[EvictionInProgressAnnotationKey]; !exists {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how can we get to this line and delete pods that lost their annotation, when we do

				    if newPod.Annotations == nil {
						return
					}
					// Ignore pod's that are not subject to an eviction in background
					if _, exists := newPod.Annotations[EvictionRequestAnnotationKey]; !exists {
						return
					}

just above?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If a pod looses its annotation there's no way of knowing whether the annotation was removed by accident or intentionally. E.g. a live migration is no longer needed since some scoring mechanism decided it's more costly to perform a live migration than to drop VM's state and re-create it.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, shouldn't we remove the check above, or move this check there instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh. I misunderstood. So it can happen a pod might still have an eviction in background correctly finished. Yet, if EvictionRequestAnnotationKey is removed (e.g. by accident or on purpose) the pod is no longer considered for further processing. EvictionRequestAnnotationKey is the referential one. We might still check for EvictionInProgressAnnotationKey annotation as well just in case there's one. Yet, some implementations might decide to just remove EvictionRequestAnnotationKey annotation to signal an eviction in background was aborted/canceled.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet, if EvictionRequestAnnotationKey is removed (e.g. by accident or on purpose) the pod is no longer considered for further processing.

I am not sure if I see all the loose ends here :D

If the pod was in the cache when the Request annotation was removed, then we skip evicting it (thus we process the InProgress annotation)

Copy link

@atiratree atiratree Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's important is to properly remove entry from the cache

which we would not do when evictionAssumed == false and EvictionRequestAnnotationKey is removed

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. Updating the delete event handler to delete every pod that's in the cache: bb084d6

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is much better. I still think it is unfortunate that we do not deschedule pods that had both EvictionRequestAnnotationKey and EvictionInProgressAnnotationKey removed at the same time. If the owner cancels the progress in this way and the pod is not terminating, then the pod will be immune to descheduling.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing the pod from the cache during update event just in case the pod is in the cache: 4425072

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks good, thx!

pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
pkg/descheduler/evictions/evictions.go Show resolved Hide resolved
pkg/framework/profile/profile.go Outdated Show resolved Hide resolved
@@ -108,6 +423,46 @@ func (pe *PodEvictor) SetClient(client clientset.Interface) {
pe.client = client
}

func (pe *PodEvictor) evictionRequestsTotal() uint {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I forgot what was the consensus last time, but do we want to introduce new metrics for this feature? https://github.com/kubernetes-sigs/descheduler/blob/master/keps/1397-evictions-in-background/README.md#metrics

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do. Yet, not as part of this PR. I will have a follow up for all other changes.

ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("kubevirtvmi-%v", idx),
Annotations: map[string]string{
"descheduler.alpha.kubernetes.io/request-evict-only": "",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just to confirm, the feature depends only on the key, and not on the value? Would be good to add a little a bit of the documentation above EvictionRequestAnnotationKey and EvictionInProgressAnnotationKey

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is correct. +1 for a documentation. I still have a blog post to write.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good. Let's not forget about the constants and update them after as well.

@ingvagabund ingvagabund force-pushed the eviction-in-background-code branch from a2b5211 to 7d50a3b Compare November 12, 2024 15:07
@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 13, 2024
@ingvagabund ingvagabund force-pushed the eviction-in-background-code branch from 7d50a3b to 1ee0351 Compare November 14, 2024 12:31
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Nov 14, 2024
return
}
// Ignore any pod that does not have eviction in progress
if _, exists := newPod.Annotations[EvictionInProgressAnnotationKey]; !exists {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, shouldn't we remove the check above, or move this check there instead?

Comment on lines 319 to 322
if oldPod.Annotations == nil {
return
}
if _, exists := oldPod.Annotations[EvictionInProgressAnnotationKey]; exists {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we can check the presence of a key on nil annotations

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The annotation value is currently empty. So oldPod.Annotations[EvictionInProgressAnnotationKey] will always be empty. If, did you mean a different check?

Comment on lines 487 to 489
if pod.Annotations != nil {
// eviction in background requested
if _, exists := pod.Annotations[EvictionRequestAnnotationKey]; exists {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we could simplify this and other checks as well

pkg/framework/profile/profile.go Outdated Show resolved Hide resolved
ObjectMeta: metav1.ObjectMeta{
Name: fmt.Sprintf("kubevirtvmi-%v", idx),
Annotations: map[string]string{
"descheduler.alpha.kubernetes.io/request-evict-only": "",

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, sounds good. Let's not forget about the constants and update them after as well.

assumedRequestTimeoutSeconds uint
}

func newEvictionRequestsCache(assumedRequestTimeoutSeconds uint) *evictionRequestsCache {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be nice to introduce unit tests for the cache.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's one in TestEvictionRequestsCacheCleanup

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that one looks sufficient. Nevertheless from the long term perspective it would be good to test all the cache operations separated from the PodEvictor. We would not have to be concerned what operations the evictor uses and which does not - useful for example when we make changes to the cache in the future.

pkg/framework/profile/profile.go Outdated Show resolved Hide resolved
pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
@ingvagabund ingvagabund force-pushed the eviction-in-background-code branch 2 times, most recently from 005f8ea to 17d7518 Compare November 15, 2024 14:36
pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
pkg/framework/profile/profile.go Outdated Show resolved Hide resolved
assumedRequestTimeoutSeconds uint
}

func newEvictionRequestsCache(assumedRequestTimeoutSeconds uint) *evictionRequestsCache {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, that one looks sufficient. Nevertheless from the long term perspective it would be good to test all the cache operations separated from the PodEvictor. We would not have to be concerned what operations the evictor uses and which does not - useful for example when we make changes to the cache in the future.

pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
return
}
// Ignore any pod that does not have eviction in progress
if _, exists := newPod.Annotations[EvictionInProgressAnnotationKey]; !exists {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yet, if EvictionRequestAnnotationKey is removed (e.g. by accident or on purpose) the pod is no longer considered for further processing.

I am not sure if I see all the loose ends here :D

If the pod was in the cache when the Request annotation was removed, then we skip evicting it (thus we process the InProgress annotation)

pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
return
}
// Ignore any pod that does not have eviction in progress
if _, exists := newPod.Annotations[EvictionInProgressAnnotationKey]; !exists {
Copy link

@atiratree atiratree Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's important is to properly remove entry from the cache

which we would not do when evictionAssumed == false and EvictionRequestAnnotationKey is removed

pkg/descheduler/evictions/evictions.go Show resolved Hide resolved
pkg/descheduler/evictions/evictions.go Outdated Show resolved Hide resolved
func getPodKey(pod *v1.Pod) (string, error) {
uid := string(pod.UID)
if len(uid) == 0 {
return "", errors.New("cannot get cache key for pod with empty UID")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this error is quite troublesome. It bubbles up to so many places. Can we remove it?

The UID should be guaranteed on already created pods. Do we ever operate on pods that have not been created yet?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assuming all pods have the UID set this error will never get returned.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for exploring this avenue and filtering out all pods that does not have their uid set from eviction in a separate PR/improvement.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replacing the error for missing uid with a panic.

@ingvagabund ingvagabund force-pushed the eviction-in-background-code branch from 9a3544d to 4425072 Compare November 19, 2024 12:55
@ingvagabund ingvagabund force-pushed the eviction-in-background-code branch from 18c833f to cfbb740 Compare November 19, 2024 14:04
@atiratree
Copy link

LGTM, thanks!

@ingvagabund
Copy link
Contributor Author

Squashing all the review commits now

When the feature is enabled each pod with descheduler.alpha.kubernetes.io/request-evict-only
annotation will have the eviction API error examined for a specific
error code/reason and message. If matched eviction of such a pod will be interpreted
as initiation of an eviction in background.
@ingvagabund ingvagabund force-pushed the eviction-in-background-code branch from c45c8cb to 7d4ec60 Compare November 19, 2024 14:29
@ingvagabund ingvagabund added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 19, 2024
@ingvagabund
Copy link
Contributor Author

@atiratree thank you for all your reviews and patience. This really helped.

@k8s-ci-robot k8s-ci-robot merged commit a4c09bf into kubernetes-sigs:master Nov 19, 2024
8 of 9 checks passed
@ingvagabund ingvagabund deleted the eviction-in-background-code branch November 19, 2024 14:58
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants