Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

design-proposal: VirtualMachineInstanceMigration - Live migration to a named node #320

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

tiraboschi
Copy link
Member

@tiraboschi tiraboschi commented Sep 3, 2024

What this PR does / why we need it:
Adding a design proposal to extend VirtualMachineInstanceMigration
object with an additional API to let a cluster admin
try to trigger a live migration of a VM injecting
on the fly and additional NodeSelector constraint.
The additional NodeSelector can only restrict the set
of Nodes that are valid target for the migration
(eventually down to a single host).
All the affinity rules defined on the VM spec are still
going to be satisfied.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes https://issues.redhat.com/browse/CNV-7075

Special notes for your reviewer:
Something like this was directly proposed/implemented with kubevirt/kubevirt#10712 getting already discussed there.

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

design-proposal: VirtualMachineInstanceMigration - Live migration to a named node

@kubevirt-bot kubevirt-bot added dco-signoff: yes Indicates the PR's author has DCO signed all their commits. size/M labels Sep 3, 2024
tiraboschi pushed a commit to tiraboschi/kubevirt that referenced this pull request Sep 3, 2024
Follow-up and derived from:
kubevirt#10712
Implements:
kubevirt/community#320

TODO: add functional tests

Signed-off-by: zhonglin6666 <[email protected]>
Signed-off-by: Simone Tiraboschi <[email protected]>
Copy link
Member

@dankenigsberg dankenigsberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lovely to see this clear design proposal (even if I don't like anything that assumes a specific node is long-living). I have two questions, though.


## Goals
- A user allowed to trigger a live-migration of a VM and list the nodes in the cluster is able to rely on a simple and direct API to try to live migrate a VM to a specific node.
- The explict migration target overrules a nodeSelector or affinity and anti-affinity rules defined by the VM owner.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I find this odd, as the VM and the application in it may not function well (or at all) if affinity is ignored. Can you share more about the origins of this goal? I'd expect the target node to be ANDed with existing anti/affinity rules.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tend to think that as a cluster admin that is trying to force a VM to migrate to named node this is the natural and expected behaviour:
if I explicitly select a named node, I'm expecting that my VM will be eventually migrated there and nowhere else (such as on a different node selected by the scheduler according to a weighted combination of affinity criteria and resource availability and so on); then I can tolerate that the live migration will fail since I chose a wrong node, but the controller should only try to live-migrate it according to what I'm explicitly asking for.
And by the way this is absolutely consistent with the native k8s behaviour for pods.
spec.nodeName for pods is under spec for historical reasons but it's basically controlled by the scheduler:
when a pod is going to be executed, the scheduler is going to check it and, according to available cluster resources, nodeselectors, weighted affinity and anti-affinity rules and so on, it's going to select a node and write it on spec.nodeName on the pod objects. At this point the kubelet on the named node will try to execute the Pod on that node.
If the user explicitly sets spec.nodeName on a pod (or in the template in a deployment and so on), the scheduler is not going to be involved in the process since the pod is basically already scheduled for that node and nothing else and so the kubelet on that node will directly try to execute it there eventually failing.
https://kubernetes.io/docs/concepts/scheduling-eviction/assign-pod-node/#nodename explictly state:

If the nodeName field is not empty, the scheduler ignores the Pod and the kubelet on the named node tries to place the Pod on that node.
Using nodeName overrules using nodeSelector or affinity and anti-affinity rules.

And this in my opinion is exactly how we should treat a Live migration attempt to a named node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's take the following example (this is a real world use-case):

  1. An admin is adding a new node to a cluster to take it into prod. This node has a taint to prevent workloads to immediately land there.
  2. The admin wants to migrate a VM to this now to validted it is working properly.

If we AND a new selector for this node, then the migration will not take place, because there is the taint. We'd also need to add a toleration to get the vm scheduled to that node.

With spec.nodeName it would be no issue - initially - it could become one if Require*atRuntime effects are used.
However, with spec.nodeName all other validations - CPU caps, extended, storage, and local resources etc will be ignored. We are asking a VM to not start.
Worse: It would be really hard now to understand WHY the vm is not launching.

Thus I think we have to AND to the node selector, but need code to understand taints specifically (because taints keep workloads away).
Then we still need to think about a generic mechanism to deal with reasons of why a pod can not be placed on the selected node.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not like taking examples from the historically-understandable Pod.spec.nodeName. Node identity is not something that should have typically been exposed to workload owners.

Can you summarize your reasoning into the proposal? I think I understand it now, but I am not at ease with it. For example, a cluster admin may easily violate anti/affinity rules that are important for app availability.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fabiand with taints is a bit more complex: the valid effects for a taint are NoExecute, NoSchedule and PreferNoSchedule.
Bypassing the scheduler directly setting spec.nodeName will allow us to bypass taints with NoSchedule and PreferNoSchedule effect but, AFAIK, it will be still blocked by a NoExecute that is also enforced by the Kubelect with eviction.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dankenigsberg yes, this is a critical aspect of this design proposal so we should carefully explore and weight the different alternatives tracking them down in the design proposal itself as a future reference.

In my opinion the choice strictly depends on the use case and the power we want to offer to the cluster admin when creating a live migration request to a named node.

Directly setting spec.nodeName on the target pod will completely bypass all the scheduling hints (spec.affinity.nodeAffinity.preferredDuringSchedulingIgnoredDuringExecution) and constraints (spec.affinity.nodeAffinity.requiredDuringSchedulingIgnoredDuringExecution) meaning that the target pod will be started on the named nome regardless how the VM is actually configured.

Another option is trying to append/merge (this sub-topic deserves by itself another discussion) something like

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
        - matchFields:
            - key: metadata.name
              operator: In
              values:
              - <nodeName>

to the affinity rules already defined on the VM.
My concern with this choice is that affinity/anti-affinity grammar is pretty complex so, if the VM owner already defined some affinity/anti-affinity rules, we can easily end up with a set of conflicting rules so that the target pod cannot be scheduled on the named node as on any other node.

If the use case that we want to address is giving to the cluster admin the right to try migrating a generic VM to a named node (for instance for maintenance/emergency reasons), this is approach is not fully addressing it with many possible cases where the only viable option is still about manually overriding affinity/anti-affinity rules set by the VM owner.

I still tend to think that the always bypass the scheduler with a spec.nodeName is the K.I.S.S. approach here if try to forcing a live migration to a named node is exactly what the cluster admin is trying to do.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I summarized this considerations into the proposal itself, let's continue from there.

design-proposals/migration-target.md Outdated Show resolved Hide resolved
design-proposals/migration-target.md Outdated Show resolved Hide resolved

# Implementation Phases
A really close attempt was already tried in the past with https://github.com/kubevirt/kubevirt/pull/10712 but the Pr got some pushbacks.
A similar PR should be reopened, refined and we should implement functional tests.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would you outline the nature of the pushback? Do we currently have good answers to the issues raised back then?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Trying to summarize (@EdDev please keep me honest on this), it was somehow considered a semi-imperative approach and it was pointed out that a similar behavior could already indirectly be achieved modifying on the fly and then reverting affinity rules on the VM object.
see: kubevirt/kubevirt#10712 (comment)
and: kubevirt/kubevirt#10712 (comment)

How much this is imperative is questionable: at the end we already have a VirtualMachineInstanceMigration object that you can use to declare that you want to trigger a live migration, this is only about letting you also declare that you want to have a live migration to a named host.

The alternative approach based on amending the affinity rules on the VM object and waiting for the LiveUpdate rollout strategy to propagate it to the VMI before trying a live migration is described, pointing out its main drawback, in the Alternative design section in this proposal.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you inline this succinctly? E.g, that Pr got some pushbacks because it was not clear why a new API for one-off migration is needed. We give here a better explanation why this one-off migration destination request is necessary.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • The "one-time" operation convinced me.
  • The reasoning for the real need is hard for me, but I did feedback on this proposal what is convincing me.

@iholder101
Copy link
Contributor

/cc

- Cluster-admin: the administrator of the cluster

## User Stories
- As a cluster admin I want to be able to try to live-migrate a VM to specific node for maintenance reasons eventually overriding what the VM owner set
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see more fleshed out user stories. It's unclear to me based on these user stories why the existing methods wouldn't suffice.

As a cluster admin I want to be able to try to live-migrate a VM to specific node for maintenance reasons eventually overriding what the VM owner set

For example, why wouldn't the cluster admin taint the source node and live migrate the vms away using the existing methods? Why would the admin need direct control over the exact node the VM goes to? I'd like to see a solid answer for why this is necessary over existing methods.

That's where this discussion usually falls apart and why it hasn't seen progress through the years. I'm not opposed to this feature, but I do think we need to articulate clearly why the feature is necessary

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I expanded this section

@kubevirt-bot kubevirt-bot added size/L and removed size/M labels Sep 4, 2024
@tiraboschi tiraboschi force-pushed the migration_target branch 3 times, most recently from 311710d to 339f9f5 Compare September 5, 2024 15:45
Comment on lines 38 to 67
## User Stories
- As a cluster admin I want to be able to try to live-migrate a VM to specific node for various possible reasons such as:
- I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions
- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations
- Foreseeing a peak in application load (e.g. new product announcement), I'd like to balance in advance my cluster according to my expectation and not to current observations
- During a planned maintenance window, I'm planning to drain more than one node in a sequence, so I want to be sure that the VM is going to land on a node that is not going to be drained in a near future (needing then a second migration) and being not interested in cordoning it also for other pods
- I just added a new node and I want to validate it trying to live migrate a specific VM there
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice! these are good reasons that hadn't been explored during previous discussions, thanks

When a pod is going to be executed, the scheduler is going to check it and, according to available cluster resources, nodeselectors, weighted affinity and anti-affinity rules and so on,
the scheduler is going to select a node and write its name on `spec.nodeName` on the pod object. At this point the kubelet on the named node will try to execute the Pod on that node.

If `spec.nodeName` is already set on a pod object as in this approach, the scheduler is not going to be involved in the process since the pod is basically already scheduled for that node and only for tha named node and so the kubelet on that node will directly try to execute it there eventually failing.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think using pod.spec.nodeName is likely the most straightforward approach. This does introduce some new failure modes that might not be obvious to admins.

For example, today if a target pod is unschedulable due to lack of resources, the migration object will time out due to the pod being stuck in "pending". This information is feed back to admin as an k8s event associated with the migration object.

However, by setting the pod.spec.NodeName directly, we'd be bypassing the checks that ensure the required resources are available on the node (like the node having the "kvm" device available for instance), and the pod would likely get scheduled and immediately fail. I don't think we are currently bubbling up these types of errors to the migration object, so this could leave admins wondering why their migration failed.

I guess what I'm trying to get at here is, I like this approach, let's make sure the new failure modes get reported back on the migration object so the Admin has some sort of clue as to why a migration has failed.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@davidvossel We already report the failure reason on the VMIM. This is part of the VMIM status.

pod.spec.nodeName entirely bypassed the scheduler making AAQ unusable as it relies on "pod scheduling readiness".

From my pov, bypassing the scheduler is a no go.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my pov, bypassing the scheduler is a no go.

luckily we have also another option as described on:
### B. appending/merging an additional nodeAffinity rule on the target virt-launcher pod (merging it with VM owner set affinity/anti-affinity rules)

This will add an additional constraint for the scheduler summing it up with existing constraints/hints.
In case of mismatching/oppositing rules, the destination pod will not be scheduled and the migration will fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vladikr @davidvossel +1.

spec.nodeName is a horrible field that is not being removed from Kubernetes only due to backward compatibility and causes a lot of trouble. I agree that it should be considered as a no-go.

Copy link
Member

@vladikr vladikr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand the intention behind introducing the nodeName field, but I fail to see how something like this may work at scale. It seems to me that most, if not all, of the user stories listed in the proposal can already be achieved through existing methods. Adding this field could potentially cause confusion for admins and lead to unnecessary friction with the Kubernetes scheduler and descheduler flows. I'd prefer to see solutions to the user stories to be aligned closely with established patterns. (descheduler policies or scheduler plugins )


## User Stories
- As a cluster admin I want to be able to try to live-migrate a VM to specific node for various possible reasons such as:
- I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder what would be so special about these VMs that cannot be handled by a descheduled?
Also, how would the admin know that the said descheduler did not remove these VMs at a later time?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The descheduler it's going to decide according to its internal policy.
In the more general use case it will be a cluster admin who can decide to live migrate a VM just because he thinks it's the right thing to do.

- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations
- Foreseeing a peak in application load (e.g. new product announcement), I'd like to balance in advance my cluster according to my expectation and not to current observations
- During a planned maintenance window, I'm planning to drain more than one node in a sequence, so I want to be sure that the VM is going to land on a node that is not going to be drained in a near future (needing then a second migration) and being not interested in cordoning it also for other pods
- I just added a new node and I want to validate it trying to live migrate a specific VM there
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could be achieved today by modifying the VM's node selector or creating a new VM. New nodes will be the schedulers' very likely target for new pods already.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right,
from a pure technical perspective this feature can be already simply achieved directly manipulating the node affinity rules on the VM object. Now we have LiveUpdate rollout strategy and so the new affinity rules will be quickly propagated to the VMI and so consumed on the target pod of the live-migration.
No doubt, on the technical side it will work.

But the central idea of this proposal is about allowing a cluster admin doing that without touching the VM object.
This for two maina reasons:

  • separation of personas: the VM owner can set rules on his VM, a cluster admin could be still interested in migrating a VM without messing up or altering the configuration set by the owner on the VM object.
  • separating what it a one-off configuration for the single migration attempt (so set on the VirtualMachineInstanceMigration object) that is relevant only for this single migration attempt but it should not produce any side effect in the future from what is a long-term configuration that is going to stay there and be applied also later on (future live migrations, restarts).

This comment applies to all the user stories here.

## User Stories
- As a cluster admin I want to be able to try to live-migrate a VM to specific node for various possible reasons such as:
- I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions
- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is also doable today as the default scheduler will try to choose the least busy node to schedule the target pod.

- As a cluster admin I want to be able to try to live-migrate a VM to specific node for various possible reasons such as:
- I just added to the cluster a new powerful node and I want to migrate a selected VM there without trying more than once according to scheduler decisions
- I'm not using any automatic workload rebalancing mechanism and I periodically want to manually rebalance my cluster according to my observations
- Foreseeing a peak in application load (e.g. new product announcement), I'd like to balance in advance my cluster according to my expectation and not to current observations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please elaborate on this?
How would the cluster look like to the admins' expectations?
Couldn't a taint be placed on some nodes to resolve capacity before the new product announcement?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same, I do not want to argue with an admin on how the cluster should be managed, but this is surely not a recommended way we want to encourage/support.

@tiraboschi
Copy link
Member Author

It seems to me that most, if not all, of the user stories listed in the proposal can already be achieved through existing methods.

Right, I also added this note:

Note

technically all of this can be already achieved manipulating the node affinity rules on the VM object, but as a cluster admin I want to keep a clear boundary between what is a long-lasting setting for a VM, defined by the VM owner, and what is single shot requirement for a one-off migration

@tiraboschi tiraboschi force-pushed the migration_target branch 3 times, most recently from 63818ed to a937ba2 Compare September 6, 2024 16:53
@vladikr
Copy link
Member

vladikr commented Sep 7, 2024

I spoke with @fabiand offline.
Perhaps we can simply copy any provided Affinity and/or Tolerations set but the admin on the VMIM to the target pod -
instead of offering a dedicated API field.

My main concern with this proposal is that it may promote a wrong assumption that manual cluster balancing is preferred instead of relying on the scheduler/descheduler - while this is just a local minimum.

@tiraboschi
Copy link
Member Author

I spoke with @fabiand offline. Perhaps we can simply copy any provided Affinity and/or Tolerations set but the admin on the VMIM to the target pod - instead of offering a dedicated API field.

I think that exposing the whole node affinity/anti-affinity (+ tolerations + ...) grammar on the VirtualMachineInstanceMigration object is by far too much.
At the end, as a cluster admin I want to only to try to migrate that VM to a named node. All the other uses cases are out of scope and should be addressed correctly setting/amending the node affinity on the VM.
I still think that exposing an optional nodeName string on the VirtualMachineInstanceMigration spec is all of what we need to accomplish all the use cases here.

My main concern with this proposal is that it may promote a wrong assumption that manual cluster balancing is preferred instead of relying on the scheduler/descheduler - while this is just a local minimum.

I think it's up to us to emphasize this assumption in the API documentation making absolutely clear that the nodeName field is optional and we recommend to keep it empty to let the scheduler find the best node (if trying to migrate to a specific named node is not strictly needed).

I'm proposing something like:

// NodeName is a request to try to migrate this VMI to a specific node.
// If it is non-empty, the migration controller simply try to configure the target VMI pod to be started onto that node,
// assuming that it fits resource, limits and other node placement constraints; it will override nodeSelector and affinity
// and anti-affinity rules set on the VM.
// If it is empty, recommended, the scheduler becomes responsible for finding the best Node to migrate the VMI to.
// +optional
NodeName string `json:"nodeName,omitempty"`

I'm adding it to this proposal.

@vladikr
Copy link
Member

vladikr commented Sep 9, 2024

I spoke with @fabiand offline. Perhaps we can simply copy any provided Affinity and/or Tolerations set but the admin on the VMIM to the target pod - instead of offering a dedicated API field.

I think that exposing the whole node affinity/anti-affinity (+ tolerations + ...) grammar on the VirtualMachineInstanceMigration object is by far too much. At the end, as a cluster admin I want to only to try to migrate that VM to a named node.

Setting affinity and toleration is exactly what any other user would need to do to allow scheduling a workload on tainted node, not sure why we need to facilitate this in the migration case.
Also, taking this route would not require us to add any new logic to the migration controller.

Generally speaking, Affinity and nodeSector are the most acceptable ways to influence scheduling decisions.

All the other uses cases are out of scope and should be addressed correctly setting/amending the node affinity on the VM. I still think that exposing an optional nodeName string on the VirtualMachineInstanceMigration spec is all of what we need to accomplish all the use cases here.

My main concern with this proposal is that it may promote a wrong assumption that manual cluster balancing is preferred instead of relying on the scheduler/descheduler - while this is just a local minimum.

I think it's up to us to emphasize this assumption in the API documentation making absolutely clear that the nodeName field is optional and we recommend to keep it empty to let the scheduler find the best node (if trying to migrate to a specific named node is not strictly needed).

From my pov, we could get away without any API changes and without advertising this option at all - making it available for special cases and not a mainstream.

I'm proposing something like:

// NodeName is a request to try to migrate this VMI to a specific node.
// If it is non-empty, the migration controller simply try to configure the target VMI pod to be started onto that node,
// assuming that it fits resource, limits and other node placement constraints; it will override nodeSelector and affinity
// and anti-affinity rules set on the VM.
// If it is empty, recommended, the scheduler becomes responsible for finding the best Node to migrate the VMI to.
// +optional
NodeName string `json:"nodeName,omitempty"`

I'm adding it to this proposal.

@tiraboschi
Copy link
Member Author

tiraboschi commented Sep 9, 2024

Setting affinity and toleration is exactly what any other user would need to do to allow scheduling a workload on tainted node, not sure why we need to facilitate this in the migration case. Also, taking this route would not require us to add any new logic to the migration controller.
...
From my pov, we could get away without any API changes and without advertising this option at all - making it available for special cases and not a mainstream.

I'm sorry but now I'm a bit confused.
As for Kubevirt documentation,
in order to initiate a live migration I'm supposed to create VirtualMachineInstanceMigration (VMIM) object like:

apiVersion: kubevirt.io/v1
kind: VirtualMachineInstanceMigration
metadata:
  name: migration-job
spec:
  vmiName: vmi-fedora

or, more imperatively, executed something like:

$ virtctl migrate vmi-fedora

that under the hood is going to create a VirtualMachineInstanceMigration for me.

This proposal is now about extending it with the optional capability to try to live migrate to a named node.
So thi is proposing to allow the creation of something like:

apiVersion: kubevirt.io/v1
kind: VirtualMachineInstanceMigration
metadata:
  name: migration-job
spec:
  vmiName: vmi-fedora
  nodeName: my-new-target-node

or executing something like:

$ virtctl migrate vmi-fedora --nodeName=my-new-target-node

and this because one of the key point here is that the cluster admin is not supposed to be required to amend the spec of VMs owned by other users in order to try to migrate them to named nodes.

The migration controller will simply notice that nodeName on the VirtualMachineInstanceMigration is not empty and it will inject/replace (still under discussion, we have two alternatives here) something like:

spec:
  nodeName: <nodeName>

or

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchFields:
              - key: metadata.name
                operator: In
                values:
                  - <nodeName>

on the target virt-launcher pod.

Can you please summarize what do you exactly mean with

Perhaps we can simply copy any provided Affinity and/or Tolerations set but the admin on the VMIM to the target pod - instead of offering a dedicated API field.

?

@vladikr
Copy link
Member

vladikr commented Sep 9, 2024

Setting affinity and toleration is exactly what any other user would need to do to allow scheduling a workload on tainted node, not sure why we need to facilitate this in the migration case. Also, taking this route would not require us to add any new logic to the migration controller.
...
From my pov, we could get away without any API changes and without advertising this option at all - making it available for special cases and not a mainstream.

I'm sorry but now I'm a bit confused.

Yes, apologies. I meant to say a dedicated API.
What I mean is that if we must support this behavior (which I'm not 100% convinced we should) then we can simply expose the already existing fields on the VMIM object,
such as .spec.affinity and .spec.tolerations`` The user will express his desire as it would be done on any other pod. Our migration controller will simply copy it to the target pod and merge with the existing ones from the VMI. As for [Kubevirt documentation](https://kubevirt.io/user-guide/compute/live_migration/), in order to initiate a live migration I'm supposed to create VirtualMachineInstanceMigration (VMIM`) object like:

apiVersion: kubevirt.io/v1
kind: VirtualMachineInstanceMigration
metadata:
  name: migration-job
spec:
  vmiName: vmi-fedora

or, more imperatively, executed something like:

$ virtctl migrate vmi-fedora

that under the hood is going to create a VirtualMachineInstanceMigration for me.

This proposal is now about extending it with the optional capability to try to live migrate to a named node. So thi is proposing to allow the creation of something like:

apiVersion: kubevirt.io/v1
kind: VirtualMachineInstanceMigration
metadata:
  name: migration-job
spec:
  vmiName: vmi-fedora
  nodeName: my-new-target-node

or executing something like:

$ virtctl migrate vmi-fedora --nodeName=my-new-target-node

and this because one of the key point here is that the cluster admin is not supposed to be required to amend the spec of VMs owned by other users in order to try to migrate them to named nodes.

The migration controller will simply notice that nodeName on the VirtualMachineInstanceMigration is not empty and it will inject/replace (still under discussion, we have two alternatives here) something like:

I think that by using .spec.affinity and .spec.tolerations the controller doesn't need to make any assumptions.
Also, nodeName will not try to migrate the workload to a tainted node.

spec:
  nodeName: <nodeName>

or

spec:
  affinity:
    nodeAffinity:
      requiredDuringSchedulingIgnoredDuringExecution:
        nodeSelectorTerms:
          - matchFields:
              - key: metadata.name
                operator: In
                values:
                  - <nodeName>

on the target virt-launcher pod.

Can you please summarize what do you exactly mean with

Perhaps we can simply copy any provided Affinity and/or Tolerations set by the admin on the VMIM to the target pod - instead of offering a dedicated API field.

?

Yes. As I mentioned above, I would prefer to let the admin add .spec.affinity and/or .spec.toleration to the VMIM object, and the migration controller would merge these on the target pod.
This way, there wouldn't be a need for new logic in the controller and the admin won't need to make assumptions about the .spec.nodeName field.

@jean-edouard
Copy link
Contributor

  1. Clarifying the use case: The primary use case is the desire of users who are transitioning from other platforms to keep their existing workflow.

It isn't just because "it's how we've always done it", other VM management platforms implemented and continue to maintain similar features because they are useful for a variety of reasons. As an admin, being able to easily and deterministically force a one time migration to a specified node without altering any other system parameters or VM definition is immensely useful.

Kubernetes uses many variables to decide where to place workloads and when to move them around.
Pod specs can of course influence the type of node they want to be placed on, but ultimately it's Kubernetes' main job to dispatch pods to available nodes. (KubeVirt VMs are just pods)

If VMIMigration objects had an option to decide where to move VMIs, nothing would guarantee that the VMI would stay there for any amount of time!
While "Move this thing to that computer" is a valid request in a bare-metal virtualization environment, in a cloud environment it isn't. (even just the concept "moving" is a stretch, traditional pods are terminated and recreated elsewhere)

While we put a lot of effort into making KubeVirt look and feel like "traditional" virtualization platforms, we also have to play by the cloud rules, which means some old habits need to be adjusted. The user/admin workflow that lead to this design proposal is one of them.

@EdDev
Copy link
Member

EdDev commented Dec 9, 2024

  1. Could you please highlight this in this PR? It would help the reviewers to understand the scope on the problem.

I think it is expressed in the user stories section, which is the important part.

  1. Clarifying the use case: The primary use case is the desire of users who are transitioning from other platforms to keep their existing workflow.

It isn't just because "it's how we've always done it", other VM management platforms implemented and continue to maintain similar features because they are useful for a variety of reasons. As an admin, being able to easily and deterministically force a one time migration to a specified node without altering any other system parameters or VM definition is immensely useful.

The maintainers/approvers seem to look for the "why" behind the need to even do this action.
I personally find the expectation and request reasonable with the existing information, even though this collides with the regular and proper operation of Kubevirt. I believe the request comes from experienced VMM operators and we can trust that the need is real even if not all the details are understood.

To mitigate this we need to make live migrations a cluster admin only action.

Limiting this to an admin makes sense to me.
If this request triggers this, sounds like a good path to proceed.

Using priority queues sounds much bigger and complicated, so it sounds less attractive to me.

@vladikr , assuming we will have a new CRD version for VMIM, strictly accessible by cluster admins, will that make this doable from your side?

While we put a lot of effort into making KubeVirt look and feel like "traditional" virtualization platforms, we also have to play by the cloud rules, which means some old habits need to be adjusted. The user/admin workflow that lead to this design proposal is one of them.

Kubevirt is aimed to run on both cloud and BM, therefore the need as I see is valid. The solution should resolve potential problems if they exist.

@tiraboschi
Copy link
Member Author

If VMIMigration objects had an option to decide where to move VMIs, nothing would guarantee that the VMI would stay there for any amount of time! While "Move this thing to that computer" is a valid request in a bare-metal virtualization environment, in a cloud environment it isn't. (even just the concept "moving" is a stretch, traditional pods are terminated and recreated elsewhere)

Correct and desired: it's just for the one-off migration attempt, it's not a long term required that can be already set on the VM object (we already have node selectors and node affinity there as for regular pods).
By the way such a similar concept is available also (if you have dedicated nodes) for popular Cloud providers like Amazon AWS EC2 or Google Cloud Platform Compute Engine (and probably also the others) so we cannot really call it as invalid idea.

While we put a lot of effort into making KubeVirt look and feel like "traditional" virtualization platforms, we also have to play by the cloud rules, which means some old habits need to be adjusted. The user/admin workflow that lead to this design proposal is one of them.

Why?
Technically we already have node selectors and node affinity for VMs as for regulars pods, we are not bypassing any rule here.
K8s documentation states

Often, you do not need to set any such constraints; the scheduler will automatically do a reasonable placement (for example, spreading your Pods across nodes so as not place Pods on a node with insufficient free resources). However, there are some circumstances where you may want to control which node the Pod deploys to...

And we already have an API to declarative do that on VMs, this proposal is only about extending, in a full declarative way, to the VirtualMachineInstanceMigration object with a similar concept for the one-off migration that is something that is not available for pods.

@tiraboschi
Copy link
Member Author

  • Alternatively, consider creating a non-namespaced EmergencyVMIM CRD that only a cluster admin could create.

Something like a cluster scoped?

apiVersion: kubevirt.io/v1
kind: EmergencyVirtualMachineInstanceMigration
metadata:
  name: migration-job
spec:
  vmiName: vmi-fedora
  vmiNamespace: usernamespace
  addedNodeSelector:
    accelerator: gpuenabled123
    kubernetes.io/hostname: "ip-172-20-114-199.example"

@greg-bock
Copy link

The maintainers/approvers seem to look for the "why" behind the need to even do this action.
I personally find the expectation and request reasonable with the existing information, even though this collides with the regular and proper operation of Kubevirt. I believe the request comes from experienced VMM operators and we can trust that the need is real even if not all the details are understood.

💯

To mitigate this we need to make live migrations a cluster admin only action.

Limiting this to an admin makes sense to me. If this request triggers this, sounds like a good path to proceed.

Put this behind a feature flag and add an rbac check just for this call. Document all the warnings about consequences around the RBAC role and the feature flag (default disabled). Give the barrier to entry a bit of friction along with warnings and let admins shoot themselves in the foot if they so desire.

@vladikr
Copy link
Member

vladikr commented Dec 9, 2024

  1. Clarifying the use case: The primary use case is the desire of users who are transitioning from other platforms to keep their existing workflow.

It isn't just because "it's how we've always done it", other VM management platforms implemented and continue to maintain similar features because they are useful for a variety of reasons. As an admin, being able to easily and deterministically force a one time migration to a specified node without altering any other system parameters or VM definition is immensely useful.

@greg-bock Thanks. I'd love to hear more. Could you please explain how you determine (out of potentially thousands of nodes) to which node you need VM X to move? What is so special about this node? (Perhaps we could propose a different API and programmatically determine that via a different API??) When choosing this node, how do you take into account all the variables the scheduler takes into account when scheduling a node?
Some members find it hard to understand the usefulness of this feature, could you please elaborate?

- Workload balancing solution doesn't always work as expected
> I have configured my cluster with the descheduler and a load aware scheduler (trimaran), thus by default, my VMs will be regularly descheduled if utilization is not balanced, and trimaran will ensure that my VMs will be scheduled to underutilized nodes. Often this is working, however, in exceptional cases, i.e. if the load changes too quickly, or only 1 VM is suffering, and I want to avoid that all Vms on the cluster are moved, I need - for exception - a tool to move one VM, once to deal with this exceptional situation.
- Troubleshooting a node
- Validating a new node migrating there a specific VM
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@tiraboschi @fabiand Sorry but I don't understand why live-migration is better than creating a new VM with a node selector of kubernetes.io/hostname?

- Experienced admins are used to control where their critical workloads are move to
> I as an admin, notice that a VM with guaranteed resources is having issues (I watched the cpu iowait metric). In order to resolve the performance issue and keep my user happy, I as admin want to move the VM, without interruption, to a node which is currently underutilized - and will make the user's vm perform better.
- Workload balancing solution doesn't always work as expected
> I have configured my cluster with the descheduler and a load aware scheduler (trimaran), thus by default, my VMs will be regularly descheduled if utilization is not balanced, and trimaran will ensure that my VMs will be scheduled to underutilized nodes. Often this is working, however, in exceptional cases, i.e. if the load changes too quickly, or only 1 VM is suffering, and I want to avoid that all Vms on the cluster are moved, I need - for exception - a tool to move one VM, once to deal with this exceptional situation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since this is a distributed system, if you've observed at T1 that a node is under utilized it doesn't mean that when you will trigger the live-migration the target node will still be under utilized, how would you guarantee that no new workloads will be scheduled on that same node?

@jean-edouard
Copy link
Contributor

While we put a lot of effort into making KubeVirt look and feel like "traditional" virtualization platforms, we also have to play by the cloud rules, which means some old habits need to be adjusted. The user/admin workflow that lead to this design proposal is one of them.

Why? Technically we already have node selectors and node affinity for VMs as for regulars pods, we are not bypassing any rule here. K8s documentation states

Exactly my point. If objects themselves want to run on a node that has specific characteristics, they can specify them via node selector/affinities.

@greg-bock
Copy link

@greg-bock Thanks. I'd love to hear more. Could you please explain how you determine (out of potentially thousands of nodes) to which node you need VM X to move? What is so special about this node? (Perhaps we could propose a different API and programmatically determine that via a different API??) When choosing this node, how do you take into account all the variables the scheduler takes into account when scheduling a node? Some members find it hard to understand the usefulness of this feature, could you please elaborate?

Advanced troubleshooting where a node might need to be prepped before the migration to collection information during the migration, or perhaps I need to move multiple production workloads (not necessarily all KubeVirt VMs) to the same node to cause behavior that a non production reproducer hasn't been found for yet.

I might not be using KubeVirt scheduling logic at all, or I have a separate orchestration layer influencing those scheduler decisions for non technical reasons (like bin packing host machines for accounting/tax purposes).

Perhaps I've run into an odd issue with a maintenance where the scheduler is making things difficult and I just need to move one workload one time.

I've run into so many similar situations where these types of features are useful. No matter how many examples or use cases I might give I'll get the same response for each one. Go skin the cat these other 9 ways that are cumbersome, complicated, error prone, or might have worse unintended side effects.

When choosing this node, how do you take into account all the variables the scheduler takes into account when scheduling a node?

We aren't, we only want to influence one of those variables one time, and if it fails to start for other reasons like not enough resources or some other issue then that's fine. This would be no different than altering the VMI definition to target the node.

If I brought 10, 25, 100 more people into this conversation just stating they want this feature when would you all acquiesce if at all?

@tiraboschi
Copy link
Member Author

The proposed API makes it easier for users unintentionally (or intentionally ) to block system flows that depend on live migration. Even if it's not maliciotly, the new flexibility could lead to issues with upgrades/node drain, etc...
To mitigate this we need to make live migrations a cluster admin only action.

  • There is a suggestion to make VMIMs more like NetworkAttachmentDefinitions, restricted by RBAC (though I’m not sure how feasible this is).

kubevirt/kubevirt#13497 is removing "write rights" for the kubevirt.io:admin clusterole that is aggregated to the default user-facing admin clusterrole that is intended to be granted to namespace admins with a RoleBinding defined within his namespace.
Users with cluster-admin role are not going to be impacted.
A new kubevirt.io:migrate clusterole is introduced for convenience.

@iholder101
Copy link
Contributor

Advanced troubleshooting where a node might need to be prepped before the migration to collection information during the migration, or perhaps I need to move multiple production workloads (not necessarily all KubeVirt VMs) to the same node to cause behavior that a non production reproducer hasn't been found for yet.

You mentioned non-kubevirt VMs.
How are you "moving" (=recreating + scheduling) pods to a specific desired node? Do you edit the workload itself (e.g. deployment/replicaSet) in order to achieve this?

I might not be using KubeVirt scheduling logic at all, or I have a separate orchestration layer influencing those scheduler decisions for non technical reasons (like bin packing host machines for accounting/tax purposes).

I'm curious to whether a descheduler was considered for this use-case.
Did you try using it? Did it help? If not, what issues were encountered?

@iholder101
Copy link
Contributor

If I brought 10, 25, 100 more people into this conversation just stating they want this feature when would you all acquiesce if at all?

This is an interesting question.

When a company develops a product with the intention of making a profit, customer or potential customer requests for a feature are arguably the strongest reason to consider implementing it (though it's not the only factor, it is certainly extremely significant).

However, Kubevirt is not a product, but an open-source project. This means it doesn't belong to a specific company but rather to a community composed of various stakeholders. These stakeholders work for different companies, have different customers, and different interests. IMHO to get features into healthy open-source projects, consensus among the stakeholders is required. This usually involves convincing them that the feature is not only useful (and explaining why it is useful) but also that it won't negatively impact the different interests of the various stakeholders.

In conclusion, while many people requesting the same feature is a strong indicator of its potential usefulness, it is not enough to only show that people want the feature IMO. It is crucial to also demonstrate how the feature aligns with the interests of the various stakeholders and ensure it does not negatively impact their different priorities in the shorter and longer term.

I believe the request comes from experienced VMM operators and we can trust that the need is real even if not all the details are understood.

I trust VMM operators in the sense that I'm sure they're speaking from their experience and reflecting on real pain that they have. However, I think it is a reasonable (and even an expected) request to understand if they have considered certain alternatives and understand why it didn't help them.

@fabiand
Copy link
Member

fabiand commented Dec 13, 2024

How are you "moving" (=recreating + scheduling) pods to a specific desired node? Do you edit the workload itself (e.g. deployment/replicaSet) in order to achieve this?

Container and VM workloads are different, this is why we have KubeVirt.
Thus to me this question is not helpful.
Do we live migrate containers? Why do we have it in KubeVirt then?

I'm curious to whether a descheduler was considered for this use-case.
Did you try using it? Did it help? If not, what issues were encountered?

I am not sure what the quesiton is here.
The descheduler is evicting only. It does not influence the scheduler at all.

@iholder101
Copy link
Contributor

iholder101 commented Dec 13, 2024

How are you "moving" (=recreating + scheduling) pods to a specific desired node? Do you edit the workload itself (e.g. deployment/replicaSet) in order to achieve this?

Container and VM workloads are different, this is why we have KubeVirt.
Thus to me this question is not helpful.
Do we live migrate containers? Why do we have it in KubeVirt then?

I know containers and VMs are different and that containers don't live-migrate..

The use-case is to move both VMs and regular pods to a node, so I'm asking how it's being done.
If the intention is not to touch the workloads themselves, why is it okay to edit affinities for deployments but not for VMs?
How is a "one-shot" move for regular pods being done?
Was a similar feature requested for Kubernetes?

@enp0s3
Copy link
Contributor

enp0s3 commented Dec 15, 2024

I trust VMM operators in the sense that I'm sure they're speaking from their experience and reflecting on real pain that they have. However, I think it is a reasonable (and even an expected) request to understand if they have considered certain alternatives and understand why it didn't help them.

@iholder101 I think that it will cost time and money to train people to get used to new habits. I also spoke to a technical solution architect, he said that the IT dept. in many large companies tend to just cordon nodes and move workloads manually, they do it before upgrades. The reason is that they don't trust the scheduler. I am talking about the old virtualization orchestration systems. Perhaps the scheduler there is really hard to configure, and easy to disable.

Bottom line, no technical reason as I see it. I have a question, is this feature going to harm the project stability?

@morete
Copy link

morete commented Jan 22, 2025

Exceptional scenarios are the use case here, troubleshooting, debugging, exceptional mitigation actions due to noisy neighbours or things like this.
I definitely not able to understand the potential risks on "disrupting system flows".
I would say putting this feature behind a feature gate and making it available to cluster admins via dedicated/protected CRD (like the proposed non-namespaced EmergencyVMIM CRD) and stating the corresponding warnings should be more than enough to go ahead and give it a go, real life feature use will tell if this should be generally available or just under protection.
In the meantime, how could I live migrate a running VM to a specific host without affecting the running VM (no restart)?

What is required to get the required missing labels to push this forward?

@enp0s3
Copy link
Contributor

enp0s3 commented Jan 22, 2025

@morete Hi, you can do the following:

  1. Set temporary label on the destination node.
  2. Set node affinity in the VM spec.
  3. Trigger live migration using virtctl migrate myvm
  4. Remove the label from the node and the VM.

@fabiand
Copy link
Member

fabiand commented Jan 22, 2025

@enp0s3

Two questions

  1. As a cluster admin (as @morete describes), why should I manipulate the VM spec of a different user ("Set node affinity in the VM spec.")?
  2. As a cluster administrator, why would I want to chose such a serialized (with mandatory depenencies; I need to remove the label from the node and likely node affinity from VM) flow over a one-shot migration API?

@mw-0
Copy link

mw-0 commented Jan 22, 2025

@enp0s3

Two questions

  1. As a cluster admin (as @morete describes), why should I manipulate the VM spec of a different user ("Set node affinity in the VM spec.")?
  2. As a cluster administrator, why would I want to chose such a serialized (with mandatory depenencies; I need to remove the label from the node and likely node affinity from VM) flow over a one-shot migration API?

As deployment are done with cd any deviation will be overwritten to on the vm spec. It shouldn't be needed to turn off e.g. argo or add a new spec to git for a label.

Copy link
Member

@EdDev EdDev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This proposal has been baking and changing for too long now, attracting attention from time to time with very similar feedback (compared to its history).

I usually prefer to separate the reasoning/need and the suggested implementation to be able and evaluate a proposal like this.

I find the need strong enough to merit a solution from Kubevirt side. it is the project best interest to seek solutions for a real need from field operators, especially when they are asking it repeatedly and with ref to other management systems.

After the proposed solution/implementation passed many cycles of changes and adjustments, it seems good enough at this stage.
The concerns are valid, however, I think we took several steps to manage the risks.
From limiting the migration operation to cluster-admins (or other dedicated members) up to the suggested feature-gate).
Therefore, I vote to take it.

/lgtm

While the raised concerns are valid, I find it unreasonable to leave this in a state of "sorry, we cannot do what you ask" or "sorry, we have not decided yet".
We should provide a solution per the needs or deny the request with a proper reason.
IMO the needs are valid and the solution is reasonable, but if someone has some other user-friendly way to solve this, please suggest another alternative solution (and not workarounds).

We should continue the discussions offline as well, to get this over with. It is dragging too long and makes many community members frustrated already.

@kubevirt-bot kubevirt-bot added the lgtm Indicates that a PR is ready to be merged. label Jan 22, 2025
tiraboschi pushed a commit to tiraboschi/kubevirt that referenced this pull request Jan 22, 2025
Follow-up and derived from:
kubevirt#10712
Implements:
kubevirt/community#320

TODO: add functional tests

Signed-off-by: zhonglin6666 <[email protected]>
Signed-off-by: Simone Tiraboschi <[email protected]>
tiraboschi pushed a commit to tiraboschi/kubevirt that referenced this pull request Jan 22, 2025
Follow-up and derived from:
kubevirt#10712
Implements:
kubevirt/community#320

TODO: add functional tests

Signed-off-by: zhonglin6666 <[email protected]>
Signed-off-by: Simone Tiraboschi <[email protected]>
tiraboschi pushed a commit to tiraboschi/kubevirt that referenced this pull request Jan 22, 2025
Follow-up and derived from:
kubevirt#10712
Implements:
kubevirt/community#320

TODO: add functional tests

Signed-off-by: zhonglin6666 <[email protected]>
Signed-off-by: Simone Tiraboschi <[email protected]>
tiraboschi pushed a commit to tiraboschi/kubevirt that referenced this pull request Jan 23, 2025
Follow-up and derived from:
kubevirt#10712
Implements:
kubevirt/community#320

Signed-off-by: zhonglin6666 <[email protected]>
Signed-off-by: Simone Tiraboschi <[email protected]>
tiraboschi pushed a commit to tiraboschi/kubevirt that referenced this pull request Jan 23, 2025
Follow-up and derived from:
kubevirt#10712
Implements:
kubevirt/community#320

Signed-off-by: zhonglin6666 <[email protected]>
Signed-off-by: Simone Tiraboschi <[email protected]>
tiraboschi pushed a commit to tiraboschi/kubevirt that referenced this pull request Jan 23, 2025
Follow-up and derived from:
kubevirt#10712
Implements:
kubevirt/community#320

Signed-off-by: zhonglin6666 <[email protected]>
Signed-off-by: Simone Tiraboschi <[email protected]>
tiraboschi pushed a commit to tiraboschi/kubevirt that referenced this pull request Jan 24, 2025
Follow-up and derived from:
kubevirt#10712
Implements:
kubevirt/community#320

Signed-off-by: zhonglin6666 <[email protected]>
Signed-off-by: Simone Tiraboschi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. lgtm Indicates that a PR is ready to be merged. needs-approver-review Indicates that a PR requires a review from an approver. sig/compute size/L
Projects
None yet
Development

Successfully merging this pull request may close these issues.