Update KubevirtVmHighMemoryUsage.md #236

sradco · 2024-05-01T12:59:22Z

What this PR does / why we need it:
The PR fixes KubevirtVmHighMemoryUsage runbook.
It updates that the alerts fires when a container hosting a virtual machine (VM) has less than 50 MB free memory, instead of 20 MB.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #
https://issues.redhat.com/browse/OCPBUGS-24377

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Design: A design document was considered and is present (link) or not required
PR: The PR description is expressive enough and will help future contributors
Code: Write code that humans can understand and Keep it simple
Refactor: You have left the code cleaner than you found it (Boy Scout Rule)
Upgrade: Impact of this change on upgrade flows was considered and addressed if required
Testing: New code requires new unit tests. New features and bug fixes require at least on e2e test
Documentation: A user-guide update was considered and is present (link) or not required. You want a user-guide update if it's a user facing feature / API change.
Community: Announcement to kubevirt-dev was considered

Release note:

NONE

Signed-off-by: Shirly Radco <[email protected]>

sradco · 2024-05-01T12:59:56Z

@machadovilaca please review

fabiand · 2024-05-02T09:28:02Z

/hold

Didn't we say we have to remove this alert?

@xpivarc @acardace @iholder101 ?

acardace · 2024-05-02T09:44:17Z

/hold

Didn't we say we have to remove this alert?

@xpivarc @acardace @iholder101 ?

No we never did, we planned to remove the component-related alerts, not the workload-related ones.

fabiand · 2024-05-02T09:57:52Z

Ah! Thanks, yes.

So, for this workload related alert - I think we have to remove it as well.
It's simply not correct, because it's taking reclaimable memory into account.

IOW it fires, even if we are not really exceeding the limits.

Do I recall this correctly?

sradco · 2024-05-02T12:00:11Z

@fabiand I suggested to update the alert expression to use the pod metrics instead of the kubevirt VMI metrics.
I believe the current expression is incorrect.

fabiand · 2024-05-02T13:41:48Z

I'm not seeing how a switch to pod metrics helps to address the problem.

sradco · 2024-05-02T14:03:43Z

@fabiand since Kubernetes uses container_memory_working_set_bytes metric to OOM kill a POD.

iholder101 · 2024-05-02T14:27:34Z

/hold

Didn't we say we have to remove this alert?

@xpivarc @acardace @iholder101 ?

TBH I was confused by this as well :)

I wonder how these alerts could benefit the user.
Except for some advanced exceptions*, when a user is creating a VM he's defining the guest memory. The fact that we choose to set certain limits on the pod in certain situations is completely hidden from the VM owner (even the fact that a VM runs inside a pod is somewhat of an implementation detail). In the vast majority of the cases this alert was firing because we (kubevirt) did not set the limits correctly, or did not anticipate a spike caused by one of our dependencies (e.g. libvirt).

Now, as a user, let's say I create a VM with dedicated CPUs and 2Gi of guest memory. Let's say that after a while I see a KubevirtVmHighMemoryUsage alert, how should I react?

The docs say the mitigation is Increase the memory limit in the VirtualMachine specification - but as a user I did not set any limits! The limits are automatically set and calculated by virt-controller...

So I tend to vote +1 for removing it, unless I'm missing something.

(*) An advance user CAN set the limits, or use advanced features like overcommitmentRatio, but these are special cases that are far from the "average VM".

sradco · 2024-05-02T16:52:49Z

@fabiand @iholder101 I'm a user and I asked for a VM with 4GB and I'm getting to the memory limit that puts my VM at risk of eviction, I need to know about it and resize my VM memory. No?

Why should it matters who set the limit?

iholder101 · 2024-05-05T18:14:49Z

@fabiand @iholder101 I'm a user and I asked for a VM with 4GB and I'm getting to the memory limit that puts my VM at risk of eviction, I need to know about it and resize my VM memory. No?

Why should it matters who set the limit?

As a user you've requested a VM with 4GB. Now you log into your guest and can use up to these 4GB. Let's say that you even use all/most of the 4GB of memory.

Now you get an alert telling you that you are about to reach the memory limit.

But you're not aware of any limits. You've requested 4GB and you're using 4GB. You, as a user, did not cross any boundary, you're just using what you've asked for.

How does such a user is able to mitigate it or even understand what's going on?

sradco · 2024-05-11T13:47:16Z

As a user you've requested a VM with 4GB. Now you log into your guest and can use up to these 4GB. Let's say that you even use all/most of the 4GB of memory.

Now you get an alert telling you that you are about to reach the memory limit.

But you're not aware of any limits. You've requested 4GB and you're using 4GB. You, as a user, did not cross any boundary, you're just using what you've asked for.

How does such a user is able to mitigate it or even understand what's going on?

@iholder101 When we check the based on the pod metrics, not the kubevirt metrics

Will the Admin be able to do something with the alert?
Is the VM at risk?

If both answers to the above questions are "no", then I think we should deprecate the alert.
If one or both is "yes", I think we should think if any update to the alert is needed or something else, but its probably still requires an alert.

iholder101 · 2024-05-12T07:52:57Z

As a user you've requested a VM with 4GB. Now you log into your guest and can use up to these 4GB. Let's say that you even use all/most of the 4GB of memory.
Now you get an alert telling you that you are about to reach the memory limit.
But you're not aware of any limits. You've requested 4GB and you're using 4GB. You, as a user, did not cross any boundary, you're just using what you've asked for.
How does such a user is able to mitigate it or even understand what's going on?

These are important questions, let's discuss them.

@iholder101 When we check the based on the pod metrics, not the kubevirt metrics

Will the Admin be able to do something with the alert?

Basically, the admin has the following options:

Report a bug, wait for it to be fixed, update Kubevirt.
(if the bug is crucial enough:) Report a bug, then use a workaround until it is fixed. The most suiting workaround would probably be using kv.spec.configuration.additionalGuestMemoryOverheadRatio.
- There are more advanced workarounds, like manually setting a higher memory request than the one calculated by Kubevirt, but that's a more advanced option.

note: keep in mind that a configuration such as kv.spec.configuration.additionalGuestMemoryOverheadRatio:

affects the whole cluster.
needs to be removed eventually.
is not trivial to configure.

Is the VM at risk?

Yes, it definitely is.

If both answers to the above questions are "no", then I think we should deprecate the alert. If one or both is "yes", I think we should think if any update to the alert is needed or something else, but its probably still requires an alert.

Please note that in the "what the admin can do" section above, the two options consist a step of reporting it as a bug. This is perhaps the main point I want to emphasize here: the fact that a virt-launcher's pod memory reaches its limits is a bug that Kubevirt is responsible for, not the user/admin. Saying that, it's important to understand that Kubevirt currently suffers from a non-accurate calculation with respect to virt-launcher's memory requests/limits.

So I believe the discussion we're having here can be generalized: how should Kubevirt cope with scenarios in which we have challenges in certain aspects, or IOW, in cases where Kubevirt doesn't perform a specific task well enough?

The current flow looks something like the following:

The admin sees that this alert is firing.
The admin doesn't understand how to mitigate the problem, and probably what the problem even is.
If the problem is not severe, that is either VMs aren't being killed in practice or VMs are being rarely killed in a way that's acceptable for the admin, the admin might dismiss the alert treating it as spam or as a mysterious alert that he's not sure about.
If the problem is severe, the admin would probably open a bug or contact Kubevirt developers which will provide a workaround until the case is studied and fixed.

The problem with this flow is that Kubevirt does not take responsibility for its weaker parts. We practically tell our users: "we know there are problems. if you face a problem we've added an alert to act as debug information for us, you can contact us and we'll consider to fix your problems in the next release or help you with a workaround". A better approach would be that us developers would receive this debug information and fix the problem for the user proactively.

I think that a better flow would look something like:

The admin creates VMs for which the memory requests/limits are miscalculated.
A telemetry will report about that fact.
Kubevirt developers would proactively work on fixing the problem and improving our calculation logic.
The next release of Kubevirt would hopefully be better in that aspect, and this fact would be possible to measure with more telematry.
On severe cases we would provide a workaround when needed.

That is, Kubevirt should be the one who's proactively chasing for bugs and fixing them while taking responsibility for its weaker parts.

As for the alert itself: I'm not entirely sure if we should keep this alert at all if we go for the suggested approach above, but the least I would expect from such an alert is:

Make sure it has a low visibility (to not spam our users, which in the long term leads to not taking our alerts seriously).
Make it very clear that this is a bug, not a misconfiguration, and that the admin should report this bug immediately.
Suggest a workaround like kv.spec.configuration.additionalGuestMemoryOverheadRatio adding that it has serious implications for the cluster as a whole, should be used with care, and removed after the problem is fixed.

@sradco WDYT?
@vladikr @xpivarc thoughts?

sradco · 2024-05-12T11:46:36Z

@iholder101

I agree with you that we should not have an alert for the virt-launcher* overhead issue and it should be replaced with a metric we can perhaps collect to Telemetry, like we did for virt-handler, virt-controller, virt-api and virt-operator.

What I meant is that I think we should alert on the VM guest when its available memory from the guest POV is low, so that the VM owner consider asking for more memory to the VMI.

What would be the impact of having very low available memory for the VMI (for updating the runbook)?
When should we alert? when below 50MB?

sradco · 2024-05-12T12:06:56Z

Closing this PR, Since we will deprecate this alert and another alert instead.

iholder101 · 2024-05-15T12:20:07Z

What I meant is that I think we should alert on the VM guest when its available memory from the guest POV is low, so that the VM owner consider asking for more memory to the VMI.

Sure! This is different and sounds like something reasonable to consider!

What would be the impact of having very low available memory for the VMI (for updating the runbook)?

I guess the implication of the guest memory being all used is that the guest OS might start OOMing applications. I think this alert can be useful for certain cases.

When should we alert? when below 50MB?

Hard to say. Can this be configurable?
50/100MB sounds reasonable, although we may consider also going with a percentage (e.g. 5/10% left).

Update KubevirtVmHighMemoryUsage.md

f809157

Signed-off-by: Shirly Radco <[email protected]>

kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label May 1, 2024

kubevirt-bot requested a review from apinnick May 1, 2024 12:59

kubevirt-bot added the size/XS label May 1, 2024

sradco requested review from machadovilaca and removed request for apinnick May 1, 2024 13:00

kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2024

sradco closed this May 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update KubevirtVmHighMemoryUsage.md #236

Update KubevirtVmHighMemoryUsage.md #236

sradco commented May 1, 2024

sradco commented May 1, 2024

fabiand commented May 2, 2024

acardace commented May 2, 2024

fabiand commented May 2, 2024

sradco commented May 2, 2024 •

edited

Loading

fabiand commented May 2, 2024

sradco commented May 2, 2024

iholder101 commented May 2, 2024

sradco commented May 2, 2024 •

edited

Loading

iholder101 commented May 5, 2024

sradco commented May 11, 2024

iholder101 commented May 12, 2024

sradco commented May 12, 2024

sradco commented May 12, 2024

iholder101 commented May 15, 2024

Update KubevirtVmHighMemoryUsage.md #236

Update KubevirtVmHighMemoryUsage.md #236

Conversation

sradco commented May 1, 2024

sradco commented May 1, 2024

fabiand commented May 2, 2024

acardace commented May 2, 2024

fabiand commented May 2, 2024

sradco commented May 2, 2024 • edited Loading

fabiand commented May 2, 2024

sradco commented May 2, 2024

iholder101 commented May 2, 2024

sradco commented May 2, 2024 • edited Loading

iholder101 commented May 5, 2024

sradco commented May 11, 2024

iholder101 commented May 12, 2024

sradco commented May 12, 2024

sradco commented May 12, 2024

iholder101 commented May 15, 2024

sradco commented May 2, 2024 •

edited

Loading

sradco commented May 2, 2024 •

edited

Loading