Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update KubevirtVmHighMemoryUsage.md #236

Closed
wants to merge 1 commit into from

Conversation

sradco
Copy link
Collaborator

@sradco sradco commented May 1, 2024

What this PR does / why we need it:
The PR fixes KubevirtVmHighMemoryUsage runbook.
It updates that the alerts fires when a container hosting a virtual machine (VM) has less than 50 MB free memory, instead of 20 MB.

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #
https://issues.redhat.com/browse/OCPBUGS-24377

Special notes for your reviewer:

Checklist

This checklist is not enforcing, but it's a reminder of items that could be relevant to every PR.
Approvers are expected to review this list.

Release note:

NONE

@kubevirt-bot kubevirt-bot added the dco-signoff: yes Indicates the PR's author has DCO signed all their commits. label May 1, 2024
@kubevirt-bot kubevirt-bot requested a review from apinnick May 1, 2024 12:59
@sradco
Copy link
Collaborator Author

sradco commented May 1, 2024

@machadovilaca please review

@sradco sradco requested review from machadovilaca and removed request for apinnick May 1, 2024 13:00
@fabiand
Copy link
Member

fabiand commented May 2, 2024

/hold

Didn't we say we have to remove this alert?

@xpivarc @acardace @iholder101 ?

@kubevirt-bot kubevirt-bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label May 2, 2024
@acardace
Copy link
Member

acardace commented May 2, 2024

/hold

Didn't we say we have to remove this alert?

@xpivarc @acardace @iholder101 ?

No we never did, we planned to remove the component-related alerts, not the workload-related ones.

@fabiand
Copy link
Member

fabiand commented May 2, 2024

Ah! Thanks, yes.

So, for this workload related alert - I think we have to remove it as well.
It's simply not correct, because it's taking reclaimable memory into account.

IOW it fires, even if we are not really exceeding the limits.

Do I recall this correctly?

@sradco
Copy link
Collaborator Author

sradco commented May 2, 2024

@fabiand I suggested to update the alert expression to use the pod metrics instead of the kubevirt VMI metrics.
I believe the current expression is incorrect.

@fabiand
Copy link
Member

fabiand commented May 2, 2024

I'm not seeing how a switch to pod metrics helps to address the problem.

@sradco
Copy link
Collaborator Author

sradco commented May 2, 2024

@fabiand since Kubernetes uses container_memory_working_set_bytes metric to OOM kill a POD.

@iholder101
Copy link
Contributor

/hold

Didn't we say we have to remove this alert?

@xpivarc @acardace @iholder101 ?

TBH I was confused by this as well :)

I wonder how these alerts could benefit the user.
Except for some advanced exceptions*, when a user is creating a VM he's defining the guest memory. The fact that we choose to set certain limits on the pod in certain situations is completely hidden from the VM owner (even the fact that a VM runs inside a pod is somewhat of an implementation detail). In the vast majority of the cases this alert was firing because we (kubevirt) did not set the limits correctly, or did not anticipate a spike caused by one of our dependencies (e.g. libvirt).

Now, as a user, let's say I create a VM with dedicated CPUs and 2Gi of guest memory. Let's say that after a while I see a KubevirtVmHighMemoryUsage alert, how should I react?

The docs say the mitigation is Increase the memory limit in the VirtualMachine specification - but as a user I did not set any limits! The limits are automatically set and calculated by virt-controller...

So I tend to vote +1 for removing it, unless I'm missing something.

(*) An advance user CAN set the limits, or use advanced features like overcommitmentRatio, but these are special cases that are far from the "average VM".

@sradco
Copy link
Collaborator Author

sradco commented May 2, 2024

@fabiand @iholder101 I'm a user and I asked for a VM with 4GB and I'm getting to the memory limit that puts my VM at risk of eviction, I need to know about it and resize my VM memory. No?

Why should it matters who set the limit?

@iholder101
Copy link
Contributor

@fabiand @iholder101 I'm a user and I asked for a VM with 4GB and I'm getting to the memory limit that puts my VM at risk of eviction, I need to know about it and resize my VM memory. No?

Why should it matters who set the limit?

As a user you've requested a VM with 4GB. Now you log into your guest and can use up to these 4GB. Let's say that you even use all/most of the 4GB of memory.

Now you get an alert telling you that you are about to reach the memory limit.

But you're not aware of any limits. You've requested 4GB and you're using 4GB. You, as a user, did not cross any boundary, you're just using what you've asked for.

How does such a user is able to mitigate it or even understand what's going on?

@sradco
Copy link
Collaborator Author

sradco commented May 11, 2024

As a user you've requested a VM with 4GB. Now you log into your guest and can use up to these 4GB. Let's say that you even use all/most of the 4GB of memory.

Now you get an alert telling you that you are about to reach the memory limit.

But you're not aware of any limits. You've requested 4GB and you're using 4GB. You, as a user, did not cross any boundary, you're just using what you've asked for.

How does such a user is able to mitigate it or even understand what's going on?

@iholder101 When we check the based on the pod metrics, not the kubevirt metrics

  • Will the Admin be able to do something with the alert?
  • Is the VM at risk?

If both answers to the above questions are "no", then I think we should deprecate the alert.
If one or both is "yes", I think we should think if any update to the alert is needed or something else, but its probably still requires an alert.

@iholder101
Copy link
Contributor

As a user you've requested a VM with 4GB. Now you log into your guest and can use up to these 4GB. Let's say that you even use all/most of the 4GB of memory.
Now you get an alert telling you that you are about to reach the memory limit.
But you're not aware of any limits. You've requested 4GB and you're using 4GB. You, as a user, did not cross any boundary, you're just using what you've asked for.
How does such a user is able to mitigate it or even understand what's going on?

These are important questions, let's discuss them.

@iholder101 When we check the based on the pod metrics, not the kubevirt metrics

  • Will the Admin be able to do something with the alert?

Basically, the admin has the following options:

  • Report a bug, wait for it to be fixed, update Kubevirt.
  • (if the bug is crucial enough:) Report a bug, then use a workaround until it is fixed. The most suiting workaround would probably be using kv.spec.configuration.additionalGuestMemoryOverheadRatio.
    • There are more advanced workarounds, like manually setting a higher memory request than the one calculated by Kubevirt, but that's a more advanced option.

note: keep in mind that a configuration such as kv.spec.configuration.additionalGuestMemoryOverheadRatio:

  • affects the whole cluster.
  • needs to be removed eventually.
  • is not trivial to configure.
  • Is the VM at risk?

Yes, it definitely is.

If both answers to the above questions are "no", then I think we should deprecate the alert. If one or both is "yes", I think we should think if any update to the alert is needed or something else, but its probably still requires an alert.

Please note that in the "what the admin can do" section above, the two options consist a step of reporting it as a bug. This is perhaps the main point I want to emphasize here: the fact that a virt-launcher's pod memory reaches its limits is a bug that Kubevirt is responsible for, not the user/admin. Saying that, it's important to understand that Kubevirt currently suffers from a non-accurate calculation with respect to virt-launcher's memory requests/limits.

So I believe the discussion we're having here can be generalized: how should Kubevirt cope with scenarios in which we have challenges in certain aspects, or IOW, in cases where Kubevirt doesn't perform a specific task well enough?

The current flow looks something like the following:

  1. The admin sees that this alert is firing.
  2. The admin doesn't understand how to mitigate the problem, and probably what the problem even is.
  3. If the problem is not severe, that is either VMs aren't being killed in practice or VMs are being rarely killed in a way that's acceptable for the admin, the admin might dismiss the alert treating it as spam or as a mysterious alert that he's not sure about.
  4. If the problem is severe, the admin would probably open a bug or contact Kubevirt developers which will provide a workaround until the case is studied and fixed.

The problem with this flow is that Kubevirt does not take responsibility for its weaker parts. We practically tell our users: "we know there are problems. if you face a problem we've added an alert to act as debug information for us, you can contact us and we'll consider to fix your problems in the next release or help you with a workaround". A better approach would be that us developers would receive this debug information and fix the problem for the user proactively.

I think that a better flow would look something like:

  1. The admin creates VMs for which the memory requests/limits are miscalculated.
  2. A telemetry will report about that fact.
  3. Kubevirt developers would proactively work on fixing the problem and improving our calculation logic.
  4. The next release of Kubevirt would hopefully be better in that aspect, and this fact would be possible to measure with more telematry.
  5. On severe cases we would provide a workaround when needed.

That is, Kubevirt should be the one who's proactively chasing for bugs and fixing them while taking responsibility for its weaker parts.


As for the alert itself: I'm not entirely sure if we should keep this alert at all if we go for the suggested approach above, but the least I would expect from such an alert is:

  1. Make sure it has a low visibility (to not spam our users, which in the long term leads to not taking our alerts seriously).
  2. Make it very clear that this is a bug, not a misconfiguration, and that the admin should report this bug immediately.
  3. Suggest a workaround like kv.spec.configuration.additionalGuestMemoryOverheadRatio adding that it has serious implications for the cluster as a whole, should be used with care, and removed after the problem is fixed.

@sradco WDYT?
@vladikr @xpivarc thoughts?

@sradco
Copy link
Collaborator Author

sradco commented May 12, 2024

@iholder101

I agree with you that we should not have an alert for the virt-launcher* overhead issue and it should be replaced with a metric we can perhaps collect to Telemetry, like we did for virt-handler, virt-controller, virt-api and virt-operator.

What I meant is that I think we should alert on the VM guest when its available memory from the guest POV is low, so that the VM owner consider asking for more memory to the VMI.

  • What would be the impact of having very low available memory for the VMI (for updating the runbook)?
  • When should we alert? when below 50MB?

@sradco
Copy link
Collaborator Author

sradco commented May 12, 2024

Closing this PR, Since we will deprecate this alert and another alert instead.

@sradco sradco closed this May 12, 2024
@iholder101
Copy link
Contributor

What I meant is that I think we should alert on the VM guest when its available memory from the guest POV is low, so that the VM owner consider asking for more memory to the VMI.

Sure! This is different and sounds like something reasonable to consider!

  • What would be the impact of having very low available memory for the VMI (for updating the runbook)?

I guess the implication of the guest memory being all used is that the guest OS might start OOMing applications. I think this alert can be useful for certain cases.

  • When should we alert? when below 50MB?

Hard to say. Can this be configurable?
50/100MB sounds reasonable, although we may consider also going with a percentage (e.g. 5/10% left).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dco-signoff: yes Indicates the PR's author has DCO signed all their commits. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. size/XS
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants