From 06c7d6f8e0a7dc3eaeaddf787b84577ceafe726c Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Thu, 18 Apr 2024 12:40:31 +0100 Subject: [PATCH 1/4] Add markdown linter Signed-off-by: machadovilaca --- .markdownlint-cli2.yaml | 47 +++++++++++++++++++++++++++++++++++++++++ Makefile | 5 +++++ 2 files changed, 52 insertions(+) create mode 100644 .markdownlint-cli2.yaml diff --git a/.markdownlint-cli2.yaml b/.markdownlint-cli2.yaml new file mode 100644 index 00000000..2878738f --- /dev/null +++ b/.markdownlint-cli2.yaml @@ -0,0 +1,47 @@ +# See https://github.com/DavidAnson/markdownlint#optionsconfig +# and https://github.com/DavidAnson/markdownlint-cli2 + +config: + # forematter metadata seems to trigger this + single-title: false + + # hard tabs are used when pasting go example code into files + no-hard-tabs: false + + # we commonly paste bare urls in the middle of paragraphs + no-bare-urls: false + + # really, this is a rule? + commands-show-output: false + + # We like to use really long lines + line-length: + line_length: 80 + code_blocks: false + + # Sometimes we repeat headings, and it's easier to just turn those + # into emphasis + no-emphasis-as-heading: false + + # We only publish HTML, so allow all HTML inline. + no-inline-html: false + + ## Rules we may want to turn on later, but that will require editing + ## existing files: + + # We tend to use `*` instead of `-` for list bullets but we aren't + # consistent, even within a single file. Ideally we would want + # `style: consistent` + ul-style: false + + # We have at least one # document that breaks up a numbered list + # with headings. Ideally we would set `style: one_or_ordered`. + ol-prefix: false + + # Vertical whitespace helps the reader, so we should turn these on + # again when someone has time to fix our existing files. + blanks-around-fences: false + blanks-around-headings: false + blanks-around-lists: false + single-trailing-newline: false + no-multiple-blanks: false diff --git a/Makefile b/Makefile index aa7448c7..f4e5b5e7 100644 --- a/Makefile +++ b/Makefile @@ -31,3 +31,8 @@ monitoringlinter-build: .PHONY: monitoringlinter-test monitoringlinter-test: monitoringlinter-build cd monitoringlinter && ./tests/e2e.sh + +.PHONY: lint-markdown +lint-markdown: + echo "Linting markdown files" + podman run -v ${PWD}:/workdir:Z docker.io/davidanson/markdownlint-cli2:v0.13.0 "/workdir/docs/*runbooks/*.md" From 2b39539fc372eea00ba4dd59caeb24dcbda95a7e Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Thu, 18 Apr 2024 12:40:41 +0100 Subject: [PATCH 2/4] Fix docs styling issues Signed-off-by: machadovilaca --- docs/deprecated_runbooks/KubeMacPoolDown.md | 5 ++--- .../deprecated_runbooks/KubeVirtVMStuckInErrorState.md | 4 ++-- ...bevirtHyperconvergedClusterOperatorNMOInUseAlert.md | 6 +++--- docs/runbooks/CDIDataVolumeUnusualRestartCount.md | 1 - docs/runbooks/CDIMultipleDefaultVirtStorageClasses.md | 2 +- docs/runbooks/CDIOperatorDown.md | 2 +- docs/runbooks/CDIStorageProfilesIncomplete.md | 2 +- docs/runbooks/CnaoDown.md | 2 +- docs/runbooks/CnaoNmstateMigration.md | 4 ++-- docs/runbooks/HCOInstallationIncomplete.md | 2 +- docs/runbooks/HPPNotReady.md | 4 ++-- docs/runbooks/HPPOperatorDown.md | 2 +- docs/runbooks/HPPSharingPoolPathWithOS.md | 6 +++--- docs/runbooks/KubeMacPoolDuplicateMacsFound.md | 2 +- docs/runbooks/KubeVirtCRModified.md | 4 ++-- docs/runbooks/KubeVirtDeprecatedAPIRequested.md | 2 +- docs/runbooks/KubeVirtVMIExcessiveMigrations.md | 10 +++++----- docs/runbooks/KubemacpoolDown.md | 5 ++--- docs/runbooks/LowKVMNodesCount.md | 2 +- docs/runbooks/LowReadyVirtControllersCount.md | 2 +- docs/runbooks/LowReadyVirtOperatorsCount.md | 6 +++--- docs/runbooks/LowVirtAPICount.md | 2 +- docs/runbooks/LowVirtControllersCount.md | 2 +- docs/runbooks/LowVirtOperatorCount.md | 6 +++--- docs/runbooks/NetworkAddonsConfigNotReady.md | 2 +- docs/runbooks/NoLeadingVirtOperator.md | 10 +++++----- docs/runbooks/NoReadyVirtOperator.md | 8 ++++---- docs/runbooks/OrphanedVirtualMachineInstances.md | 2 +- .../OutdatedVirtualMachineInstanceWorkloads.md | 2 +- docs/runbooks/SSPFailingToReconcile.md | 2 +- docs/runbooks/SSPHighRateRejectedVms.md | 2 +- docs/runbooks/SingleStackIPv6Unsupported.md | 8 ++++---- docs/runbooks/UnsupportedHCOModification.md | 2 +- docs/runbooks/VMStorageClassWarning.md | 2 +- docs/runbooks/VirtAPIDown.md | 1 - docs/runbooks/VirtApiRESTErrorsBurst.md | 2 +- docs/runbooks/VirtApiRESTErrorsHigh.md | 4 ++-- docs/runbooks/VirtControllerRESTErrorsBurst.md | 2 +- docs/runbooks/VirtControllerRESTErrorsHigh.md | 2 +- docs/runbooks/VirtHandlerRESTErrorsHigh.md | 6 +++--- docs/runbooks/VirtOperatorDown.md | 4 ++-- docs/runbooks/VirtOperatorRESTErrorsBurst.md | 2 +- docs/runbooks/VirtOperatorRESTErrorsHigh.md | 4 ++-- 43 files changed, 74 insertions(+), 78 deletions(-) diff --git a/docs/deprecated_runbooks/KubeMacPoolDown.md b/docs/deprecated_runbooks/KubeMacPoolDown.md index 17bf81ec..55f4d31c 100644 --- a/docs/deprecated_runbooks/KubeMacPoolDown.md +++ b/docs/deprecated_runbooks/KubeMacPoolDown.md @@ -9,7 +9,7 @@ ## Impact -If `KubeMacPool` is down, `VirtualMachine` objects cannot be created. +If `KubeMacPool` is down, `VirtualMachine` objects cannot be created. ## Diagnosis @@ -19,7 +19,7 @@ If `KubeMacPool` is down, `VirtualMachine` objects cannot be created. $ export KMP_NAMESPACE="$(kubectl get pod -A --no-headers -l \ control-plane=mac-controller-manager | awk '{print $1}')" ``` - + 2. Set the `KMP_NAME` environment variable: ```bash @@ -48,4 +48,3 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - diff --git a/docs/deprecated_runbooks/KubeVirtVMStuckInErrorState.md b/docs/deprecated_runbooks/KubeVirtVMStuckInErrorState.md index 304f5870..d72f07b4 100644 --- a/docs/deprecated_runbooks/KubeVirtVMStuckInErrorState.md +++ b/docs/deprecated_runbooks/KubeVirtVMStuckInErrorState.md @@ -145,11 +145,11 @@ $ kubectl get nodes -l node-role.kubernetes.io/worker= -o json | jq '.items | .[ ## Mitigation -First, ensure that the VirtualMachine configuration is correct and all necessary +First, ensure that the VirtualMachine configuration is correct and all necessary resources exist. For example, if a PVC is missing, it should be created. Also, verify that the cluster's infrastructure is healthy and there are enough resources to run the VirtualMachine. This problem can be caused by several reasons. Therefore, we advise you to try to identify and fix the root cause. If you cannot resolve this issue, please -open an issue and attach the artifacts gathered in the Diagnosis section. \ No newline at end of file +open an issue and attach the artifacts gathered in the Diagnosis section. diff --git a/docs/deprecated_runbooks/KubevirtHyperconvergedClusterOperatorNMOInUseAlert.md b/docs/deprecated_runbooks/KubevirtHyperconvergedClusterOperatorNMOInUseAlert.md index 3137cad8..b87d39c9 100644 --- a/docs/deprecated_runbooks/KubevirtHyperconvergedClusterOperatorNMOInUseAlert.md +++ b/docs/deprecated_runbooks/KubevirtHyperconvergedClusterOperatorNMOInUseAlert.md @@ -32,7 +32,7 @@ You cannot upgrade to OKD 1.7. Example output: - ``` + ```json { "lastTransitionTime": "2022-05-26T09:23:21Z", "message": "NMO custom resources have been found", @@ -44,7 +44,7 @@ You cannot upgrade to OKD 1.7. 2. Check for a ClusterServiceVersion (CSV) warning event such as the following: - ``` + ```text Warning NotUpgradeable 2m12s (x5 over 2m50s) kubevirt-hyperconvergedNode Maintenance Operator custom resources nodemaintenances.nodemaintenance.kubevirt.io have been found. @@ -60,7 +60,7 @@ You cannot upgrade to OKD 1.7. Example output: - ``` + ```text NAME AGE nodemaintenance-test 5m33s ``` diff --git a/docs/runbooks/CDIDataVolumeUnusualRestartCount.md b/docs/runbooks/CDIDataVolumeUnusualRestartCount.md index 791f48cf..f6368655 100644 --- a/docs/runbooks/CDIDataVolumeUnusualRestartCount.md +++ b/docs/runbooks/CDIDataVolumeUnusualRestartCount.md @@ -40,4 +40,3 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - diff --git a/docs/runbooks/CDIMultipleDefaultVirtStorageClasses.md b/docs/runbooks/CDIMultipleDefaultVirtStorageClasses.md index 7f570424..a6baede0 100644 --- a/docs/runbooks/CDIMultipleDefaultVirtStorageClasses.md +++ b/docs/runbooks/CDIMultipleDefaultVirtStorageClasses.md @@ -29,4 +29,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/CDIOperatorDown.md b/docs/runbooks/CDIOperatorDown.md index 1ff2c0f0..2e38b284 100644 --- a/docs/runbooks/CDIOperatorDown.md +++ b/docs/runbooks/CDIOperatorDown.md @@ -23,7 +23,7 @@ The CDI components might fail to deploy or to stay in a required state. The CDI ```bash $ kubectl -n $CDI_NAMESPACE get pods -l name=cdi-operator ``` - + 3. Obtain the details of the `cdi-operator` pod: ```bash diff --git a/docs/runbooks/CDIStorageProfilesIncomplete.md b/docs/runbooks/CDIStorageProfilesIncomplete.md index 23ac3a51..7754762a 100644 --- a/docs/runbooks/CDIStorageProfilesIncomplete.md +++ b/docs/runbooks/CDIStorageProfilesIncomplete.md @@ -38,4 +38,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/CnaoDown.md b/docs/runbooks/CnaoDown.md index 7c597567..9fe93d7f 100644 --- a/docs/runbooks/CnaoDown.md +++ b/docs/runbooks/CnaoDown.md @@ -23,7 +23,7 @@ If the CNAO is not running, the cluster cannot reconcile changes to virtual mach ```bash $ kubectl -n $NAMESPACE get pods -l name=cluster-network-addons-operator ``` - + 3. Check the `cluster-network-addons-operator` logs for error messages: ```bash diff --git a/docs/runbooks/CnaoNmstateMigration.md b/docs/runbooks/CnaoNmstateMigration.md index 36f186bf..765057d0 100644 --- a/docs/runbooks/CnaoNmstateMigration.md +++ b/docs/runbooks/CnaoNmstateMigration.md @@ -13,6 +13,6 @@ You cannot upgrade your cluster to OpenShift Virtualization 4.11. ## Mitigation -Install the Kubernetes NMState Operator from the OperatorHub. CNAO automatically transfers the `kubernetes-nmstate` deployment to the Operator. +Install the Kubernetes NMState Operator from the OperatorHub. CNAO automatically transfers the `kubernetes-nmstate` deployment to the Operator. -Afterwards, you can upgrade to OpenShift Virtualization 4.11. \ No newline at end of file +Afterwards, you can upgrade to OpenShift Virtualization 4.11. diff --git a/docs/runbooks/HCOInstallationIncomplete.md b/docs/runbooks/HCOInstallationIncomplete.md index b5fd1963..aa01bf5d 100644 --- a/docs/runbooks/HCOInstallationIncomplete.md +++ b/docs/runbooks/HCOInstallationIncomplete.md @@ -1,5 +1,5 @@ # HCOInstallationIncomplete - + ## Meaning This alert fires when the HyperConverged Cluster Operator (HCO) runs for more than an hour without a `HyperConverged` custom resource (CR). diff --git a/docs/runbooks/HPPNotReady.md b/docs/runbooks/HPPNotReady.md index c0aec658..086b8952 100644 --- a/docs/runbooks/HPPNotReady.md +++ b/docs/runbooks/HPPNotReady.md @@ -4,7 +4,7 @@ ## Meaning -This alert fires when a hostpath provisioner (HPP) installation is in a degraded state. +This alert fires when a hostpath provisioner (HPP) installation is in a degraded state. The HPP dynamically provisions hostpath volumes to provide storage for persistent volume claims (PVCs). @@ -17,7 +17,7 @@ HPP is not usable. Its components are not ready and they are not progressing tow 1. Set the `HPP_NAMESPACE` environment variable: ```bash - export HPP_NAMESPACE="$(kubectl get deployment -A | grep hostpath-provisioner-operator | awk '{print $1}')" + $ export HPP_NAMESPACE="$(kubectl get deployment -A | grep hostpath-provisioner-operator | awk '{print $1}')" ``` 2. Check for HPP components that are currently not ready: diff --git a/docs/runbooks/HPPOperatorDown.md b/docs/runbooks/HPPOperatorDown.md index 9b601c5f..0b6f970d 100644 --- a/docs/runbooks/HPPOperatorDown.md +++ b/docs/runbooks/HPPOperatorDown.md @@ -24,7 +24,7 @@ The HPP components might fail to deploy or to remain in the required state. As a ```bash $ kubectl -n $HPP_NAMESPACE get pods -l name=hostpath-provisioner-operator ``` - + 3. Obtain the details of the `hostpath-provisioner-operator` pod: ```bash diff --git a/docs/runbooks/HPPSharingPoolPathWithOS.md b/docs/runbooks/HPPSharingPoolPathWithOS.md index 0d2ec4dd..9747b65f 100644 --- a/docs/runbooks/HPPSharingPoolPathWithOS.md +++ b/docs/runbooks/HPPSharingPoolPathWithOS.md @@ -24,7 +24,7 @@ A shared hostpath pool puts pressure on the node's disks. The node might have de ```bash $ kubectl -n $HPP_NAMESPACE get pods | grep hostpath-provisioner-csi ``` - + 3. Check the `hostpath-provisioner-csi` logs to identify the shared pool and path: ```bash @@ -33,7 +33,7 @@ A shared hostpath pool puts pressure on the node's disks. The node might have de Example output: - ``` + ```text I0208 15:21:03.769731 1 utils.go:221] pool (/csi), shares path with OS which can lead to node disk pressure ``` @@ -47,4 +47,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/KubeMacPoolDuplicateMacsFound.md b/docs/runbooks/KubeMacPoolDuplicateMacsFound.md index 3d1b121f..cbe829ab 100644 --- a/docs/runbooks/KubeMacPoolDuplicateMacsFound.md +++ b/docs/runbooks/KubeMacPoolDuplicateMacsFound.md @@ -27,7 +27,7 @@ Duplicate MAC addresses on the same LAN might cause network issues. Example output: - ``` + ```text mac address 02:00:ff:ff:ff:ff already allocated to vm/kubemacpool-test/testvm, br1, conflict with: vm/kubemacpool-test/testvm2, br1 ``` diff --git a/docs/runbooks/KubeVirtCRModified.md b/docs/runbooks/KubeVirtCRModified.md index afe25bab..eedbeb93 100644 --- a/docs/runbooks/KubeVirtCRModified.md +++ b/docs/runbooks/KubeVirtCRModified.md @@ -17,7 +17,7 @@ Check the `component_name` in the alert details to determine the operand that is In the following example, the operand kind is `kubevirt` and the operand name is `kubevirt-kubevirt-hyperconverged`: -``` +```text Labels alertname=KubeVirtCRModified component_name=kubevirt/kubevirt-kubevirt-hyperconverged @@ -26,6 +26,6 @@ Labels ## Mitigation -Do not change the HCO operands directly. Use `HyperConverged` objects to configure the cluster. +Do not change the HCO operands directly. Use `HyperConverged` objects to configure the cluster. The alert resolves itself after 10 minutes if the operands are not changed manually. diff --git a/docs/runbooks/KubeVirtDeprecatedAPIRequested.md b/docs/runbooks/KubeVirtDeprecatedAPIRequested.md index 958c6c7d..7a743b2f 100644 --- a/docs/runbooks/KubeVirtDeprecatedAPIRequested.md +++ b/docs/runbooks/KubeVirtDeprecatedAPIRequested.md @@ -12,7 +12,7 @@ Usage of deprecated APIs is not recommended because they will be removed in a fu ## Diagnosis Check the `description` and `summary` alert annotations for more details on which API is being accessed, for example: -``` +```text description: "Detected requests to the deprecated virtualmachines.kubevirt.io/v1alpha3 API." summary: "2 requests were detected in the last 10 minutes." ``` diff --git a/docs/runbooks/KubeVirtVMIExcessiveMigrations.md b/docs/runbooks/KubeVirtVMIExcessiveMigrations.md index 23c895a1..5424b66a 100644 --- a/docs/runbooks/KubeVirtVMIExcessiveMigrations.md +++ b/docs/runbooks/KubeVirtVMIExcessiveMigrations.md @@ -21,7 +21,7 @@ A virtual machine (VM) that migrates too frequently might experience degraded pe Example output: - ``` + ```json { "cpu": "3500m", "devices.kubevirt.io/kvm": "1k", @@ -44,7 +44,7 @@ A virtual machine (VM) that migrates too frequently might experience degraded pe Example output: - ``` + ```text { "lastHeartbeatTime": "2022-05-26T07:36:01Z", "lastTransitionTime": "2022-05-23T08:12:02Z", @@ -93,9 +93,9 @@ A virtual machine (VM) that migrates too frequently might experience degraded pe ## Mitigation -Ensure that the worker nodes have sufficient resources (CPU, memory, disk) to run VM workloads without interruption. - -If the problem persists, try to identify the root cause and resolve the issue. +Ensure that the worker nodes have sufficient resources (CPU, memory, disk) to run VM workloads without interruption. + +If the problem persists, try to identify the root cause and resolve the issue. diff --git a/docs/runbooks/KubemacpoolDown.md b/docs/runbooks/KubemacpoolDown.md index ca5c6c1c..a9bac79c 100644 --- a/docs/runbooks/KubemacpoolDown.md +++ b/docs/runbooks/KubemacpoolDown.md @@ -7,7 +7,7 @@ ## Impact -If `KubeMacPool` is down, `VirtualMachine` objects cannot be created. +If `KubeMacPool` is down, `VirtualMachine` objects cannot be created. ## Diagnosis @@ -17,7 +17,7 @@ If `KubeMacPool` is down, `VirtualMachine` objects cannot be created. $ export KMP_NAMESPACE="$(kubectl get pod -A --no-headers -l \ control-plane=mac-controller-manager | awk '{print $1}')" ``` - + 2. Set the `KMP_NAME` environment variable: ```bash @@ -46,4 +46,3 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - diff --git a/docs/runbooks/LowKVMNodesCount.md b/docs/runbooks/LowKVMNodesCount.md index b8864adf..2c61c90c 100644 --- a/docs/runbooks/LowKVMNodesCount.md +++ b/docs/runbooks/LowKVMNodesCount.md @@ -26,4 +26,4 @@ $ kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | grep devices. Validate the [hardware virtualization support](https://kubevirt.io/user-guide/operations/installation/#validate-hardware-virtualization-support). If hardware virtualization is not available, [software emulation](https://github.com/kubevirt/kubevirt/blob/master/docs/software-emulation.md) can be enabled. - \ No newline at end of file + diff --git a/docs/runbooks/LowReadyVirtControllersCount.md b/docs/runbooks/LowReadyVirtControllersCount.md index a3df58b1..a2cb1943 100644 --- a/docs/runbooks/LowReadyVirtControllersCount.md +++ b/docs/runbooks/LowReadyVirtControllersCount.md @@ -4,7 +4,7 @@ ## Meaning -This alert fires when one or more `virt-controller` pods are running, but none of these pods has been in the `Ready` state for the last 5 minutes. +This alert fires when one or more `virt-controller` pods are running, but none of these pods has been in the `Ready` state for the last 5 minutes. A `virt-controller` device monitors the custom resource definitions (CRDs) of a virtual machine instance (VMI) and manages the associated pods. The device create pods for VMIs and manages the lifecycle of the pods. The device is critical for cluster-wide virtualization functionality. diff --git a/docs/runbooks/LowReadyVirtOperatorsCount.md b/docs/runbooks/LowReadyVirtOperatorsCount.md index 5dde17f1..7fcfcac9 100644 --- a/docs/runbooks/LowReadyVirtOperatorsCount.md +++ b/docs/runbooks/LowReadyVirtOperatorsCount.md @@ -1,4 +1,4 @@ -# LowReadyVirtOperatorsCount +# LowReadyVirtOperatorsCount ## Meaning @@ -7,7 +7,7 @@ This alert fires when one or more `virt-operator` pods are running, but none of The `virt-operator` is the first Operator to start in a cluster. The `virt-operator` deployment has a default replica of two `virt-operator` pods. -Its primary responsibilities include the following: +Its primary responsibilities include the following: - Installing, live-updating, and live-upgrading a cluster - Monitoring the lifecycle of top-level controllers, such as `virt-controller`, `virt-handler`, `virt-launcher`, and managing their reconciliation @@ -55,4 +55,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/LowVirtAPICount.md b/docs/runbooks/LowVirtAPICount.md index fbb204ce..f3a5ef59 100644 --- a/docs/runbooks/LowVirtAPICount.md +++ b/docs/runbooks/LowVirtAPICount.md @@ -45,4 +45,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/LowVirtControllersCount.md b/docs/runbooks/LowVirtControllersCount.md index e085fa51..3ddf1727 100644 --- a/docs/runbooks/LowVirtControllersCount.md +++ b/docs/runbooks/LowVirtControllersCount.md @@ -57,4 +57,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/LowVirtOperatorCount.md b/docs/runbooks/LowVirtOperatorCount.md index 087bd6d9..c5e550e4 100644 --- a/docs/runbooks/LowVirtOperatorCount.md +++ b/docs/runbooks/LowVirtOperatorCount.md @@ -3,9 +3,9 @@ ## Meaning -This alert fires when only one `virt-operator` pod in a `Ready` state has been running for the last 60 minutes. +This alert fires when only one `virt-operator` pod in a `Ready` state has been running for the last 60 minutes. -The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: +The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: - Installing, live-updating, and live-upgrading a cluster - Monitoring the lifecycle of top-level controllers, such as `virt-controller`, `virt-handler`, `virt-launcher`, and managing their reconciliation @@ -53,4 +53,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/NetworkAddonsConfigNotReady.md b/docs/runbooks/NetworkAddonsConfigNotReady.md index 252059a8..58ca3354 100644 --- a/docs/runbooks/NetworkAddonsConfigNotReady.md +++ b/docs/runbooks/NetworkAddonsConfigNotReady.md @@ -21,7 +21,7 @@ Network functionality is affected. Example output: - ``` + ```text DaemonSet "cluster-network-addons/macvtap-cni" update is being processed... ``` diff --git a/docs/runbooks/NoLeadingVirtOperator.md b/docs/runbooks/NoLeadingVirtOperator.md index fe3061eb..da452854 100644 --- a/docs/runbooks/NoLeadingVirtOperator.md +++ b/docs/runbooks/NoLeadingVirtOperator.md @@ -1,11 +1,11 @@ -# NoLeadingVirtOperator +# NoLeadingVirtOperator ## Meaning This alert fires when no `virt-operator` pod with a leader lease has been detected for 10 minutes, although the `virt-operator` pods are in a `Ready` state. The alert indicates that no leader pod is available. -The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: +The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: - Installing, live updating, and live upgrading a cluster @@ -41,7 +41,7 @@ This alert indicates a failure at the level of the cluster. As a result, critica Leader pod example: - ``` + ```text {"component":"virt-operator","level":"info","msg":"Attempting to acquire leader status","pos":"application.go:400","timestamp":"2021-11-30T12:15:18.635387Z"} I1130 12:15:18.635452 1 leaderelection.go:243] attempting to acquire leader lease /virt-operator... I1130 12:15:19.216582 1 leaderelection.go:253] successfully acquired lease /virt-operator @@ -50,7 +50,7 @@ This alert indicates a failure at the level of the cluster. As a result, critica Non-leader pod example: - ``` + ```text {"component":"virt-operator","level":"info","msg":"Attempting to acquire leader status","pos":"application.go:400","timestamp":"2021-11-30T12:15:20.533696Z"} I1130 12:15:20.533792 1 leaderelection.go:243] attempting to acquire leader lease /virt-operator... ``` @@ -71,4 +71,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/NoReadyVirtOperator.md b/docs/runbooks/NoReadyVirtOperator.md index 83b34660..b3438dab 100644 --- a/docs/runbooks/NoReadyVirtOperator.md +++ b/docs/runbooks/NoReadyVirtOperator.md @@ -1,11 +1,11 @@ -# NoReadyVirtOperator +# NoReadyVirtOperator ## Meaning This alert fires when no `virt-operator` pod in a `Ready` state has been detected for 10 minutes. -The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: +The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: - Installing, live-updating, and live-upgrading a cluster - Monitoring the life cycle of top-level controllers, such as `virt-controller`, `virt-handler`, `virt-launcher`, and managing their reconciliation @@ -13,7 +13,7 @@ The `virt-operator` is the first Operator to start in a cluster. Its primary res The default deployment is two `virt-operator` pods. -## Impact +## Impact This alert indicates a cluster-level failure. Critical cluster management functionalities, such as certification rotation, upgrade, and reconciliation of controllers, might not be not available. @@ -55,4 +55,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/OrphanedVirtualMachineInstances.md b/docs/runbooks/OrphanedVirtualMachineInstances.md index ad497bd4..d371d306 100644 --- a/docs/runbooks/OrphanedVirtualMachineInstances.md +++ b/docs/runbooks/OrphanedVirtualMachineInstances.md @@ -31,7 +31,7 @@ Orphaned VMIs cannot be managed. Example output: - ``` + ```text NAME DESIRED CURRENT READY UP-TO-DATE AVAILABLE NODE SELECTOR AGE virt-handler 2 2 2 2 2 kubernetes.io/os=linux 4h ``` diff --git a/docs/runbooks/OutdatedVirtualMachineInstanceWorkloads.md b/docs/runbooks/OutdatedVirtualMachineInstanceWorkloads.md index 7a935407..30e65e26 100644 --- a/docs/runbooks/OutdatedVirtualMachineInstanceWorkloads.md +++ b/docs/runbooks/OutdatedVirtualMachineInstanceWorkloads.md @@ -71,7 +71,7 @@ A new VMI spins up immediately in an updated `virt-launcher` pod to replace the Note: Manually stopping a _live-migratable_ VM is destructive and not recommended because it interrupts the workload. ### Migrating a live-migratable VMI - + If a VMI is live-migratable, you can update it by creating a `VirtualMachineInstanceMigration` object that targets a specific running VMI. The VMI is migrated into an updated `virt-launcher` pod. 1. Create a `VirtualMachineInstanceMigration` manifest and save it as `migration.yaml`: diff --git a/docs/runbooks/SSPFailingToReconcile.md b/docs/runbooks/SSPFailingToReconcile.md index 55183be3..fc2ce308 100644 --- a/docs/runbooks/SSPFailingToReconcile.md +++ b/docs/runbooks/SSPFailingToReconcile.md @@ -48,7 +48,7 @@ Dependent components might not be deployed. Changes in the components might not ```bash $ kubectl -n $NAMESPACE logs --tail=-1 -l name=virt-template-validator ``` - + ## Mitigation Try to identify the root cause and resolve the issue. diff --git a/docs/runbooks/SSPHighRateRejectedVms.md b/docs/runbooks/SSPHighRateRejectedVms.md index 3072aa93..2af9dee2 100644 --- a/docs/runbooks/SSPHighRateRejectedVms.md +++ b/docs/runbooks/SSPHighRateRejectedVms.md @@ -25,7 +25,7 @@ The VMs are not created or modified. As a result, the environment might not beha Example output: - ``` + ```text {"component":"kubevirt-template-validator","level":"info","msg":"evalution summary for ubuntu-3166wmdbbfkroku0:\nminimal-required-memory applied: FAIL, value 1073741824 is lower than minimum [2147483648]\n\nsucceeded=false", diff --git a/docs/runbooks/SingleStackIPv6Unsupported.md b/docs/runbooks/SingleStackIPv6Unsupported.md index 64f4d004..5bc2cca3 100644 --- a/docs/runbooks/SingleStackIPv6Unsupported.md +++ b/docs/runbooks/SingleStackIPv6Unsupported.md @@ -5,12 +5,12 @@ This alert fires when user tries to install KubeVirt Hyperconverged on a single stack IPv6 cluster. -KubeVirt Hyperconverged is not yet supported on an OpenShift cluster configured with single stack IPv6. It's +KubeVirt Hyperconverged is not yet supported on an OpenShift cluster configured with single stack IPv6. It's progress is being tracked on [this issue](https://issues.redhat.com/browse/CNV-28924). ## Impact -KubeVirt Hyperconverged Operator can't be installed on a single stack IPv6 cluster, and hence creation virtual +KubeVirt Hyperconverged Operator can't be installed on a single stack IPv6 cluster, and hence creation virtual machines on top of such a cluster is not possible. ## Diagnosis @@ -24,5 +24,5 @@ machines on top of such a cluster is not possible. ## Mitigation -It is recommended to use single stack IPv4 or a dual stack IPv4/IPv6 networking to use KubeVirt Hyperconverged. -Refer the [documentation](https://docs.openshift.com/container-platform/latest/networking/ovn_kubernetes_network_provider/converting-to-dual-stack.html). \ No newline at end of file +It is recommended to use single stack IPv4 or a dual stack IPv4/IPv6 networking to use KubeVirt Hyperconverged. +Refer the [documentation](https://docs.openshift.com/container-platform/latest/networking/ovn_kubernetes_network_provider/converting-to-dual-stack.html). diff --git a/docs/runbooks/UnsupportedHCOModification.md b/docs/runbooks/UnsupportedHCOModification.md index 41eee55e..7983bf96 100644 --- a/docs/runbooks/UnsupportedHCOModification.md +++ b/docs/runbooks/UnsupportedHCOModification.md @@ -23,7 +23,7 @@ Upgrading a system with JSON Patch annotations is dangerous because the structur Check the `annotation_name` in the alert details to identify the JSON Patch annotation: -``` +```text Labels alertname=UnsupportedHCOModification annotation_name=kubevirt.kubevirt.io/jsonpatch diff --git a/docs/runbooks/VMStorageClassWarning.md b/docs/runbooks/VMStorageClassWarning.md index c3aa2745..0caef756 100644 --- a/docs/runbooks/VMStorageClassWarning.md +++ b/docs/runbooks/VMStorageClassWarning.md @@ -1,7 +1,7 @@ # VMStorageClassWarning -## Meaning +## Meaning When running VMs using ODF storage with 'rbd' mounter or 'rbd.csi.ceph.com' provisioner, Windows VMs may cause reports of bad crc/signature errors due to diff --git a/docs/runbooks/VirtAPIDown.md b/docs/runbooks/VirtAPIDown.md index 84ca51f7..cd7adf9f 100644 --- a/docs/runbooks/VirtAPIDown.md +++ b/docs/runbooks/VirtAPIDown.md @@ -51,4 +51,3 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - diff --git a/docs/runbooks/VirtApiRESTErrorsBurst.md b/docs/runbooks/VirtApiRESTErrorsBurst.md index fc1a3864..792e8c8e 100644 --- a/docs/runbooks/VirtApiRESTErrorsBurst.md +++ b/docs/runbooks/VirtApiRESTErrorsBurst.md @@ -8,7 +8,7 @@ For the last 10 minutes or longer, over 80% of the REST calls made to `virt-api` A very high rate of failed REST calls to `virt-api` might lead to slow response and execution of API calls, and potentially to API calls being completely dismissed. -However, currently running virtual machine workloads are not likely to be affected. +However, currently running virtual machine workloads are not likely to be affected. ## Diagnosis diff --git a/docs/runbooks/VirtApiRESTErrorsHigh.md b/docs/runbooks/VirtApiRESTErrorsHigh.md index 6ea4f2d1..0939fb03 100644 --- a/docs/runbooks/VirtApiRESTErrorsHigh.md +++ b/docs/runbooks/VirtApiRESTErrorsHigh.md @@ -9,7 +9,7 @@ More than 5% of REST calls have failed in the `virt-api` pods in the last 60 min A high rate of failed REST calls to `virt-api` might lead to slow response and execution of API calls. -However, currently running virtual machine workloads are not likely to be affected. +However, currently running virtual machine workloads are not likely to be affected. ## Diagnosis @@ -65,4 +65,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/VirtControllerRESTErrorsBurst.md b/docs/runbooks/VirtControllerRESTErrorsBurst.md index 6f0a7bdf..acbd1f10 100644 --- a/docs/runbooks/VirtControllerRESTErrorsBurst.md +++ b/docs/runbooks/VirtControllerRESTErrorsBurst.md @@ -14,7 +14,7 @@ This error is frequently caused by one of the following problems: ## Impact -Status updates are not propagated and actions like migrations cannot take place. However, running workloads are not impacted. +Status updates are not propagated and actions like migrations cannot take place. However, running workloads are not impacted. ## Diagnosis diff --git a/docs/runbooks/VirtControllerRESTErrorsHigh.md b/docs/runbooks/VirtControllerRESTErrorsHigh.md index 2baaaa9d..b17b6c97 100644 --- a/docs/runbooks/VirtControllerRESTErrorsHigh.md +++ b/docs/runbooks/VirtControllerRESTErrorsHigh.md @@ -3,7 +3,7 @@ ## Meaning -More than 5% of REST calls failed in `virt-controller` in the last 60 minutes. +More than 5% of REST calls failed in `virt-controller` in the last 60 minutes. This is most likely because `virt-controller` has partially lost connection to the API server. diff --git a/docs/runbooks/VirtHandlerRESTErrorsHigh.md b/docs/runbooks/VirtHandlerRESTErrorsHigh.md index 9c6937fd..25075e9f 100644 --- a/docs/runbooks/VirtHandlerRESTErrorsHigh.md +++ b/docs/runbooks/VirtHandlerRESTErrorsHigh.md @@ -36,8 +36,8 @@ Node-related actions, such as starting and migrating workloads, are delayed on t ``` Example error message: - - ``` + + ```json {"component":"virt-handler","level":"error","msg":"Can't patch node my-node","pos":"heartbeat.go:96","reason":"the server has received too many API requests and has asked us to try again later","timestamp":"2023-11-06T11:11:41.099883Z","uid":"132c50c2-8d82-4e49-8857-dc737adcd6cc"} ``` @@ -55,4 +55,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/VirtOperatorDown.md b/docs/runbooks/VirtOperatorDown.md index d0dd56c3..8a4a53c7 100644 --- a/docs/runbooks/VirtOperatorDown.md +++ b/docs/runbooks/VirtOperatorDown.md @@ -3,9 +3,9 @@ ## Meaning -This alert fires when no `virt-operator` pod in the `Running` state has been detected for 10 minutes. +This alert fires when no `virt-operator` pod in the `Running` state has been detected for 10 minutes. -The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: +The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: - Installing, live-updating, and live-upgrading a cluster - Monitoring the life cycle of top-level controllers, such as `virt-controller`, `virt-handler`, `virt-launcher`, and managing their reconciliation diff --git a/docs/runbooks/VirtOperatorRESTErrorsBurst.md b/docs/runbooks/VirtOperatorRESTErrorsBurst.md index 6a63c50a..c02f56a2 100644 --- a/docs/runbooks/VirtOperatorRESTErrorsBurst.md +++ b/docs/runbooks/VirtOperatorRESTErrorsBurst.md @@ -14,7 +14,7 @@ This error is frequently caused by one of the following problems: ## Impact -Cluster-level actions, such as upgrading and controller reconciliation, might not be available. +Cluster-level actions, such as upgrading and controller reconciliation, might not be available. However, customer workloads, such as virtual machines (VMs) and VM instances (VMIs), are not likely to be affected. diff --git a/docs/runbooks/VirtOperatorRESTErrorsHigh.md b/docs/runbooks/VirtOperatorRESTErrorsHigh.md index 811560ba..899c4dbb 100644 --- a/docs/runbooks/VirtOperatorRESTErrorsHigh.md +++ b/docs/runbooks/VirtOperatorRESTErrorsHigh.md @@ -13,7 +13,7 @@ This error is frequently caused by one of the following problems: ## Impact -Cluster-level actions, such as upgrading and controller reconciliation, might be delayed. +Cluster-level actions, such as upgrading and controller reconciliation, might be delayed. However, customer workloads, such as virtual machines (VMs) and VM instances (VMIs), are not likely to be affected. @@ -57,4 +57,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + From df1ab5124f9d39c57493a65527f7e0a6cd203c28 Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Thu, 18 Apr 2024 13:42:35 +0100 Subject: [PATCH 3/4] Fix docs line length Signed-off-by: machadovilaca --- docs/deprecated_runbooks/KubeMacPoolDown.md | 7 ++- .../KubeVirtComponentExceedsRequestedCPU.md | 3 +- ...KubeVirtComponentExceedsRequestedMemory.md | 3 +- ...erconvergedClusterOperatorNMOInUseAlert.md | 39 +++++++++++---- docs/runbooks/CDIDataImportCronOutdated.md | 50 ++++++++++++++----- .../CDIDataVolumeUnusualRestartCount.md | 9 +++- .../CDIDefaultStorageClassDegraded.md | 26 ++++++---- .../CDIMultipleDefaultVirtStorageClasses.md | 17 +++++-- docs/runbooks/CDINoDefaultStorageClass.md | 15 ++++-- docs/runbooks/CDINotReady.md | 12 +++-- docs/runbooks/CDIOperatorDown.md | 11 ++-- docs/runbooks/CDIStorageProfilesIncomplete.md | 16 ++++-- docs/runbooks/CnaoDown.md | 7 ++- docs/runbooks/CnaoNmstateMigration.md | 10 ++-- docs/runbooks/HCOInstallationIncomplete.md | 15 ++++-- docs/runbooks/HPPNotReady.md | 16 ++++-- docs/runbooks/HPPOperatorDown.md | 13 +++-- docs/runbooks/HPPSharingPoolPathWithOS.md | 17 +++++-- .../runbooks/KubeMacPoolDuplicateMacsFound.md | 4 +- docs/runbooks/KubeVirtCRModified.md | 23 ++++++--- .../KubeVirtDeprecatedAPIRequested.md | 14 ++++-- .../KubeVirtVMIExcessiveMigrations.md | 17 +++++-- docs/runbooks/KubemacpoolDown.md | 7 ++- docs/runbooks/KubevirtVmHighMemoryUsage.md | 12 +++-- docs/runbooks/LowKVMNodesCount.md | 3 +- docs/runbooks/LowReadyVirtControllersCount.md | 25 +++++++--- docs/runbooks/LowReadyVirtOperatorsCount.md | 28 ++++++++--- docs/runbooks/LowVirtAPICount.md | 10 ++-- docs/runbooks/LowVirtControllersCount.md | 25 +++++++--- docs/runbooks/LowVirtOperatorCount.md | 27 +++++++--- docs/runbooks/NetworkAddonsConfigNotReady.md | 13 +++-- docs/runbooks/NoLeadingVirtOperator.md | 27 +++++++--- docs/runbooks/NoReadyVirtController.md | 24 ++++++--- docs/runbooks/NoReadyVirtOperator.md | 27 +++++++--- .../OrphanedVirtualMachineInstances.md | 32 ++++++++---- ...OutdatedVirtualMachineInstanceWorkloads.md | 34 +++++++++---- .../SSPCommonTemplatesModificationReverted.md | 13 +++-- docs/runbooks/SSPDown.md | 15 ++++-- docs/runbooks/SSPFailingToReconcile.md | 14 ++++-- docs/runbooks/SSPHighRateRejectedVms.md | 13 +++-- docs/runbooks/SSPOperatorDown.md | 18 ++++--- docs/runbooks/SSPTemplateValidatorDown.md | 10 ++-- docs/runbooks/SingleStackIPv6Unsupported.md | 16 +++--- docs/runbooks/UnsupportedHCOModification.md | 29 +++++++---- docs/runbooks/VMCannotBeEvicted.md | 15 ++++-- docs/runbooks/VirtAPIDown.md | 7 ++- docs/runbooks/VirtApiRESTErrorsBurst.md | 20 +++++--- docs/runbooks/VirtApiRESTErrorsHigh.md | 19 ++++--- docs/runbooks/VirtControllerDown.md | 13 +++-- .../runbooks/VirtControllerRESTErrorsBurst.md | 23 ++++++--- docs/runbooks/VirtControllerRESTErrorsHigh.md | 24 ++++++--- .../VirtHandlerDaemonSetRolloutFailing.md | 13 +++-- docs/runbooks/VirtHandlerRESTErrorsBurst.md | 27 ++++++---- docs/runbooks/VirtHandlerRESTErrorsHigh.md | 28 ++++++++--- docs/runbooks/VirtOperatorDown.md | 27 +++++++--- docs/runbooks/VirtOperatorRESTErrorsBurst.md | 26 +++++++--- docs/runbooks/VirtOperatorRESTErrorsHigh.md | 24 ++++++--- 57 files changed, 720 insertions(+), 312 deletions(-) diff --git a/docs/deprecated_runbooks/KubeMacPoolDown.md b/docs/deprecated_runbooks/KubeMacPoolDown.md index 55f4d31c..63ff4a68 100644 --- a/docs/deprecated_runbooks/KubeMacPoolDown.md +++ b/docs/deprecated_runbooks/KubeMacPoolDown.md @@ -5,7 +5,8 @@ ## Meaning -`KubeMacPool` is down. `KubeMacPool` is responsible for allocating MAC addresses and preventing MAC address conflicts. +`KubeMacPool` is down. `KubeMacPool` is responsible for allocating MAC addresses +and preventing MAC address conflicts. ## Impact @@ -41,7 +42,9 @@ If `KubeMacPool` is down, `VirtualMachine` objects cannot be created. ## Mitigation - + If you cannot resolve the issue, see the following resources: diff --git a/docs/deprecated_runbooks/KubeVirtComponentExceedsRequestedCPU.md b/docs/deprecated_runbooks/KubeVirtComponentExceedsRequestedCPU.md index 212e988b..e5cc900f 100644 --- a/docs/deprecated_runbooks/KubeVirtComponentExceedsRequestedCPU.md +++ b/docs/deprecated_runbooks/KubeVirtComponentExceedsRequestedCPU.md @@ -1,4 +1,5 @@ # KubeVirtComponentExceedsRequestedCPU [Deprecated] -This alert has been deprecated; it does not indicate a genuine issue. If triggered, it may be safely ignored and silenced. +This alert has been deprecated; it does not indicate a genuine issue. If +triggered, it may be safely ignored and silenced. diff --git a/docs/deprecated_runbooks/KubeVirtComponentExceedsRequestedMemory.md b/docs/deprecated_runbooks/KubeVirtComponentExceedsRequestedMemory.md index 88711f6d..93fb9327 100644 --- a/docs/deprecated_runbooks/KubeVirtComponentExceedsRequestedMemory.md +++ b/docs/deprecated_runbooks/KubeVirtComponentExceedsRequestedMemory.md @@ -1,4 +1,5 @@ # KubeVirtComponentExceedsRequestedMemory [Deprecated] -This alert has been deprecated; it does not indicate a genuine issue. If triggered, it may be safely ignored and silenced. +This alert has been deprecated; it does not indicate a genuine issue. If +triggered, it may be safely ignored and silenced. diff --git a/docs/deprecated_runbooks/KubevirtHyperconvergedClusterOperatorNMOInUseAlert.md b/docs/deprecated_runbooks/KubevirtHyperconvergedClusterOperatorNMOInUseAlert.md index b87d39c9..72ef80d0 100644 --- a/docs/deprecated_runbooks/KubevirtHyperconvergedClusterOperatorNMOInUseAlert.md +++ b/docs/deprecated_runbooks/KubevirtHyperconvergedClusterOperatorNMOInUseAlert.md @@ -3,16 +3,27 @@ ## Meaning - + - + - + -This alert fires when _integrated_ Node Maintenance Operator (NMO) custom resources (CRs) are detected. This alert only affects OKD 1.6. - -The presence of `NodeMaintenance` CRs belonging to the `nodemaintenance.kubevirt.io` API group indicates that the node specified in `spec.nodeName` was put into maintenance mode. The target node has been [cordoned off](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cordon) and [drained](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service). +This alert fires when _integrated_ Node Maintenance Operator (NMO) custom +resources (CRs) are detected. This alert only affects OKD 1.6. + +The presence of `NodeMaintenance` CRs belonging to the +`nodemaintenance.kubevirt.io` API group indicates that the node specified in +`spec.nodeName` was put into maintenance mode. The target node has been +[cordoned off](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cordon) +and [drained](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service). ## Impact @@ -67,13 +78,21 @@ You cannot upgrade to OKD 1.7. ## Mitigation -Remove all NMO CRs belonging to the `nodemaintenance.nodemaintenance.kubevirt.io/` API group. After the integrated NMO resources are removed, the alert is cleared and you can upgrade. +Remove all NMO CRs belonging to the +`nodemaintenance.nodemaintenance.kubevirt.io/` API group. After the integrated +NMO resources are removed, the alert is cleared and you can upgrade. -If a node must remain in maintenance mode during upgrade, install the Node Maintenance Operator from OperatorHub. Then, create an NMO CR belonging to the `nodemaintenance.nodemaintenance.medik8s.io/v1beta1` API group and version for the node. +If a node must remain in maintenance mode during upgrade, install the Node +Maintenance Operator from OperatorHub. Then, create an NMO CR belonging to the +`nodemaintenance.nodemaintenance.medik8s.io/v1beta1` API group and version for +the node. - + -See the [HCO cluster configuration documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#enablecommonbootimageimport-feature-gate) for more information. +See the [HCO cluster configuration documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#enablecommonbootimageimport-feature-gate) +for more information. If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/CDIDataImportCronOutdated.md b/docs/runbooks/CDIDataImportCronOutdated.md index 0a29d95a..67856203 100644 --- a/docs/runbooks/CDIDataImportCronOutdated.md +++ b/docs/runbooks/CDIDataImportCronOutdated.md @@ -3,11 +3,18 @@ ## Meaning -This alert fires when `DataImportCron` cannot poll or import the latest disk image versions. +This alert fires when `DataImportCron` cannot poll or import the latest disk +image versions. -`DataImportCron` polls disk images, checking for the latest versions, and imports the images into persistent volume claims (PVCs) or VolumeSnapshots. This process ensures that these sources are updated to the latest version so that they can be used as reliable clone sources or golden images for virtual machines (VMs). +`DataImportCron` polls disk images, checking for the latest versions, and +imports the images into persistent volume claims (PVCs) or VolumeSnapshots. This +process ensures that these sources are updated to the latest version so that +they can be used as reliable clone sources or golden images for virtual machines +(VMs). -For golden images, _latest_ refers to the latest operating system of the distribution. For other disk images, _latest_ refers to the latest hash of the image that is available. +For golden images, _latest_ refers to the latest operating system of the +distribution. For other disk images, _latest_ refers to the latest hash of the +image that is available. ## Impact @@ -23,7 +30,13 @@ VMs might fail to start because no boot source is available for cloning. $ kubectl get sc ``` - The output displays the storage classes with `(default)` beside the name of the default storage class. You must set a default storage class, either on the cluster or in the `DataImportCron` specification, in order for the `DataImportCron` to poll and import golden images. If no storage class is defined, the DataVolume controller fails to create PVCs and the following event is displayed: `DataVolume.storage spec is missing accessMode and no storageClass to choose profile`. + The output displays the storage classes with `(default)` beside the name of + the default storage class. You must set a default storage class, either on + the cluster or in the `DataImportCron` specification, in order for the + `DataImportCron` to poll and import golden images. If no storage class is + defined, the DataVolume controller fails to create PVCs and the following + event is displayed: `DataVolume.storage spec is missing accessMode and no + storageClass to choose profile`. 2. Obtain the `DataImportCron` namespace and name: @@ -31,7 +44,8 @@ VMs might fail to start because no boot source is available for cloning. $ kubectl get dataimportcron -A -o json | jq -r '.items[] | select(.status.conditions[] | select(.type == "UpToDate" and .status == "False")) | .metadata.namespace + "/" + .metadata.name' ``` -3. If a default storage class is not defined on the cluster, check the `DataImportCron` specification for a default storage class: +3. If a default storage class is not defined on the cluster, check the +`DataImportCron` specification for a default storage class: ```bash $ kubectl get dataimportcron -o yaml | grep -B 5 storageClassName @@ -48,7 +62,8 @@ VMs might fail to start because no boot source is available for cloning. storageClassName: rook-ceph-block ``` -4. Obtain the name of the `DataVolume` associated with the `DataImportCron` object: +4. Obtain the name of the `DataVolume` associated with the `DataImportCron` +object: ```bash $ kubectl -n get dataimportcron -o json | jq .status.lastImportedPVC.name @@ -74,20 +89,31 @@ VMs might fail to start because no boot source is available for cloning. ## Mitigation -1. Set a default storage class, either on the cluster or in the `DataImportCron` specification, to poll and import golden images. The updated Containerized Data Importer (CDI) should resolve the issue within a few seconds. +1. Set a default storage class, either on the cluster or in the `DataImportCron` +specification, to poll and import golden images. The updated Containerized Data +Importer (CDI) should resolve the issue within a few seconds. -2. If the issue does not resolve itself, or, if you have changed the default storage class in the cluster, -you must delete the existing boot sources (datavolumes or volumesnapshots) in the cluster namespace that are configured with the previous default storage class. The CDI will recreate the data volumes with the newly configured default storage class. +2. If the issue does not resolve itself, or, if you have changed the default +storage class in the cluster, +you must delete the existing boot sources (datavolumes or volumesnapshots) in +the cluster namespace that are configured with the previous default storage +class. The CDI will recreate the data volumes with the newly configured default +storage class. -3. If your cluster is installed in a restricted network environment, disable the `enableCommonBootImageImport` feature gate in order to opt out of automatic updates: +3. If your cluster is installed in a restricted network environment, disable the +`enableCommonBootImageImport` feature gate in order to opt out of automatic +updates: ```bash $ kubectl patch hco kubevirt-hyperconverged -n $CDI_NAMESPACE --type json -p '[{"op": "replace", "path": "/spec/featureGates/enableCommonBootImageImport", "value": false}]' ``` - + -See the [HCO cluster configuration documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#enablecommonbootimageimport-feature-gate) for more information. +See the [HCO cluster configuration documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#enablecommonbootimageimport-feature-gate) +for more information. If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/CDIDataVolumeUnusualRestartCount.md b/docs/runbooks/CDIDataVolumeUnusualRestartCount.md index f6368655..d4b632c3 100644 --- a/docs/runbooks/CDIDataVolumeUnusualRestartCount.md +++ b/docs/runbooks/CDIDataVolumeUnusualRestartCount.md @@ -7,7 +7,10 @@ This alert fires when a `DataVolume` object restarts more than three times. ## Impact -Data volumes are responsible for importing and creating a virtual machine disk on a persistent volume claim. If a data volume restarts more than three times, these operations are unlikely to succeed. You must diagnose and resolve the issue. +Data volumes are responsible for importing and creating a virtual machine disk +on a persistent volume claim. If a data volume restarts more than three times, +these operations are unlikely to succeed. You must diagnose and resolve the +issue. ## Diagnosis @@ -33,7 +36,9 @@ Data volumes are responsible for importing and creating a virtual machine disk o Delete the data volume, resolve the issue, and create a new data volume. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/CDIDefaultStorageClassDegraded.md b/docs/runbooks/CDIDefaultStorageClassDegraded.md index 7c8d33ee..f8b511dc 100644 --- a/docs/runbooks/CDIDefaultStorageClassDegraded.md +++ b/docs/runbooks/CDIDefaultStorageClassDegraded.md @@ -3,15 +3,19 @@ ## Meaning -This alert fires when the default (Kubernetes or virtualization) storage class supports smart clone (either CSI or snapshot based) and ReadWriteMany. +This alert fires when the default (Kubernetes or virtualization) storage class +supports smart clone (either CSI or snapshot based) and ReadWriteMany. -A default virtualization storage class has precedence over a default Kubernetes storage class for creating a VirtualMachine disk image. +A default virtualization storage class has precedence over a default Kubernetes +storage class for creating a VirtualMachine disk image. ## Impact -If the default storage class does not support smart clone, we fallback to host-assisted cloning, which is the least efficient method of cloning. +If the default storage class does not support smart clone, we fallback to +host-assisted cloning, which is the least efficient method of cloning. -If the default storage class does not suppprt ReadWriteMany, a virtual machine using it is not live-migratable. +If the default storage class does not suppprt ReadWriteMany, a virtual machine +using it is not live-migratable. ## Diagnosis @@ -21,23 +25,27 @@ $ export CDI_DEFAULT_VIRT_SC="$(kubectl get sc -o json | jq -r '.items[].metadat $ echo default_virt_sc=$CDI_DEFAULT_VIRT_SC ``` -If the default virtualization storage class is set, check if it supports ReadWriteMany +If the default virtualization storage class is set, check if it supports +ReadWriteMany ```bash $ kubectl get storageprofile $CDI_DEFAULT_VIRT_SC -o json | jq '.status.claimPropertySets'| grep ReadWriteMany ``` -Otherwise, if the default virtualization storage class is not set, get the default Kubernetes storage class: +Otherwise, if the default virtualization storage class is not set, get the +default Kubernetes storage class: ```bash $ export CDI_DEFAULT_K8S_SC="$(kubectl get sc -o json | jq -r '.items[].metadata|select(.annotations."storageclass.kubernetes.io/is-default-class"=="true")|.name')" $ echo default_k8s_sc=$CDI_DEFAULT_K8S_SC ``` -If the default Kubernetes storage class is set, check if it supports ReadWriteMany: +If the default Kubernetes storage class is set, check if it supports +ReadWriteMany: ```bash $ kubectl get storageprofile $CDI_DEFAULT_K8S_SC -o json | jq '.status.claimPropertySets'| grep ReadWriteMany ``` -See [doc](https://github.com/kubevirt/containerized-data-importer/blob/main/doc/efficient-cloning.md) for details about smart clone prerequisites. +See [doc](https://github.com/kubevirt/containerized-data-importer/blob/main/doc/efficient-cloning.md) +for details about smart clone prerequisites. ## Mitigation @@ -48,4 +56,4 @@ If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/CDIMultipleDefaultVirtStorageClasses.md b/docs/runbooks/CDIMultipleDefaultVirtStorageClasses.md index a6baede0..d4f19781 100644 --- a/docs/runbooks/CDIMultipleDefaultVirtStorageClasses.md +++ b/docs/runbooks/CDIMultipleDefaultVirtStorageClasses.md @@ -5,15 +5,19 @@ This alert fires when more than one default virtualization storage class exists. -A default virtualization storage class has precedence over a default Kubernetes storage class for creating a VirtualMachine disk image. +A default virtualization storage class has precedence over a default Kubernetes +storage class for creating a VirtualMachine disk image. ## Impact -If more than one default virtualization storage class exists, a data volume that requests a default storage class (storage class not explicitly specified), receives the most recently created one. +If more than one default virtualization storage class exists, a data volume that +requests a default storage class (storage class not explicitly specified), +receives the most recently created one. ## Diagnosis -Obtain a list of default virtualization storage classes by running the following command: +Obtain a list of default virtualization storage classes by running the following +command: ```bash $ kubectl get sc -o json | jq '.items[].metadata|select(.annotations."storageclass.kubevirt.io/is-default-virt-class"=="true")|.name' @@ -21,9 +25,12 @@ $ kubectl get sc -o json | jq '.items[].metadata|select(.annotations."storagecla ## Mitigation -Ensure that only one storage class has the default virtualization storage class annotation. +Ensure that only one storage class has the default virtualization storage class +annotation. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/CDINoDefaultStorageClass.md b/docs/runbooks/CDINoDefaultStorageClass.md index ac9267db..53edac09 100644 --- a/docs/runbooks/CDINoDefaultStorageClass.md +++ b/docs/runbooks/CDINoDefaultStorageClass.md @@ -3,13 +3,17 @@ ## Meaning -This alert fires when there is no default (Kubernetes or virtualization) storage class, and a data volume is pending for one. +This alert fires when there is no default (Kubernetes or virtualization) storage +class, and a data volume is pending for one. -A default virtualization storage class has precedence over a default Kubernetes storage class for creating a VirtualMachine disk image. +A default virtualization storage class has precedence over a default Kubernetes +storage class for creating a VirtualMachine disk image. ## Impact -If there is no default (k8s or virt) storage class, a data volume that requests a default storage class (storage class not explicitly specified) will be pending for one. +If there is no default (k8s or virt) storage class, a data volume that requests +a default storage class (storage class not explicitly specified) will be pending +for one. ## Diagnosis @@ -35,11 +39,12 @@ $ kubectl patch storageclass -p '{"metadata": {"annotations ## Mitigation -Ensure that there is one storage class that has the default (k8s or virt) storage class annotation. +Ensure that there is one storage class that has the default (k8s or virt) +storage class annotation. If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/CDINotReady.md b/docs/runbooks/CDINotReady.md index 3d68f394..6f8fde6d 100644 --- a/docs/runbooks/CDINotReady.md +++ b/docs/runbooks/CDINotReady.md @@ -3,15 +3,17 @@ ## Meaning -This alert fires when the containerized data importer (CDI) is in a degraded state: +This alert fires when the containerized data importer (CDI) is in a degraded +state: - Not progressing - Not available to use ## Impact -CDI is not usable, so users cannot build virtual machine disks on persistent volume claims (PVCs) using CDI's data volumes. -CDI components are not ready and they stopped progressing towards a ready state. +CDI is not usable, so users cannot build virtual machine disks on persistent +volume claims (PVCs) using CDI's data volumes. CDI components are not ready, and +they stopped progressing towards a ready state. ## Diagnosis @@ -43,7 +45,9 @@ CDI components are not ready and they stopped progressing towards a ready state. Try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/CDIOperatorDown.md b/docs/runbooks/CDIOperatorDown.md index 2e38b284..1a1c212b 100644 --- a/docs/runbooks/CDIOperatorDown.md +++ b/docs/runbooks/CDIOperatorDown.md @@ -4,11 +4,14 @@ ## Meaning This alert fires when the Containerized Data Importer (CDI) Operator is down. -The CDI Operator deploys and manages the CDI infrastructure components, such as data volume and persistent volume claim (PVC) controllers. These controllers help users build virtual machine disks on PVCs. +The CDI Operator deploys and manages the CDI infrastructure components, such as +data volume and persistent volume claim (PVC) controllers. These controllers +help users build virtual machine disks on PVCs. ## Impact -The CDI components might fail to deploy or to stay in a required state. The CDI installation might not function correctly. +The CDI components might fail to deploy or to stay in a required state. The CDI +installation might not function correctly. ## Diagnosis @@ -38,7 +41,9 @@ The CDI components might fail to deploy or to stay in a required state. The CDI ## Mitigation - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/CDIStorageProfilesIncomplete.md b/docs/runbooks/CDIStorageProfilesIncomplete.md index 7754762a..3516bb45 100644 --- a/docs/runbooks/CDIStorageProfilesIncomplete.md +++ b/docs/runbooks/CDIStorageProfilesIncomplete.md @@ -3,9 +3,12 @@ ## Meaning -This alert fires when a Containerized Data Importer (CDI) storage profile is incomplete. +This alert fires when a Containerized Data Importer (CDI) storage profile is +incomplete. -If a storage profile is incomplete, the CDI cannot infer persistent volume claim (PVC) fields, such as `volumeMode` and `accessModes`, which are required to create a virtual machine (VM) disk. +If a storage profile is incomplete, the CDI cannot infer persistent volume claim +(PVC) fields, such as `volumeMode` and `accessModes`, which are required to +create a virtual machine (VM) disk. ## Impact @@ -28,11 +31,14 @@ $ kubectl patch storageprofile local --type=merge -p '{"spec": {"claimPropertySe ``` -See [Empty profiles](https://github.com/kubevirt/containerized-data-importer/blob/main/doc/storageprofile.md#empty-storage-profile) and -[User defined profiles](https://github.com/kubevirt/containerized-data-importer/blob/main/doc/storageprofile.md#user-defined-storage-profile) for more details about storage profiles. +See [Empty profiles](https://github.com/kubevirt/containerized-data-importer/blob/main/doc/storageprofile.md#empty-storage-profile) +and [User defined profiles](https://github.com/kubevirt/containerized-data-importer/blob/main/doc/storageprofile.md#user-defined-storage-profile) +for more details about storage profiles. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/CnaoDown.md b/docs/runbooks/CnaoDown.md index 9fe93d7f..89b235cf 100644 --- a/docs/runbooks/CnaoDown.md +++ b/docs/runbooks/CnaoDown.md @@ -8,7 +8,8 @@ The CNAO deploys additional networking components on top of the cluster. ## Impact -If the CNAO is not running, the cluster cannot reconcile changes to virtual machine components. As a result, the changes might fail to take effect. +If the CNAO is not running, the cluster cannot reconcile changes to virtual +machine components. As a result, the changes might fail to take effect. ## Diagnosis @@ -38,7 +39,9 @@ If the CNAO is not running, the cluster cannot reconcile changes to virtual mach ## Mitigation - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/CnaoNmstateMigration.md b/docs/runbooks/CnaoNmstateMigration.md index 765057d0..02216530 100644 --- a/docs/runbooks/CnaoNmstateMigration.md +++ b/docs/runbooks/CnaoNmstateMigration.md @@ -3,9 +3,12 @@ ## Meaning -This alert fires when a `kubernetes-nmstate` deployment is detected and the Kubernetes NMState Operator is not installed. This alert only affects OpenShift Virtualization 4.10. +This alert fires when a `kubernetes-nmstate` deployment is detected and the +Kubernetes NMState Operator is not installed. This alert only affects OpenShift +Virtualization 4.10. -The Cluster Network Addons Operator (CNAO) does not support `kubernetes-nmstate` deployments in OpenShift Virtualization 4.11 and later. +The Cluster Network Addons Operator (CNAO) does not support `kubernetes-nmstate` +deployments in OpenShift Virtualization 4.11 and later. ## Impact @@ -13,6 +16,7 @@ You cannot upgrade your cluster to OpenShift Virtualization 4.11. ## Mitigation -Install the Kubernetes NMState Operator from the OperatorHub. CNAO automatically transfers the `kubernetes-nmstate` deployment to the Operator. +Install the Kubernetes NMState Operator from the OperatorHub. CNAO automatically +transfers the `kubernetes-nmstate` deployment to the Operator. Afterwards, you can upgrade to OpenShift Virtualization 4.11. diff --git a/docs/runbooks/HCOInstallationIncomplete.md b/docs/runbooks/HCOInstallationIncomplete.md index aa01bf5d..d5f88e5c 100644 --- a/docs/runbooks/HCOInstallationIncomplete.md +++ b/docs/runbooks/HCOInstallationIncomplete.md @@ -2,16 +2,20 @@ ## Meaning -This alert fires when the HyperConverged Cluster Operator (HCO) runs for more than an hour without a `HyperConverged` custom resource (CR). +This alert fires when the HyperConverged Cluster Operator (HCO) runs for more +than an hour without a `HyperConverged` custom resource (CR). This alert has the following causes: -- During the installation process, you installed the HCO but you did not create the `HyperConverged` CR. -- During the uninstall process, you removed the `HyperConverged` CR before uninstalling the HCO and the HCO is still running. +- During the installation process, you installed the HCO but you did not create +the `HyperConverged` CR. +- During the uninstall process, you removed the `HyperConverged` CR before +uninstalling the HCO and the HCO is still running. ## Mitigation -Installation: Complete the installation by creating a `HyperConverged` CR with its default values: +Installation: Complete the installation by creating a `HyperConverged` CR with +its default values: ```bash $ cat < + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/HPPOperatorDown.md b/docs/runbooks/HPPOperatorDown.md index 0b6f970d..24da04d7 100644 --- a/docs/runbooks/HPPOperatorDown.md +++ b/docs/runbooks/HPPOperatorDown.md @@ -5,11 +5,13 @@ This alert fires when the hostpath provisioner (HPP) Operator is down. -The HPP Operator deploys and manages the HPP infrastructure components, such as the daemon set that provisions hostpath volumes. +The HPP Operator deploys and manages the HPP infrastructure components, such as +the daemon set that provisions hostpath volumes. ## Impact -The HPP components might fail to deploy or to remain in the required state. As a result, the HPP installation might not work correctly in the cluster. +The HPP components might fail to deploy or to remain in the required state. As a +result, the HPP installation might not work correctly in the cluster. ## Diagnosis @@ -39,9 +41,12 @@ The HPP components might fail to deploy or to remain in the required state. As a ## Mitigation -Based on the information obtained during Diagnosis, try to find and resolve the cause of the issue. +Based on the information obtained during Diagnosis, try to find and resolve the +cause of the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/HPPSharingPoolPathWithOS.md b/docs/runbooks/HPPSharingPoolPathWithOS.md index 9747b65f..cf5c12ef 100644 --- a/docs/runbooks/HPPSharingPoolPathWithOS.md +++ b/docs/runbooks/HPPSharingPoolPathWithOS.md @@ -3,13 +3,16 @@ ## Meaning -This alert fires when the hostpath provisioner (HPP) shares a file system with other critical components, such as `kubelet` or the operating system (OS). +This alert fires when the hostpath provisioner (HPP) shares a file system with +other critical components, such as `kubelet` or the operating system (OS). -HPP dynamically provisions hostpath volumes to provide storage for persistent volume claims (PVCs). +HPP dynamically provisions hostpath volumes to provide storage for persistent +volume claims (PVCs). ## Impact -A shared hostpath pool puts pressure on the node's disks. The node might have degraded performance and stability. +A shared hostpath pool puts pressure on the node's disks. The node might have +degraded performance and stability. ## Diagnosis @@ -39,9 +42,13 @@ A shared hostpath pool puts pressure on the node's disks. The node might have de ## Mitigation -Using the data obtained in the Diagnosis section, try to prevent the pool path from being shared with the OS. The specific steps vary based on the node and other circumstances. +Using the data obtained in the Diagnosis section, try to prevent the pool path +from being shared with the OS. The specific steps vary based on the node and +other circumstances. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/KubeMacPoolDuplicateMacsFound.md b/docs/runbooks/KubeMacPoolDuplicateMacsFound.md index cbe829ab..a3c41e5f 100644 --- a/docs/runbooks/KubeMacPoolDuplicateMacsFound.md +++ b/docs/runbooks/KubeMacPoolDuplicateMacsFound.md @@ -5,7 +5,9 @@ This alert fires when `KubeMacPool` detects duplicate MAC addresses. -`KubeMacPool` is responsible for allocating MAC addresses and preventing MAC address conflicts. When `KubeMacPool` starts, it scans the cluster for the MAC addresses of virtual machines (VMs) in managed namespaces. +`KubeMacPool` is responsible for allocating MAC addresses and preventing MAC +address conflicts. When `KubeMacPool` starts, it scans the cluster for the MAC +addresses of virtual machines (VMs) in managed namespaces. ## Impact diff --git a/docs/runbooks/KubeVirtCRModified.md b/docs/runbooks/KubeVirtCRModified.md index eedbeb93..43b19854 100644 --- a/docs/runbooks/KubeVirtCRModified.md +++ b/docs/runbooks/KubeVirtCRModified.md @@ -3,19 +3,26 @@ ## Meaning -This alert fires when an operand of the HyperConverged Cluster Operator (HCO) is changed by someone or something other than HCO. +This alert fires when an operand of the HyperConverged Cluster Operator (HCO) is +changed by someone or something other than HCO. -HCO configures KubeVirt and its supporting operators in an opinionated way and overwrites its operands when there is an unexpected change to them. Users must not modify the operands directly. The `HyperConverged` custom resource is the source of truth for the configuration. +HCO configures KubeVirt and its supporting operators in an opinionated way and +overwrites its operands when there is an unexpected change to them. Users must +not modify the operands directly. The `HyperConverged` custom resource is the +source of truth for the configuration. ## Impact -Changing the operands manually causes the cluster configuration to fluctuate and might lead to instability. +Changing the operands manually causes the cluster configuration to fluctuate and +might lead to instability. ## Diagnosis -Check the `component_name` in the alert details to determine the operand that is being changed. +Check the `component_name` in the alert details to determine the operand that is +being changed. -In the following example, the operand kind is `kubevirt` and the operand name is `kubevirt-kubevirt-hyperconverged`: +In the following example, the operand kind is `kubevirt` and the operand name is +`kubevirt-kubevirt-hyperconverged`: ```text Labels @@ -26,6 +33,8 @@ Labels ## Mitigation -Do not change the HCO operands directly. Use `HyperConverged` objects to configure the cluster. +Do not change the HCO operands directly. Use `HyperConverged` objects to +configure the cluster. -The alert resolves itself after 10 minutes if the operands are not changed manually. +The alert resolves itself after 10 minutes if the operands are not changed +manually. diff --git a/docs/runbooks/KubeVirtDeprecatedAPIRequested.md b/docs/runbooks/KubeVirtDeprecatedAPIRequested.md index 7a743b2f..23cd89df 100644 --- a/docs/runbooks/KubeVirtDeprecatedAPIRequested.md +++ b/docs/runbooks/KubeVirtDeprecatedAPIRequested.md @@ -7,11 +7,13 @@ This alert fires when a deprecated KubeVirt API is requested. ## Impact -Usage of deprecated APIs is not recommended because they will be removed in a future release. +Usage of deprecated APIs is not recommended because they will be removed in a +future release. ## Diagnosis -Check the `description` and `summary` alert annotations for more details on which API is being accessed, for example: +Check the `description` and `summary` alert annotations for more details on +which API is being accessed, for example: ```text description: "Detected requests to the deprecated virtualmachines.kubevirt.io/v1alpha3 API." summary: "2 requests were detected in the last 10 minutes." @@ -21,9 +23,11 @@ Check the `description` and `summary` alert annotations for more details on whic Make sure to only use a supported version when making requests to the API. -Some requests to deprecated APIs are made by KubeVirt components themselves (e.g VirtualMachineInstancePresets). -These alerts cannot be mitigated because the requests are still necessary to serve the deprecated API. -They are harmless and will be resolved when the deprecated API is removed in a future release of KubeVirt. +Some requests to deprecated APIs are made by KubeVirt components themselves +(e.g VirtualMachineInstancePresets). These alerts cannot be mitigated because +the requests are still necessary to serve the deprecated API. They are harmless +and will be resolved when the deprecated API is removed in a future release of +KubeVirt. Alerts will resolve after 10 minutes if the deprecated API is not used again. diff --git a/docs/runbooks/KubeVirtVMIExcessiveMigrations.md b/docs/runbooks/KubeVirtVMIExcessiveMigrations.md index 5424b66a..78021eb1 100644 --- a/docs/runbooks/KubeVirtVMIExcessiveMigrations.md +++ b/docs/runbooks/KubeVirtVMIExcessiveMigrations.md @@ -3,13 +3,17 @@ ## Meaning -This alert fires when a virtual machine instance (VMI) live migrates more than 12 times over a period of 24 hours. +This alert fires when a virtual machine instance (VMI) live migrates more than +12 times over a period of 24 hours. -This migration rate is abnormally high, even during an upgrade. This alert might indicate a problem in the cluster infrastructure, such as network disruptions or insufficient resources. +This migration rate is abnormally high, even during an upgrade. This alert might +indicate a problem in the cluster infrastructure, such as network disruptions or +insufficient resources. ## Impact -A virtual machine (VM) that migrates too frequently might experience degraded performance because memory page faults occur during the transition. +A virtual machine (VM) that migrates too frequently might experience degraded +performance because memory page faults occur during the transition. ## Diagnosis @@ -93,11 +97,14 @@ A virtual machine (VM) that migrates too frequently might experience degraded pe ## Mitigation -Ensure that the worker nodes have sufficient resources (CPU, memory, disk) to run VM workloads without interruption. +Ensure that the worker nodes have sufficient resources (CPU, memory, disk) to +run VM workloads without interruption. If the problem persists, try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/KubemacpoolDown.md b/docs/runbooks/KubemacpoolDown.md index a9bac79c..0d4abfe7 100644 --- a/docs/runbooks/KubemacpoolDown.md +++ b/docs/runbooks/KubemacpoolDown.md @@ -3,7 +3,8 @@ ## Meaning -`KubeMacPool` is down. `KubeMacPool` is responsible for allocating MAC addresses and preventing MAC address conflicts. +`KubeMacPool` is down. `KubeMacPool` is responsible for allocating MAC addresses +and preventing MAC address conflicts. ## Impact @@ -39,7 +40,9 @@ If `KubeMacPool` is down, `VirtualMachine` objects cannot be created. ## Mitigation - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/KubevirtVmHighMemoryUsage.md b/docs/runbooks/KubevirtVmHighMemoryUsage.md index 9d76cc9e..3fc5e6a7 100644 --- a/docs/runbooks/KubevirtVmHighMemoryUsage.md +++ b/docs/runbooks/KubevirtVmHighMemoryUsage.md @@ -3,11 +3,13 @@ ## Meaning -This alert fires when a container hosting a virtual machine (VM) has less than 20 MB free memory. +This alert fires when a container hosting a virtual machine (VM) has less than +20 MB free memory. ## Impact -The virtual machine running inside the container is terminated by the runtime if the container's memory limit is exceeded. +The virtual machine running inside the container is terminated by the runtime +if the container's memory limit is exceeded. ## Diagnosis @@ -17,7 +19,8 @@ The virtual machine running inside the container is terminated by the runtime if $ kubectl get pod -o yaml ``` -2. Identify `compute` container processes with high memory usage in the `virt-launcher` pod: +2. Identify `compute` container processes with high memory usage in the +`virt-launcher` pod: ```bash $ kubectl exec -it -c compute -- top @@ -25,7 +28,8 @@ The virtual machine running inside the container is terminated by the runtime if ## Mitigation -Increase the memory limit in the `VirtualMachine` specification as in the following example: +Increase the memory limit in the `VirtualMachine` specification as in the +following example: ```yaml spec: diff --git a/docs/runbooks/LowKVMNodesCount.md b/docs/runbooks/LowKVMNodesCount.md index 2c61c90c..f46b96ee 100644 --- a/docs/runbooks/LowKVMNodesCount.md +++ b/docs/runbooks/LowKVMNodesCount.md @@ -25,5 +25,6 @@ $ kubectl get nodes -o jsonpath='{.items[*].status.allocatable}' | grep devices. Validate the [hardware virtualization support](https://kubevirt.io/user-guide/operations/installation/#validate-hardware-virtualization-support). -If hardware virtualization is not available, [software emulation](https://github.com/kubevirt/kubevirt/blob/master/docs/software-emulation.md) can be enabled. +If hardware virtualization is not available, [software emulation](https://github.com/kubevirt/kubevirt/blob/master/docs/software-emulation.md) +can be enabled. diff --git a/docs/runbooks/LowReadyVirtControllersCount.md b/docs/runbooks/LowReadyVirtControllersCount.md index a2cb1943..90285943 100644 --- a/docs/runbooks/LowReadyVirtControllersCount.md +++ b/docs/runbooks/LowReadyVirtControllersCount.md @@ -4,13 +4,19 @@ ## Meaning -This alert fires when one or more `virt-controller` pods are running, but none of these pods has been in the `Ready` state for the last 5 minutes. +This alert fires when one or more `virt-controller` pods are running, but none +of these pods has been in the `Ready` state for the last 5 minutes. -A `virt-controller` device monitors the custom resource definitions (CRDs) of a virtual machine instance (VMI) and manages the associated pods. The device create pods for VMIs and manages the lifecycle of the pods. The device is critical for cluster-wide virtualization functionality. +A `virt-controller` device monitors the custom resource definitions (CRDs) of a +virtual machine instance (VMI) and manages the associated pods. The device +create pods for VMIs and manages the lifecycle of the pods. The device is +critical for cluster-wide virtualization functionality. ## Impact -This alert indicates that a cluster-level failure might occur, which would cause actions related to VM lifecycle management to fail. This notably includes launching a new VMI or shutting down an existing VMI. +This alert indicates that a cluster-level failure might occur, which would cause +actions related to VM lifecycle management to fail. This notably includes +launching a new VMI or shutting down an existing VMI. ## Diagnosis @@ -32,13 +38,15 @@ This alert indicates that a cluster-level failure might occur, which would cause $ kubectl -n $NAMESPACE get deploy virt-controller -o yaml ``` -4. Obtain the details of the `virt-controller` deployment to check for status conditions, such as crashing pods or failures to pull images: +4. Obtain the details of the `virt-controller` deployment to check for status +conditions, such as crashing pods or failures to pull images: ```bash $ kubectl -n $NAMESPACE describe deploy virt-controller ``` -5. Check if any problems occurred with the nodes. For example, they might be in a `NotReady` state: +5. Check if any problems occurred with the nodes. For example, they might be in +a `NotReady` state: ```bash $ kubectl get nodes @@ -50,12 +58,15 @@ This alert can have multiple causes, including the following: - Not enough memory on the cluster - Nodes are down -- The API server is overloaded. For example, the scheduler might be under a heavy load and therefore not completely available. +- The API server is overloaded. For example, the scheduler might be under a +heavy load and therefore not completely available. - Networking issues Try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/LowReadyVirtOperatorsCount.md b/docs/runbooks/LowReadyVirtOperatorsCount.md index 7fcfcac9..17486b3b 100644 --- a/docs/runbooks/LowReadyVirtOperatorsCount.md +++ b/docs/runbooks/LowReadyVirtOperatorsCount.md @@ -3,21 +3,30 @@ ## Meaning -This alert fires when one or more `virt-operator` pods are running, but none of these pods has been in a `Ready` state for the last 10 minutes. +This alert fires when one or more `virt-operator` pods are running, but none of +these pods has been in a `Ready` state for the last 10 minutes. -The `virt-operator` is the first Operator to start in a cluster. The `virt-operator` deployment has a default replica of two `virt-operator` pods. +The `virt-operator` is the first Operator to start in a cluster. The +`virt-operator` deployment has a default replica of two `virt-operator` pods. Its primary responsibilities include the following: - Installing, live-updating, and live-upgrading a cluster -- Monitoring the lifecycle of top-level controllers, such as `virt-controller`, `virt-handler`, `virt-launcher`, and managing their reconciliation -- Certain cluster-wide tasks, such as certificate rotation and infrastructure management +- Monitoring the lifecycle of top-level controllers, such as `virt-controller`, +`virt-handler`, `virt-launcher`, and managing their reconciliation +- Certain cluster-wide tasks, such as certificate rotation and infrastructure +management ## Impact -A cluster-level failure might occur. Critical cluster-wide management functionalities, such as certification rotation, upgrade, and reconciliation of controllers, might become unavailable. Such a state also triggers the `NoReadyVirtOperator` alert. +A cluster-level failure might occur. Critical cluster-wide management +functionalities, such as certification rotation, upgrade, and reconciliation of +controllers, might become unavailable. Such a state also triggers the +`NoReadyVirtOperator` alert. -The `virt-operator` is not directly responsible for virtual machines (VMs) in the cluster. Therefore, its temporary unavailability does not significantly affect VM workloads. +The `virt-operator` is not directly responsible for virtual machines (VMs) in +the cluster. Therefore, its temporary unavailability does not significantly +affect VM workloads. ## Diagnosis @@ -47,9 +56,12 @@ The `virt-operator` is not directly responsible for virtual machines (VMs) in th ## Mitigation -Based on the information obtained during Diagnosis, try to find the cause of the issue and resolve it. +Based on the information obtained during Diagnosis, try to find the cause of the +issue and resolve it. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/LowVirtAPICount.md b/docs/runbooks/LowVirtAPICount.md index f3a5ef59..057b56c2 100644 --- a/docs/runbooks/LowVirtAPICount.md +++ b/docs/runbooks/LowVirtAPICount.md @@ -3,11 +3,13 @@ ## Meaning -This alert fires when only one available `virt-api` pod is detected during a 60-minute period, although at least two nodes are available for scheduling. +This alert fires when only one available `virt-api` pod is detected during a +60-minute period, although at least two nodes are available for scheduling. ## Impact -An API call outage might occur during node eviction because the `virt-api` pod becomes a single point of failure. +An API call outage might occur during node eviction because the `virt-api` pod +becomes a single point of failure. ## Diagnosis @@ -39,7 +41,9 @@ An API call outage might occur during node eviction because the `virt-api` pod b Try to identify the root cause and to resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/LowVirtControllersCount.md b/docs/runbooks/LowVirtControllersCount.md index 3ddf1727..a8fb0dbb 100644 --- a/docs/runbooks/LowVirtControllersCount.md +++ b/docs/runbooks/LowVirtControllersCount.md @@ -4,15 +4,22 @@ ## Meaning -This alert fires when a low number of `virt-controller` pods is detected. At least one `virt-controller` pod must be available in order to ensure high availability. The default number of replicas is 2. +This alert fires when a low number of `virt-controller` pods is detected. At +least one `virt-controller` pod must be available in order to ensure high +availability. The default number of replicas is 2. -A `virt-controller` device monitors the custom resource definitions (CRDs) of a virtual machine instance (VMI) and manages the associated pods. The device create pods for VMIs and manages the lifecycle of the pods. The device is critical for cluster-wide virtualization functionality. +A `virt-controller` device monitors the custom resource definitions (CRDs) of a +virtual machine instance (VMI) and manages the associated pods. The device +create pods for VMIs and manages the lifecycle of the pods. The device is +critical for cluster-wide virtualization functionality. ## Impact -The responsiveness of KubeVirt might become negatively affected. For example, certain requests might be missed. +The responsiveness of KubeVirt might become negatively affected. For example, +certain requests might be missed. -In addition, if another `virt-launcher` instance terminates unexpectedly, KubeVirt might become completely unresponsive. +In addition, if another `virt-launcher` instance terminates unexpectedly, +KubeVirt might become completely unresponsive. ## Diagnosis @@ -34,7 +41,8 @@ In addition, if another `virt-launcher` instance terminates unexpectedly, KubeVi $ kubectl -n $NAMESPACE logs ``` -4. Obtain the details of the `virt-launcher` pod to check for status conditions such as unexpected termination or a `NotReady` state. +4. Obtain the details of the `virt-launcher` pod to check for status conditions +such as unexpected termination or a `NotReady` state. ```bash $ kubectl -n $NAMESPACE describe pod/ @@ -46,12 +54,15 @@ This alert can have a variety of causes, including: - Not enough memory on the cluster - Nodes are down -- The API server is overloaded. For example, the scheduler might be under a heavy load and therefore not completely available. +- The API server is overloaded. For example, the scheduler might be under a +heavy load and therefore not completely available. - Networking issues Identify the root cause and fix it, if possible. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/LowVirtOperatorCount.md b/docs/runbooks/LowVirtOperatorCount.md index c5e550e4..1c7c207b 100644 --- a/docs/runbooks/LowVirtOperatorCount.md +++ b/docs/runbooks/LowVirtOperatorCount.md @@ -3,19 +3,27 @@ ## Meaning -This alert fires when only one `virt-operator` pod in a `Ready` state has been running for the last 60 minutes. +This alert fires when only one `virt-operator` pod in a `Ready` state has been +running for the last 60 minutes. -The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: +The `virt-operator` is the first Operator to start in a cluster. Its primary +responsibilities include the following: - Installing, live-updating, and live-upgrading a cluster -- Monitoring the lifecycle of top-level controllers, such as `virt-controller`, `virt-handler`, `virt-launcher`, and managing their reconciliation -- Certain cluster-wide tasks, such as certificate rotation and infrastructure management +- Monitoring the lifecycle of top-level controllers, such as `virt-controller`, +`virt-handler`, `virt-launcher`, and managing their reconciliation +- Certain cluster-wide tasks, such as certificate rotation and infrastructure +management ## Impact -The `virt-operator` cannot provide high availability (HA) for the deployment. HA requires two or more `virt-operator` pods in a `Ready` state. The default deployment is two pods. +The `virt-operator` cannot provide high availability (HA) for the deployment. HA +requires two or more `virt-operator` pods in a `Ready` state. The default +deployment is two pods. -The `virt-operator` is not directly responsible for virtual machines (VMs) in the cluster. Therefore, its decreased availability does not significantly affect VM workloads. +The `virt-operator` is not directly responsible for virtual machines (VMs) in +the cluster. Therefore, its decreased availability does not significantly affect +VM workloads. ## Diagnosis @@ -45,9 +53,12 @@ The `virt-operator` is not directly responsible for virtual machines (VMs) in th ## Mitigation -Based on the information obtained during Diagnosis, try to find the cause of the issue and resolve it. +Based on the information obtained during Diagnosis, try to find the cause of the +issue and resolve it. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/NetworkAddonsConfigNotReady.md b/docs/runbooks/NetworkAddonsConfigNotReady.md index 58ca3354..c8e71c67 100644 --- a/docs/runbooks/NetworkAddonsConfigNotReady.md +++ b/docs/runbooks/NetworkAddonsConfigNotReady.md @@ -3,9 +3,11 @@ ## Meaning -This alert fires when the `NetworkAddonsConfig` custom resource (CR) of the Cluster Network Addons Operator (CNAO) is not ready. +This alert fires when the `NetworkAddonsConfig` custom resource (CR) of the +Cluster Network Addons Operator (CNAO) is not ready. -CNAO deploys additional networking components on the cluster. This alert indicates that one of the deployed components is not ready. +CNAO deploys additional networking components on the cluster. This alert +indicates that one of the deployed components is not ready. ## Impact @@ -13,7 +15,8 @@ Network functionality is affected. ## Diagnosis -1. Check the status conditions of the `NetworkAddonsConfig` CR to identify the deployment or daemon set that is not ready: +1. Check the status conditions of the `NetworkAddonsConfig` CR to identify the +deployment or daemon set that is not ready: ```bash $ kubectl get networkaddonsconfig -o custom-columns="":.status.conditions[*].message @@ -47,7 +50,9 @@ Network functionality is affected. Try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/NoLeadingVirtOperator.md b/docs/runbooks/NoLeadingVirtOperator.md index da452854..83d6d3b0 100644 --- a/docs/runbooks/NoLeadingVirtOperator.md +++ b/docs/runbooks/NoLeadingVirtOperator.md @@ -3,21 +3,29 @@ ## Meaning -This alert fires when no `virt-operator` pod with a leader lease has been detected for 10 minutes, although the `virt-operator` pods are in a `Ready` state. The alert indicates that no leader pod is available. +This alert fires when no `virt-operator` pod with a leader lease has been +detected for 10 minutes, although the `virt-operator` pods are in a `Ready` +state. The alert indicates that no leader pod is available. -The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: +The `virt-operator` is the first Operator to start in a cluster. Its primary +responsibilities include the following: - Installing, live updating, and live upgrading a cluster -- Monitoring the lifecycle of top-level controllers, such as `virt-controller`, `virt-handler`, `virt-launcher`, and managing their reconciliation +- Monitoring the lifecycle of top-level controllers, such as `virt-controller`, +`virt-handler`, `virt-launcher`, and managing their reconciliation -- Certain cluster-wide tasks, such as certificate rotation and infrastructure management +- Certain cluster-wide tasks, such as certificate rotation and infrastructure +management -The `virt-operator` deployment has a default replica of 2 pods, with one pod holding a leader lease. +The `virt-operator` deployment has a default replica of 2 pods, with one pod +holding a leader lease. ## Impact -This alert indicates a failure at the level of the cluster. As a result, critical cluster-wide management functionalities, such as certification rotation, upgrade, and reconciliation of controllers, might not be available. +This alert indicates a failure at the level of the cluster. As a result, +critical cluster-wide management functionalities, such as certification +rotation, upgrade, and reconciliation of controllers, might not be available. ## Diagnosis @@ -63,9 +71,12 @@ This alert indicates a failure at the level of the cluster. As a result, critica ## Mitigation -Based on the information obtained during Diagnosis, try to find and resolve the cause of the issue. +Based on the information obtained during Diagnosis, try to find and resolve the +cause of the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/NoReadyVirtController.md b/docs/runbooks/NoReadyVirtController.md index fd85aec7..bcb65482 100644 --- a/docs/runbooks/NoReadyVirtController.md +++ b/docs/runbooks/NoReadyVirtController.md @@ -3,14 +3,19 @@ ## Meaning -This alert fires when no available `virt-controller` devices have been detected for 5 minutes. +This alert fires when no available `virt-controller` devices have been detected +for 5 minutes. -The `virt-controller` devices monitor the custom resource definitions of virtual machine instances (VMIs) and manage the associated pods. The devices create pods for VMIs and manage the lifecycle of the pods. +The `virt-controller` devices monitor the custom resource definitions of virtual +machine instances (VMIs) and manage the associated pods. The devices create pods +for VMIs and manage the lifecycle of the pods. -Therefore, `virt-controller` devices are critical for all cluster-wide virtualization functionality. +Therefore, `virt-controller` devices are critical for all cluster-wide +virtualization functionality. ## Impact -Any actions related to VM lifecycle management fail. This notably includes launching a new VMI or shutting down an existing VMI. +Any actions related to VM lifecycle management fail. This notably includes +launching a new VMI or shutting down an existing VMI. ## Diagnosis @@ -32,7 +37,8 @@ Any actions related to VM lifecycle management fail. This notably includes launc $ kubectl -n $NAMESPACE get deploy virt-controller -o yaml ``` -4. Obtain the details of the `virt-controller` deployment to check for status conditions such as crashing pods or failure to pull images: +4. Obtain the details of the `virt-controller` deployment to check for status +conditions such as crashing pods or failure to pull images: ```bash $ kubectl -n $NAMESPACE describe deploy virt-controller @@ -58,13 +64,15 @@ Any actions related to VM lifecycle management fail. This notably includes launc ## Mitigation -Based on the information obtained during Diagnosis, try to find and resolve the cause of the issue. +Based on the information obtained during Diagnosis, try to find and resolve the +cause of the issue. - + If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - diff --git a/docs/runbooks/NoReadyVirtOperator.md b/docs/runbooks/NoReadyVirtOperator.md index b3438dab..ad4bdc50 100644 --- a/docs/runbooks/NoReadyVirtOperator.md +++ b/docs/runbooks/NoReadyVirtOperator.md @@ -3,21 +3,29 @@ ## Meaning -This alert fires when no `virt-operator` pod in a `Ready` state has been detected for 10 minutes. +This alert fires when no `virt-operator` pod in a `Ready` state has been +detected for 10 minutes. -The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: +The `virt-operator` is the first Operator to start in a cluster. Its primary +responsibilities include the following: - Installing, live-updating, and live-upgrading a cluster -- Monitoring the life cycle of top-level controllers, such as `virt-controller`, `virt-handler`, `virt-launcher`, and managing their reconciliation -- Certain cluster-wide tasks, such as certificate rotation and infrastructure management +- Monitoring the life cycle of top-level controllers, such as `virt-controller`, +`virt-handler`, `virt-launcher`, and managing their reconciliation +- Certain cluster-wide tasks, such as certificate rotation and infrastructure +management The default deployment is two `virt-operator` pods. ## Impact -This alert indicates a cluster-level failure. Critical cluster management functionalities, such as certification rotation, upgrade, and reconciliation of controllers, might not be not available. +This alert indicates a cluster-level failure. Critical cluster management +functionalities, such as certification rotation, upgrade, and reconciliation of +controllers, might not be not available. -The `virt-operator` is not directly responsible for virtual machines in the cluster. Therefore, its temporary unavailability does not significantly affect custom workloads. +The `virt-operator` is not directly responsible for virtual machines in the +cluster. Therefore, its temporary unavailability does not significantly affect +custom workloads. ## Diagnosis @@ -47,9 +55,12 @@ The `virt-operator` is not directly responsible for virtual machines in the clus ## Mitigation -Based on the information obtained during Diagnosis, try to find and resolve the cause of the issue. +Based on the information obtained during Diagnosis, try to find and resolve the +cause of the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/OrphanedVirtualMachineInstances.md b/docs/runbooks/OrphanedVirtualMachineInstances.md index d371d306..3deba68d 100644 --- a/docs/runbooks/OrphanedVirtualMachineInstances.md +++ b/docs/runbooks/OrphanedVirtualMachineInstances.md @@ -3,7 +3,9 @@ ## Meaning -This alert fires when a virtual machine instance (VMI), or `virt-launcher` pod, runs on a node that does not have a running `virt-handler` pod. Such a VMI is called _orphaned_. +This alert fires when a virtual machine instance (VMI), or `virt-launcher` pod, +runs on a node that does not have a running `virt-handler` pod. Such a VMI is +called _orphaned_. ## Impact @@ -11,13 +13,15 @@ Orphaned VMIs cannot be managed. ## Diagnosis -1. Check the status of the `virt-handler` pods to view the nodes on which they are running: +1. Check the status of the `virt-handler` pods to view the nodes on which they +are running: ```bash $ kubectl get pods --all-namespaces -o wide -l kubevirt.io=virt-handler ``` -2. Check the status of the VMIs to identify VMIs running on nodes that do not have a running `virt-handler` pod: +2. Check the status of the VMIs to identify VMIs running on nodes that do not +have a running `virt-handler` pod: ```bash $ kubectl get vmis --all-namespaces @@ -36,9 +40,11 @@ Orphaned VMIs cannot be managed. virt-handler 2 2 2 2 2 kubernetes.io/os=linux 4h ``` - The daemon set is considered healthy if the `Desired`, `Ready`, and `Available` columns contain the same value. + The daemon set is considered healthy if the `Desired`, `Ready`, and + `Available` columns contain the same value. -4. If the `virt-handler` daemon set is not healthy, check the `virt-handler` daemon set for pod deployment issues: +4. If the `virt-handler` daemon set is not healthy, check the `virt-handler` +daemon set for pod deployment issues: ```bash $ kubectl get daemonset virt-handler --all-namespaces -o yaml | jq .status @@ -50,7 +56,8 @@ Orphaned VMIs cannot be managed. $ kubectl get nodes ``` -6. Check the `spec.workloads` stanza of the `KubeVirt` custom resource (CR) for a workloads placement policy: +6. Check the `spec.workloads` stanza of the `KubeVirt` custom resource (CR) for +a workloads placement policy: ```bash $ kubectl get kubevirt kubevirt --all-namespaces -o yaml @@ -58,15 +65,20 @@ Orphaned VMIs cannot be managed. ## Mitigation -If a workloads placement policy is configured, add the node with the VMI to the policy. +If a workloads placement policy is configured, add the node with the VMI to the +policy. -Possible causes for the removal of a `virt-handler` pod from a node include changes to the node's taints and tolerations or to a pod's scheduling rules. +Possible causes for the removal of a `virt-handler` pod from a node include +changes to the node's taints and tolerations or to a pod's scheduling rules. Try to identify the root cause and resolve the issue. - + -See [How Daemon Pods are scheduled](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#how-daemon-pods-are-scheduled) for more information. +See [How Daemon Pods are scheduled](https://kubernetes.io/docs/concepts/workloads/controllers/daemonset/#how-daemon-pods-are-scheduled) +for more information. If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/OutdatedVirtualMachineInstanceWorkloads.md b/docs/runbooks/OutdatedVirtualMachineInstanceWorkloads.md index 30e65e26..267a8069 100644 --- a/docs/runbooks/OutdatedVirtualMachineInstanceWorkloads.md +++ b/docs/runbooks/OutdatedVirtualMachineInstanceWorkloads.md @@ -3,13 +3,16 @@ ## Meaning -This alert fires when running virtual machine instances (VMIs) in outdated `virt-launcher` pods are detected 24 hours after the KubeVirt control plane has been updated. +This alert fires when running virtual machine instances (VMIs) in outdated +`virt-launcher` pods are detected 24 hours after the KubeVirt control plane has +been updated. ## Impact Outdated VMIs might not have access to new KubeVirt features. -Outdated VMIs will not receive the security fixes associated with the `virt-launcher` pod update. +Outdated VMIs will not receive the security fixes associated with the +`virt-launcher` pod update. ## Diagnosis @@ -19,7 +22,8 @@ Outdated VMIs will not receive the security fixes associated with the `virt-laun $ kubectl get vmi -l kubevirt.io/outdatedLauncherImage --all-namespaces ``` -2. Check the `KubeVirt` custom resource (CR) to determine whether `workloadUpdateMethods` is configured in the `workloadUpdateStrategy` stanza: +2. Check the `KubeVirt` custom resource (CR) to determine whether +`workloadUpdateMethods` is configured in the `workloadUpdateStrategy` stanza: ```bash $ kubectl get kubevirt --all-namespaces -o yaml @@ -55,26 +59,34 @@ Outdated VMIs will not receive the security fixes associated with the `virt-laun Update the `KubeVirt` CR to enable automatic workload updates. -See [Updating KubeVirt Workloads](https://kubevirt.io/user-guide/operations/updating_and_deletion/#updating-kubevirt-workloads) for more information. +See [Updating KubeVirt Workloads](https://kubevirt.io/user-guide/operations/updating_and_deletion/#updating-kubevirt-workloads) +for more information. ### Stopping a VM associated with a non-live-migratable VMI -If a VMI is not live-migratable and if `runStrategy: always` is set in the corresponding `VirtualMachine` object, you can update the VMI by manually stopping the virtual machine (VM): +If a VMI is not live-migratable and if `runStrategy: always` is set in the +corresponding `VirtualMachine` object, you can update the VMI by manually +stopping the virtual machine (VM): ```bash $ virctl stop --namespace ``` -A new VMI spins up immediately in an updated `virt-launcher` pod to replace the stopped VMI. This is the equivalent of a restart action. +A new VMI spins up immediately in an updated `virt-launcher` pod to replace the +stopped VMI. This is the equivalent of a restart action. -Note: Manually stopping a _live-migratable_ VM is destructive and not recommended because it interrupts the workload. +Note: Manually stopping a _live-migratable_ VM is destructive and not +recommended because it interrupts the workload. ### Migrating a live-migratable VMI -If a VMI is live-migratable, you can update it by creating a `VirtualMachineInstanceMigration` object that targets a specific running VMI. The VMI is migrated into an updated `virt-launcher` pod. +If a VMI is live-migratable, you can update it by creating a +`VirtualMachineInstanceMigration` object that targets a specific running VMI. +The VMI is migrated into an updated `virt-launcher` pod. -1. Create a `VirtualMachineInstanceMigration` manifest and save it as `migration.yaml`: +1. Create a `VirtualMachineInstanceMigration` manifest and save it as +`migration.yaml`: ```yaml apiVersion: kubevirt.io/v1 @@ -92,7 +104,9 @@ If a VMI is live-migratable, you can update it by creating a `VirtualMachineInst $ kubectl create -f migration.yaml ``` - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/SSPCommonTemplatesModificationReverted.md b/docs/runbooks/SSPCommonTemplatesModificationReverted.md index b0135403..68b347c7 100644 --- a/docs/runbooks/SSPCommonTemplatesModificationReverted.md +++ b/docs/runbooks/SSPCommonTemplatesModificationReverted.md @@ -3,9 +3,12 @@ ## Meaning -This alert fires when the Scheduling, Scale, and Performance (SSP) Operator reverts changes to common templates as part of its reconciliation procedure. +This alert fires when the Scheduling, Scale, and Performance (SSP) Operator +reverts changes to common templates as part of its reconciliation procedure. -The SSP Operator deploys and reconciles the common templates and the Template Validator. If a user or script changes a common template, the changes are reverted by the SSP Operator. +The SSP Operator deploys and reconciles the common templates and the Template +Validator. If a user or script changes a common template, the changes are +reverted by the SSP Operator. ## Impact @@ -29,9 +32,11 @@ Changes to common templates are overwritten. Try to identify and resolve the cause of the changes. -Ensure that changes are made only to copies of templates, and not to the templates themselves. +Ensure that changes are made only to copies of templates, and not to the +templates themselves. -See the [documentation](https://kubevirt.io/user-guide/virtual_machines/templates) for details. +See the [documentation](https://kubevirt.io/user-guide/virtual_machines/templates) +for details. diff --git a/docs/runbooks/SSPDown.md b/docs/runbooks/SSPDown.md index a7bc5657..f5039454 100644 --- a/docs/runbooks/SSPDown.md +++ b/docs/runbooks/SSPDown.md @@ -2,13 +2,17 @@ ## Meaning -This alert fires when all the Scheduling, Scale and Performance (SSP) Operator pods are down. +This alert fires when all the Scheduling, Scale and Performance (SSP) Operator +pods are down. -The SSP Operator is responsible for deploying and reconciling the common templates and the Template Validator. +The SSP Operator is responsible for deploying and reconciling the common +templates and the Template Validator. ## Impact -Dependent components might not be deployed. Changes in the components might not be reconciled. As a result, the common templates and/or the Template Validator might not be updated or reset if they fail. +Dependent components might not be deployed. Changes in the components might not +be reconciled. As a result, the common templates and/or the Template Validator +might not be updated or reset if they fail. ## Diagnosis @@ -39,11 +43,12 @@ Dependent components might not be deployed. Changes in the components might not ## Mitigation Try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - diff --git a/docs/runbooks/SSPFailingToReconcile.md b/docs/runbooks/SSPFailingToReconcile.md index fc2ce308..acd18f91 100644 --- a/docs/runbooks/SSPFailingToReconcile.md +++ b/docs/runbooks/SSPFailingToReconcile.md @@ -3,13 +3,17 @@ ## Meaning -This alert fires when the reconcile cycle of the Scheduling, Scale and Performance (SSP) Operator fails repeatedly, although the SSP Operator is running. +This alert fires when the reconcile cycle of the Scheduling, Scale and +Performance (SSP) Operator fails repeatedly, although the SSP Operator is running. -The SSP Operator is responsible for deploying and reconciling the common templates and the Template Validator. +The SSP Operator is responsible for deploying and reconciling the common +templates and the Template Validator. ## Impact -Dependent components might not be deployed. Changes in the components might not be reconciled. As a result, the common templates and/or the Template Validator might not be updated or reset if they fail. +Dependent components might not be deployed. Changes in the components might not +be reconciled. As a result, the common templates and/or the Template Validator +might not be updated or reset if they fail. ## Diagnosis @@ -52,7 +56,9 @@ Dependent components might not be deployed. Changes in the components might not ## Mitigation Try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/SSPHighRateRejectedVms.md b/docs/runbooks/SSPHighRateRejectedVms.md index 2af9dee2..594f7215 100644 --- a/docs/runbooks/SSPHighRateRejectedVms.md +++ b/docs/runbooks/SSPHighRateRejectedVms.md @@ -3,11 +3,13 @@ ## Meaning -This alert fires when a user or script attempts to create or modify a large number of virtual machines (VMs), using an invalid configuration. +This alert fires when a user or script attempts to create or modify a large +number of virtual machines (VMs), using an invalid configuration. ## Impact -The VMs are not created or modified. As a result, the environment might not behave as expected. +The VMs are not created or modified. As a result, the environment might not +behave as expected. ## Diagnosis @@ -17,7 +19,8 @@ The VMs are not created or modified. As a result, the environment might not beha $ export NAMESPACE="$(kubectl get deployment -A | grep ssp-operator | awk '{print $1}')" ``` -2. Check the `virt-template-validator` logs for errors that might indicate the cause: +2. Check the `virt-template-validator` logs for errors that might indicate the +cause: ```bash $ kubectl -n $NAMESPACE logs --tail=-1 -l name=virt-template-validator @@ -35,7 +38,9 @@ The VMs are not created or modified. As a result, the environment might not beha ## Mitigation Try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/SSPOperatorDown.md b/docs/runbooks/SSPOperatorDown.md index bb0488eb..15ba3cb9 100644 --- a/docs/runbooks/SSPOperatorDown.md +++ b/docs/runbooks/SSPOperatorDown.md @@ -3,13 +3,17 @@ ## Meaning -This alert fires when all the Scheduling, Scale and Performance (SSP) Operator pods are down. +This alert fires when all the Scheduling, Scale and Performance (SSP) Operator +pods are down. -The SSP Operator is responsible for deploying and reconciling the common templates and the Template Validator. +The SSP Operator is responsible for deploying and reconciling the common +templates and the Template Validator. ## Impact -Dependent components might not be deployed. Changes in the components might not be reconciled. As a result, the common templates and/or the Template Validator might not be updated or reset if they fail. +Dependent components might not be deployed. Changes in the components might not +be reconciled. As a result, the common templates and/or the Template Validator +might not be updated or reset if they fail. ## Diagnosis @@ -40,7 +44,9 @@ Dependent components might not be deployed. Changes in the components might not ## Mitigation Try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: @@ -48,5 +54,5 @@ If you cannot resolve the issue, see the following resources: - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) -**Note:** Starting from 4.14, this runbook will no longer be supported. For a supported runbook, please see [SSPDown Runbook](http://kubevirt.io/monitoring/runbooks/SSPDown.html). - +**Note:** Starting from 4.14, this runbook will no longer be supported. For a +supported runbook, please see [SSPDown Runbook](http://kubevirt.io/monitoring/runbooks/SSPDown.html). diff --git a/docs/runbooks/SSPTemplateValidatorDown.md b/docs/runbooks/SSPTemplateValidatorDown.md index 013a0eea..7aa83bd2 100644 --- a/docs/runbooks/SSPTemplateValidatorDown.md +++ b/docs/runbooks/SSPTemplateValidatorDown.md @@ -5,11 +5,13 @@ This alert fires when all the Template Validator pods are down. -The Template Validator checks virtual machines (VMs) to ensure that they do not violate their templates. +The Template Validator checks virtual machines (VMs) to ensure that they do not +violate their templates. ## Impact -VMs are not validated against their templates. As a result, VMs might be created with specifications that do not match their respective workloads. +VMs are not validated against their templates. As a result, VMs might be created +with specifications that do not match their respective workloads. ## Diagnosis @@ -40,7 +42,9 @@ VMs are not validated against their templates. As a result, VMs might be created ## Mitigation Try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/SingleStackIPv6Unsupported.md b/docs/runbooks/SingleStackIPv6Unsupported.md index 5bc2cca3..f36ce17f 100644 --- a/docs/runbooks/SingleStackIPv6Unsupported.md +++ b/docs/runbooks/SingleStackIPv6Unsupported.md @@ -3,15 +3,17 @@ ## Meaning -This alert fires when user tries to install KubeVirt Hyperconverged on a single stack IPv6 cluster. +This alert fires when user tries to install KubeVirt Hyperconverged on a single +stack IPv6 cluster. -KubeVirt Hyperconverged is not yet supported on an OpenShift cluster configured with single stack IPv6. It's -progress is being tracked on [this issue](https://issues.redhat.com/browse/CNV-28924). +KubeVirt Hyperconverged is not yet supported on an OpenShift cluster configured +with single stack IPv6. It's progress is being tracked on [this issue](https://issues.redhat.com/browse/CNV-28924). ## Impact -KubeVirt Hyperconverged Operator can't be installed on a single stack IPv6 cluster, and hence creation virtual -machines on top of such a cluster is not possible. +KubeVirt Hyperconverged Operator can't be installed on a single stack IPv6 +cluster, and hence creation virtual machines on top of such a cluster is not +possible. ## Diagnosis @@ -24,5 +26,5 @@ machines on top of such a cluster is not possible. ## Mitigation -It is recommended to use single stack IPv4 or a dual stack IPv4/IPv6 networking to use KubeVirt Hyperconverged. -Refer the [documentation](https://docs.openshift.com/container-platform/latest/networking/ovn_kubernetes_network_provider/converting-to-dual-stack.html). +It is recommended to use single stack IPv4 or a dual stack IPv4/IPv6 networking +to use KubeVirt Hyperconverged. Refer the [documentation](https://docs.openshift.com/container-platform/latest/networking/ovn_kubernetes_network_provider/converting-to-dual-stack.html). diff --git a/docs/runbooks/UnsupportedHCOModification.md b/docs/runbooks/UnsupportedHCOModification.md index 7983bf96..4ab29278 100644 --- a/docs/runbooks/UnsupportedHCOModification.md +++ b/docs/runbooks/UnsupportedHCOModification.md @@ -3,25 +3,34 @@ ## Meaning -This alert fires when a JSON Patch annotation is used to change an operand of the HyperConverged Cluster Operator (HCO). +This alert fires when a JSON Patch annotation is used to change an operand of +the HyperConverged Cluster Operator (HCO). -HCO configures KubeVirt and its supporting operators in an opinionated way and overwrites its operands when there is an unexpected change to them. Users must not modify the operands directly. +HCO configures KubeVirt and its supporting operators in an opinionated way and +overwrites its operands when there is an unexpected change to them. Users must +not modify the operands directly. -However, if a change is required and it is not supported by the HCO API, you can force HCO to set a change in an operator by using JSON Patch annotations. These changes are not reverted by HCO during its reconciliation process. +However, if a change is required and it is not supported by the HCO API, you can +force HCO to set a change in an operator by using JSON Patch annotations. These +changes are not reverted by HCO during its reconciliation process. -See the [KubeVirt documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#jsonpatch-annotations) for details. +See the [KubeVirt documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#jsonpatch-annotations) +for details. ## Impact -Incorrect use of JSON Patch annotations might lead to unexpected results or an unstable environment. +Incorrect use of JSON Patch annotations might lead to unexpected results or an +unstable environment. -Upgrading a system with JSON Patch annotations is dangerous because the structure of the component custom resources might change. +Upgrading a system with JSON Patch annotations is dangerous because the +structure of the component custom resources might change. ## Diagnosis -Check the `annotation_name` in the alert details to identify the JSON Patch annotation: +Check the `annotation_name` in the alert details to identify the JSON Patch +annotation: ```text Labels @@ -32,10 +41,12 @@ Labels ## Mitigation -It is best to use the HCO API to change an operand. However, if the change can only be done with a JSON Patch annotation, proceed with caution. +It is best to use the HCO API to change an operand. However, if the change can +only be done with a JSON Patch annotation, proceed with caution. Remove JSON Patch annotations before upgrade to avoid potential issues. -If the JSON Patch annotation is generic and useful, you can submit an RFE to add the modification to the API by filing a [bug](https://bugzilla.redhat.com/). +If the JSON Patch annotation is generic and useful, you can submit an RFE to add +the modification to the API by filing a [bug](https://bugzilla.redhat.com/). diff --git a/docs/runbooks/VMCannotBeEvicted.md b/docs/runbooks/VMCannotBeEvicted.md index 3c3902db..58a8a229 100644 --- a/docs/runbooks/VMCannotBeEvicted.md +++ b/docs/runbooks/VMCannotBeEvicted.md @@ -3,21 +3,25 @@ ## Meaning -This alert fires when the eviction strategy of a virtual machine (VM) is set to `LiveMigration` but the VM is not migratable. +This alert fires when the eviction strategy of a virtual machine (VM) is set to +`LiveMigration` but the VM is not migratable. ## Impact -Non-migratable VMs prevent node eviction. This condition affects operations such as node drain and updates. +Non-migratable VMs prevent node eviction. This condition affects operations such +as node drain and updates. ## Diagnosis -1. Check the VMI configuration to determine whether the value of `evictionStrategy` is `LiveMigrate` of the VMI: +1. Check the VMI configuration to determine whether the value of +`evictionStrategy` is `LiveMigrate` of the VMI: ```bash $ kubectl get vmis -o yaml ``` -2. Check for a `False` status in the `LIVE-MIGRATABLE` column to identify VMIs that are not migratable: +2. Check for a `False` status in the `LIVE-MIGRATABLE` column to identify VMIs +that are not migratable: ```bash $ kubectl get vmis -o wide @@ -44,4 +48,5 @@ Non-migratable VMs prevent node eviction. This condition affects operations such ## Mitigation -Set the `evictionStrategy` of the VMI to `shutdown` or resolve the issue that prevents the VMI from migrating. +Set the `evictionStrategy` of the VMI to `shutdown` or resolve the issue that +prevents the VMI from migrating. diff --git a/docs/runbooks/VirtAPIDown.md b/docs/runbooks/VirtAPIDown.md index cd7adf9f..7c1dc455 100644 --- a/docs/runbooks/VirtAPIDown.md +++ b/docs/runbooks/VirtAPIDown.md @@ -29,7 +29,8 @@ KubeVirt objects cannot send API calls. $ kubectl -n $NAMESPACE get deploy virt-api -o yaml ``` -4. Check the `virt-api` deployment details for issues such as crashing pods or image pull failures: +4. Check the `virt-api` deployment details for issues such as crashing pods or +image pull failures: ```bash $ kubectl -n $NAMESPACE describe deploy virt-api @@ -44,7 +45,9 @@ KubeVirt objects cannot send API calls. ## Mitigation Try to identify the root cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/VirtApiRESTErrorsBurst.md b/docs/runbooks/VirtApiRESTErrorsBurst.md index 792e8c8e..568741e5 100644 --- a/docs/runbooks/VirtApiRESTErrorsBurst.md +++ b/docs/runbooks/VirtApiRESTErrorsBurst.md @@ -2,13 +2,17 @@ ## Meaning -For the last 10 minutes or longer, over 80% of the REST calls made to `virt-api` pods have failed. +For the last 10 minutes or longer, over 80% of the REST calls made to `virt-api` +pods have failed. ## Impact -A very high rate of failed REST calls to `virt-api` might lead to slow response and execution of API calls, and potentially to API calls being completely dismissed. +A very high rate of failed REST calls to `virt-api` might lead to slow response +and execution of API calls, and potentially to API calls being completely +dismissed. -However, currently running virtual machine workloads are not likely to be affected. +However, currently running virtual machine workloads are not likely to be +affected. ## Diagnosis @@ -36,7 +40,8 @@ However, currently running virtual machine workloads are not likely to be affect $ kubectl describe -n $NAMESPACE ``` -5. Check if any problems occurred with the nodes. For example, they might be in a `NotReady` state: +5. Check if any problems occurred with the nodes. For example, they might be in +a `NotReady` state: ```bash $ kubectl get nodes @@ -56,9 +61,12 @@ However, currently running virtual machine workloads are not likely to be affect ## Mitigation -Based on the information obtained during Diagnosis, try to identify the root cause and resolve the issue. +Based on the information obtained during Diagnosis, try to identify the root +cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/VirtApiRESTErrorsHigh.md b/docs/runbooks/VirtApiRESTErrorsHigh.md index 0939fb03..27061482 100644 --- a/docs/runbooks/VirtApiRESTErrorsHigh.md +++ b/docs/runbooks/VirtApiRESTErrorsHigh.md @@ -3,13 +3,16 @@ ## Meaning -More than 5% of REST calls have failed in the `virt-api` pods in the last 60 minutes. +More than 5% of REST calls have failed in the `virt-api` pods in the last 60 +minutes. ## Impact -A high rate of failed REST calls to `virt-api` might lead to slow response and execution of API calls. +A high rate of failed REST calls to `virt-api` might lead to slow response and +execution of API calls. -However, currently running virtual machine workloads are not likely to be affected. +However, currently running virtual machine workloads are not likely to be +affected. ## Diagnosis @@ -37,7 +40,8 @@ However, currently running virtual machine workloads are not likely to be affect $ kubectl describe -n $NAMESPACE ``` -5. Check if any problems occurred with the nodes. For example, they might be in a `NotReady` state: +5. Check if any problems occurred with the nodes. For example, they might be in +a `NotReady` state: ```bash $ kubectl get nodes @@ -57,9 +61,12 @@ However, currently running virtual machine workloads are not likely to be affect ## Mitigation -Based on the information obtained during Diagnosis, try to identify the root cause and resolve the issue. +Based on the information obtained during Diagnosis, try to identify the root +cause and resolve the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/VirtControllerDown.md b/docs/runbooks/VirtControllerDown.md index 8dc298cd..30c6f965 100644 --- a/docs/runbooks/VirtControllerDown.md +++ b/docs/runbooks/VirtControllerDown.md @@ -6,7 +6,9 @@ No running `virt-controller` pod has been detected for 5 minutes. ## Impact -Any actions related to virtual machine (VM) lifecycle management fail. This notably includes launching a new virtual machine instance (VMI) or shutting down an existing VMI. +Any actions related to virtual machine (VM) lifecycle management fail. This +notably includes launching a new virtual machine instance (VMI) or shutting down +an existing VMI. ## Diagnosis @@ -35,15 +37,18 @@ This alert can have a variety of causes, including the following: - Node resource exhaustion - Not enough memory on the cluster - Nodes are down -- The API server is overloaded. For example, the scheduler might be under a heavy load and therefore not completely available. +- The API server is overloaded. For example, the scheduler might be under a +heavy load and therefore not completely available. - Networking issues Identify the root cause and fix it, if possible. - + If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - \ No newline at end of file + diff --git a/docs/runbooks/VirtControllerRESTErrorsBurst.md b/docs/runbooks/VirtControllerRESTErrorsBurst.md index acbd1f10..32f2d65e 100644 --- a/docs/runbooks/VirtControllerRESTErrorsBurst.md +++ b/docs/runbooks/VirtControllerRESTErrorsBurst.md @@ -2,19 +2,24 @@ ## Meaning -For the last 10 minutes or longer, over 80% of the REST calls made to `virt-controller` pods have failed. +For the last 10 minutes or longer, over 80% of the REST calls made to +`virt-controller` pods have failed. The `virt-controller` has likely fully lost the connection to the API server. This error is frequently caused by one of the following problems: -- The API server is overloaded, which causes timeouts. To verify if this is the case, check the metrics of the API server, and view its response times and overall calls. +- The API server is overloaded, which causes timeouts. To verify if this is the +case, check the metrics of the API server, and view its response times and +overall calls. -- The `virt-controller` pod cannot reach the API server. This is commonly caused by DNS issues on the node and networking connectivity issues. +- The `virt-controller` pod cannot reach the API server. This is commonly caused +by DNS issues on the node and networking connectivity issues. ## Impact -Status updates are not propagated and actions like migrations cannot take place. However, running workloads are not impacted. +Status updates are not propagated and actions like migrations cannot take place. +However, running workloads are not impacted. ## Diagnosis @@ -30,7 +35,8 @@ Status updates are not propagated and actions like migrations cannot take place. $ kubectl get pods -n $NAMESPACE -l=kubevirt.io=virt-controller ``` -3. Check the `virt-controller` logs for error messages when connecting to the API server: +3. Check the `virt-controller` logs for error messages when connecting to the +API server: ```bash $ kubectl logs -n $NAMESPACE @@ -38,13 +44,16 @@ Status updates are not propagated and actions like migrations cannot take place. ## Mitigation -If the `virt-controller` pod cannot connect to the API server, delete the pod to force a restart: +If the `virt-controller` pod cannot connect to the API server, delete the pod to +force a restart: ```bash $ kubectl delete -n $NAMESPACE ``` - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/VirtControllerRESTErrorsHigh.md b/docs/runbooks/VirtControllerRESTErrorsHigh.md index b17b6c97..68aa8c09 100644 --- a/docs/runbooks/VirtControllerRESTErrorsHigh.md +++ b/docs/runbooks/VirtControllerRESTErrorsHigh.md @@ -5,17 +5,23 @@ More than 5% of REST calls failed in `virt-controller` in the last 60 minutes. -This is most likely because `virt-controller` has partially lost connection to the API server. +This is most likely because `virt-controller` has partially lost connection to +the API server. This error is frequently caused by one of the following problems: -- The API server is overloaded, which causes timeouts. To verify if this is the case, check the metrics of the API server, and view its response times and overall calls. +- The API server is overloaded, which causes timeouts. To verify if this is the +case, check the metrics of the API server, and view its response times and +overall calls. -- The `virt-controller` pod cannot reach the API server. This is commonly caused by DNS issues on the node and networking connectivity issues. +- The `virt-controller` pod cannot reach the API server. This is commonly caused +by DNS issues on the node and networking connectivity issues. ## Impact -Node-related actions, such as starting and migrating, and scheduling virtual machines, are delayed. Running workloads are not affected, but reporting their current status might be delayed. +Node-related actions, such as starting and migrating, and scheduling virtual +machines, are delayed. Running workloads are not affected, but reporting their +current status might be delayed. ## Diagnosis @@ -31,7 +37,8 @@ Node-related actions, such as starting and migrating, and scheduling virtual mac $ kubectl get pods -n $NAMESPACE -l=kubevirt.io=virt-controller ``` -3. Check the `virt-controller` logs for error messages when connecting to the API server: +3. Check the `virt-controller` logs for error messages when connecting to the +API server: ```bash $ kubectl logs -n $NAMESPACE @@ -39,13 +46,16 @@ Node-related actions, such as starting and migrating, and scheduling virtual mac ## Mitigation -If the `virt-controller` pod cannot connect to the API server, delete the pod to force a restart: +If the `virt-controller` pod cannot connect to the API server, delete the pod to +force a restart: ```bash $ kubectl delete -n $NAMESPACE ``` - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/VirtHandlerDaemonSetRolloutFailing.md b/docs/runbooks/VirtHandlerDaemonSetRolloutFailing.md index dacc034e..cfbc2d82 100644 --- a/docs/runbooks/VirtHandlerDaemonSetRolloutFailing.md +++ b/docs/runbooks/VirtHandlerDaemonSetRolloutFailing.md @@ -3,11 +3,14 @@ ## Meaning -The `virt-handler` daemon set has failed to deploy on one or more worker nodes after 15 minutes. +The `virt-handler` daemon set has failed to deploy on one or more worker nodes +after 15 minutes. ## Impact -This alert is a warning. It does not indicate that all `virt-handler` daemon sets have failed to deploy. Therefore, the normal lifecycle of virtual machines is not affected unless the cluster is overloaded. +This alert is a warning. It does not indicate that all `virt-handler` daemon +sets have failed to deploy. Therefore, the normal lifecycle of virtual machines +is not affected unless the cluster is overloaded. ## Diagnosis @@ -19,7 +22,8 @@ Identify worker nodes that do not have a running `virt-handler` pod: $ export NAMESPACE="$(kubectl get kubevirt -A -o custom-columns="":.metadata.namespace)" ``` -2. Check the status of the `virt-handler` pods to identify pods that have not deployed: +2. Check the status of the `virt-handler` pods to identify pods that have not +deployed: ```bash $ kubectl get pods -n $NAMESPACE -l=kubevirt.io=virt-handler @@ -33,4 +37,5 @@ Identify worker nodes that do not have a running `virt-handler` pod: ## Mitigation -If the `virt-handler` pods failed to deploy because of insufficient resources, you can delete other pods on the affected worker node. +If the `virt-handler` pods failed to deploy because of insufficient resources, +you can delete other pods on the affected worker node. diff --git a/docs/runbooks/VirtHandlerRESTErrorsBurst.md b/docs/runbooks/VirtHandlerRESTErrorsBurst.md index 81fa2e37..dfc2c6bc 100644 --- a/docs/runbooks/VirtHandlerRESTErrorsBurst.md +++ b/docs/runbooks/VirtHandlerRESTErrorsBurst.md @@ -2,19 +2,25 @@ ## Meaning -For the last 10 minutes or longer, over 80% of the REST calls made to `virt-handler` pods have failed. +For the last 10 minutes or longer, over 80% of the REST calls made to +`virt-handler` pods have failed. -This alert usually indicates that the `virt-handler` pods cannot connect to the API server. +This alert usually indicates that the `virt-handler` pods cannot connect to the +API server. This error is frequently caused by one of the following problems: -- The API server is overloaded, which causes timeouts. To verify if this is the case, check the metrics of the API server, and view its response times and overall calls. +- The API server is overloaded, which causes timeouts. To verify if this is the +case, check the metrics of the API server, and view its response times and +overall calls. -- The `virt-handler` pod cannot reach the API server. This is commonly caused by DNS issues on the node and networking connectivity issues. +- The `virt-handler` pod cannot reach the API server. This is commonly caused by +DNS issues on the node and networking connectivity issues. ## Impact -Status updates are not propagated and node-related actions, such as migrations, fail. However, running workloads on the affected node are not impacted. +Status updates are not propagated and node-related actions, such as migrations, +fail. However, running workloads on the affected node are not impacted. ## Diagnosis @@ -30,7 +36,8 @@ Status updates are not propagated and node-related actions, such as migrations, $ kubectl get pods -n $NAMESPACE -l=kubevirt.io=virt-handler ``` -3. Check the `virt-handler` logs for error messages when connecting to the API server: +3. Check the `virt-handler` logs for error messages when connecting to the API +server: ```bash $ kubectl logs -n $NAMESPACE @@ -38,17 +45,19 @@ Status updates are not propagated and node-related actions, such as migrations, ## Mitigation -If the `virt-handler` cannot connect to the API server, delete the pod to force a restart: +If the `virt-handler` cannot connect to the API server, delete the pod to force +a restart: ```bash $ kubectl delete -n $NAMESPACE ``` - + If you cannot resolve the issue, see the following resources: - [OKD Help](https://www.okd.io/help/) - [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization) - diff --git a/docs/runbooks/VirtHandlerRESTErrorsHigh.md b/docs/runbooks/VirtHandlerRESTErrorsHigh.md index 25075e9f..4becf202 100644 --- a/docs/runbooks/VirtHandlerRESTErrorsHigh.md +++ b/docs/runbooks/VirtHandlerRESTErrorsHigh.md @@ -3,17 +3,24 @@ ## Meaning -More than 5% of REST calls failed in `virt-handler` in the last 60 minutes. This alert usually indicates that the `virt-handler` pods have partially lost connection to the API server. +More than 5% of REST calls failed in `virt-handler` in the last 60 minutes. This +alert usually indicates that the `virt-handler` pods have partially lost +connection to the API server. This error is frequently caused by one of the following problems: -- The API server is overloaded, which causes timeouts. To verify if this is the case, check the metrics of the API server, and view its response times and overall calls. +- The API server is overloaded, which causes timeouts. To verify if this is the +case, check the metrics of the API server, and view its response times and +overall calls. -- The `virt-handler` pod cannot reach the API server. This is commonly caused by DNS issues on the node and networking connectivity issues. +- The `virt-handler` pod cannot reach the API server. This is commonly caused by +DNS issues on the node and networking connectivity issues. ## Impact -Node-related actions, such as starting and migrating workloads, are delayed on the node that `virt-handler` is running on. Running workloads are not affected, but reporting their current status might be delayed. +Node-related actions, such as starting and migrating workloads, are delayed on +the node that `virt-handler` is running on. Running workloads are not affected, +but reporting their current status might be delayed. ## Diagnosis @@ -23,13 +30,15 @@ Node-related actions, such as starting and migrating workloads, are delayed on t $ export NAMESPACE="$(kubectl get kubevirt -A -o custom-columns="":.metadata.namespace)" ``` -2. List the available `virt-handler` pods to identify the failing `virt-handler` pod: +2. List the available `virt-handler` pods to identify the failing `virt-handler` +pod: ```bash $ kubectl get pods -n $NAMESPACE -l=kubevirt.io=virt-handler ``` -3. Check the failing `virt-handler` pod log for error messages when connecting to the API server: +3. Check the failing `virt-handler` pod log for error messages when connecting +to the API server: ```bash $ kubectl logs -n $NAMESPACE @@ -43,13 +52,16 @@ Node-related actions, such as starting and migrating workloads, are delayed on t ## Mitigation -If the `virt-handler` cannot connect to the API server, delete the pod to force a restart: +If the `virt-handler` cannot connect to the API server, delete the pod to force +a restart: ```bash $ kubectl delete -n $NAMESPACE ``` - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/VirtOperatorDown.md b/docs/runbooks/VirtOperatorDown.md index 8a4a53c7..4d9d888d 100644 --- a/docs/runbooks/VirtOperatorDown.md +++ b/docs/runbooks/VirtOperatorDown.md @@ -3,21 +3,29 @@ ## Meaning -This alert fires when no `virt-operator` pod in the `Running` state has been detected for 10 minutes. +This alert fires when no `virt-operator` pod in the `Running` state has been +detected for 10 minutes. -The `virt-operator` is the first Operator to start in a cluster. Its primary responsibilities include the following: +The `virt-operator` is the first Operator to start in a cluster. Its primary +responsibilities include the following: - Installing, live-updating, and live-upgrading a cluster -- Monitoring the life cycle of top-level controllers, such as `virt-controller`, `virt-handler`, `virt-launcher`, and managing their reconciliation -- Certain cluster-wide tasks, such as certificate rotation and infrastructure management +- Monitoring the life cycle of top-level controllers, such as `virt-controller`, +`virt-handler`, `virt-launcher`, and managing their reconciliation +- Certain cluster-wide tasks, such as certificate rotation and infrastructure +management The `virt-operator` deployment has a default replica of 2 pods. ## Impact -This alert indicates a failure at the level of the cluster. Critical cluster-wide management functionalities, such as certification rotation, upgrade, and reconciliation of controllers, might not be available. +This alert indicates a failure at the level of the cluster. Critical +cluster-wide management functionalities, such as certification rotation, +upgrade, and reconciliation of controllers, might not be available. -The `virt-operator` is not directly responsible for virtual machines (VMs) in the cluster. Therefore, its temporary unavailability does not significantly affect VM workloads. +The `virt-operator` is not directly responsible for virtual machines (VMs) in +the cluster. Therefore, its temporary unavailability does not significantly +affect VM workloads. ## Diagnosis @@ -53,9 +61,12 @@ The `virt-operator` is not directly responsible for virtual machines (VMs) in th ## Mitigation -Based on the information obtained during Diagnosis, try to find and resolve the cause of the issue. +Based on the information obtained during Diagnosis, try to find and resolve the +cause of the issue. - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/VirtOperatorRESTErrorsBurst.md b/docs/runbooks/VirtOperatorRESTErrorsBurst.md index c02f56a2..6311cb55 100644 --- a/docs/runbooks/VirtOperatorRESTErrorsBurst.md +++ b/docs/runbooks/VirtOperatorRESTErrorsBurst.md @@ -2,21 +2,28 @@ ## Meaning -For the last 10 minutes or longer, over 80% of the REST calls made to `virt-operator` pods have failed. +For the last 10 minutes or longer, over 80% of the REST calls made to +`virt-operator` pods have failed. -This usually indicates that the `virt-operator` pods cannot connect to the API server. +This usually indicates that the `virt-operator` pods cannot connect to the API +server. This error is frequently caused by one of the following problems: -- The API server is overloaded, which causes timeouts. To verify if this is the case, check the metrics of the API server, and view its response times and overall calls. +- The API server is overloaded, which causes timeouts. To verify if this is the +case, check the metrics of the API server, and view its response times and +overall calls. -- The `virt-operator` pod cannot reach the API server. This is commonly caused by DNS issues on the node and networking connectivity issues. +- The `virt-operator` pod cannot reach the API server. This is commonly caused +by DNS issues on the node and networking connectivity issues. ## Impact -Cluster-level actions, such as upgrading and controller reconciliation, might not be available. +Cluster-level actions, such as upgrading and controller reconciliation, might +not be available. -However, customer workloads, such as virtual machines (VMs) and VM instances (VMIs), are not likely to be affected. +However, customer workloads, such as virtual machines (VMs) and VM instances +(VMIs), are not likely to be affected. ## Diagnosis @@ -46,13 +53,16 @@ However, customer workloads, such as virtual machines (VMs) and VM instances (VM ## Mitigation -If the `virt-operator` pod cannot connect to the API server, delete the pod to force a restart: +If the `virt-operator` pod cannot connect to the API server, delete the pod to +force a restart: ```bash $ kubectl delete -n $NAMESPACE ``` - + If you cannot resolve the issue, see the following resources: diff --git a/docs/runbooks/VirtOperatorRESTErrorsHigh.md b/docs/runbooks/VirtOperatorRESTErrorsHigh.md index 899c4dbb..12382c61 100644 --- a/docs/runbooks/VirtOperatorRESTErrorsHigh.md +++ b/docs/runbooks/VirtOperatorRESTErrorsHigh.md @@ -3,19 +3,26 @@ ## Meaning -This alert fires when more than 5% of the REST calls in `virt-operator` pods failed in the last 60 minutes. This usually indicates the `virt-operator` pods cannot connect to the API server. +This alert fires when more than 5% of the REST calls in `virt-operator` pods +failed in the last 60 minutes. This usually indicates the `virt-operator` pods +cannot connect to the API server. This error is frequently caused by one of the following problems: -- The API server is overloaded, which causes timeouts. To verify if this is the case, check the metrics of the API server, and view its response times and overall calls. +- The API server is overloaded, which causes timeouts. To verify if this is the +case, check the metrics of the API server, and view its response times and +overall calls. -- The `virt-operator` pod cannot reach the API server. This is commonly caused by DNS issues on the node and networking connectivity issues. +- The `virt-operator` pod cannot reach the API server. This is commonly caused +by DNS issues on the node and networking connectivity issues. ## Impact -Cluster-level actions, such as upgrading and controller reconciliation, might be delayed. +Cluster-level actions, such as upgrading and controller reconciliation, might be +delayed. -However, customer workloads, such as virtual machines (VMs) and VM instances (VMIs), are not likely to be affected. +However, customer workloads, such as virtual machines (VMs) and VM instances +(VMIs), are not likely to be affected. ## Diagnosis @@ -45,13 +52,16 @@ However, customer workloads, such as virtual machines (VMs) and VM instances (VM ## Mitigation -If the `virt-operator` pod cannot connect to the API server, delete the pod to force a restart: +If the `virt-operator` pod cannot connect to the API server, delete the pod to +force a restart: ```bash $ kubectl delete -n ``` - + If you cannot resolve the issue, see the following resources: From 199820e293ac8830395ae31bf74ef9a7f7a855a8 Mon Sep 17 00:00:00 2001 From: machadovilaca Date: Thu, 18 Apr 2024 15:03:11 +0100 Subject: [PATCH 4/4] Add GitHub Action Signed-off-by: machadovilaca --- .github/workflows/sanity.yaml | 20 ++++++++++++++++++++ 1 file changed, 20 insertions(+) create mode 100644 .github/workflows/sanity.yaml diff --git a/.github/workflows/sanity.yaml b/.github/workflows/sanity.yaml new file mode 100644 index 00000000..498b45e6 --- /dev/null +++ b/.github/workflows/sanity.yaml @@ -0,0 +1,20 @@ +name: Sanity Checks + +on: + push: + branches: [ main ] + pull_request: + branches: [ main ] + workflow_dispatch: + +jobs: + build: + name: Sanity Checks + runs-on: ubuntu-latest + + steps: + - uses: actions/checkout@v3 + + - uses: DavidAnson/markdownlint-cli2-action@v16 + with: + globs: 'docs/*runbooks/*.md'