Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add markdown linter for runbooks #234

Merged
merged 4 commits into from
Apr 29, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
20 changes: 20 additions & 0 deletions .github/workflows/sanity.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,20 @@
name: Sanity Checks

on:
push:
branches: [ main ]
pull_request:
branches: [ main ]
workflow_dispatch:

jobs:
build:
name: Sanity Checks
runs-on: ubuntu-latest

steps:
- uses: actions/checkout@v3

- uses: DavidAnson/markdownlint-cli2-action@v16
with:
globs: 'docs/*runbooks/*.md'
47 changes: 47 additions & 0 deletions .markdownlint-cli2.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
# See https://github.com/DavidAnson/markdownlint#optionsconfig
# and https://github.com/DavidAnson/markdownlint-cli2

config:
# forematter metadata seems to trigger this
single-title: false

# hard tabs are used when pasting go example code into files
no-hard-tabs: false

# we commonly paste bare urls in the middle of paragraphs
no-bare-urls: false

# really, this is a rule?
commands-show-output: false

# We like to use really long lines
line-length:
line_length: 80
code_blocks: false

# Sometimes we repeat headings, and it's easier to just turn those
# into emphasis
no-emphasis-as-heading: false

# We only publish HTML, so allow all HTML inline.
no-inline-html: false

## Rules we may want to turn on later, but that will require editing
## existing files:

# We tend to use `*` instead of `-` for list bullets but we aren't
# consistent, even within a single file. Ideally we would want
# `style: consistent`
ul-style: false

# We have at least one # document that breaks up a numbered list
# with headings. Ideally we would set `style: one_or_ordered`.
ol-prefix: false

# Vertical whitespace helps the reader, so we should turn these on
# again when someone has time to fix our existing files.
blanks-around-fences: false
blanks-around-headings: false
blanks-around-lists: false
single-trailing-newline: false
no-multiple-blanks: false
5 changes: 5 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -31,3 +31,8 @@ monitoringlinter-build:
.PHONY: monitoringlinter-test
monitoringlinter-test: monitoringlinter-build
cd monitoringlinter && ./tests/e2e.sh

.PHONY: lint-markdown
lint-markdown:
echo "Linting markdown files"
podman run -v ${PWD}:/workdir:Z docker.io/davidanson/markdownlint-cli2:v0.13.0 "/workdir/docs/*runbooks/*.md"
12 changes: 7 additions & 5 deletions docs/deprecated_runbooks/KubeMacPoolDown.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,11 +5,12 @@

## Meaning

`KubeMacPool` is down. `KubeMacPool` is responsible for allocating MAC addresses and preventing MAC address conflicts.
`KubeMacPool` is down. `KubeMacPool` is responsible for allocating MAC addresses
and preventing MAC address conflicts.

## Impact

If `KubeMacPool` is down, `VirtualMachine` objects cannot be created.
If `KubeMacPool` is down, `VirtualMachine` objects cannot be created.

## Diagnosis

Expand All @@ -19,7 +20,7 @@ If `KubeMacPool` is down, `VirtualMachine` objects cannot be created.
$ export KMP_NAMESPACE="$(kubectl get pod -A --no-headers -l \
control-plane=mac-controller-manager | awk '{print $1}')"
```

2. Set the `KMP_NAME` environment variable:

```bash
Expand All @@ -41,11 +42,12 @@ If `KubeMacPool` is down, `VirtualMachine` objects cannot be created.

## Mitigation

<!--DS: If you cannot resolve the issue, log in to the link:https://access.redhat.com[Customer Portal] and open a support case, attaching the artifacts gathered during the Diagnosis procedure.-->
<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the Diagnosis procedure.-->
<!--USstart-->
If you cannot resolve the issue, see the following resources:

- [OKD Help](https://www.okd.io/help/)
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
<!--USend-->

Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# KubeVirtComponentExceedsRequestedCPU [Deprecated]
<!-- Edited by apinnick, Nov 2022-->

This alert has been deprecated; it does not indicate a genuine issue. If triggered, it may be safely ignored and silenced.
This alert has been deprecated; it does not indicate a genuine issue. If
triggered, it may be safely ignored and silenced.
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
# KubeVirtComponentExceedsRequestedMemory [Deprecated]
<!-- Edited by apinnick, Nov 2022-->

This alert has been deprecated; it does not indicate a genuine issue. If triggered, it may be safely ignored and silenced.
This alert has been deprecated; it does not indicate a genuine issue. If
triggered, it may be safely ignored and silenced.
4 changes: 2 additions & 2 deletions docs/deprecated_runbooks/KubeVirtVMStuckInErrorState.md
Original file line number Diff line number Diff line change
Expand Up @@ -145,11 +145,11 @@ $ kubectl get nodes -l node-role.kubernetes.io/worker= -o json | jq '.items | .[

## Mitigation

First, ensure that the VirtualMachine configuration is correct and all necessary
First, ensure that the VirtualMachine configuration is correct and all necessary
resources exist. For example, if a PVC is missing, it should be created. Also,
verify that the cluster's infrastructure is healthy and there are enough
resources to run the VirtualMachine.

This problem can be caused by several reasons. Therefore, we advise you to try
to identify and fix the root cause. If you cannot resolve this issue, please
open an issue and attach the artifacts gathered in the Diagnosis section.
open an issue and attach the artifacts gathered in the Diagnosis section.
Original file line number Diff line number Diff line change
Expand Up @@ -3,16 +3,27 @@

## Meaning

<!--DS: This alert fires when _integrated_ Node Maintenance Operator (NMO) custom resources (CRs) are detected. This alert only affects {VirtProductName} 4.10.-->
<!--DS: This alert fires when _integrated_ Node Maintenance Operator (NMO)
custom resources (CRs) are detected. This alert only affects {VirtProductName}
4.10.-->

<!--DS: The Node Maintenance Operator is not included with {VirtProductName} 4.11.0 or later. Instead, the Operator is installed from OperatorHub.-->
<!--DS: The Node Maintenance Operator is not included with {VirtProductName}
4.11.0 or later. Instead, the Operator is installed from OperatorHub.-->

<!--DS: The presence of `NodeMaintenance` CRs belonging to the `nodemaintenance.kubevirt.io` API group indicates that the node specified in `spec.nodeName` was put into maintenance mode. The target node has been cordoned off and drained.-->
<!--DS: The presence of `NodeMaintenance` CRs belonging to the
`nodemaintenance.kubevirt.io` API group indicates that the node specified in
`spec.nodeName` was put into maintenance mode. The target node has been cordoned
off and drained.-->

<!--USstart-->
This alert fires when _integrated_ Node Maintenance Operator (NMO) custom resources (CRs) are detected. This alert only affects OKD 1.6.

The presence of `NodeMaintenance` CRs belonging to the `nodemaintenance.kubevirt.io` API group indicates that the node specified in `spec.nodeName` was put into maintenance mode. The target node has been [cordoned off](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cordon) and [drained](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service).
This alert fires when _integrated_ Node Maintenance Operator (NMO) custom
resources (CRs) are detected. This alert only affects OKD 1.6.

The presence of `NodeMaintenance` CRs belonging to the
`nodemaintenance.kubevirt.io` API group indicates that the node specified in
`spec.nodeName` was put into maintenance mode. The target node has been
[cordoned off](https://kubernetes.io/docs/reference/generated/kubectl/kubectl-commands#cordon)
and [drained](https://kubernetes.io/docs/tasks/administer-cluster/safely-drain-node/#use-kubectl-drain-to-remove-a-node-from-service).
<!--USend-->

## Impact
Expand All @@ -32,7 +43,7 @@ You cannot upgrade to OKD 1.7.

Example output:

```
```json
{
"lastTransitionTime": "2022-05-26T09:23:21Z",
"message": "NMO custom resources have been found",
Expand All @@ -44,7 +55,7 @@ You cannot upgrade to OKD 1.7.

2. Check for a ClusterServiceVersion (CSV) warning event such as the following:

```
```text
Warning NotUpgradeable 2m12s (x5 over 2m50s) kubevirt-hyperconvergedNode
Maintenance Operator custom resources nodemaintenances.nodemaintenance.kubevirt.io
have been found.
Expand All @@ -60,20 +71,28 @@ You cannot upgrade to OKD 1.7.

Example output:

```
```text
NAME AGE
nodemaintenance-test 5m33s
```

## Mitigation

Remove all NMO CRs belonging to the `nodemaintenance.nodemaintenance.kubevirt.io/` API group. After the integrated NMO resources are removed, the alert is cleared and you can upgrade.
Remove all NMO CRs belonging to the
`nodemaintenance.nodemaintenance.kubevirt.io/` API group. After the integrated
NMO resources are removed, the alert is cleared and you can upgrade.

If a node must remain in maintenance mode during upgrade, install the Node Maintenance Operator from OperatorHub. Then, create an NMO CR belonging to the `nodemaintenance.nodemaintenance.medik8s.io/v1beta1` API group and version for the node.
If a node must remain in maintenance mode during upgrade, install the Node
Maintenance Operator from OperatorHub. Then, create an NMO CR belonging to the
`nodemaintenance.nodemaintenance.medik8s.io/v1beta1` API group and version for
the node.

<!--DS: If you cannot resolve the issue, log in to the link:https://access.redhat.com[Customer Portal] and open a support case, attaching the artifacts gathered during the Diagnosis procedure.-->
<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the Diagnosis procedure.-->
<!--USstart-->
See the [HCO cluster configuration documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#enablecommonbootimageimport-feature-gate) for more information.
See the [HCO cluster configuration documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#enablecommonbootimageimport-feature-gate)
for more information.

If you cannot resolve the issue, see the following resources:

Expand Down
50 changes: 38 additions & 12 deletions docs/runbooks/CDIDataImportCronOutdated.md
Original file line number Diff line number Diff line change
Expand Up @@ -3,11 +3,18 @@

## Meaning

This alert fires when `DataImportCron` cannot poll or import the latest disk image versions.
This alert fires when `DataImportCron` cannot poll or import the latest disk
image versions.

`DataImportCron` polls disk images, checking for the latest versions, and imports the images into persistent volume claims (PVCs) or VolumeSnapshots. This process ensures that these sources are updated to the latest version so that they can be used as reliable clone sources or golden images for virtual machines (VMs).
`DataImportCron` polls disk images, checking for the latest versions, and
imports the images into persistent volume claims (PVCs) or VolumeSnapshots. This
process ensures that these sources are updated to the latest version so that
they can be used as reliable clone sources or golden images for virtual machines
(VMs).

For golden images, _latest_ refers to the latest operating system of the distribution. For other disk images, _latest_ refers to the latest hash of the image that is available.
For golden images, _latest_ refers to the latest operating system of the
distribution. For other disk images, _latest_ refers to the latest hash of the
image that is available.

## Impact

Expand All @@ -23,15 +30,22 @@ VMs might fail to start because no boot source is available for cloning.
$ kubectl get sc
```

The output displays the storage classes with `(default)` beside the name of the default storage class. You must set a default storage class, either on the cluster or in the `DataImportCron` specification, in order for the `DataImportCron` to poll and import golden images. If no storage class is defined, the DataVolume controller fails to create PVCs and the following event is displayed: `DataVolume.storage spec is missing accessMode and no storageClass to choose profile`.
The output displays the storage classes with `(default)` beside the name of
the default storage class. You must set a default storage class, either on
the cluster or in the `DataImportCron` specification, in order for the
`DataImportCron` to poll and import golden images. If no storage class is
defined, the DataVolume controller fails to create PVCs and the following
event is displayed: `DataVolume.storage spec is missing accessMode and no
storageClass to choose profile`.

2. Obtain the `DataImportCron` namespace and name:

```bash
$ kubectl get dataimportcron -A -o json | jq -r '.items[] | select(.status.conditions[] | select(.type == "UpToDate" and .status == "False")) | .metadata.namespace + "/" + .metadata.name'
```

3. If a default storage class is not defined on the cluster, check the `DataImportCron` specification for a default storage class:
3. If a default storage class is not defined on the cluster, check the
`DataImportCron` specification for a default storage class:

```bash
$ kubectl get dataimportcron <dataimportcron> -o yaml | grep -B 5 storageClassName
Expand All @@ -48,7 +62,8 @@ VMs might fail to start because no boot source is available for cloning.
storageClassName: rook-ceph-block
```

4. Obtain the name of the `DataVolume` associated with the `DataImportCron` object:
4. Obtain the name of the `DataVolume` associated with the `DataImportCron`
object:

```bash
$ kubectl -n <namespace> get dataimportcron <dataimportcron> -o json | jq .status.lastImportedPVC.name
Expand All @@ -74,20 +89,31 @@ VMs might fail to start because no boot source is available for cloning.

## Mitigation

1. Set a default storage class, either on the cluster or in the `DataImportCron` specification, to poll and import golden images. The updated Containerized Data Importer (CDI) should resolve the issue within a few seconds.
1. Set a default storage class, either on the cluster or in the `DataImportCron`
specification, to poll and import golden images. The updated Containerized Data
Importer (CDI) should resolve the issue within a few seconds.

2. If the issue does not resolve itself, or, if you have changed the default storage class in the cluster,
you must delete the existing boot sources (datavolumes or volumesnapshots) in the cluster namespace that are configured with the previous default storage class. The CDI will recreate the data volumes with the newly configured default storage class.
2. If the issue does not resolve itself, or, if you have changed the default
storage class in the cluster,
you must delete the existing boot sources (datavolumes or volumesnapshots) in
the cluster namespace that are configured with the previous default storage
class. The CDI will recreate the data volumes with the newly configured default
storage class.

3. If your cluster is installed in a restricted network environment, disable the `enableCommonBootImageImport` feature gate in order to opt out of automatic updates:
3. If your cluster is installed in a restricted network environment, disable the
`enableCommonBootImageImport` feature gate in order to opt out of automatic
updates:

```bash
$ kubectl patch hco kubevirt-hyperconverged -n $CDI_NAMESPACE --type json -p '[{"op": "replace", "path": "/spec/featureGates/enableCommonBootImageImport", "value": false}]'
```

<!--DS: If you cannot resolve the issue, log in to the link:https://access.redhat.com[Customer Portal] and open a support case, attaching the artifacts gathered during the Diagnosis procedure.-->
<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the Diagnosis procedure.-->
<!--USstart-->
See the [HCO cluster configuration documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#enablecommonbootimageimport-feature-gate) for more information.
See the [HCO cluster configuration documentation](https://github.com/kubevirt/hyperconverged-cluster-operator/blob/main/docs/cluster-configuration.md#enablecommonbootimageimport-feature-gate)
for more information.

If you cannot resolve the issue, see the following resources:

Expand Down
10 changes: 7 additions & 3 deletions docs/runbooks/CDIDataVolumeUnusualRestartCount.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,7 +7,10 @@ This alert fires when a `DataVolume` object restarts more than three times.

## Impact

Data volumes are responsible for importing and creating a virtual machine disk on a persistent volume claim. If a data volume restarts more than three times, these operations are unlikely to succeed. You must diagnose and resolve the issue.
Data volumes are responsible for importing and creating a virtual machine disk
on a persistent volume claim. If a data volume restarts more than three times,
these operations are unlikely to succeed. You must diagnose and resolve the
issue.

## Diagnosis

Expand All @@ -33,11 +36,12 @@ Data volumes are responsible for importing and creating a virtual machine disk o

Delete the data volume, resolve the issue, and create a new data volume.

<!--DS: If you cannot resolve the issue, log in to the link:https://access.redhat.com[Customer Portal] and open a support case, attaching the artifacts gathered during the Diagnosis procedure.-->
<!--DS: If you cannot resolve the issue, log in to the
link:https://access.redhat.com[Customer Portal] and open a support case,
attaching the artifacts gathered during the Diagnosis procedure.-->
<!--USstart-->
If you cannot resolve the issue, see the following resources:

- [OKD Help](https://www.okd.io/help/)
- [#virtualization Slack channel](https://kubernetes.slack.com/channels/virtualization)
<!--USend-->

Loading
Loading