Skip to content

Commit

Permalink
KB: add the article for potential risk with fstrim
Browse files Browse the repository at this point in the history
    - Also mentioned how to avoid this risk.

Signed-off-by: Vicente Cheng <[email protected]>
  • Loading branch information
Vicente-Cheng committed Jan 31, 2024
1 parent 27e6a70 commit 61c6dbb
Showing 1 changed file with 65 additions and 0 deletions.
65 changes: 65 additions & 0 deletions kb/2024-01-30/the_potential_risk_with_fstrim.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,65 @@
---
title: The potential risk with fstrim
description: The potential risk with fstrim and how to avoid it
slug: the_potential_risk_with_fstrim
authors:
- name: Vicente Cheng
title: Senior Software Engineer
url: https://github.com/Vicente-Cheng
image_url: https://github.com/Vicente-Cheng.png
tags: [harvester, rancher integration, longhorn, fstrim]
hide_table_of_contents: false
---

The `fstrim` is the common way to release the unused space of the filesystem. However, we encounter the known issue with `fstrim` on the Longhorn volume. This article shares the potential risk with `fstrim` and how to avoid it.

The known issue is that executing the `fstrim` on the Longhorn volume may result in IOErrors if the volume is rebuilding. Related issue: (You can find more details in the issues)
- https://github.com/harvester/harvester/issues/4739
- https://github.com/longhorn/longhorn/issues/7103

## The potential risk and affection with fstrim

If you encounter the known issue on the above, that will result in the IOErrors. The IOErrors will cause the pod that uses this volume to be stuck. If the pod is critical, it will cause the application to be unavailable. For example, Harvester usually uses the Longhorn volume as the VM disk. After encountering this issue, the VM will flap in pause and running state until the volume rebuild is completed.

That does not affect the data integrity, but it will cause some panic issues for users. It caused the VM to hang, and the application will be unavailable. Consider the guest Kubernetes cluster scenario. When the VM is unavailable, it means the etcd service is not available. If half of the etcd service is unavailable, the Kubernetes cluster will be unavailable. Meanwhile, any services running on this Kubernetes cluster will be unavailable.

## How to avoid the potential risk

The way to avoid the potential risk is to disable the `fstrim`. The `fstrim` is enabled by default on various modern Linux distributions.
You can check the following items for the potential `fstrim`.
- Disable the `fstrim` by adding the `nodiscard` option to the mount options on the `/etc/fstab` file. (`nodiscard` sometimes is the default value, but you can still add it.)

You can refer to the following cloud-init template.
```
#cloud-config
package_update: true
packages:
- qemu-guest-agent
runcmd:
- - systemctl
- enable
- '--now'
- qemu-guest-agent.service
- - sed
- -i
- 's/discard/nodiscard/'
- /etc/fstab
- - mount
- -a
- -o
- remount
password: ubuntu
chpasswd: { expire: False }
ssh_pwauth: True
```
- Check the service `fstrim.timer`. You can disable it or edit the service file to make the `fstrim` does not execute almost simultaneously.
Please check the following section and modify it to disable or distribute the `fstrim` timing.
```
[Timer]
OnCalendar=weekly
AccuracySec=1h
Persistent=true
RandomizedDelaySec=6000
```

0 comments on commit 61c6dbb

Please sign in to comment.