Skip to content

Commit

Permalink
KB: add the article for potential risk with fstrim
Browse files Browse the repository at this point in the history
    - Also mentioned how to avoid this risk.

Signed-off-by: Vicente Cheng <[email protected]>
Co-authored-by: Kiefer Chang <[email protected]>
  • Loading branch information
Vicente-Cheng and bk201 committed Jan 31, 2024
1 parent 27e6a70 commit 9b644d0
Showing 1 changed file with 44 additions and 0 deletions.
44 changes: 44 additions & 0 deletions kb/2024-01-30/the_potential_risk_with_fstrim.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
---
title: The potential risk with fstrim
description: The potential risk with fstrim and how to avoid it
slug: the_potential_risk_with_fstrim
authors:
- name: Vicente Cheng
title: Senior Software Engineer
url: https://github.com/Vicente-Cheng
image_url: https://github.com/Vicente-Cheng.png
tags: [harvester, rancher integration, longhorn, fstrim]
hide_table_of_contents: false
---

The `fstrim` is the common way to release the unused space of the filesystem. However, we encounter the known issue with `fstrim` on the Longhorn volume. This article shares the potential risk with `fstrim` and how to avoid it.

The known issue is that executing the `fstrim` on the Longhorn volume may result in IOErrors if the volume is rebuilding. Related issue: (You can find more details in the issues)
- https://github.com/harvester/harvester/issues/4739
- https://github.com/longhorn/longhorn/issues/7103

## The potential risk and affection with fstrim

If you encounter the known issue on the above, that will result in the IOErrors. The IOErrors will cause the VM that uses this volume to be stuck. If the VM is critical, it will cause the application to be unavailable. For example, Harvester usually uses the Longhorn volume as the VM disk. After encountering this issue, the VM will flap in pause and running state until the volume rebuild is completed.

That does not affect the data integrity, but it will cause some panic issues for users. It caused the VM to hang, and the application will be unavailable. Consider the guest Kubernetes cluster scenario. When the VM is unavailable, it means the etcd service is not available. If half of the etcd service is unavailable, the Kubernetes cluster will be unavailable. Meanwhile, any services running on this Kubernetes cluster will be unavailable.

## How to avoid the potential risk

The way to avoid the potential risk is to disable the `fstrim` in VMs. The `fstrim` is enabled by default on various modern Linux distributions.
You can check the following items for the potential `fstrim`.

:::note
The following items are for VMs that use the Longhorn volume, so `fstrim` will cause the above issue.
:::

- Check the service `fstrim.timer`. You can **disable** it or **edit** the service file to make the `fstrim` does not execute almost simultaneously.

Please check the following section and modify it to distribute the `fstrim` timing.
```
[Timer]
OnCalendar=weekly
AccuracySec=1h
Persistent=true
RandomizedDelaySec=6000
```

0 comments on commit 9b644d0

Please sign in to comment.