vdev queue stats #16200

robn · 2024-05-16T07:00:50Z

Motivation and Context

Part of my ongoing quest to understand what's happening inside the box (previously).

This time, its counters showing what vdev_queue is up to.

Description

Adds a bunch of wmsum_t counters to ever vdev_queue instance for a real device. This show current count of IOs queued and in-flight (total and broken down by class), total IOs in/out over the lifetime of the queue, and basic aggregation counters.

The counters are exposed under /proc/spl/kstat/zfs/<pool>/vdev/<guid>/queue on Linux, or kstat.zfs.<pool>.vdev.<guid>.misc.queue on FreeBSD.

# zpool status -g
  pool: tank
 state: ONLINE
config:

	NAME                      STATE     READ WRITE CKSUM
	tank                      ONLINE       0     0     0
	 11293794978541385724    ONLINE       0     0     0
	   13809318117615536196  ONLINE       0     0     0
	   1868205675291292825   ONLINE       0     0     0
	   815484099661475330    ONLINE       0     0     0
	   14246512426141088651  ONLINE       0     0     0

errors: No known data errors

# ls -l /proc/spl/kstat/zfs/tank/vdev/*/queue
-rw-r--r-- 1 root root 0 May 16 06:27 /proc/spl/kstat/zfs/tank/vdev/13809318117615536196/queue
-rw-r--r-- 1 root root 0 May 16 06:27 /proc/spl/kstat/zfs/tank/vdev/14246512426141088651/queue
-rw-r--r-- 1 root root 0 May 16 06:27 /proc/spl/kstat/zfs/tank/vdev/1868205675291292825/queue
-rw-r--r-- 1 root root 0 May 16 06:27 /proc/spl/kstat/zfs/tank/vdev/815484099661475330/queue

# cat /proc/spl/kstat/zfs/tank/vdev/13809318117615536196/queue
20 1 0x01 45 12240 3024876135 13088804505
name                            type data
io_queued                       4    0
io_syncread_queued              4    0
io_syncwrite_queued             4    0
io_asyncread_queued             4    0
io_asyncwrite_queued            4    0
io_scrub_queued                 4    0
io_removal_queued               4    0
io_initializing_queued          4    0
io_trim_queued                  4    0
io_rebuild_queued               4    0
io_active                       4    0
io_syncread_active              4    0
io_syncwrite_active             4    0
io_asyncread_active             4    0
io_asyncwrite_active            4    0
io_scrub_active                 4    0
io_removal_active               4    0
io_initializing_active          4    0
io_trim_active                  4    0
io_rebuild_active               4    0
io_enqueued_total               4    236036
io_syncread_enqueued_total      4    11
io_syncwrite_enqueued_total     4    13054
io_asyncread_enqueued_total     4    0
io_asyncwrite_enqueued_total    4    222971
io_scrub_enqueued_total         4    0
io_removal_enqueued_total       4    0
io_initializing_enqueued_total  4    0
io_trim_enqueued_total          4    0
io_rebuild_enqueued_total       4    0
io_dequeued_total               4    236036
io_syncread_dequeued_total      4    11
io_syncwrite_dequeued_total     4    13054
io_asyncread_dequeued_total     4    0
io_asyncwrite_dequeued_total    4    222971
io_scrub_dequeued_total         4    0
io_removal_dequeued_total       4    0
io_initializing_dequeued_total  4    0
io_trim_dequeued_total          4    0
io_rebuild_dequeued_total       4    0
io_aggregated_total             4    37902
io_aggregated_data_total        4    107667
io_aggregated_read_gap_total    4    0
io_aggregated_write_gap_total   4    0
io_aggregated_shrunk_total      4    0

FreeBSD:

$ sysctl kstat.zfs.tank.vdev.3686087381038636139.misc.queue
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_shrunk_total: 41
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_write_gap_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_read_gap_total: 10
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_data_total: 109
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_aggregated_total: 20
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_rebuild_dequeued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_trim_dequeued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_initializing_dequeued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_removal_dequeued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_scrub_dequeued_total: 69
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncwrite_dequeued_total: 192
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncread_dequeued_total: 1
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncwrite_dequeued_total: 23
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncread_dequeued_total: 42
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_dequeued_total: 327
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_rebuild_enqueued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_trim_enqueued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_initializing_enqueued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_removal_enqueued_total: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_scrub_enqueued_total: 69
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncwrite_enqueued_total: 192
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncread_enqueued_total: 1
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncwrite_enqueued_total: 23
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncread_enqueued_total: 42
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_enqueued_total: 327
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_rebuild_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_trim_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_initializing_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_removal_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_scrub_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncwrite_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncread_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncwrite_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncread_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_active: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_rebuild_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_trim_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_initializing_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_removal_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_scrub_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncwrite_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_asyncread_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncwrite_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_syncread_queued: 0
kstat.zfs.tank.vdev.3686087381038636139.misc.queue.io_queued: 0

Notes

The actual stats part is pretty unremarkable, being little more than the normal "sums & stats" boilerplate. They perhaps don't technically need to be wmsum_t, since all the changes are made under vq_lock anyway, but its following a common pattern and part of why I want this is to assist with removing or greatly reducing the scope of vq_lock, so this is where they'll need to be anyway.

The more interesting part of the PR is in the SPL kstats changes. These could be a separate PR, even two, but since they have no other application (yet) it seems fair to leave them here so at least there's something to test with. (I will however make them separate PRs on request).

The main part is allowing for multi-level kstat module names. I want this so I can bolt sub-object stats (like individual vdevs) under the pool stats, as you see. For Linux its not really complex, just a little more housekeeping. For FreeBSD, every kstat has its own "view" of the tree anyway, attached to the sysctl context, so its quite trivial as no cleanup code is required.

The name reuse thing, meanwhile, is the least invasive solution I could find to an annoying structural problem that came up. Every vdev_t has a vdev_queue_t that isn't easily decoupled, and now every vdev_queue_t creates some stats. During import, a tree of vdev_ts are created with the untrusted config, and then a second set with the trusted config. Both of these register kstats with the same names. The effective policy that falls out of the implementations was that the first to claim the name wins, so the untrusted vdev tree gets them. Once the pool is imported though, that tree is discarded. The trusted tree remains and becomes the active pool, but at that point it never got to register its kstats, and the original ones are gone.

Reordering the import is not really possible, as the two trees briefly coexist to copy "updated" values from the untrusted tree to the trusted (eg device paths have changed since last import). There's no comfortable way I could find to know where in the process we are, and don't create stats until the live one comes up. There's other options, like delaying kstats creation until first use, but in all these cases it felt dangerous to be mucking in pool and vdev initialisation just to satisfy a quirk of the kstats system.

So instead, I effectively just changed the policy from first-wins to last-wins, and it all works out ok. There's probably a better structed "correct" way to sort it out, but I'll leave that for the eventual stats subsystem rewrite that of course is now buzzing in the back of my head 😇.

How Has This Been Tested?

Mostly through repeated pool create -> IO -> scrub -> export -> import -> IO -> export -> unload cycles, on both Linux and FreeBSD. Once the numbers looked good and things stopped complaining about replacement names and/or panicking, I declared it good.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

tonyhutter · 2024-05-16T16:01:31Z

zpool iostat -q will show you instantaneous queue levels. Would it make sense to add a zpool iostat -q --totals to display the totals (rather than separate kstat)?

Module names are mapped directly to directory names in procfs, but nothing is done to create the intermediate directories, or remove them. This makes it impossible to sensibly present kstats about sub-objects. This commit loops through '/'-separated names in the full module name, creates a separate module for each, and hooks them up with a parent pointer and child counter, and then unrolls this on the other side when deleting a module. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>

Previously, if a kstat proc name already existed, the old one would be kept. This makes it so the old one is discarded and the new one kept. Arguably, a collision like this shouldn't ever happen, but during import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats) can exist at the same time for the same guid. There's no real way to tell which is which without substantial refactoring in the import and vdev init codepaths, whch is probably worthwhile but not for today. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>

This extends the existing special-case for zfs/poolname to split and create any number of intermediate sysctl names, so that multi-level module names are possible. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>

Normally, when trying to add a sysctl name that already exists, the kernel rejects it with a warning. This changes the code to search for a sysctl with the wanted name in same root. If it exists, it is destroyed, allowing the new one to go in. Arguably, a collision like this shouldn't ever happen, but during import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats) can exist at the same time for the same guid. There's no real way to tell which is which without substantial refactoring in the import and vdev init codepaths, whch is probably worthwhile but not for today. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>

Adding a bunch of gauges and counters to show in-flight and total IOs, with per-class breakdowns, and some aggregation counters. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>

robn · 2025-01-07T12:06:58Z

At the time I had reservations about running it through iostat, in that kstats can have less of an impact on performance, which I was worried about, but also, I didn't want to think much about the ABI changes required. These days I'm more interested in having uniform interfaces for all platforms, though that might be a uniform kstat-like interface. I will play :)

tonyhutter · 2025-01-07T23:07:47Z

@robn If you decide to go the kstat route, try reading the new kstats in a tight loop, while exporting the pool. It's a good smoke test for panics.

robn force-pushed the vdev-queue-stats branch from 9bcb026 to f9ae1e2 Compare May 16, 2024 11:13

robn force-pushed the vdev-queue-stats branch from f9ae1e2 to 0a9a614 Compare May 21, 2024 23:45

behlendorf added the Status: Code Review Needed Ready for review and testing label May 29, 2024

robn added 5 commits January 7, 2025 22:03

vdev_queue: per-vdev stats

1db22ec

Adding a bunch of gauges and counters to show in-flight and total IOs, with per-class breakdowns, and some aggregation counters. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>

robn force-pushed the vdev-queue-stats branch from 0a9a614 to 1db22ec Compare January 7, 2025 11:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

vdev queue stats #16200

vdev queue stats #16200

robn commented May 16, 2024

tonyhutter commented May 16, 2024

robn commented Jan 7, 2025

tonyhutter commented Jan 7, 2025

vdev queue stats #16200

Are you sure you want to change the base?

vdev queue stats #16200

Conversation

robn commented May 16, 2024

Motivation and Context

Description

Notes

How Has This Been Tested?

Types of changes

Checklist:

tonyhutter commented May 16, 2024

robn commented Jan 7, 2025

tonyhutter commented Jan 7, 2025