-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
vdev queue stats #16200
base: master
Are you sure you want to change the base?
vdev queue stats #16200
Conversation
|
Module names are mapped directly to directory names in procfs, but nothing is done to create the intermediate directories, or remove them. This makes it impossible to sensibly present kstats about sub-objects. This commit loops through '/'-separated names in the full module name, creates a separate module for each, and hooks them up with a parent pointer and child counter, and then unrolls this on the other side when deleting a module. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>
Previously, if a kstat proc name already existed, the old one would be kept. This makes it so the old one is discarded and the new one kept. Arguably, a collision like this shouldn't ever happen, but during import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats) can exist at the same time for the same guid. There's no real way to tell which is which without substantial refactoring in the import and vdev init codepaths, whch is probably worthwhile but not for today. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>
This extends the existing special-case for zfs/poolname to split and create any number of intermediate sysctl names, so that multi-level module names are possible. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>
Normally, when trying to add a sysctl name that already exists, the kernel rejects it with a warning. This changes the code to search for a sysctl with the wanted name in same root. If it exists, it is destroyed, allowing the new one to go in. Arguably, a collision like this shouldn't ever happen, but during import multiple vdev_t (and so vdev_queue_t, and so vdev_queue stats) can exist at the same time for the same guid. There's no real way to tell which is which without substantial refactoring in the import and vdev init codepaths, whch is probably worthwhile but not for today. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>
Adding a bunch of gauges and counters to show in-flight and total IOs, with per-class breakdowns, and some aggregation counters. Sponsored-by: Klara, Inc. Sponsored-by: Syneto Signed-off-by: Rob Norris <[email protected]>
At the time I had reservations about running it through iostat, in that kstats can have less of an impact on performance, which I was worried about, but also, I didn't want to think much about the ABI changes required. These days I'm more interested in having uniform interfaces for all platforms, though that might be a uniform kstat-like interface. I will play :) |
@robn If you decide to go the kstat route, try reading the new kstats in a tight loop, while exporting the pool. It's a good smoke test for panics. |
Motivation and Context
Part of my ongoing quest to understand what's happening inside the box (previously).
This time, its counters showing what
vdev_queue
is up to.Description
Adds a bunch of
wmsum_t
counters to evervdev_queue
instance for a real device. This show current count of IOs queued and in-flight (total and broken down by class), total IOs in/out over the lifetime of the queue, and basic aggregation counters.The counters are exposed under
/proc/spl/kstat/zfs/<pool>/vdev/<guid>/queue
on Linux, orkstat.zfs.<pool>.vdev.<guid>.misc.queue
on FreeBSD.FreeBSD:
Notes
The actual stats part is pretty unremarkable, being little more than the normal "sums & stats" boilerplate. They perhaps don't technically need to be
wmsum_t
, since all the changes are made undervq_lock
anyway, but its following a common pattern and part of why I want this is to assist with removing or greatly reducing the scope ofvq_lock
, so this is where they'll need to be anyway.The more interesting part of the PR is in the SPL kstats changes. These could be a separate PR, even two, but since they have no other application (yet) it seems fair to leave them here so at least there's something to test with. (I will however make them separate PRs on request).
The main part is allowing for multi-level kstat module names. I want this so I can bolt sub-object stats (like individual vdevs) under the pool stats, as you see. For Linux its not really complex, just a little more housekeeping. For FreeBSD, every kstat has its own "view" of the tree anyway, attached to the sysctl context, so its quite trivial as no cleanup code is required.
The name reuse thing, meanwhile, is the least invasive solution I could find to an annoying structural problem that came up. Every
vdev_t
has avdev_queue_t
that isn't easily decoupled, and now everyvdev_queue_t
creates some stats. During import, a tree ofvdev_t
s are created with the untrusted config, and then a second set with the trusted config. Both of these register kstats with the same names. The effective policy that falls out of the implementations was that the first to claim the name wins, so the untrusted vdev tree gets them. Once the pool is imported though, that tree is discarded. The trusted tree remains and becomes the active pool, but at that point it never got to register its kstats, and the original ones are gone.Reordering the import is not really possible, as the two trees briefly coexist to copy "updated" values from the untrusted tree to the trusted (eg device paths have changed since last import). There's no comfortable way I could find to know where in the process we are, and don't create stats until the live one comes up. There's other options, like delaying kstats creation until first use, but in all these cases it felt dangerous to be mucking in pool and vdev initialisation just to satisfy a quirk of the kstats system.
So instead, I effectively just changed the policy from first-wins to last-wins, and it all works out ok. There's probably a better structed "correct" way to sort it out, but I'll leave that for the eventual stats subsystem rewrite that of course is now buzzing in the back of my head 😇.
How Has This Been Tested?
Mostly through repeated pool create -> IO -> scrub -> export -> import -> IO -> export -> unload cycles, on both Linux and FreeBSD. Once the numbers looked good and things stopped complaining about replacement names and/or panicking, I declared it good.
Types of changes
Checklist:
Signed-off-by
.