-
Notifications
You must be signed in to change notification settings - Fork 1.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Detect a slow raidz child during reads #16900
base: master
Are you sure you want to change the base?
Conversation
module/zfs/vdev_raidz.c
Outdated
*/ | ||
uint64_t two_sectors = 2ULL << zio->io_vd->vdev_top->vdev_ashift; | ||
if (zio->io_type == ZIO_TYPE_READ && zio->io_error == 0 && | ||
zio->io_size >= two_sectors && zio->io_delay != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you explain why do we care about the two sectors (all data columns) here?
Not accounting aggregated ZIOs makes this algorithm even more random that periodic sampling alone would do. With RAIDZ splitting ZIOs between vdevs into smaller ones, they are good candidates for aggregation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original thinking was to avoid small-ish metadata reads in the latency samples that were not aggregated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if in practice this extra filtering is really helpful? If it is, great, let's keep it and add the reasoning about avoiding small-ish metadata reads to the comment. If we can't, perhaps we want to drop this artificial 2 sector limit.
module/zfs/vdev_raidz.c
Outdated
latency_sort(lat_data, samples); | ||
uint64_t fence = latency_quartiles_fence(lat_data, samples); | ||
if (lat_data[samples - 1] > fence) { | ||
/* | ||
* Keep track of how many times this child has had | ||
* an outlier read. A disk that persitently has a | ||
* higer than peers outlier count will be considered | ||
* a slow disk. | ||
*/ | ||
atomic_add_64(&svd->vdev_outlier_count, 1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With small number of children and with only one random sample from each I really doubt this math can be statistically meaningful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The switch to using a weighted moving average addresses I believe addresses the concern here.
Initial thoughts This is going to be really useful on large pools. We often see SCSI disks "lose their mind" for a few minutes in our larger pools, tanking performance. It's nice that this feature will basically read and rebuild the data from parity first rather than attempt to read from the bad disk. I can see us using this in production. Will disks that are "sitting out" still be read during a scrub? If a disk is sitting out, and a read is unable to reconstruct from the other "healthy" disks, will ZFS read from the sitting out disk as a last-ditch effort? Could this work for mirrors as well? What happens if the algorithm wants to sit out more disks in a raid group than there is parity? What happens if a pool has dissimilar disks in its raid group? Like, say, 7 NVMe and 1 HDD. Will it sit out the HDD since it will be so much slower than the other drives? That might actually be a nice side effect... It would be good to see which disks are currently sitting out via a vdev property. I would expect users would want to see this for troubleshooting performance problems and locating sick disks. I can see users wanting to manually set it on sick disks as well:
You could tell if Alternatively you could name the property |
Latest changes...
|
Yes. both resilver and scrub reads will not sit out.
Possibly but that would require a different outlier detection.
Currently one one outlier sit out at a time is allowed.
Yes, it would likely end up sitting out the HDD since it will be an outlier. I did test this exact scenario and could hear the sit out period. 😃
I was using |
module/zfs/vdev_raidz.c
Outdated
*/ | ||
uint64_t two_sectors = 2ULL << zio->io_vd->vdev_top->vdev_ashift; | ||
if (zio->io_type == ZIO_TYPE_READ && zio->io_error == 0 && | ||
zio->io_size >= two_sectors && zio->io_delay != 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we know if in practice this extra filtering is really helpful? If it is, great, let's keep it and add the reasoning about avoiding small-ish metadata reads to the comment. If we can't, perhaps we want to drop this artificial 2 sector limit.
A read-only vdev property seems like the way to go. That would provide some additional visibility in to when a drive is sitting out. I don't think we'd want to make it writable for exactly the reasons you mentioned. |
Added read-only |
1140ee3
to
3965f73
Compare
Some recent feedback changes. Rebased to latest master and squash commits. |
I gave this a try today using the script below. I was able to sit out the disk I delayed, but I wasn't able to remove the sit-out state afterwards. Maybe I need non-ARC reads to trigger the sitting-out -> not-sitting-out logic? #!/bin/bash
# Run this from a zfs git workspace
#
# shorten sit out period for testing
sudo bash -c 'echo 5 > /sys/module/zfs/parameters/raidz_read_sit_out_secs'
truncate -s 200M file{0,1,2,3,4,5,6,7,8,9}
sudo ./zpool create tank raidz2 `pwd`/file{0..9}
sudo dd if=/dev/urandom bs=1M count=100 of=/tank/bigfile
sudo ./zpool export tank
sudo ./zpool import -d . tank
echo "Initial state"
# We should be sitting out
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9
# Add 500ms delay on last disk
sudo ./zinject -d `pwd`/file9 -D500:1 tank
# Do some reads
sudo dd if=/tank/bigfile of=/dev/null bs=4k
echo "Should be sitting out"
# We should be sitting out
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9
# Clear fault injection
sudo ./zinject -c all
echo "wait for us to stop sitting out part 1"
sleep 6
# Are we still sitting out?
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9
# Do some more reads to see if we can trigger the vdev to stop sitting out
sudo dd if=/tank/bigfile of=/dev/null bs=4k
echo "wait for us to stop sitting out part 2"
sleep 6
# Are we still sitting out?
sudo ./zpool get -Hp sit_out_reads tank `pwd`/file9 |
3965f73
to
77e4878
Compare
module/zfs/vdev_raidz.c
Outdated
* A zio->io_delay value of zero means this IO was part of | ||
* an aggregation. | ||
*/ | ||
if (zio->io_type == ZIO_TYPE_READ && zio->io_error == 0 && |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you think about adding if raidz_read_sit_out_secs != 0
to this check to skip calculating the ewma when this detection is disabled. It might help a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good idea. Added
* A zio->io_delay value of zero means this IO was part of
* an aggregation.
*/
- if (zio->io_type == ZIO_TYPE_READ && zio->io_error == 0 &&
- zio->io_size > 0 && zio->io_delay != 0) {
+ if (raidz_read_sit_out_secs != 0 && zio->io_type == ZIO_TYPE_READ &&
+ zio->io_error == 0 && zio->io_size > 0 && zio->io_delay != 0) {
vdev_t *vd = zio->io_vd;
uint64_t previous_ewma = atomic_load_64(&vd->vdev_ewma_latency);
if (previous_ewma == 0)
A single slow responding disk can affect the overall read performance of a raidz group. When a raidz child disk is determined to be a persistent slow outlier, then have it sit out during reads for a period of time. The raidz group can use parity to reconstruct the data that was skipped. Each time a slow disk is placed into a sit out period, its `vdev_stat.vs_slow_ios count` is incremented and a zevent class `ereport.fs.zfs.delay` is posted. The length of the sit out period can be changed using the `raid_read_sit_out_secs` module parameter. Setting it to zero disables slow outlier detection. Sponsored-by: Klara, Inc. Sponsored-by: Wasabi Technology, Inc. Signed-off-by: Don Brady <[email protected]>
77e4878
to
64d5138
Compare
A response said that resilver reads would not be affected. I'd like to suggest that this may be a bad idea. We currently have a RAIDZ2 that is rebuilding. One of the disks is very, very slow but still responds. It's really slowing down the rebuild. I'd like the resilver to avoid reading from it. For a scrub I can see that you want to read everything, but I'm not so sure about resilver. |
@@ -501,6 +501,18 @@ For testing, pause RAID-Z expansion when reflow amount reaches this value. | |||
.It Sy raidz_io_aggregate_rows Ns = Ns Sy 4 Pq ulong | |||
For expanded RAID-Z, aggregate reads that have more rows than this. | |||
. | |||
.It Sy raidz_read_sit_out_secs Ns = Ns Sy 600 Ns s Po 10 min Pc Pq ulong |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All references to 'raidz' should be renamed to 'raid', since this works with dRAID (and to match the raid_read_sit_out_secs
name in the commit message)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It was originally, I asked Don to rename it "raidz" in this #16900 (comment) so it's consistent with some other existing names. Perhaps the best thing to do would be rename it read_sit_out_secs
. Then we could also generically apply it to mirrors is someday support is added for that.
Can also be incremented when a vdev was determined to be a raidz leaf latency | ||
outlier. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand what you mean by this. Who would be incrementing zio_slow_io_ms
?
I agree with this - we would want the resilver to finish as quickly as possible. That would mean sitting out the sick disk if we could reconstruct from parity. |
@@ -104,12 +104,19 @@ Comma separated list of children of this vdev | |||
The number of children belonging to this vdev | |||
.It Sy read_errors , write_errors , checksum_errors , initialize_errors , trim_errors | |||
The number of errors of each type encountered by this vdev | |||
.It Sy sit_out_reads |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename it to just sit_out
here and in other places to be brief. We could use it for other things too. For writes we now have ability to mark top-level vdevs as non-allocating in some cases, but may be we could converge those two at some point.
if (io_flags & (ZIO_FLAG_SCRUB | ZIO_FLAG_RESILVER)) | ||
return (B_FALSE); | ||
|
||
return (vd->vdev_read_sit_out_expire >= gethrtime()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With raidz_read_sit_out_secs
measured in seconds it makes no sense to measure vdev_read_sit_out_expire
in nanoseconds. gethrtime()
might be quite expensive on old hardware. gethrestime_sec()
or something similar would be more than enough.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point. I'll change it to use gethrestime_sec()
/* | ||
* Scale using 16 bits with an effective alpha of 0.50 | ||
*/ | ||
const uint64_t scale = 16; | ||
const uint64_t alpha = 32768; | ||
|
||
return (((alpha * latest_value) + (((1ULL << scale) - alpha) * | ||
previous_ewma)) >> scale); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why do you need 16 bit scale (you should still have few orders of magnitude before overflow, but it is just unnecessary) to implement alpha of 0.5. And why the alpha is so big? I would consider more significant dampening factor, considering we are talking about decision that will affect us for next 10 minutes.
hrtime_t now = gethrtime(); | ||
uint64_t last = atomic_load_64(&vd->vdev_last_latency_check); | ||
|
||
if ((now - last) < MSEC2NSEC(raid_outlier_check_interval_ms) || |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above, I think this could be switched to seconds.
vdev_child_slow_outlier(zio_t *zio) | ||
{ | ||
vdev_t *vd = zio->io_vd; | ||
if (raidz_read_sit_out_secs == 0 || vd->vdev_children < LAT_SAMPLES_MIN) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While some limit on number of samples makes sense, I am not sure it makes sense to apply it to the number of children. 3-wide RAIDZ1 may also have the same problems, but may be there we could use wider safety interval. If anywhere, the limit should be applied to a number of samples we have averaged for each children before we can trust any statistics. Though averaging hides the dispersion, so it might need more thinking how to account it better.
spa_t *spa = zio->io_spa; | ||
if (spa_load_state(spa) == SPA_LOAD_TRYIMPORT || | ||
spa_load_state(spa) == SPA_LOAD_RECOVER || | ||
(spa_load_state(spa) != SPA_LOAD_NONE && | ||
spa->spa_last_open_failed)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This block is executed every time in production for no much reason.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you're right. I'll remove it.
} | ||
|
||
int samples = vd->vdev_children; | ||
uint64_t data[LAT_SAMPLES_STACK]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I hope we have these 512 bytes of stack to waste.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's more than I'd like, but my feeling is it should be alright. It does raise a few question though.
- Does the
kmem_alloc()
really hurt performance given that we call this at most once perraid_outlier_check_interval_ms
? - If it really is that expensive, it looks like we could allocate this buffer once and stash a pointer to it in the top-level raidz/draid vdev. The
atomic_cas_64(&vd->vdev_last_latency_check)
check on line 2923 should prevent concurrent access, and this would get right of the alloc in all cases. We'd definitely want to add a comment explaining this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
64 was chosen for benefit of draid path. Stack function depth is bounded and not that deep (below). But still, 512 is big and perhaps outside the spirt of being a good stack usage citizen.
vdev_child_slow_outlier
vdev_raidz_io_done
zio_vdev_io_done
zio_execute
taskq_thread
q3 = latency_median_value(&data[(n+1) >> 1], n>>1); | ||
|
||
uint64_t iqr = q3 - q1; | ||
uint64_t fence = q3 + iqr; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In comment you write fence = Q3 + 2 x (Q3 - Q1)
, but I don't see the 2x
here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back across branches not sure how it mutated. I'll put it back to:
return (q3 + ((q3 - q1) << 2));
if (atomic_add_64_nv(&svd->vdev_outlier_count, 1) > | ||
LAT_OUTLIER_LIMIT && svd == ovd && | ||
svd->vdev_read_sit_out_expire == 0) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure why do you care about the vdev with maximum number of outliers here, considering first of them reachinf LAT_OUTLIER_LIMIT
will reset all to 0? Also since this code should be executed only by one thread, I am not sure after atomic_cas_64()
above vdev_outlier_count
needs atomic accesses. Also vdev_read_sit_out_expire
should also be always 0 here, since you've checked that above.
@don-brady I added a test case in my https://github.com/tonyhutter/zfs/tree/slow_vdev_sit_out branch. Commit is here: tonyhutter@8a58493 Would you mind pulling it into your patch stack? I have run it locally in a VM but have not tested in on the qemu runners yet. |
Motivation and Context
There is a concern, and has been observed in practice, that a slow disk can bring down the overall read performance of raidz. Currently in ZFS, a slow disk is detected by comparing the disk read latency to a custom threshold value, such as 30 seconds. This can be tuned to a lower threshold but that requires understanding the context in which it will be applied. And hybrid pools can have a wide range of expected disk latencies.
What might be a better approach, is to identify the presence of a slow disk outlier based on its latency distance from the latencies of its peers. This would offer a more dynamic solution that can adapt to different type of media and workloads.
Description
The solution proposed here comes in two parts
Detecting Outliers
The most recent latency value for each child is saved in the
vdev_t
. Then periodically, the samples from all the children are sorted and a statistical outlier can be detected if present. The code uses a Tukey's fence, with K = 2, for detecting extreme outliers. This rule defines extreme outliers as data points outside the fence of the third quartile plus two times the Interquartile Range (IQR). This range is the distance between the first and third quartile.Sitting Out
After a vdev has encounter multiple outlier detections (> 50), it is marked for being in a sit out period that by default lasts for 10 minutes.
Each time a slow disk is placed into a sit out period, its
vdev_stat.vs_slow_ios count
is incremented and azevent
classereport.fs.zfs.delay
is posted.The length of the sit out period can be changed using the
raid_read_sit_out_secs
module parameter. Setting it to zero disables slow outlier detection.Sponsored-by: Klara, Inc.
Sponsored-by: Wasabi Technology, Inc.
How Has This Been Tested?
Tested with various configs, including dRAID.
For an extreme example, an HDD was used in an 8 wide SSD raidz2 and it was compared to taking the HDD offline. This was using a
fio(1)
streaming read workload across 4 threads to 20GB files. Both the record size and IO request size were 1MB.Also measured the cost over time of vdev_child_slow_outlier() where the statistical analysis occurs (every 20ms).
Types of changes
Checklist:
Signed-off-by
.