Implement `statx(STATX_DIOALIGN)` so applications can discover correct `O_DIRECT` alignment #16972

robn · 2025-01-22T06:10:46Z

Motivation and Context

statx(STATX_DIOALIGN) is a Linux-specific syscall to let an application discover if direct IO (O_DIRECT) to a file is possible and, if so, what the alignment requirements are. This is useful since Linux does not mandate what happens when an unaligned access is requested; the filesystem may return EINVAL (as OpenZFS does), or may fall back to a non-aligned write.

This PR wires this call up to OpenZFS.

Description

Adds a ZPL function, zfs_get_direct_alignment(), to return the direct read and write alignment for the given open znode;
Extends zpl_getattr_impl to handle STATX_DIOALIGN by filling it out with information derived from zfs_get_direct_alignment();
Adds a ZTS helper program to call statx(2) requesting specific fields and display the results;
Adds a ZTS test to use the program to check the values are returned correctly for various combinations of direct=, recordsize= and object size and structure.

Notes

STATX_DIOALIGN is required to return two alignment values:

dio_mem_align: the required alignment for a userspace memory region used in O_DIRECT calls;
dio_offset_align: the required alignment for DMA to the underlying storage device.

Both set to 0 indicates that direct IO is not possible on the file. If direct IO is possible, both must be non-zero. It's not valid to set one to zero and the other non-zero.

Since its a stat-like query, it's not even necessary that the file be open, so our decision on whether or not direct is possible is simply checking that DIO would be possible on this dataset: zfs_dio_enabled set, and direct=always or standard.

OpenZFS has different alignment requirements for reads and writes, which can't be expressed through this interface, so I've taken the larger of the read and write alignment, so at least the caller gets something that is guaranteed to work for both cases.

Write alignment for direct IO is documented to be recordsize. This is implemented as z_blksz, which makes sense, but presents a conundrum. If an object is only one L0 block, then z_blksz may be smaller than the recordsize right now, because it will grow if more data is added. If we only returned z_blksz, then it would be possible for the apparent alignment requirement to change after a single write, which seems like it would make things difficult for an application that wanted to get the alignment once, when the file is opened, and then go to work. So, I've simply made it that if the object is still in the first block and the z_blksz is less than the dataset's z_max_blksz (ie recordsize), we use the dataset max instead.

I'm not sure if this really lines up with what really happens in zfs_setup_direct(), but it seems that it falling out as the recordsize most of the time is probably what's expected, and honestly, direct IO for files smaller than recordsize is kind of a misuse of the whole thing? If anyone knows a program that uses this STATX_DIOALIGN, I'd like to find out what it actually expects.

For the IO alignment, I've just gone with the pool ashift. Again, without knowing how its used I don't really know, but it seems like the closest to not outright lying, even if we don't ever pass user pages down to IO devices. Possibly we should return the same value for both, given that an arbitrary 4K alignment probably doesn't work for direct access to a 128K record, so we don't really want anything to choose it.

How Has This Been Tested?

New test included.

Test passes on Linux 6.6.x and 6.1.x. Test skipped on Linux 5.4.x and 4.19.x. Test also skipped on FreeBSD 14.2.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

This statx(2) mask returns the alignment restrictions for O_DIRECT access on the given file. We're expected to return both the user memory alignment and the IO DMA alignment for the underlying device. These concepts don't map perfectly to OpenZFS, so we choose values that are guaranteed to work. - for memory alignment, the object's block size, or if the object is still in the first block, the maximum size it could grow to (ie the dataset's recordsize) - for IO alignment, the pool minimum ashift. This value is largely irrelevant since its intended to indicate the DMA alignment, which doesn't make sense for OpenZFS. We have to return something though, and this value seems logically correct. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>

A standalone helper program to call statx(2) with various masks, and display the returned value, or fail if the kernel does not support it. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>

Uses the statx helper to test the results of the STATX_DIOALIGN request as we manipulate DIO enable, dataset recordsize and file size and structure. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>

robn · 2025-01-22T06:16:12Z

@gamanakis @hanneskasp FYI. Hannes, I'd love any feedback from your development team; it's hard to guess what applications are expecting here. Thank you!

hanneskasp · 2025-01-22T06:38:23Z

Hello,
we try to solve the O_DIRECT misalignment problem mentioned here #16953 (there is an strace about what fails today in #16953 (comment))

we use the following code for testing today

#include <iostream>
#define _GNU_SOURCE
#include <fcntl.h>
#include <sys/stat.h>

int main(int argc, char* argv[])
{
        struct statx statxbuf{};
        if (statx(0, argv[1], 0, STATX_DIOALIGN, &statxbuf))
        {
                perror("Cannot execute statx");
                return -1;
        }
        std::cout << "mask" << statxbuf.stx_mask << " memalign: " << statxbuf.stx_dio_mem_align << " offsetalign: " << statxbuf.stx_dio_offset_align << std::endl;
        return 0;
}

Server 1

Rocky Linux 9.5 with 5.14.0-503.21.1.el9_5.x86_64 (we know that statx needs kernel 6.1, but I did not find a quick way to get the mainline kernel working with ZFS on Rocky 9.5 and it seems to work fine for XFS)
zfs-2.3.0-1 & zfs-kmod-2.3.0-1
[root@rocky-zfs23 ~]# ./a.out /etc/passwd (this is XFS)
mask14335 memalign: 512 offsetalign: 512
[root@rocky-zfs23 ~]# ./a.out /datareduction/recordsize16k/VBR-FK/BJ-HK-zfs-fastdedup_DC01/HK-DC01_7B18C.vbm
mask6143 memalign: 0 offsetalign: 0

Server 2

Ubuntu 24.04 with 6.8.0-51-generic kernel
zfs-2.2.2-0ubuntu9.1 and zfs-kmod-2.2.2-0ubuntu9.1 root@ubuntu:# ./a.out /etc/passwd (this is ext4)
mask14335 memalign: 4 offsetalign: 512
root@ubuntu:# ./a.out /alignment/test.txt
mask6143 memalign: 0 offsetalign: 0

Does that help to answer the question (I'm not a developer and would bring one in if needed)?

Best regards
Hannes

gamanakis · 2025-01-22T12:59:37Z

@hanneskasp OpenZFS 2.3.0 does not include the present pull-request, that's why memalign=0 and offsetalign=0 in your example. In order for your program to output some meaningful memalign and offsetalign you have to apply this pull-request locally and then compile OpenZFS.

@robn is actually asking how you would use memalign and offsetalign, what values do you expect?

hanneskasp · 2025-01-22T13:43:25Z

quoting my colleague:

We're expecting values that will tell us the needed alignment for correct execution of direct IO requests. We will align the memory address of the memory buffer we're passing to the request, offset of the data for IO and size of the data for IO according to these values.

Does that help?

The plan is to continue testing once the PR is in the 2.3 RPM (the main testing system is Rocky Linux 9 with ZFS 2.3)

gamanakis · 2025-01-22T15:45:50Z

@robn on my veeam backup OpenZFS 2.3.0 filesystem with 128k recordsize, with this PR applied and using the script from @hanneskasp I get:

mask14303 memalign: 131072 offsetalign: 4096

amotin · 2025-01-22T16:27:00Z

@robn May be I misremember something, but the patch looks weird to me. I'd expect you to statically set dio_mem_align to PAGE_SIZE, since Direct I/O code strictly requires all buffers to be page-aligned/-multiple. dio_offset_align is indeed tricky since it is different for reads and writes. For reads it can be anything page-aligned/-multiple, but preferably the file block size to avoid partial reads. For writes should be a full block size to not fall back to ARC, but as you pointed the block size might change. I think we should return there the file's current block size if the file has more than one block, or either recordsize or the file's current block size (rounded to the next power of 2 if needed) if it is bigger, but in either case no less than PAGE_SIZE.

tonyhutter · 2025-01-22T18:39:25Z

module/os/linux/zfs/zpl_inode.c

+		} else
+			stat->dio_mem_align = stat->dio_offset_align = 0;
+		stat->result_mask |= STATX_DIOALIGN;


I don't know if this is the right answer, but does it make more sense to just not set the STATX_DIOALGIN bit if DIO is off?

I'm concerned people will call statx(), check that STATX_DIOALIGN was set in the response, and then blindly posix_memalign() a buffer using the dio_mem_align value (which would be 0 if DIO is disabled). Considering some dataset could have DIO on, and others off, it makes it so the user has to check both STATX_DIOALIGN is set, and dio_mem_align != 0 if they want to proceed.

It's a little awkward with the statx iterface, because there's no easy way to say "I know you're querying STATX_DIOALIGN, and I do support DIO in general and know how to respond to that, but it's not supported on this particular dataset".

https://manpages.debian.org/bookworm/manpages-dev/statx.2.en.html explicitly tells "or 0 if direct I/O is not supported on this file.". So the code looks right to me.

I agree dio_mem_align should be 0 if DIO is disabled. I don't know if that necessarily means the STATX_DIOALGIN bit should be set though:

If a filesystem does not support a field or if it has an unrepresentable value (for instance, a file with an exotic type), then the mask bit corresponding to that field will be cleared in stx_mask even if the user asked for it and a dummy value will be filled in for compatibility purposes if one is available (e.g., a dummy UID and GID may be specified to mount under some circumstances).

It gets down to the definition what "does not support" means. Does ZFS "not support" the field if DIO is disabled? I would lean towards yes, since I'm guessing most users would look at the bit as a "can I do DIO or not" flag.

For what it's worth, the Linux Test Project is marking the combination of "STATX_DIOALIGN bit set + stx_dio_mem_align == 0" as a test failure:
https://github.com/linux-test-project/ltp/blob/master/testcases/kernel/syscalls/statx/statx10.c#L39

"the Linux Test Project is marking" -- it feels wrong to me, but whatever, I am not going to fight it.

I agree with @tonyhutter on this one. It is fine to set them to 0, but the result_mask should not have STATX_DIOALIGN set if it is disabled or zfs_dio_enabled=0. I think that makes the most sense reading the documentation and looking al the LTP stuff.

amotin · 2025-01-22T19:38:17Z

What also makes me wonder is how this API supposed to be used for creating and writing new files? I wonder if applications in that case call it on directory, in which case we might not care about specific zp properties, but only the dataset in general? Or it specifies name of still absent file, in which case I wonder what call ZFS will receive?

bwatkinson · 2025-01-22T20:49:56Z

module/os/linux/zfs/zpl_inode.c

+			 * really do that, but we have to return something
+			 * non-zero, and this seems reasonably correct.
+			 */
+			stat->dio_offset_align = MAX(PAGE_SIZE, 1 <<


As @amotin stated, the offset alignment will always be PAGE_SIZE. There was a point were @behlendorf and I considered just using ashift, but the plumbing got a bit messy from what I recall. So this should always just be set to PAGE_SIZE for dio_offset_align.

bwatkinson · 2025-01-22T20:50:34Z

module/zfs/zfs_vnops.c

+	if (!zfs_dio_enabled || zfsvfs->z_os->os_direct == ZFS_DIRECT_DISABLED)
+		return (SET_ERROR(EOPNOTSUPP));
+
+	/* Read only needs to be page-aligned */


The comment is not correct. All read and writes must be PAGE_SIZE aligned.

bwatkinson · 2025-01-22T21:08:35Z

module/zfs/zfs_vnops.c

+		if (zp->z_size <= zp->z_blksz)
+			*walignp = MAX(zp->z_blksz, zfsvfs->z_max_blksz);
+		else
+			*walignp = zp->z_blksz;
+		*walignp = MAX(PAGE_SIZE, *walignp);


This is a bit tricky as recordsize aligned write buffers will only be issued as Direct I/O writes. The good news with the growing blocksize is the first block of a file is never written using O_DIRECT as we do not allow it in that case anyways. See the o_direct_defer variable in zfs_write().

In the write case for both direct=standard and direct=always it will just silently pass it off to the ARC if the buffer is not recordsize aligned. However, the only true restrictions on alignment is PAGE_SIZE to not return EINVAL with direct=standard. The recordsize alignment is to avoid the RMW that would be required with checksums. For reads, the only alignment requirements are PAGE_SIZE. So I guess the question here is what does statx really say about the offset? If it is alignment, then PAGE_SIZE should be used for both dio_mem_align and dio_offset_align. I could see how this would trip a user up though if they are issuing a write with a 64k page-aligned buffer to a 128k recordsize block and thinking it is going to do an Direct I/O write (it would fall into the ARC silently in this case); however, EINVAL would still not be returned as an error. I just wonder if it would misleading to set dio_offset_align to the blocksize? In the case of reads this is not at all true. If a partial block is read using O_DIRECT (and the buffer is PAGE_SIZE aligned) then the entire block will be read to validate the checksum. I guess my question is, are we saying dio_offset_align is the optimal offset alignment to guarantee Direct I/O writes take place and no inflated blcocksize Direct I/O reads will occur? If so, then I guess we stick with that value even if it is misleading for a read operation.

In either case, we are only strict that the buffer memory and buffer size must be PAGE_SIZE in order to not return EINVAL with direct=standard.

are we saying dio_offset_align is the optimal offset alignment to guarantee Direct I/O writes take place

Yes, exactly this. What I think we want to do is advertise the optimal offset size to ensure Direct I/O can be performed without incurring any RMW penalty on write. Otherwise, you'd probably be better of going through the ARC anyway. There's no harm in accepting page-aligned offsets, bit it seems best not to recommend it via statx.

This does potentially make reads a little misleading but it shouldn't significantly impact performance. Under the covers we're always going to be reading the full block from disk anyway to validate the checksum. Which I believe boils down to:

dio_mem_align = PAGE_SIZE; dio_offset_align = MAX(zp->z_blksz, PAGE_SIZE);

robn added 3 commits January 22, 2025 13:01

ZTS: statx test helper

981b1e0

A standalone helper program to call statx(2) with various masks, and display the returned value, or fail if the kernel does not support it. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>

ZTS: test for STATX_DIOALIGN

3775cf2

Uses the statx helper to test the results of the STATX_DIOALIGN request as we manipulate DIO enable, dataset recordsize and file size and structure. Sponsored-by: https://despairlabs.com/sponsor/ Signed-off-by: Rob Norris <[email protected]>

robn requested review from amotin and bwatkinson January 22, 2025 06:11

tonyhutter reviewed Jan 22, 2025

View reviewed changes

behlendorf added the Status: Code Review Needed Ready for review and testing label Jan 22, 2025

bwatkinson reviewed Jan 22, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement `statx(STATX_DIOALIGN)` so applications can discover correct `O_DIRECT` alignment #16972

Implement `statx(STATX_DIOALIGN)` so applications can discover correct `O_DIRECT` alignment #16972

robn commented Jan 22, 2025

robn commented Jan 22, 2025

hanneskasp commented Jan 22, 2025

gamanakis commented Jan 22, 2025

hanneskasp commented Jan 22, 2025

gamanakis commented Jan 22, 2025

amotin commented Jan 22, 2025 •

edited

Loading

tonyhutter Jan 22, 2025

amotin Jan 22, 2025

tonyhutter Jan 22, 2025

amotin Jan 22, 2025

bwatkinson Jan 23, 2025

amotin commented Jan 22, 2025

bwatkinson Jan 22, 2025 •

edited

Loading

bwatkinson Jan 22, 2025

bwatkinson Jan 22, 2025 •

edited

Loading

behlendorf Jan 23, 2025

Implement statx(STATX_DIOALIGN) so applications can discover correct O_DIRECT alignment #16972

Are you sure you want to change the base?

Implement statx(STATX_DIOALIGN) so applications can discover correct O_DIRECT alignment #16972

Conversation

robn commented Jan 22, 2025

Motivation and Context

Description

Notes

How Has This Been Tested?

Types of changes

Checklist:

robn commented Jan 22, 2025

hanneskasp commented Jan 22, 2025

gamanakis commented Jan 22, 2025

hanneskasp commented Jan 22, 2025

gamanakis commented Jan 22, 2025

amotin commented Jan 22, 2025 • edited Loading

tonyhutter Jan 22, 2025

Choose a reason for hiding this comment

amotin Jan 22, 2025

Choose a reason for hiding this comment

tonyhutter Jan 22, 2025

Choose a reason for hiding this comment

amotin Jan 22, 2025

Choose a reason for hiding this comment

bwatkinson Jan 23, 2025

Choose a reason for hiding this comment

amotin commented Jan 22, 2025

bwatkinson Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

bwatkinson Jan 22, 2025

Choose a reason for hiding this comment

bwatkinson Jan 22, 2025 • edited Loading

Choose a reason for hiding this comment

behlendorf Jan 23, 2025

Choose a reason for hiding this comment

Implement `statx(STATX_DIOALIGN)` so applications can discover correct `O_DIRECT` alignment #16972

Implement `statx(STATX_DIOALIGN)` so applications can discover correct `O_DIRECT` alignment #16972

amotin commented Jan 22, 2025 •

edited

Loading

bwatkinson Jan 22, 2025 •

edited

Loading

bwatkinson Jan 22, 2025 •

edited

Loading