Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Direct IO Support #10018

Merged
merged 1 commit into from
Sep 14, 2024
Merged

Direct IO Support #10018

merged 1 commit into from
Sep 14, 2024

Conversation

bwatkinson
Copy link
Contributor

@bwatkinson bwatkinson commented Feb 18, 2020

Adding O_DIRECT support to ZFS.

Motivation and Context

By adding Direct IO support to ZFS, the ARC can be bypassed when issuing reads/writes.
There are certain cases where caching data in the ARC can decrease overall performance.
In particular the performance of ZPool's composed of NVMe devices displayed poor read/write
performance due to the extra overhead of memcpy's issued to the ARC.

There are also cases where caching in the ARC may not make sense such as when data
will not be referenced later. By using the O_DIRECT flag, unnecessary data copies to the
ARC can be avoided.

Closes Issue: #8381

Description

O_DIRECT support in ZFS will always ensure there is coherency between buffered and O_DIRECT IO requests.
This ensures that all IO requests, whether buffered or direct, will see the same file contents at all times. Just
as in other FS's , O_DIRECT does not imply O_SYNC. While data is written directly to VDEV disks, metadata will
not be synced until the associated TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes, at a minimum, must be PAGE_SIZE aligned.
In the event they are not, then EINVAL is returned except for in the event the direct property is set to always.

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write request will take the normal (buffered) write path.
In the event that request is block aligned and a cached copy of the buffer in the ARC, then it will be discarded
from the ARC forcing all further reads to retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event that the requested data is in buffered
(in the ARC) it will just be copied from the ARC into the user buffer.

To ensure data integrity for all data written using O_DIRECT, all user pages are made stable in the event one
of the following is required:
Checksum
Compression
Encryption
Parity

By making the user pages stable, we make sure the contents of the user provided buffer can not be changed after
any of the above operations have taken place.

A new dataset property direct has been added with the following 3
allowable values:

  • disabled - Accepts O_DIRECT flag, but silently ignores it and treats the request as a buffered IO request.

  • standard - Follows the alignment restrictions outlined above for write/read IO requests when the O_DIRECT flag is used.

  • always - Treats every write/read IO request as though it passed O_DIRECT. In the event the request is not page aligned, it will be redirected through the ARC. All other alignment restrictions are followed.

Direct IO does not bypass the ZIO pipeline, so all checksums, compression, etc. are still
all supported with Direct IO.

Some issues that still need to be addressed:

  • Create ZTS tests for O_DIRECT
  • Possibly allow for DVA throttle with O_DIRECT writes
  • Further testing/verification of FreeBSD (majority of debugging has been on Linux)
  • Possibly allow for O_DIRECT with zvols
  • Address race conditions in dbuf code with O_DIRECT

How Has This Been Tested?

Testing was primarily done using FIO and XDD with striping, mirror, raidz, and dRAID VDEV ZPool's.

Tests were performed on CentOS using various kernel's ranging from 3.10, 4.18, and 4.20.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Performance enhancement (non-breaking change which improves efficiency)
  • Code cleanup (non-breaking change which makes code smaller or more readable)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation (a change to man pages or other documentation)

Checklist:

  • My code follows the ZFS on Linux code style requirements.
  • I have updated the documentation accordingly.
  • I have read the contributing document.
  • I have added tests to cover my changes.
  • I have run the ZFS Test Suite with this change applied.
  • All commit messages are properly formatted and contain Signed-off-by.

@behlendorf behlendorf self-requested a review February 18, 2020 17:17
@behlendorf behlendorf added the Type: Feature Feature request or new feature label Feb 18, 2020
@behlendorf behlendorf added the Status: Work in Progress Not yet ready for general review label Feb 18, 2020
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 2 times, most recently from 21eddef to 3464559 Compare February 19, 2020 20:06
Copy link
Member

@ahrens ahrens left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to understand the use cases for the various property values. Could we do something simpler like:

directio=standard | always | disabled

where standard means: if you request DIRECTIO, we’ll do it directly if we think it’s a good idea (e.g. writes are recordsize-aligned), and otherwise we'll do the i/o non-directly (we won't fail it for poor alignment). This is the default.

always means act like DIRECTIO was always requested (may be actually direct or indirect depending on i/o alignment, won't fail for poor alignment).

disabled means act like DIRECTIO was never requested (which is the current behavior).

man/man8/zfsprops.8 Outdated Show resolved Hide resolved
man/man8/zfsprops.8 Outdated Show resolved Hide resolved
module/zfs/dbuf.c Outdated Show resolved Hide resolved
module/zfs/vdev_queue.c Outdated Show resolved Hide resolved
Copy link
Contributor

@behlendorf behlendorf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couple quick comments.

config/kernel-get-user-pages.m4 Show resolved Hide resolved
config/kernel-get-user-pages.m4 Outdated Show resolved Hide resolved
config/kernel-get-user-pages.m4 Outdated Show resolved Hide resolved
include/os/linux/kernel/linux/kmap_compat.h Outdated Show resolved Hide resolved
include/os/linux/spl/sys/mutex.h Outdated Show resolved Hide resolved
include/os/linux/spl/sys/uio.h Outdated Show resolved Hide resolved
include/sys/abd.h Outdated Show resolved Hide resolved
module/os/linux/zfs/abd.c Outdated Show resolved Hide resolved
module/os/linux/zfs/abd.c Outdated Show resolved Hide resolved
@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 3 times, most recently from 865bfc2 to 428962a Compare February 25, 2020 00:25
@codecov
Copy link

codecov bot commented Feb 25, 2020

Codecov Report

Attention: Patch coverage is 63.17044% with 309 lines in your changes missing coverage. Please review.

Project coverage is 61.94%. Comparing base (161ed82) to head (04e3a35).
Report is 2509 commits behind head on master.

Current head 04e3a35 differs from pull request most recent head 76a8337

Please upload reports for the commit 76a8337 to get more accurate results.

Files Patch % Lines
module/zfs/dmu.c 51.01% 265 Missing ⚠️
module/os/linux/zfs/abd.c 88.30% 20 Missing ⚠️
module/zfs/dbuf.c 75.71% 17 Missing ⚠️
lib/libzpool/kernel.c 0.00% 5 Missing ⚠️
include/sys/abd.h 50.00% 2 Missing ⚠️
Additional details and impacted files
@@             Coverage Diff             @@
##           master   #10018       +/-   ##
===========================================
- Coverage   75.17%   61.94%   -13.24%     
===========================================
  Files         402      260      -142     
  Lines      128071    73582    -54489     
===========================================
- Hits        96283    45578    -50705     
+ Misses      31788    28004     -3784     
Flag Coverage Δ
kernel 51.01% <43.78%> (-27.75%) ⬇️
user 59.10% <59.33%> (+11.67%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@bwatkinson bwatkinson force-pushed the direct_page_aligned branch 3 times, most recently from f405fed to a6894d1 Compare February 28, 2020 00:27
module/zfs/dmu.c Outdated Show resolved Hide resolved
module/zfs/dmu.c Outdated
Comment on lines 1761 to 1799
zio = zio_write(pio, os->os_spa, txg, bp, data,
db->db.db_size, db->db.db_size, &zp,
dmu_write_direct_ready, NULL, NULL, dmu_write_direct_done, dsa,
ZIO_PRIORITY_SYNC_WRITE, ZIO_FLAG_CANFAIL, &zb);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a little concerned about bypassing zio_dva_throttle(). Background: slides 11-16 and video from my talk at BSDCAN 2016.

This means that DIRECTIO writes will be spread out among the vdevs using the old round-robin algorithm. This could potentially result in poor performance due to allocating from the slowest / most fragmented vdev, and we could potentially make the vdevs even more imbalanced (at least in terms of performance/fragmentation). @grwilson do you have any thoughts on this? How big the impact could be, and potential ways to mitigate? Could we make this use the throttle?

Copy link
Contributor

@snajpa snajpa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is such a large reorganization really needed? Couldn't be things solved by more prototypes at the beginning/in the header files? I'm just asking, because this will make debuging by git blame more difficult.

@snajpa
Copy link
Contributor

snajpa commented Jun 20, 2020

Overall, I have to say, thanks for taking this one on! This looks like it wasn't trivial to figure out.

With regards to zio_dva_throttle() and performance, I'd like to point to an older PR here: #7560 - so it looks like skipping it might have some justification. Ideally IMHO it'd be best to leave it up to the user (ie. configurable).

I'm excited about this PR, it looks to be a solid basis for support of .splice_read()/.splice_write() in order to support IO to/from pipes. I was looking at it this week because of vpsfreecz/linux@1a980b8 - with OverlayFS on top of ZFS, this patch makes all apps using sendfile(2) go tra-la. Issue about that one: #1156

config/kernel-get-user-pages.m4 Show resolved Hide resolved
Comment on lines +85 to +90
static inline boolean_t
zfs_dio_page_aligned(void *buf)
{
return ((((unsigned long)(buf) & (PAGESIZE - 1)) == 0) ?
B_TRUE : B_FALSE);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This function is under #ifdef _KERNEL in include/sys, since PAGESIZE may be unavailable in user-space, but still present here in libspl.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wound up doing a deep dive not this today, as I forgot why this was even the case of not putting zfs_dio_page_aligned() under a _KERNEL guard originally in uio_impl.h. Way back in the day, I updated zfs_context.h to account for including uio_impl.h for the kernel and uio.h for the user-space code. I just completely forgot about that. I went back and removed including uio_impl.h where it was not necessary in the latest commit. Also, there is user-space code in ztest that uses zfs_dio_page_aligned(). Calls to dmu_read() ->dmu_read_impl(). This just exercises part of the Direct I/O code for reads outside of the kernel pinning user pages. If you still believe there is places that are not covered by the guards though, please let me know.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My original point was that user-space is not supposed to have compile-time defines for PAGESIZE, etc, since the same user-space binary may run on systems (of just kernel configs) with different page sizes. If you look into lib/libzpool/abd_os.c, there are completely arbitrary defines for ABD_PAGESIZE and ABD_PAGEMASK.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah okay. I am open to any suggestions on how to resolve the issue ASAP. Do you have an idea how to get this resolved in a beter way?

Copy link
Member

@amotin amotin Sep 13, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I hoped your answer will be: "Sure, we don't need it, since we do not implement Direct I/O in user-space, lets just stub it." But I see that you randomly set DMU_DIRECTIO in few places ztest, so the question looks like: "what part of Direct I/O functionality we can and want to reasonably implement in user-space and how can we give it coherent notion of PAGE* macros in all places that really need it".

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know how many platforms with PAGESIZE != 4096 in user-space do we have to cause troubles. From ASAP perspective I guess it might be not a show-stopper, if we do not care about ztest there right now.

man/man4/zfs.4 Outdated Show resolved Hide resolved
module/os/freebsd/spl/spl_uio.c Outdated Show resolved Hide resolved
module/os/freebsd/spl/spl_uio.c Show resolved Hide resolved
module/os/linux/zfs/zfs_uio.c Outdated Show resolved Hide resolved
module/os/linux/zfs/zpl_file.c Outdated Show resolved Hide resolved
module/os/linux/zfs/zpl_file.c Outdated Show resolved Hide resolved
module/zfs/dmu.c Outdated Show resolved Hide resolved
@xuyufenghz
Copy link

Hello everyone, I tested the direct io branch, and I found that there was no significant improvement in the scenario of 4K IOPS 100% random write. We used 5 enterprise-class NVMe hard drives with 6.4T U-2 interface, created a RAIDZ1, and then created a 1T zvol (4K block size, compressed off, compressed off). checksum off), tested by fio, IOPS in the 50,000-60,000 direct. I would like to ask if this situation is normal

@amotin
Copy link
Member

amotin commented Sep 11, 2024

@xuyufenghz 5-wide RAIDZ1 is not a configuration to measure IOPS performance, neither to store 4KB blocks efficiently. Still 50-60K IOPS sounds low to me, but I can't say if it is an attribute of direct I/O. With such a small data blocks obviously direct I/O can't save much on memory traffic, so the question is what is your real bottleneck.

@tonyhutter
Copy link
Contributor

tonyhutter commented Sep 11, 2024

@xuyufenghz just wanted to double check:

  1. Did you test with direct=always set on the dataset?
  2. Did you see the Direct IO stats increasing (cat /proc/spl/zfs/<pool_name>/iostats | grep direct) when you ran your test?

Comment on lines +254 to +259
} else if (os->os_direct == ZFS_DIRECT_ALWAYS && (ioflag & O_DIRECT)) {
/*
* Direct I/O was requested through the direct=always, but it
* is not properly PAGE_SIZE aligned. The request will be
* directed through the ARC.
*/
ioflag &= ~O_DIRECT;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

May be I miss what else may set O_DIRECT except user and this function, but I think this section should be removed. I can't guess why the fact of direct set to always should mean an amnesty for applications using O_DIRECT explicitly but with incorrect alignment.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is an interesting case. With the direct property setting being set to always, it is a best attempt at doing acting as though O_DIRECT was pass on every open call. However, the main thing with the always property setting is knowing that alignment restrictions are relaxed. By relaxed, that means that unaligned PAGE_SIZE I/O just happily gets passed off to the ARC. This was the entire idea behind always. EINVAL could never be returned for misalignment. If the users want that kind of strict enforcement of alignment checks, we happily accommodate that. That is the direct property being set to standard. In my opinion, the semantics of the dataset property supersede the fact that O_DIRECT is passed as flag. Otherwise, there would be no reason to even have the always property value. It is a best attempt, but assurance that it will not fail under misalignment.

uio->uio_extflg |= UIO_DIRECT;

if (error != 0)
n -= dio_remaining_resid;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
n -= dio_remaining_resid;
n += dio_remaining_resid;

You've already subtracted dio_remaining_resid from n above when calculating it. So in case of error you should add it here, saying that there is more left to read, not subtract again.

Also this error handling does not replicate the error == ECKSUM case from above. I wonder if instead of code duplication we could repeat the loop once more with little dirt.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah... That is was dumb logic on my part. I changed this over to +=. I was looking at the while loop above, but I feel like the code as it stands is fine. However, if you have a better way of refactoring, I am happy to replace the code with anything you could provide.

module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved
module/zfs/zfs_vnops.c Outdated Show resolved Hide resolved
module/zfs/dbuf.c Outdated Show resolved Hide resolved
module/zfs/dbuf.c Outdated Show resolved Hide resolved
@amotin
Copy link
Member

amotin commented Sep 11, 2024

While reviewing dedup log patches I've got a thought that needs deeper thinking: dedup's ddt_tree is pool-wide and is normally accessed only from syncing context. It is locked against concurrent inserts/modifications by ZIO threads, but its flushing seems not locked, since at that point no new inserts are expected. Running ZIO write/free pipelines from open context seems to break that assumption. And I think it is more than just a locking, since we allocate space in open TXG, while adding record into DDT of syncing one, that may cause inconsistencies in case of crash. Please correct me if I am wrong.

I will have to look into this more myself. I am not all all familiar with the DDT code stuff unfortunately... We definitely are allocating space in the open TXG with Direct I/O though.

@bwatkinson Any updates on this? IIRC BRT has per-TXG accounting (see brt_pending_tree), which allows tracking of changes done in each different TXG. DDT AFAIK has no such mechanism, relying on all changes done only in syncing context.

@robn
Copy link
Member

robn commented Sep 12, 2024

@bwatkinson @amotin I thought about interaction between O_DIRECT and dedup.

So yes, as currently constructed, its assumed that ddt_tree will be emptied in syncing context, and the updates written back to disk (ZAP tables or the newer log, but the principle remains same for both). The assumption that it won't be updated at other times is pretty deeply baked into it.

I think the answer might be pretty simple though: just disable dedup for O_DIRECT blocks. This is exactly what we do for ZIL blocks, which are also written outside syncing context. I'd do the same as we do there: turn off dedup, but still use a dedup-capable checksum. See WP_DMU_SYNC in dmu_write_policy().

As justification, I think you simply sell it it as a performance feature: the ultimate goal of O_DIRECT is to be fast? From cold, dedup requires a disk read, or a cache hit, to see if its already in the table, and then the writing block has to wait for that lookup, and then there's another write at end of txg to get the updated dedup entry to disk (DDT-ZAP or FDT-log). Performance gets worse on smaller block sizes too, which (as I understand it) is when you're likely to get most benefit from O_DIRECT?

If there is a case where you want dedup and O_DIRECT to do the right thing, the workaround would just be to set direct=disabled, and then I would really like to know about that use case!

I can't think of a simple way to extend dedup as it stands to support changes at arbitrary times. Probably it is some sort of per-txg list, but like I say, its pretty deeply baked in and I think it would be effectively a redesign of the entire facility to support it. I'd be surprised if there's any real need or interest.

@bwatkinson
Copy link
Contributor Author

bwatkinson commented Sep 12, 2024

@bwatkinson @amotin I thought about interaction between O_DIRECT and dedup.

So yes, as currently constructed, its assumed that ddt_tree will be emptied in syncing context, and the updates written back to disk (ZAP tables or the newer log, but the principle remains same for both). The assumption that it won't be updated at other times is pretty deeply baked into it.

I think the answer might be pretty simple though: just disable dedup for O_DIRECT blocks. This is exactly what we do for ZIL blocks, which are also written outside syncing context. I'd do the same as we do there: turn off dedup, but still use a dedup-capable checksum. See WP_DMU_SYNC in dmu_write_policy().

As justification, I think you simply sell it it as a performance feature: the ultimate goal of O_DIRECT is to be fast? From cold, dedup requires a disk read, or a cache hit, to see if its already in the table, and then the writing block has to wait for that lookup, and then there's another write at end of txg to get the updated dedup entry to disk (DDT-ZAP or FDT-log). Performance gets worse on smaller block sizes too, which (as I understand it) is when you're likely to get most benefit from O_DIRECT?

If there is a case where you want dedup and O_DIRECT to do the right thing, the workaround would just be to set direct=disabled, and then I would really like to know about that use case!

I can't think of a simple way to extend dedup as it stands to support changes at arbitrary times. Probably it is some sort of per-txg list, but like I say, its pretty deeply baked in and I think it would be effectively a redesign of the entire facility to support it. I'd be surprised if there's any real need or interest.

@robn thank you for clarifying this. Also, I just noticed I am already passing WP_DMU_SYNC to dmu_write_policy() in dmu_write_direct(). So, all this time, we were never doing dedup with Direct I/O writes anyways. I still will add some comments in the PR explaining this reasoning as well as maybe an ASSERT() or two in code to make sure we never get Direct I/O writes in the DDT write ZIO pipeline.

@xuyufenghz
Copy link

@xuyufenghz just wanted to double check:

1. Did you test with `direct=always` set on the dataset?

2. Did you see the Direct IO stats increasing (`cat /proc/spl/zfs/<pool_name>/iostats  | grep direct`) when you ran your test?

Thank you for your reply. I confirm that the direct property is turned on. I also found that I could not see Direct IO stats increasing when testing with volumes, and I could see Direct IO stats statistics when testing with the same configuration of the file system. Does the volume support the Direct IO property

@xuyufenghz
Copy link

@xuyufenghz 5-wide RAIDZ1 is not a configuration to measure IOPS performance, neither to store 4KB blocks efficiently. Still 50-60K IOPS sounds low to me, but I can't say if it is an attribute of direct I/O. With such a small data blocks obviously direct I/O can't save much on memory traffic, so the question is what is your real bottleneck.

Hi,I retested it several times and the IOPS was still very low. I generated a FlameGraph statistic, let's see what went wrong。
image
Uploading fio_randwrite_4k_zfs_2.99.1_ori.svg…

@amotin
Copy link
Member

amotin commented Sep 13, 2024

@xuyufenghz I don't know if it is the case, but I have practice to specify iomem_align=2m to fio. Otherwise it may sometimes allocate misaligned memory, which would be fatal for Direct I/O here.

Also for a note: --iodepth=512 does not work with --ioengine=psync, since the last can execute only one request per thread at a time, so it will do nothing and you have 32 requests at a time, as the number of jobs you've configured.

PS: Your svg link is broken.

Adding O_DIRECT support to ZFS to bypass the ARC for writes/reads.

O_DIRECT support in ZFS will always ensure there is coherency between
buffered and O_DIRECT IO requests. This ensures that all IO requests,
whether buffered or direct, will see the same file contents at all
times. Just as in other FS's , O_DIRECT does not imply O_SYNC. While
data is written directly to VDEV disks, metadata will not be synced
until the associated  TXG is synced.
For both O_DIRECT read and write request the offset and requeset sizes,
at a minimum, must be PAGE_SIZE aligned. In the event they are not,
then EINVAL is returned unless the direct property is set to always (see
below).

For O_DIRECT writes:
The request also must be block aligned (recordsize) or the write
request will take the normal (buffered) write path. In the event that
request is block aligned and a cached copy of the buffer in the ARC,
then it will be discarded from the ARC forcing all further reads to
retrieve the data from disk.

For O_DIRECT reads:
The only alignment restrictions are PAGE_SIZE alignment. In the event
that the requested data is in buffered (in the ARC) it will just be
copied from the ARC into the user buffer.

For both O_DIRECT writes and reads the O_DIRECT flag will be ignored in
the event that file contents are mmap'ed. In this case, all requests
that are at least PAGE_SIZE aligned will just fall back to the buffered
paths. If the request however is not PAGE_SIZE aligned, EINVAL will
be returned as always regardless if the file's contents are mmap'ed.

Since O_DIRECT writes go through the normal ZIO pipeline, the
following operations are supported just as with normal buffered writes:
Checksum
Compression
Encryption
Erasure Coding
There is one caveat for the data integrity of O_DIRECT writes that is
distinct for each of the OS's supported by ZFS.
FreeBSD - FreeBSD is able to place user pages under write protection so
          any data in the user buffers and written directly down to the
	  VDEV disks is guaranteed to not change. There is no concern
	  with data integrity and O_DIRECT writes.
Linux - Linux is not able to place anonymous user pages under write
        protection. Because of this, if the user decides to manipulate
	the page contents while the write operation is occurring, data
	integrity can not be guaranteed. However, there is a module
	parameter `zfs_vdev_direct_write_verify` that contols the
	if a O_DIRECT writes that can occur to a top-level VDEV before
	a checksum verify is run before the contents of the I/O buffer
        are committed to disk. In the event of a checksum verification
	failure the write will return EIO. The number of O_DIRECT write
	checksum verification errors can be observed by doing
	`zpool status -d`, which will list all verification errors that
	have occurred on a top-level VDEV. Along with `zpool status`, a
	ZED event will be issues as `dio_verify` when a checksum
	verification error occurs.

ZVOLs and dedup is not currently supported with Direct I/O.

A new dataset property `direct` has been added with the following 3
allowable values:
disabled - Accepts O_DIRECT flag, but silently ignores it and treats
	   the request as a buffered IO request.
standard - Follows the alignment restrictions  outlined above for
	   write/read IO requests when the O_DIRECT flag is used.
always   - Treats every write/read IO request as though it passed
           O_DIRECT and will do O_DIRECT if the alignment restrictions
	   are met otherwise will redirect through the ARC. This
	   property will not allow a request to fail.

There is also a module paramter zfs_dio_enabled that can be used to
force all reads and writes through the ARC. By setting this module
paramter to 0, it mimics as if the  direct dataset property is set to
disabled.

Signed-off-by: Brian Atkinson <[email protected]>
Co-authored-by: Mark Maybee <[email protected]>
Co-authored-by: Matt Macy <[email protected]>
Co-authored-by: Brian Behlendorf <[email protected]>
@xuyufenghz
Copy link

Sorry, I re-uploaded svg via attachment. thank you!
fio_randwrite_4k_zfs_2 99 1_ori

@behlendorf behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Sep 14, 2024
@behlendorf behlendorf merged commit a10e552 into openzfs:master Sep 14, 2024
19 of 21 checks passed
@Powdered8502
Copy link

Happy to learn that this has been merged. But is this compatible with a pool that has L2ARC cache? I do not understand how it would work since the working principles of L2ARC is the caching of data in ARC that is about to be evicted.

@HWXLR8
Copy link

HWXLR8 commented Sep 29, 2024

Thrilled to see this finally merged! Can't wait to test on my nvme pool. Thanks everyone for the great work.

@liyimeng
Copy link

how this work in a mix pool? lets say I have a zpool with hhd, but nvme as special device. If I turn on O_DIRECT, what to happen? can it optimise write to special vdev and raidz hhd separately?

@amotin
Copy link
Member

amotin commented Oct 31, 2024

@liyimeng mirror/stripe of NVMe and raidz of HDDs are two different worlds. You should optimize for the slowest one, i.e. allow caching as much as possible, since it will be the bottleneck, with or without direct I/O to NVMe. BTW, if special vdev is used mostly for metadata, then O_DIRECT just does not apply to it, since metadata are (almost) never written by user directly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Status: Accepted Ready to integrate (reviewed, tested) Type: Feature Feature request or new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.