NVMe Read Performance Issues with ZFS (submit_bio to io_schedule) #8381

bwatkinson · 2019-02-05T17:20:46Z

System information

Type	Version/Name
Distribution Name	CentOS
Distribution Version	7.5
Linux Kernel	4.18.20-100
Architecture	x86_64
ZFS Version	0.7.12-1
SPL Version	0.7.12-1

Describe the problem you're observing

We are currently seeing poor read performance with ZFS 0.7.12 with our Samsung PM1725a devices in a Dell PowerEdge R7425.

Describe how to reproduce the problem

Briefly describing our setup, we currently have four PM1724a devices attached to the PCIe root complex in NUMA domain 1 on an AMD EPYC 7401 processor. In order to measure read throughput of ZFS, XFS, and the Raw Devices, the XDD tool was used which is available at:

[email protected]:bwatkinson/xdd.git

In all cases I am presenting, kernel 4.18.20-100 was used and I disabled all CPU's not on socket 0 within the kernel. I also issued asynchronous sequential reads to the file systems/devices while pinning all XDD threads to NUMA domain 1 and Socket 0's memory banks. I conducted four tests consisting of measuring throughput for the raw devices, XFS, and ZFS 0.7.12. For the Raw Device tests, I had 6 I/O threads per devices with a request sizes of 1 MB and a total of 32 GB read from each device using Direct I/O. In the XFS case, I created a single XFS file system on each of the 4 devices. In each of the XFS file systems, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB using Direct I/O. In the ZFS Single ZPool case I created a single ZPool composed of 4 VDEVs and read 128 GB file of random data using 24 I/O threads with request sizes of 1 MB. In the ZFS Multiple ZPool case I create 4 separate ZPools each consisting of a single VDEV. In each of the ZPools, I read a 32 GB file of random data using 6 I/O threads per file with request sizes of 1 MB. In both the Single ZPool and Multipl ZPool cases I set the record sizes for all pools to 1 MB and I set the primarycache=none. We decided to disable the ARC in all cases, because we were reading 128 GB of data, which was exactly equal to 2x available memory on Socket 0. Even with the ARC enabled we were seeing no performance benefits. Below are the throughput measurements I collected for each of these case.

Performance Results:
Raw Device - 12,7246.724 MB/s
XFS Direct I/O - 12,734.633 MB/s
ZFS Single Zpool - 6,452.946 MB/s
ZFS Multiple Zpool – 6,344.915 MB/s

In order to try and solve what was cutting ZFS read performance in half, I generated flame graphs using the following tool:

http://www.brendangregg.com/flamegraphs.html

In general I found most of the perf samples were occurring in zio_wait. It is in this call that io_scheudle is called. Comparing the ZFS flame graphs to XFS flame graphs, I found that the number of samples between the submit_bio and io_schedule was significantly larger in the ZFS case. I decided to take timestamps of each call to io_schedule for both ZFS and XFS to measure the latency between the calls. Below is a link to to histograms as well as total elapsed time in microseconds between io_schedule calls for the tests I described above. In total I collected 110,000 timestamps. In plotted data the first 10,000 timestamps were ignored to allow for the file systems to reach a steady state.

https://drive.google.com/drive/folders/1t7tLXxyhurcxGjry4WYkw_wryh6eMqdh?usp=sharing

In general, ZFS has a significant latency between io_scheudule calls. I have also verified that the output from iostat shows a larger await_r value for ZFS over XFS for these tests as well. In general it seems ZFS is letting request sit in the hardware queues longer than XFS and the Raw Devices causing a huge performance penalty in ZFS reads (effectively cutting the available device bandwidth in half).

In general, has this issue been noticed with NVMe SSD’s and ZFS and is there a current fix to this issue? If there is no current fix, is this issue being worked on?

Also, I have tried to duplicate the XFS results using ZFS 0.8.0-rc2 using Direct I/O, but the performance for 0.8.0 read almost exactly matched ZFS 0.7.12 read performance without Direct I/O.

Include any warning/errors/backtraces from the system logs

gcs-github · 2019-03-14T11:52:19Z

When testing ZFS, what was your block devices IO scheduler set to?

cat /sys/block/your_nvme_drive_here/queue/scheduler

Normally it should be set to noop (which ZFS should set on its own if you've created your vdevs using full drives, which is recommended) so as to let ZFS do the heavy lifting instead of regular in-kernel IO schedulers. In some cases however, it can be necessary to set the scheduler yourself.

I'm not saying it'd necessarily explain your issues or fix them, but it might be a relevant piece of the puzzle here.

bwatkinson · 2019-03-14T16:12:25Z

In the data I have presented all the NVMe drive schedulers were set to none under /sys/block/nvme#n1/queue/scheduler. Also the ZFS module parameter for zfs_vdev_scheduler was set to noop.

I think only achieving half of the available bandwidth for synchronous reads is more than just under performance at this point. I could see only getting 75-80% as under performance; however, 50% is just way too low.

I have been continuing to try and track this issue down. I have taken my time stamp approach and placed time stamps in ZFS dmu.c (dmu_buf_hold_array_by_dnode), vdev_disk.c (vdev_submit_bio_impl), and SPL spl-condvar.c (cv_wait_common). My idea is try and narrow the scope down of what is causing such a drastic latencies in ZFS between when a read enters the ZFS software stack, to submitting the bio request to the linux block mulitqueue layer, and finally when the io_schdule call occurs. I am however seeing an issue where I am not collected the same number of time stamps in the vdev_submit_bio_impl call as the other 2 call sites for 1 MB request for a total of 128 GB of data with 24 I/O threads. I added a variable inside of the zio_t struct to flag the zio as a read and set this variable to true, for reads, inside of the dmu_buf_hold_array_by_node call right after the zio_root call occurs. I also updated the zio_create function to add the flag from the parent zio to the child if a parent is passed. Not sure why I am seeing less time stamps collected in just this one function, but I working to try and figure that out now.

I am sticking with this time stamp approach as nothing is blatantly obvious, when reading the source code, as the cause of the latencies. I thought maybe the issue might be occurring the SPL taskq's; however, for synchronous reads, there are 8 queues with 12 threads each to service the checksum verify in the read zio pipeline as well as finally calling zio_wait to hit the io_schedule call.

Hopefully collecting the timestamps and narrowing down the area of collection will lead to more insight to why this issue is present. Any advice on places to look in the source code or even why I am only getting a fraction of the time stamps collected in the vdev_submit_bio_impl call would greatly be appreciated.

stevenh · 2019-03-15T20:37:20Z

Have you tried increasing max number of I/O requests active for each device?

bwatkinson · 2019-03-15T20:56:30Z

Yes, I have done this for zfs_vdev_sync_read_max_active and zfs_vdev_max_active. Currently the amount of work does not seem to be causing the poor synchronous read performance. ZFS seems more than capable of handing off enough work to the NVMe SSD drives (even with default module parameter values). The real issue is that the requests are sitting the devices hardware queues for far too long (aka between when ZFS issues the requests and finally asks for the data). This is why I am trying to figure out what is causing the large latencies in the source code between calls to submit_bio and io_schedule for a synchronous read requests.

prometheanfire · 2019-03-15T21:01:56Z

NVME devs still have queue depth, maybe it's some issue with that (or other controller issues)

jwittlincohen · 2019-03-15T21:30:14Z

@bwatkinson Did you also try increasing zfs_vdev_sync_read_min_active?

bwatkinson · 2019-03-15T22:17:50Z

@jwittlincohen I did not increase the min. Is there a reason why you think this would decrease the latency issue I am seeing with read requests sitting the NVMe Device queues?

@prometheanfire The data (on the Google Drive) I presented in the first post shows that the hardware and hardware paths seem perfectly fine. XFS easily gets the full NVMe SDD's bandwidth for reads for the exact same workloads. Of course I am using Direct I/O with XFS. Also, directly reading from the devices I do not experience the same issue I am seeing in ZFS. That is also why I believe this has something to do with the ZFS software stack. I have tried using the ZFS 0.8.0-rc which allows for Direct I/O but this same issue is still present.

bwatkinson · 2019-03-19T19:32:24Z

I just wanted to give an update on this and see if there was any other suggestions where I should narrow my focus down in the ZFS source code to resolve the latency issues I am see with NVMe SDD's for synchronous reads.

I was able to solve where the missing timestamps were for the submit_bio calls. I had to trace the zio_t's to the parent from the children to get the correct matching PID's. I found that some of the submit_bio calls were being handled by the kernel threads in the SPL taskq's. I have attached the new results in the hrtime_dmu_submit_bio_io_schedule_data.zip. These timestamps were collected using a single ZPool in a RAIDZ0 configuration using 4 NVMe SDD's. The three call sites, where the timestamps were collected, were in the functions dmu_buf_hold_array_by_dnode, vdev_submit_bio_impl, and cv_wait_common. The total data read was 128 GB using 24 IO threads with each read request being 1 MB. I set the primarycache to none and set the recordsize to 1 MB for the ZPool.

In general there is a larger mean, median latency between the submit_bio to the io_schudule calls. Surprising, it is not as significant as I was expecting in comparison between the dmu_buf_hold_array_by_dnode call site and the submit_bio call site. One area I currently focusing on is the SPL taskq's. I have noticed the read queue is dynamic. I am planning on statically allocating all the kernel threads for this queue when the ZFS, SPL module is loaded to see if this has any effect. Any suggestions or advice related to places to focus on in the source code would be greatly be appreciated.

hrtime_dmu_submit_bio_io_schedule_data.zip

bwatkinson · 2019-03-19T19:36:10Z

Also, I meant to say having the kernel threads in the read taskq statically allocated when the ZPool is created.

behlendorf · 2019-03-19T20:46:33Z

@bwatkinson the taskq's are a good place to start investigating. I'd suggest setting the spl_taskq_thread_dynamic module option to 0 and see if it helps. This will disable all of the dynamic task queues and statistically allocate the threads.

I'd also suggest looking at the number of threads per taskq in the I/O pipeline. You may be encountering increased contention of the taskq locks due to the very low latency of your devices. Increasing the number of tasks and decreasing the number of threads may help reduce this contention. There is no tuning for this, but it's straight forward to make the change in the source here.

bwatkinson · 2019-03-19T21:11:15Z

@behlendorf that is the exact source file I am currently manipulating to conduct further testing. I am glad to know I wasn't that far off. At the moment I am just searching for the z_rd_int_* named taskq's in spa_taskqs_init (spa.c) and removing the TASKQ_DYNAMIC flag as a quick test. I wasn't sure if I should mess around with the other taskq's and their dynamic settings at the moment, but I will also disable the spa_taskq_thread_dynamic module parameter with further testing to see if this resolves the present issue. The contention issue also makes sense. I will explore adjusting the number of queues as well as the threads per queue. Thank you for you help with this.

dweeezil · 2019-03-21T17:01:19Z

Here's a couple of comments from my observations over time which have been gleaned mainly from running various tests and benchmarks on tmpfs-based pools: First off, on highly-concurrent workloads, as @behlendorf alluded to, the overhead in managing dynamic taskqs can become rather significant.

Also, as the devices used for vdevs become ever faster and lower latency, I think the overhead, in general, of the entire zio pipeline can start to become a bottleneck w.r.t. overall latency. I've also discovered that builds containing 1ce23dc (which is not in 0.7.x) can experience a sort-of fixed amount of latency due to zfs_commit_timeout_pct. There are likely other friction points within the ZIL code, but I've done done any more extensive research.

In general, however, it seems that as devices get ever faster and have ever lower latency, some of the rationale for pipelining starts to go away and the overhead in doing so becomes an ever larger contributor to user-visible latency. I think this is an area that's ripe for much more extensive performance analyses and it seems that @bwatkinson's work is a good start.

janetcampbell · 2019-04-06T12:17:16Z

primarycache=none will kill your read performance. ZFS normally brings in a fair amount of metadata from aggregated reads that you will not see any advantage from. Use primarycache=all and put in a smaller ARC max if you don't like how much memory it takes. Other settings of primarycache are only to be used in dire situations, they are major warning signs that something is wrong.

If you don't have a SLOG, you likely have a pool full of indirect writes. These will also kill read performance, as data and metadata end up fragmented from each other.

Get zpool iostat -r data while you are running a zfs send and while you are running a zfs receive, and post your kernel config and zfs get all - these will help significantly. I suspect you will see a lot of unaggregated 4K reads. If you set the vdev agg max to 1.5m or so and they still don't merge, then you have fragmented metadata. zfs send | receive it to another data set and repeat the test, if it helps significantly then you know what the problem is.

Hope this helps.

bwatkinson · 2019-04-08T15:19:57Z

@janetcampbell I am no way implying that we are not using the ARC when we run our normal ZPool configurations. Reads requested serviced from the ARC are obviously preferable; however, the issue is that not all reads will be serviced from the ARC (AKA hitting the underlying VDEV NVMe SDD's). When the ZFS code path for reads that are hitting the VDEV devices queues occur, only getting half of the total available bandwidth of the underlying devices is achieved. Inevitably not all read requests will be serviced at the ARC and that is where the primary concern is. This is why I set primarycache=none to help track down in the ZFS/SPL code path where such large latencies are occurring.
That is an interesting point about indirect writes. I currently am only issuing asynchronous writes to the ZPool, so I am not certain if a SLOG will help resolve this issue. Please correct me if I am wrong, but my understanding of the SLOG is that it only helps service synchronous writes.
I have modified just about every ZFS (some SPL) tunable parameter, but the performance hit of issuing reads to the devices is still present. I will try out your suggestion though to see if that helps.

bwatkinson · 2019-04-08T15:31:08Z

I have also added more data points to the Google drive:

https://drive.google.com/drive/u/2/folders/1t7tLXxyhurcxGjry4WYkw_wryh6eMqdh

Unfortunately, adjusting the SPL interrupt queues (Number of Threads vs Number of Queues and Static Allocation) did not have a significant impact. I have moved on to trying to identify other possible sources of contention within the ZFS pipeline source at this point. It has been brought to my attention from others, experiencing this same issue, that the read performance dips significantly as more VDEV's are added to the ZPool. I have started to explore why this might be the case as this might lead to uncovering what is leading to the performance penalties with NVMe VDEV's.

joebauman · 2019-05-06T15:48:31Z

@bwatkinson I've been looking at this as well, though not yet to the level of your investigation. We would like to use the NVMe for an L2ARC device but when that didn't give us the performance we were expecting I started experimenting with the performance of the NVMe on its own. I'm also seeing 50% or so of expected performance. I'm curious if you've found any additional leads since your last post?

dweeezil · 2019-05-08T02:32:00Z

Something I've been meaning to test for quite awhile on pools with with ultra low-latency vdevs, but haven't had the time to do so yet is to try various combinations of:

intel_idle.max_cstate=0
processor.max_cstate=0
idle=poll

It would be interesting to see whether any of these sufficiently lower latency to measurably impact performance on these types of pools. See the second and third paragraphs in this comment for rationale. I'm mentioning it here as a suggestion to anyone currently performing this type of testing.

joebauman · 2019-05-09T16:07:12Z

@dweeezil I've been doing read testing with fio using ZFS on an NVMe and it actually looks like read bandwidth dropped about 15-20% when I added those options to the boot config. I tried a couple different combinations (including max_cstate=1) and performance either didn't change or dropped 15-20%.

bwatkinson · 2019-05-09T16:19:26Z

@dweeezil @joebauman I have tried multiple kernel settings myself and found they either had negligible effects or actually reduced ZFS NVMe read performance.

@joebauman I am still working on solving this issue. I looked into the Space Map allocation code to see if I could find anything that would be causing this issue, but nothing in particular stood out. I started hunting in this part of the source when discovering that NVMe read performance levels out completely when more than 2 VDEV's are in a single ZPool. I am working on collecting individual timings between each of the ZIO pipeline stages at the moment to see if I can exactly narrow down where in the pipeline read requests are stalling out leading to these low performance numbers. Hopefully will be sharing some of those results soon, but I am working on getting a new testbed up and running before giving credence to any results I collect going forward.

richardelling · 2019-05-09T18:39:03Z

Setting max_cstate=0 is not desirable for many cases. The goal you're trying to achieve can be reached on modern CPUs with max_cstate=1 FWIW, we do this for all systems where high performance is desired. Use powertop to observe the impacts.

IMHO, the jury is still out on polling, this is an area of relatively intense research as more polling drivers are becoming available (DPDK, SPDK, et.al.)

recklessnl · 2019-09-04T22:49:14Z

@bwatkinson Have you made any progress in finding the cause? How are the NVMe SSDs running for you now with the newer version of ZFS?

bwatkinson · 2019-09-04T22:56:25Z

@recklessnl we actually have discovered that the bottle neck had to do with the overhead of memory copies. I am working the @behlendorf and @mmaybee on getting Direct IO to work in ZFS 0.8. The plan is to get things ironed out and make an official pull request against zfsonlinux/zfs master. I was planning on updating this ticket once an official pull request has been made.

recklessnl · 2019-09-04T23:26:15Z

@bwatkinson That's very interesting, thanks for sharing. Wasn't Direct I/O already added in ZFS 0.8? See here #7823

Or have you found a bug with the current implementation?

bwatkinson · 2019-09-05T11:57:30Z

@recklessnl so the addition in 0.8 was just to accept the O_DIRECT flag; however, the IO path for both reads/writes remained the same (AKA they traveled through the ARC). If you run ZFS 0.8 and use Direct IO you will see the ARC is still in play (watch output from arcstat). This still means memory copies are being performed and limiting performance. We are actually working on implementing Direct IO by mapping the user’s pages into kernel space and reading/writing to them directly. This allows us to bypass the ARC all together.

h1z1 · 2019-09-07T09:30:58Z

Curious why would you want to bypass the ARC? There are other problems if accessing through it is slower.

Please never make that default.

mmaybee · 2019-09-07T14:46:01Z

@h1z1 the ARC bypass is a side effect of the Direct IO support (although one could argue that it is implied by "direct"). The primary goal was to eliminate data copies to improve performance. Because the Direct IO path maps the user buffer directly into the kernel to avoid the data copy, there is no kernel copy of the data made so it cannot be cached in the ARC. That said, there are good reasons to avoid the ARC sometimes: If the data being read/written is not going to be accessed again (known from the application level), then avoiding the cache may be a benefit because other, more relevant, data can be kept in cache.

Direct IO will not be the default. It must be explicitly requested from the application layer via the O_DIRECT flag.

bwatkinson · 2019-09-07T16:28:12Z

@h1z1 as @mmaybee said, this will not be default behavior in ZFS going forward. It has to be explicitly requested from the application layer.

There has been quite a bit of concern on this issue about bypassing the ARC or avoiding it all together. I think it is also important to remember the one of the key purposes of the ARC, which is to hide disk latency for read requests. That is what is unique about NVMe SDD’s though, the latency is no longer a giant bottleneck. Essentially by trying to mask latency from a NVMe device, we are actually inducing a much higher latency overhead. Even using Direct IO with reads from these devices well out paces any performance gains from caching reads or prefetching them. Direct IO will allow ZFS another path to work with very low latency devices while still providing all the data protection, reduction, and consistency that one expects from ZFS.

richardelling · 2019-09-07T17:10:24Z

well, not quite that fast. Today, the most common NVMe devices write at 60-90 usec and read at 80-120 usec. That is 2-3 orders of magnitude slower than RAM. The bigger win will be more efficient lookup and block management in ARC.

fricmonkey · 2020-01-22T03:51:46Z

Any new info on this front?
Working on a build at the moment with 8 nvme, and coming across these issues myself even down to just a single drive alone in a zfs pool. I'm not nearly as technically knowledgeable as yourselves, but I've done a lot of research into this and have come up dry outside of this thread.

prometheanfire · 2020-01-22T03:59:01Z

I guess I'll add myself more explicitly to this, have a storage server that's all nvme that I'd prefer to use zfs on as well.

bwatkinson · 2020-01-22T17:25:20Z

@fricmonkey,
@mmaybee and I are currently still working on the code for Direct IO in ZFS, but we are getting closer to the finish line with the work. Hoping to have an official pull request into ZoL master soon. No exact date yet, but I will reference the pull request in this thread once we have created it. I think once we have created the pull request, it would be extremely beneficial for people to experiment with using Direct IO in ZFS with NVMe Zpools. That will help in assessing the overall performance of the Direct IO code paths with multiple architectures.

fricmonkey · 2020-01-22T17:31:39Z

@bwatkinson Thanks for the update! With the state of zfs and nvme as it is now, I will likely be holding off putting the machine into our production environment. Whenever you have something ready to test, I'd be happy to be involved in testing and experimentation if one of you could help walk me through it.

randomsamples · 2020-03-19T01:27:08Z

Just wanted to add that I've been seeing similar performance issues with spinny drives. I have tried numerous experiments with various file systems and zfs tweaks on my hardware, including 8 WD Blue 6TB drives and a second set of 8 WD Gold 10TB drives (to eliminate the possibility of SMR being the culprit) but nothing has solved the problem. With 8 drives at 120-260MB/s each I would expect around 2GB/s throughput in striped configuration (or mirrored reads) but I never get over around 1GB/s. Almost exactly half the expected performance. I also suspect a software issue and found this bug in my search for possible answers. Anyway, plus one to this, I hope it fixes my issues as well! Would love to see better read performance from my ZFS nas.

Some details here: https://www.reddit.com/r/freenas/comments/fax1wl/reads_12_the_speed_of_writes/

shodanshok · 2020-03-19T08:33:17Z

@randomsamples it is unlikely that you are facing that specific problem with spinning disks. Rather, single-threaded sequential access does not scale linearly with stripe count, as in this case the only method to extract performance is via read-ahead. You can try increasing your read-ahead setting by tuning zfetch_max_distance

randomsamples · 2020-03-19T23:39:34Z

HI @shodanshok, thank you for the tip. II did try this out today and saw a nice gain, but writes are still way faster than reads on my array which I find quite surprising. FWIW this is 8 WD Gold 10 TB drives, 4 sets of mirrors, the drives will do up to 230MB/s each, so I would expect about 2GB/s reads and 1GB/s writes in this configuration (raw dev reads/writes demonstrate this). Dataset is standard sync, no compression, atime off, dedupe off, 1M records, case insensitive (SMB share), and in this case I have SLOG ssd and L2Arc attached (its a working pool so I didn't want to change too much but I used huge files to make sure cache is not making perf magic happen). I started with zfetch_max_distance at the default (I think it about 8MB) and doubled it until perf stopped improving, which was 64MB (probably a huge value for this setting?). Anyway I ran some more dd tests, both single and 8 parallel instances, using 128GB of zeros (all independent files per worker). All tests show incredibly stable and fast writes, about 1.86GB/s (928MB/s visible throughput due to RAID 10) with 100% utilization of every disk, and about 1.2GB/s reads with around 80% utilization (much more fluctuation). Large streaming reads are about 20% faster with max_distance set larger, but I still feel like there has to be a good reason why the reads are so much slower than the writes, shouldn't it be the other way around? Anyway, I don't want to hijack this thread so maybe if you want to continue this conversation (I would love to get to the bottom of this) then you can ping me on reddit (find me through the post I linked to above) and we can find a place to discuss further. Thanks again, this definitely was helpful!

beren12 · 2020-04-23T22:54:51Z

Does your DirectIO still do zfs checksums?

bwatkinson · 2020-04-24T16:12:56Z

Checksums will still be supported; however, the buffer will be temporarily copied on writes but not cached in the event that checksums are enabled. This will affect DirectIO requests going through the normal ZPL path (write system call), but we can avoid the temporary memory copy in the case of DirectIO with Lustre. I have attached a link (https://docs.google.com/document/d/1C8AgqoRodxutvYIodH39J8hRQ2xcP9poELU6Z6CXmzY/edit#heading=h.r28mljyq7a2b) to the document @ahrens created discussing the semantics that we settled on for DirectIO in ZFS. I am currently in the processes of updating the PR #10018 to use these new semantics.

KosmiK2001 · 2020-07-21T18:13:28Z

I want to suggest the following to the author of the topic:
Re-create the pool from scratch. And when creating! pool, disable ALL feature's. In general, everything. And repeat your tests and linear reading to.
In the future, I can suggest installing FreeBSD on a separate PC. And repeat the tests in this OS. And create a pool from scratch. With all feature's and without. And repeat the tests.

I actually ran into this problem on an old Samsung 2TB hard drive. Pure linear reading from the start starts at 118mb/s. By the end of the disc - 52.
With ZFS, I can see half or around. When recording - everything is fine.
And by the way, why did I mention FreeBSD. I have a corei5-4430 + 8gb_ram. There is an old 320GB notebook drive. This disk from the start gives 60mb/s. So, reading with compression enabled, dedup and 16MB blocks gives 45MB/s. So is the writing on disk!
I can only advise you to check these possibilities.

stale · 2021-07-21T21:21:38Z

This issue has been automatically marked as "stale" because it has not had any activity for a while. It will be closed in 90 days if no further activity occurs. Thank you for your contributions.

tlvu · 2021-11-10T19:08:54Z

We have a new server that's all nvme (12 disks of 8TB each) that I'd like to use ZFS for its on-the-fly compression to squeeze more storage out of those disks, that are quite expensive.

With ZFS we are thinking of on Vdev RaidZ2 with all those 12 disks for 2 Vdev RaidZ1 in stripe mode.

Our other alternative is hybrid raid with VTOC built into the Intel CPU with LVM (2 raid-5 in stripe mode) and XFS.

We are on Centos 7.

So what is the current nvme support state with ZFS on Centos 7, in terms of performance, drive longevity (write amplification on ssd) and data integrity (write hole with raid in case of power failure)?

Thanks.

rnz · 2021-11-25T18:59:58Z

Write same slow:

Increase zfs_dirty_data_max (4294967296 -> 10737418240 -> 21474836480 -> 42949672960) compensate performance penalties, but this is background record same slow per nvme devices ~10k iops per device:

# fio --time_based --name=benchmark --size=15G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=32 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=32
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=355MiB/s][w=90.0k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=14610: Thu Nov 25 18:34:14 2021
  write: IOPS=170k, BW=663MiB/s (695MB/s)(19.4GiB/30001msec); 0 zone resets
    slat (usec): min=4, max=81198, avg=22.68, stdev=51.44
    clat (nsec): min=1280, max=84040k, avg=730981.18, stdev=635737.86
     lat (usec): min=13, max=84140, avg=753.74, stdev=654.34
    clat percentiles (usec):
     |  1.00th=[  465],  5.00th=[  474], 10.00th=[  482], 20.00th=[  490],
     | 30.00th=[  502], 40.00th=[  510], 50.00th=[  529], 60.00th=[  545],
     | 70.00th=[  570], 80.00th=[  709], 90.00th=[ 1156], 95.00th=[ 1844],
     | 99.00th=[ 3359], 99.50th=[ 3458], 99.90th=[ 3654], 99.95th=[ 3752],
     | 99.99th=[ 4047]
   bw (  KiB/s): min=145080, max=973275, per=100.00%, avg=682153.22, stdev=78295.26, samples=236
   iops        : min=36270, max=243318, avg=170538.22, stdev=19573.82, samples=236
  lat (usec)   : 2=0.01%, 4=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=0.01%, 500=29.68%, 750=51.75%, 1000=5.93%
  lat (msec)   : 2=7.83%, 4=4.80%, 10=0.01%, 50=0.01%, 100=0.01%
  cpu          : usr=4.04%, sys=77.27%, ctx=219363, majf=0, minf=256
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=100.0%, >=64=0.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.1%, 64=0.0%, >=64=0.0%
     issued rwts: total=0,5091203,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=32

Run status group 0 (all jobs):
  WRITE: bw=663MiB/s (695MB/s), 663MiB/s-663MiB/s (695MB/s-695MB/s), io=19.4GiB (20.9GB), run=30001-30001msec

After test backround writing is still process:

# iostat  -x 1 | awk '{print $1"\t"$8"\t"$9}'
Device    w/s    wkB/s
loop0    1.00    4.00
loop1    1.00    4.00
nvme0n1    0.00    0.00
nvme1n1    7963.00    398872.00
nvme2n1    6197.00    393752.00
nvme3n1    8052.00    403096.00
nvme4n1    7933.00    398872.00
nvme5n1    0.00    0.00
nvme6n1    0.00    0.00

Than single nvme device have raw speed 700k iops:

# fio --time_based --name=benchmark --size=15G --runtime=30 --filename=/dev/nvme6n1 --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=2761MiB/s][w=707k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=3828468: Thu Nov 25 21:30:06 2021
  write: IOPS=706k, BW=2758MiB/s (2892MB/s)(80.8GiB/30001msec); 0 zone resets
    slat (nsec): min=1300, max=258439, avg=2391.93, stdev=1159.86
    clat (usec): min=314, max=2934, avg=722.11, stdev=111.84
     lat (usec): min=316, max=2936, avg=724.57, stdev=111.82
    clat percentiles (usec):
     |  1.00th=[  502],  5.00th=[  553], 10.00th=[  586], 20.00th=[  627],
     | 30.00th=[  660], 40.00th=[  685], 50.00th=[  717], 60.00th=[  750],
     | 70.00th=[  783], 80.00th=[  816], 90.00th=[  865], 95.00th=[  906],
     | 99.00th=[  988], 99.50th=[ 1020], 99.90th=[ 1106], 99.95th=[ 1205],
     | 99.99th=[ 1942]
   bw (  MiB/s): min= 2721, max= 2801, per=100.00%, avg=2760.62, stdev= 3.03, samples=236
   iops        : min=696742, max=717228, avg=706717.86, stdev=775.87, samples=236
  lat (usec)   : 500=0.97%, 750=59.81%, 1000=38.47%
  lat (msec)   : 2=0.74%, 4=0.01%
  cpu          : usr=24.41%, sys=44.96%, ctx=5787827, majf=0, minf=70
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,21184205,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=2758MiB/s (2892MB/s), 2758MiB/s-2758MiB/s (2892MB/s-2892MB/s), io=80.8GiB (86.8GB), run=30001-30001msec

Disk stats (read/write):
  nvme6n1: ios=50/21098501, merge=0/0, ticks=4/15168112, in_queue=15168116, util=99.77%

# uname -a
Linux host1 5.11.22-7-pve #1 SMP PVE 5.11.22-12 (Sun, 07 Nov 2021 21:46:36 +0100) x86_64 GNU/Linux

# zpool --version
zfs-2.1.1-pve3
zfs-kmod-2.0.6-pve1

# zpool status
  pool: zfs-p1
 state: ONLINE
config:

	NAME                                             STATE     READ WRITE CKSUM
	zfs-p1                                           ONLINE       0     0     0
	  raidz1-0                                       ONLINE       0     0     0
	    nvme-INTEL_SSDPE2KE076T8_PHLN039400AT7P6DGN  ONLINE       0     0     0
	    nvme-INTEL_SSDPE2KE076T8_PHLN0393005J7P6DGN  ONLINE       0     0     0
	    nvme-INTEL_SSDPE2KE076T8_PHLN039400AF7P6DGN  ONLINE       0     0     0
	    nvme-INTEL_SSDPE2KE076T8_PHLN0393005E7P6DGN  ONLINE       0     0     0

# zpool get all | egrep 'ashift|trim'
zfs-p1  ashift                         12                             local
zfs-p1  autotrim                       on                             local

# zfs get all | egrep 'ashift|compression|trim|time|record'
zfs-p1                    recordsize            128K                       default
zfs-p1                    compression           off                        default
zfs-p1                    atime                 on                         default
zfs-p1                    relatime              off                        default
zfs-p1/subvol-700-disk-0  recordsize            128K                       local
zfs-p1/subvol-700-disk-0  compression           off                        default
zfs-p1/subvol-700-disk-0  atime                 on                         default
zfs-p1/subvol-700-disk-0  relatime              off                        default

rnz · 2021-11-25T19:30:20Z

Also increase:
zfs_vdev_async_write_min_active
zfs_vdev_async_write_max_active

increased iops:

Device	w/s	wkB/s
loop0	0.00	0.00
loop1	0.00	0.00
nvme0n1	0.00	0.00
nvme1n1	48071.00	1595316.00
nvme2n1	47334.00	1496244.00
nvme3n1	48044.00	1595120.00
nvme4n1	47676.00	1549908.00
nvme5n1	0.00	0.00
nvme6n1	0.00	0.00

richardelling · 2021-11-26T23:36:47Z

Increasing zfs_vdev_async_write_max_active can make a big difference for SSDs, but can kill performance for HDDs.

Changing zfs_vdev_async_write_min_active will not make a performance impact unless you have a lot of different write types concurrently. Use zpool iostat -q and friends to see the breakdown of I/Os for each I/O type.

rnz · 2021-11-27T12:59:48Z

Increasing zfs_vdev_async_write_max_active can make a big difference for SSDs, but can kill performance for HDDs.

As I understand it, we are discussing performance on nvme ssd

Changing zfs_vdev_async_write_min_active will not make a performance impact unless you have a lot of different write types concurrently.

Increasing zfs_vdev_async_write_max_active only - was not effects on performance, iops per nvme device was still low and load was lower 50% and and overall performance was low. Performance changed significantly after increasing zfs_vdev_async_write_min_active too and nvme devices began to load at 90%

Use zpool iostat -q and friends to see the breakdown of I/Os for each I/O type.

I used it and -r and -w and other options

Also I increase zfs_vdev_sync_write_min_active and zfs_vdev_sync_write_max_active and per nvme device:
load increased 90% -> 99-100%
iops increased 48k -> 60k
bw increased 1.5G -> 2G

From these results, the conclusion suggests itself: ZFS groups 4kb blocks into 30-50kb block for writing to device in raidz

richardelling · 2021-12-04T23:00:36Z

Increasing zfs_vdev_sync_write_min_active can starve the other storage I/O types, which can have unexpected and potentially awful results. The only way to observe this interaction is by monitoring with zpool iostat -q and friends.

rnz · 2021-12-20T19:58:41Z

Increasing zfs_vdev_sync_write_min_active can starve the other storage I/O types,

For example? (note: I read you article about ZIO scheduling)

which can have unexpected and potentially awful results.

Can you show in the code exactly how this might affect?

The only way to observe this interaction is by monitoring with zpool iostat -q and friends.

I create mirror instead raidz, same result - w/o tune zfs_vdev_sync_write_min_active and zfs_vdev_async_write_min_active - nvme devices is not utilized and performance zfs is poor. When I Increase zfs_vdev_sync_write_min_active and zfs_vdev_async_write_min_active - performance is grow:

# fio --time_based --name=benchmark --size=8G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=539MiB/s][w=138k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=11695: Mon Dec 20 19:41:21 2021
  write: IOPS=184k, BW=720MiB/s (755MB/s)(21.1GiB/30001msec); 0 zone resets
    slat (usec): min=3, max=14615, avg=20.75, stdev=28.53
    clat (usec): min=2, max=23491, avg=2754.57, stdev=897.84
     lat (usec): min=20, max=23527, avg=2775.40, stdev=904.34
    clat percentiles (usec):
     |  1.00th=[ 1926],  5.00th=[ 2089], 10.00th=[ 2180], 20.00th=[ 2245],
     | 30.00th=[ 2278], 40.00th=[ 2343], 50.00th=[ 2409], 60.00th=[ 2540],
     | 70.00th=[ 2737], 80.00th=[ 3097], 90.00th=[ 3884], 95.00th=[ 4686],
     | 99.00th=[ 6128], 99.50th=[ 6718], 99.90th=[ 8586], 99.95th=[11207],
     | 99.99th=[14615]
   bw (  KiB/s): min=396919, max=965040, per=100.00%, avg=739064.49, stdev=37556.33, samples=236
   iops        : min=99229, max=241260, avg=184766.00, stdev=9389.11, samples=236
  lat (usec)   : 4=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=2.30%, 4=88.74%, 10=8.89%, 20=0.06%, 50=0.01%
  cpu          : usr=4.67%, sys=94.65%, ctx=7882, majf=0, minf=983
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,5532699,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=720MiB/s (755MB/s), 720MiB/s-720MiB/s (755MB/s-755MB/s), io=21.1GiB (22.7GB), run=30001-30001msec

# zpool iostat -l 1 -v
...

               capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool         alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
------------ -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
zfs-p1       46.0G  6.94T      0  48.6K  4.00K  6.03G   98us   10ms   98us   10ms  768ns  864ns      -    5us      -      -
  mirror     46.0G  6.94T      0  48.6K  4.00K  6.03G   98us   10ms   98us   10ms  768ns  864ns      -    5us      -      -
    nvme1n1      -      -      0  24.2K      0  3.01G      -   12ms      -   12ms      -  864ns      -    3us      -      -
    nvme2n1      -      -      0  24.3K  4.00K  3.02G   98us    7ms   98us    7ms  768ns  864ns      -    7us      -      -
------------ -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool         alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
------------ -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
zfs-p1       46.0G  6.94T      0  48.6K      0  6.08G      -    9ms      -    9ms      -      -      -   12us      -      -
  mirror     46.0G  6.94T      0  48.6K      0  6.08G      -    9ms      -    9ms      -      -      -   12us      -      -
    nvme1n1      -      -      0  24.3K      0  3.04G      -   13ms      -   13ms      -      -      -    5us      -      -
    nvme2n1      -      -      0  24.3K      0  3.04G      -    4ms      -    4ms      -      -      -   20us      -      -
------------ -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim
pool         alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait
------------ -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
zfs-p1       46.0G  6.94T      0  48.7K  4.00K  6.04G   98us    8ms   98us    8ms  768ns  576ns      -   14us      -      -
  mirror     46.0G  6.94T      0  48.7K  4.00K  6.04G   98us    8ms   98us    8ms  768ns  576ns      -   14us      -      -
    nvme1n1      -      -      0  24.4K  4.00K  3.03G   98us   12ms   98us   12ms  768ns  624ns      -    3us      -      -
    nvme2n1      -      -      0  24.3K      0  3.01G      -    4ms      -    4ms      -  528ns      -   24us      -      -
...

and one more time:

# fio --time_based --name=benchmark --size=8G --runtime=30 --filename=/mnt/zfs/g-fio.test --ioengine=libaio --randrepeat=0 --iodepth=128 --direct=1 --invalidate=1 --verify=0 --verify_fatal=0 --numjobs=4 --rw=randwrite --blocksize=4k --group_reporting
benchmark: (g=0): rw=randwrite, bs=(R) 4096B-4096B, (W) 4096B-4096B, (T) 4096B-4096B, ioengine=libaio, iodepth=128
...
fio-3.25
Starting 4 processes
Jobs: 4 (f=4): [w(4)][100.0%][w=1126MiB/s][w=288k IOPS][eta 00m:00s]
benchmark: (groupid=0, jobs=4): err= 0: pid=11977: Mon Dec 20 20:05:57 2021
  write: IOPS=265k, BW=1035MiB/s (1086MB/s)(30.3GiB/30001msec); 0 zone resets
    slat (usec): min=3, max=12762, avg=14.21, stdev=17.34
    clat (nsec): min=1770, max=21001k, avg=1916627.53, stdev=728140.34
     lat (usec): min=9, max=21035, avg=1930.92, stdev=733.58
    clat percentiles (usec):
     |  1.00th=[ 1418],  5.00th=[ 1483], 10.00th=[ 1500], 20.00th=[ 1549],
     | 30.00th=[ 1565], 40.00th=[ 1598], 50.00th=[ 1631], 60.00th=[ 1696],
     | 70.00th=[ 1811], 80.00th=[ 2073], 90.00th=[ 2737], 95.00th=[ 3523],
     | 99.00th=[ 4817], 99.50th=[ 5407], 99.90th=[ 6915], 99.95th=[ 7570],
     | 99.99th=[13304]
   bw (  MiB/s): min=  512, max= 1345, per=99.70%, avg=1032.30, stdev=58.82, samples=236
   iops        : min=131156, max=344480, avg=264268.41, stdev=15058.95, samples=236
  lat (usec)   : 2=0.01%, 10=0.01%, 20=0.01%, 50=0.01%, 100=0.01%
  lat (usec)   : 250=0.01%, 500=0.01%, 750=0.01%, 1000=0.01%
  lat (msec)   : 2=78.16%, 4=18.76%, 10=3.06%, 20=0.02%, 50=0.01%
  cpu          : usr=6.30%, sys=93.07%, ctx=8914, majf=0, minf=310
  IO depths    : 1=0.1%, 2=0.1%, 4=0.1%, 8=0.1%, 16=0.1%, 32=0.1%, >=64=100.0%
     submit    : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.0%
     complete  : 0=0.0%, 4=100.0%, 8=0.0%, 16=0.0%, 32=0.0%, 64=0.0%, >=64=0.1%
     issued rwts: total=0,7952176,0,0 short=0,0,0,0 dropped=0,0,0,0
     latency   : target=0, window=0, percentile=100.00%, depth=128

Run status group 0 (all jobs):
  WRITE: bw=1035MiB/s (1086MB/s), 1035MiB/s-1035MiB/s (1086MB/s-1086MB/s), io=30.3GiB (32.6GB), run=30001-30001msec

# zpool iostat -l 1 -v -q
...
               capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim   syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool         alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
-----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
zfs-p1       46.0G  6.94T      0  48.3K  4.00K  6.00G   98us   18ms   98us   18ms    1us  672ns      -    7us      -      -      0      0      0      0      0      0      0    738      0      0      0      0
  mirror     46.0G  6.94T      0  48.3K  4.00K  6.00G   98us   18ms   98us   18ms    1us  672ns      -    7us      -      -      0      0      0      0      0      0      0    737      0      0      0      0
    nvme1n1      -      -      0  24.2K      0  3.00G      -   24ms      -   24ms      -  672ns      -    3us      -      -      0      0      0      0      0      0      0    395      0      0      0      0
    nvme2n1      -      -      0  24.2K  4.00K  3.00G   98us   12ms   98us   12ms    1us  672ns      -   11us      -      -      0      0      0      0      0      0      0    348      0      0      0      0
-----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim   syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool         alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
-----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
zfs-p1       46.0G  6.94T      0  48.7K      0  6.08G      -   18ms      -   18ms      -      -      -    7us      -      -      0      0      0      0      0      0      0    818      0      0      0      0
  mirror     46.0G  6.94T      0  48.7K      0  6.08G      -   18ms      -   18ms      -      -      -    7us      -      -      0      0      0      0      0      0      0    820      0      0      0      0
    nvme1n1      -      -      0  24.3K      0  3.04G      -   24ms      -   24ms      -      -      -    3us      -      -      0      0      0      0      0      0      0    503      0      0      0      0
    nvme2n1      -      -      0  24.3K      0  3.04G      -   12ms      -   12ms      -      -      -   11us      -      -      0      0      0      0      0      0      0    320      0      0      0      0
-----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
               capacity     operations     bandwidth    total_wait     disk_wait    syncq_wait    asyncq_wait  scrub   trim   syncq_read    syncq_write   asyncq_read  asyncq_write   scrubq_read   trimq_write
pool         alloc   free   read  write   read  write   read  write   read  write   read  write   read  write   wait   wait   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ   pend  activ
-----------  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----  -----
zfs-p1       46.0G  6.94T      0  48.4K      0  6.04G      -   18ms      -   18ms      -      -      -    6us      -      -      0      0      0      0      0      0      0    789      0      0      0      0
  mirror     46.0G  6.94T      0  48.4K      0  6.04G      -   18ms      -   18ms      -      -      -    6us      -      -      0      0      0      0      0      0      0    811      0      0      0      0
    nvme1n1      -      -      0  24.2K      0  3.02G      -   25ms      -   25ms      -      -      -    3us      -      -      0      0      0      0      0      0      0    512      0      0      0      0
    nvme2n1      -      -      0  24.2K      0  3.02G      -   12ms      -   12ms      -      -      -    8us      -      -      0      0      0      0      0      0      0    304      0      0      0      0
...

richardelling · 2021-12-24T23:11:01Z

Your test workload is a write-only workload with no overwrites. So you won't see the penalty of raising the _min.

rnz · 2022-01-09T18:01:34Z

Your test workload is a write-only workload with no overwrites. So you won't see the penalty of raising the _min.

File /mnt/zfs/g-fio.test not recreated when running multiple tests (exclude creating new fs dataset). Each fio run overwrite data in this file. Tests runned many times.

dm17 · 2022-12-13T08:44:33Z

For those working on this for years: is there some consensus for the best settings, or are they too configuration specific to state? As I understand it, we should not expect ZFS to automatically set the best defaults based on what hardware is being attached (HDD, SDD, NVME), correct?

I'm curious if most ZFS devs are using ZFS on their PCs - considering all of their PCs are likely to have NVMEs by now. Dogfooding can be inspiring :)

Reading through this today, and found it interesting: https://zfsonlinux.topicbox.com/groups/zfs-discuss/T5122ffd3e191f75f/zfs-cache-speed-vs-os-cache-speed

0xTomDaniel · 2023-01-06T04:21:10Z

@fricmonkey, @mmaybee and I are currently still working on the code for Direct IO in ZFS, but we are getting closer to the finish line with the work. Hoping to have an official pull request into ZoL master soon. No exact date yet, but I will reference the pull request in this thread once we have created it. I think once we have created the pull request, it would be extremely beneficial for people to experiment with using Direct IO in ZFS with NVMe Zpools. That will help in assessing the overall performance of the Direct IO code paths with multiple architectures.

Any updates? Thanks.

amotin · 2023-01-06T14:18:52Z

@0xTomDaniel It is not exactly a DirectIO, but a nice small sampler was just integrated: #14243 . It can dramatically improve performance in case of primarycache=metadata.

rnz · 2023-01-06T23:46:35Z

For those working on this for years: is there some consensus for the best settings, or are they too configuration specific to state? As I understand it, we should not expect ZFS to automatically set the best defaults based on what hardware is being attached (HDD, SDD, NVME), correct?

I'm curious if most ZFS devs are using ZFS on their PCs - considering all of their PCs are likely to have NVMEs by now. Dogfooding can be inspiring :)

Reading through this today, and found it interesting: https://zfsonlinux.topicbox.com/groups/zfs-discuss/T5122ffd3e191f75f/zfs-cache-speed-vs-os-cache-speed

My current settings for servers with 2 TiB RAM and 2x 7.68TB (SSDPE2KE076T8) NVMe drives in mirrored pool:

options zfs zfs_arc_max=137400000000
options zfs zfs_dirty_data_max=42949672960
options zfs zfs_dirty_data_max_percent=30
options zfs zfs_txg_timeout=100
options zfs zfs_vdev_async_read_max_active=2048
options zfs zfs_vdev_async_read_min_active=1024
options zfs zfs_vdev_async_write_max_active=2048
options zfs zfs_vdev_async_write_min_active=1024
options zfs zfs_vdev_queue_depth_pct=100
options zfs zfs_vdev_sync_read_max_active=2048
options zfs zfs_vdev_sync_read_min_active=1024
options zfs zfs_vdev_sync_write_max_active=2048
options zfs zfs_vdev_sync_write_min_active=1024

bunder2015 added the Type: Performance Performance improvement or performance problem label Feb 6, 2019

This comment has been minimized.

Sign in to view

bwatkinson mentioned this issue Feb 18, 2020

Direct IO Support #10018

Merged

17 tasks

gmelikov mentioned this issue Aug 31, 2020

bad performance on NVME SSD #10856

Closed

stale bot added the Status: Stale No recent activity for issue label Jul 21, 2021

stale bot closed this as completed Oct 20, 2021

gmelikov reopened this Nov 4, 2021

stale bot removed the Status: Stale No recent activity for issue label Nov 4, 2021

NVMe Read Performance Issues with ZFS (submit_bio to io_schedule) #8381

NVMe Read Performance Issues with ZFS (submit_bio to io_schedule) #8381

Comments

bwatkinson commented Feb 5, 2019

System information

Describe the problem you're observing

Describe how to reproduce the problem

Include any warning/errors/backtraces from the system logs

This comment has been minimized.

gcs-github commented Mar 14, 2019

bwatkinson commented Mar 14, 2019

stevenh commented Mar 15, 2019

bwatkinson commented Mar 15, 2019

prometheanfire commented Mar 15, 2019

jwittlincohen commented Mar 15, 2019

bwatkinson commented Mar 15, 2019

bwatkinson commented Mar 19, 2019

bwatkinson commented Mar 19, 2019

behlendorf commented Mar 19, 2019

bwatkinson commented Mar 19, 2019

dweeezil commented Mar 21, 2019

janetcampbell commented Apr 6, 2019 • edited Loading

bwatkinson commented Apr 8, 2019

bwatkinson commented Apr 8, 2019

joebauman commented May 6, 2019

dweeezil commented May 8, 2019

joebauman commented May 9, 2019

bwatkinson commented May 9, 2019

richardelling commented May 9, 2019

recklessnl commented Sep 4, 2019

bwatkinson commented Sep 4, 2019 • edited Loading

recklessnl commented Sep 4, 2019

bwatkinson commented Sep 5, 2019

h1z1 commented Sep 7, 2019

mmaybee commented Sep 7, 2019

bwatkinson commented Sep 7, 2019

richardelling commented Sep 7, 2019

fricmonkey commented Jan 22, 2020

prometheanfire commented Jan 22, 2020

bwatkinson commented Jan 22, 2020

fricmonkey commented Jan 22, 2020

randomsamples commented Mar 19, 2020 • edited Loading

shodanshok commented Mar 19, 2020

randomsamples commented Mar 19, 2020

beren12 commented Apr 23, 2020

bwatkinson commented Apr 24, 2020

KosmiK2001 commented Jul 21, 2020 • edited Loading

stale bot commented Jul 21, 2021

tlvu commented Nov 10, 2021

rnz commented Nov 25, 2021 • edited Loading

rnz commented Nov 25, 2021 • edited Loading

richardelling commented Nov 26, 2021

rnz commented Nov 27, 2021 • edited Loading

richardelling commented Dec 4, 2021

rnz commented Dec 20, 2021 • edited Loading

richardelling commented Dec 24, 2021

rnz commented Jan 9, 2022

dm17 commented Dec 13, 2022 • edited Loading

0xTomDaniel commented Jan 6, 2023

amotin commented Jan 6, 2023

rnz commented Jan 6, 2023

janetcampbell commented Apr 6, 2019 •

edited

Loading

bwatkinson commented Sep 4, 2019 •

edited

Loading

randomsamples commented Mar 19, 2020 •

edited

Loading

KosmiK2001 commented Jul 21, 2020 •

edited

Loading

rnz commented Nov 25, 2021 •

edited

Loading

rnz commented Nov 25, 2021 •

edited

Loading

rnz commented Nov 27, 2021 •

edited

Loading

rnz commented Dec 20, 2021 •

edited

Loading

dm17 commented Dec 13, 2022 •

edited

Loading