Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

bad performance on NVME SSD #10856

Closed
mabod opened this issue Aug 31, 2020 · 23 comments
Closed

bad performance on NVME SSD #10856

mabod opened this issue Aug 31, 2020 · 23 comments
Labels
Type: Performance Performance improvement or performance problem

Comments

@mabod
Copy link

mabod commented Aug 31, 2020

System information

Type Version/Name
Distribution Name Manjaro
Distribution Version Testing
Linux Kernel tested with 4.19.141 and 5.8.3
Architecture amd64
ZFS Version 0.8.4-1
SPL Version 0.8.4-1

Describe the problem you're observing

zfs performance tested with fio is poor on nvme ssd compared to xfs. I expected that. But what I find is that with the second or third fio run the write performance drops significantly to 30 %. And it stays there until I do a trim. Then it starts off with full speed again only to drop again after the second fio run..

XFS performance is the benchmark:

   READ: bw=2149MiB/s (2253MB/s), 2149MiB/s-2149MiB/s (2253MB/s-2253MB/s), io=64.0GiB (68.7GB), run=30498-30498msec
  WRITE: bw=2234MiB/s (2343MB/s), 2234MiB/s-2234MiB/s (2343MB/s-2343MB/s), io=64.0GiB (68.7GB), run=29332-29332msec

zfs performance with a fresh new pool PCIE3:

   READ: bw=1434MiB/s (1504MB/s), 1434MiB/s-1434MiB/s (1504MB/s-1504MB/s), io=64.0GiB (68.7GB), run=45691-45691msec
  WRITE: bw=2083MiB/s (2184MB/s), 2083MiB/s-2083MiB/s (2184MB/s-2184MB/s), io=64.0GiB (68.7GB), run=31465-31465msec

But with the 2nd and all subsequent fio runs the zfs performance drops to this:

   READ: bw=1299MiB/s (1362MB/s), 1299MiB/s-1299MiB/s (1362MB/s-1362MB/s), io=64.0GiB (68.7GB), run=50460-50460msec
  WRITE: bw=581MiB/s (609MB/s), 581MiB/s-581MiB/s (609MB/s-609MB/s), io=34.0GiB (36.5GB), run=60002-60002msec

And it stays there until I do a zpool trim PCIE3. Then it starts with higher performance just to drop again. The xfs performance does not depend on trim. XFS performance is constantly good.

Additional Info

Hardware
The nvme is a KINGSTON A2000 1TB
PC is a AMD Ryzen 7 3700X with 64 GB RAM.

zpool get all PCIE3:

42# zpool get all PCIE3
NAME   PROPERTY                       VALUE                          SOURCE
PCIE3  size                           928G                           -
PCIE3  capacity                       1%                             -
PCIE3  altroot                        -                              default
PCIE3  health                         ONLINE                         -
PCIE3  guid                           7875971917660632396            -
PCIE3  version                        -                              default
PCIE3  bootfs                         -                              default
PCIE3  delegation                     on                             default
PCIE3  autoreplace                    off                            default
PCIE3  cachefile                      -                              default
PCIE3  failmode                       wait                           default
PCIE3  listsnapshots                  off                            default
PCIE3  autoexpand                     off                            default
PCIE3  dedupditto                     0                              default
PCIE3  dedupratio                     1.00x                          -
PCIE3  free                           918G                           -
PCIE3  allocated                      9,88G                          -
PCIE3  readonly                       off                            -
PCIE3  ashift                         12                             local
PCIE3  comment                        -                              default
PCIE3  expandsize                     -                              -
PCIE3  freeing                        0                              -
PCIE3  fragmentation                  0%                             -
PCIE3  leaked                         0                              -
PCIE3  multihost                      off                            default
PCIE3  checkpoint                     -                              -
PCIE3  load_guid                      15081823422894068691           -
PCIE3  autotrim                       off                            default
PCIE3  feature@async_destroy          enabled                        local
PCIE3  feature@empty_bpobj            active                         local
PCIE3  feature@lz4_compress           active                         local
PCIE3  feature@multi_vdev_crash_dump  enabled                        local
PCIE3  feature@spacemap_histogram     active                         local
PCIE3  feature@enabled_txg            active                         local
PCIE3  feature@hole_birth             active                         local
PCIE3  feature@extensible_dataset     active                         local
PCIE3  feature@embedded_data          active                         local
PCIE3  feature@bookmarks              enabled                        local
PCIE3  feature@filesystem_limits      enabled                        local
PCIE3  feature@large_blocks           active                         local
PCIE3  feature@large_dnode            enabled                        local
PCIE3  feature@sha512                 enabled                        local
PCIE3  feature@skein                  enabled                        local
PCIE3  feature@edonr                  enabled                        local
PCIE3  feature@userobj_accounting     active                         local
PCIE3  feature@encryption             enabled                        local
PCIE3  feature@project_quota          active                         local
PCIE3  feature@device_removal         enabled                        local
PCIE3  feature@obsolete_counts        enabled                        local
PCIE3  feature@zpool_checkpoint       enabled                        local
PCIE3  feature@spacemap_v2            active                         local
PCIE3  feature@allocation_classes     enabled                        local
PCIE3  feature@resilver_defer         enabled                        local
PCIE3  feature@bookmark_v2            enabled                        local

zfs get all PCIE3/BENCHMARK:

38# zfs get all PCIE3/BENCHMARK
NAME             PROPERTY              VALUE                 SOURCE
PCIE3/BENCHMARK  type                  filesystem            -
PCIE3/BENCHMARK  creation              Mo Aug 31  7:32 2020  -
PCIE3/BENCHMARK  used                  13,9G                 -
PCIE3/BENCHMARK  available             885G                  -
PCIE3/BENCHMARK  referenced            13,9G                 -
PCIE3/BENCHMARK  compressratio         1.00x                 -
PCIE3/BENCHMARK  mounted               yes                   -
PCIE3/BENCHMARK  quota                 none                  default
PCIE3/BENCHMARK  reservation           none                  default
PCIE3/BENCHMARK  recordsize            1M                    default
PCIE3/BENCHMARK  mountpoint            /mnt/PCIE3/BENCHMARK  inherited from PCIE3
PCIE3/BENCHMARK  sharenfs              off                   default
PCIE3/BENCHMARK  checksum              on                    default
PCIE3/BENCHMARK  compression           off                   default
PCIE3/BENCHMARK  atime                 on                    default
PCIE3/BENCHMARK  devices               on                    default
PCIE3/BENCHMARK  exec                  on                    default
PCIE3/BENCHMARK  setuid                on                    default
PCIE3/BENCHMARK  readonly              off                   default
PCIE3/BENCHMARK  zoned                 off                   default
PCIE3/BENCHMARK  snapdir               hidden                default
PCIE3/BENCHMARK  aclinherit            restricted            default
PCIE3/BENCHMARK  createtxg             17                    -
PCIE3/BENCHMARK  canmount              on                    default
PCIE3/BENCHMARK  xattr                 on                    default
PCIE3/BENCHMARK  copies                1                     default
PCIE3/BENCHMARK  version               5                     -
PCIE3/BENCHMARK  utf8only              off                   -
PCIE3/BENCHMARK  normalization         none                  -
PCIE3/BENCHMARK  casesensitivity       sensitive             -
PCIE3/BENCHMARK  vscan                 off                   default
PCIE3/BENCHMARK  nbmand                off                   default
PCIE3/BENCHMARK  sharesmb              off                   default
PCIE3/BENCHMARK  refquota              none                  default
PCIE3/BENCHMARK  refreservation        none                  default
PCIE3/BENCHMARK  guid                  2757755237307636527   -
PCIE3/BENCHMARK  primarycache          all                   default
PCIE3/BENCHMARK  secondarycache        all                   default
PCIE3/BENCHMARK  usedbysnapshots       128K                  -
PCIE3/BENCHMARK  usedbydataset         13,9G                 -
PCIE3/BENCHMARK  usedbychildren        0B                    -
PCIE3/BENCHMARK  usedbyrefreservation  0B                    -
PCIE3/BENCHMARK  logbias               latency               default
PCIE3/BENCHMARK  objsetid              515                   -
PCIE3/BENCHMARK  dedup                 off                   default
PCIE3/BENCHMARK  mlslabel              none                  default
PCIE3/BENCHMARK  sync                  standard              default
PCIE3/BENCHMARK  dnodesize             legacy                default
PCIE3/BENCHMARK  refcompressratio      1.00x                 -
PCIE3/BENCHMARK  written               13,9G                 -
PCIE3/BENCHMARK  logicalused           13,9G                 -
PCIE3/BENCHMARK  logicalreferenced     13,9G                 -
PCIE3/BENCHMARK  volmode               default               default
PCIE3/BENCHMARK  filesystem_limit      none                  default
PCIE3/BENCHMARK  snapshot_limit        none                  default
PCIE3/BENCHMARK  filesystem_count      none                  default
PCIE3/BENCHMARK  snapshot_count        none                  default
PCIE3/BENCHMARK  snapdev               hidden                default
PCIE3/BENCHMARK  acltype               off                   default
PCIE3/BENCHMARK  context               none                  default
PCIE3/BENCHMARK  fscontext             none                  default
PCIE3/BENCHMARK  defcontext            none                  default
PCIE3/BENCHMARK  rootcontext           none                  default
PCIE3/BENCHMARK  relatime              off                   default
PCIE3/BENCHMARK  redundant_metadata    all                   default
PCIE3/BENCHMARK  overlay               off                   default
PCIE3/BENCHMARK  encryption            off                   default
PCIE3/BENCHMARK  keylocation           none                  default
PCIE3/BENCHMARK  keyformat             none                  default
PCIE3/BENCHMARK  pbkdf2iters           0                     default
PCIE3/BENCHMARK  special_small_blocks  0                     default

fio command:

fio --runtime=60s --output=/mnt/PCIE3/BENCHMARK/read.out  /root/Benchmark-Results/fio//fio-bench-generic-seq-read.options
fio --runtime=60s --output=/mnt/PCIE3/BENCHMARK/write.out /root/Benchmark-Results/fio//fio-bench-generic-seq-write.options

fio configs:

SIZE=64G
NUMJOBS=1
BS=1M
5# cat fio-bench-generic-seq-read.options
[global]
#bs=1M
bs=${BS}
ioengine=libaio
#ioengine=psync
#ioengine=sync
invalidate=1
refill_buffers
numjobs=${NUMJOBS}
#fallocate=none
size=${SIZE}

[seq-read]
rw=read
stonewall
6# cat fio-bench-generic-seq-write.options
[global]
#bs=1M
bs=${BS}
ioengine=libaio
#ioengine=psync
#ioengine=sync
invalidate=1
refill_buffers
numjobs=${NUMJOBS}
# fallocate=none
size=${SIZE}

[seq-write]
rw=write
stonewall
@mabod mabod added Status: Triage Needed New issue which needs to be triaged Type: Defect Incorrect behavior (e.g. crash, hang) labels Aug 31, 2020
@gmelikov
Copy link
Member

gmelikov commented Aug 31, 2020

Due to CoW nature of ZFS it just eats your NVMe's cache faster than XFS, as in review on this disk https://www.tomshardware.com/reviews/kingston-a2000-m2-nvme-ssd/2 :

As mentioned, the Kingston A2000 features a pSLC write cache. The drive can absorb about 165GB of writes before performance degrades from 2,200 MBps down to roughly 490 MBps.

, please try to test with more than 200GB SIZE of FIO job with XFS.

ZFS doesn't do anything new with trim, it's the same as other FSes, but due to it's CoW nature it may trash disk's any cache it has (of course depends on disk's cache implementation) faster.

Not a bug for me.

@gmelikov
Copy link
Member

IIRC IO codepath before and after trim in ZFS is equal, so this issue is about exact NVMe model behavior.

@mabod if there is a degradation on XFS after 300GB, i think it will degradate further on up to full NVMe size tests, could you test it too?

Ah, your tests didn't give 300GB+ writes, for ex:

WRITE: bw=1412MiB/s (1480MB/s), 1412MiB/s-1412MiB/s (1480MB/s-1480MB/s), io=82.7GiB (88.8GB), run=60001-60001msec

IIRC io=82.7GiB tells that only 82.7GB of write IO was issued, I'm pretty sure it's due to --runtime=60s limitation, so you may not get up to 165GB cache of your NVMe. Could you retest without runtime limit?

From my experience, you can't test well any SSD/NVMe in just 60 seconds, some models may show a huge degradation even after hours of load.

@mabod
Copy link
Author

mabod commented Aug 31, 2020

I delete my previous post because I realized that I had runtime=60 sec for fio. which did not allow the full test.
I will recreate my post when I have redone all tests.

Please delete for answer as well. You can reply when my new entry is available.

Sorry for the inconvenience.

@mabod
Copy link
Author

mabod commented Aug 31, 2020

@gmelikov

you are right when it comes to wrote performance. Without any runtime limit the fio write performace with 300 GB is pretty similar for xfs and zfs with btrfs lagging behind:

xfs:
WRITE: bw=775MiB/s (813MB/s), 775MiB/s-775MiB/s (813MB/s-813MB/s), io=300GiB (322GB), run=396316-396316msec
zfs:
WRITE: bw=689MiB/s (722MB/s), 689MiB/s-689MiB/s (722MB/s-722MB/s), io=40.3GiB (43.3GB), run=60002-60002msec
(still with runtime limit)
btrfs:
WRITE: bw=446MiB/s (468MB/s), 446MiB/s-446MiB/s (468MB/s-468MB/s), io=300GiB (322GB), run=688768-688768msec

But the read performance differences are still remarkable:

xfs:
READ: bw=2026MiB/s (2125MB/s), 2026MiB/s-2026MiB/s (2125MB/s-2125MB/s), io=300GiB (322GB), run=151620-151620msec
zfs:
READ: bw=1328MiB/s (1392MB/s), 1328MiB/s-1328MiB/s (1392MB/s-1392MB/s), io=77.8GiB (83.5GB), run=60001-60001msec
(still with runtime limit)
btrfs:
READ: bw=2173MiB/s (2278MB/s), 2173MiB/s-2173MiB/s (2278MB/s-2278MB/s), io=300GiB (322GB), run=141394-141394msec

Why is zfs so bad in reading?

@gmelikov
Copy link
Member

@mabod Please show all variables from tests, especially BS,

PCIE3/BENCHMARK recordsize 1M default

If you test with BS less than recordsize, then it may give you huge read amplification. An 1M recordsize is not an original 128k.

@mabod
Copy link
Author

mabod commented Aug 31, 2020

All parameters are listed in my first post

BS is 1M and so is recordsize

@PrivatePuffin
Copy link
Contributor

Sidenote: This is basically the same issue why the Phoronix test(S) comparing ZFS with XFS and other filesystems was a bust, ZFS on single NVME is not the best test case for ZFS or the best testcase for ZFS

@mabod
Copy link
Author

mabod commented Aug 31, 2020

Now the zfs test with 300 GB filesaize and no runtime limit als finished. Does not look good for zfs:

   READ: bw=1208MiB/s (1267MB/s), 1208MiB/s-1208MiB/s (1267MB/s-1267MB/s), io=300GiB (322GB), run=254278-254278msec
  WRITE: bw=357MiB/s (374MB/s), 357MiB/s-357MiB/s (374MB/s-374MB/s), io=300GiB (322GB), run=861310-861310msec

The performance is really bad.

@mabod
Copy link
Author

mabod commented Aug 31, 2020

@Ornias1993
your answer is too easy. zfs performance is also very bad compared to btrfs which is also a CoW filesystem. And we have several other issues open here in github which talk about performance issues on zfs 0.8.x

@PrivatePuffin
Copy link
Contributor

PrivatePuffin commented Aug 31, 2020

@mabod As I said: It wasn't an answer, it was a sidenote. ZFS has been notoriusly bad with single NVME drives for quite some time. So it isn't the best test case for general testing of ZFS vs other filesystems. That's all i'm saying. Like I said: Thats a sidenote.

I'm not saying anything about this specific issue, just that this is quite known to be a somewhat problematic scenario with ZFS. That doesn't mean it should be problematic, I agree.

@gmelikov
Copy link
Member

Now the zfs test with 300 GB filesaize and no runtime limit als finished. Does not look good for zfs:

As it appears that it is indeed the thing with this NVMe cache, you may want to use FIO with ramp_time to get results without stats on warm up period.

About read performance on NVMe - looks like this question is a duplicate of #8381 .

@ahrens ahrens added Type: Performance Performance improvement or performance problem and removed Type: Defect Incorrect behavior (e.g. crash, hang) labels Aug 31, 2020
@behlendorf behlendorf removed the Status: Triage Needed New issue which needs to be triaged label Aug 31, 2020
@IvanVolosyuk
Copy link

IvanVolosyuk commented Sep 1, 2020

autotrim is disabled. What's the results with autotrim on?

@mabod
Copy link
Author

mabod commented Sep 1, 2020

autotrim makes the performance even worse. Not read speed which basically stays the same, but write speed for 64 GB filesize is going down to 483 MB/s

WRITE: bw=461MiB/s (483MB/s), 461MiB/s-461MiB/s (483MB/s-483MB/s), io=64.0GiB (68.7GB), run=142220-142220msec

That is not much slower but slower than the ca. 600 MB/s I get without autotrim.

@IvanVolosyuk
Copy link

I also noticed that you have xattr=on and casesensitivity = sensitive. I would disable both or make xattr=sa. I'm not sure what kind of operations fio does, so this might be irrelevant. And change recordsize back to 128k at least to make random reads faster.

@mabod
Copy link
Author

mabod commented Sep 4, 2020

I typically have xattr=sa. Dont know why this time it was xattr=on. Anyways, it has no impact on fio performance figures. I tested it.

@orange888
Copy link

Does it behave similarly (poor) on SATA SSD?

@mabod
Copy link
Author

mabod commented Oct 17, 2020

I do not have SATA SSD.

@adamdmoss
Copy link
Contributor

I'd like to suggest the zfs_abd_scatter_enabled=0 module param. It removes one or two memcpy()s per read, especially if you're using compression (e.g. it might not affect the above case where you're not testing with compression, but there might still be extraneous memcpy()s because reasons).

@mabod
Copy link
Author

mabod commented May 21, 2021

I am closing this now. I am not seeing any performance issues with my NVME drives and zfs 2.0.4.

@mabod mabod closed this as completed May 21, 2021
@dm17
Copy link

dm17 commented Dec 13, 2022

@mabod As I said: It wasn't an answer, it was a sidenote. ZFS has been notoriusly bad with single NVME drives for quite some time. So it isn't the best test case for general testing of ZFS vs other filesystems. That's all i'm saying. Like I said: Thats a sidenote.

I'm not saying anything about this specific issue, just that this is quite known to be a somewhat problematic scenario with ZFS. That doesn't mean it should be problematic, I agree.

How much better is it supposed to be for mirrored NVMEs - that should be considered a primary use-case of ZFS, shouldn't it? If this is an "NVME's cache size" issue, then perhaps there should be recommended NVME products for those planning to buy NVMEs and use ZFS on them?

@mabod
Copy link
Author

mabod commented Dec 13, 2022

I closed this issue in 2021. It was for zfs 0.8.4.
My closing comment was: "I am not seeing any performance issues with my NVME drives and zfs 2.0.4."
No need to open up this discussion again. It is obsolete.

@dm17
Copy link

dm17 commented Dec 13, 2022

Right - thanks. Did you end up running the numbers again on your same hardware to compare modern ZFS to XFS, @mabod? Just curious what it ended up being.

@PrivatePuffin
Copy link
Contributor

What it ended up being, is you bumping this closed(!) 2021(!) issue into 8 people their github notifications.
Dick move to push your question that way.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Type: Performance Performance improvement or performance problem
Projects
None yet
Development

No branches or pull requests

9 participants