-
Notifications
You must be signed in to change notification settings - Fork 17
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance regression during rclone copy with preallocation enabled #318
Comments
What is weird is that this seems to only happen with rclone so far, everything else does not seem to slow down. I tried a |
Most likely it uses sync writes, to make sure it has landed on disk. Would be equivalent of sync=always on the dataset for the other tools. |
Wow, when I try rclone with 512 in blksz and preallocation disabled, it instantly hangs the virtual machine, and for some reason really hard, I can't even reset the machine in VMware and the kernel debugger will not break. After a few minutes it will recover. I'm not seeing any IO errors on the hardware though and also tried restarting the machine :\ I'm not sure if this also happens with 4096 in blksz, but this used to work. |
So the system recovers after a while and the debugger breaks, but it did not reveal anything interesting. However, when disabling preallocation in rclone, the copy seems to proceed at a normal speed. |
Hmmm, this only seems to happen when using VMWare, I don't see the same when using libvirt/QEMU. Actually it works as expected when I don't use VMWare Shared Folders as the copy source. It does still crash like in #319 when aborting the copy though. Can this performance regression be related to preallocation/sparse files? On 1ecc0ca that is the only difference I see between a test run that is fast and one that is slow. When slow, it starts out fast but then falls over the course of a minute to < 10MB/s, aborting takes minutes. When fast, it starts fast and stays fast, aborting takes seconds. When I set blksz to 4096 it does not matter if the test scripts use preallocation/sparse files or not. I guess I should find out whether specifying only |
It is definitely the |
I also see the same behavior on my second host that uses QEMU instead of VMWare. |
OK so, you feel confident that setting z_blksz to ashift gives you a considerable performance boost. This is when we only set z_blksz to 4096 if it is 0, does it help if we remove the |
Hmm I wonder if it is just |
That does definitely improve things. I'm unsure if it is the right thing to do, I just noticed performance drop sharply after 1ecc0ca. What is still weird is that I have been unable to measure this performance drop with anything other than
I'll try that. |
Or try the |
Setting diff --git a/module/os/windows/zfs/zfs_znode.c b/module/os/windows/zfs/zfs_znode.c
index f501b3ac1..eff94188f 100644
--- a/module/os/windows/zfs/zfs_znode.c
+++ b/module/os/windows/zfs/zfs_znode.c
@@ -624,9 +624,6 @@ zfs_znode_alloc(zfsvfs_t *zfsvfs, dmu_buf_t *db, int blksz,
zp->z_zfsvfs = zfsvfs;
mutex_exit(&zfsvfs->z_znodes_lock);
- if (zp->z_blksz == 0)
- zp->z_blksz = zfs_blksz(zp);
-
return (zp);
}
diff --git a/module/zfs/zfs_log.c b/module/zfs/zfs_log.c
index 613fcdb23..ad0dfec0d 100644
--- a/module/zfs/zfs_log.c
+++ b/module/zfs/zfs_log.c
@@ -634,6 +634,9 @@ zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype,
uint64_t gen = 0;
ssize_t size = resid;
+ if (blocksize == 0)
+ blocksize = 4096;
+
if (zil_replaying(zilog, tx) || zp->z_unlinked ||
zfs_xattr_owner_unlinked(zp)) {
if (callback != NULL)
|
excellent, I'll do a commit to fish out the value properly when I wake |
I moved it back into zfs_znode.c - there's too much change to zfs_log.c otherwise. |
I think this is fixed. |
I had to change the way we do this, running the same commit on macOS caused panic in the removal tests, it is likely to be a bug in that code, see: but don't want to wait for that to be fixed, so let's handle the performance change differently, in just |
I also just did some more tests, I doubt the performance issue actually comes from Good (as described above), going @ ~220MB/s:
Now with only if (zp->z_blksz == 0)
zp->z_blksz = zfs_blksz(zp); in Bad (starts at > 200MB/s, slows down over multiple seconds and ends up at a few hundred kilobytes per second):
|
To make it more clear what I mean with good / bad, here are the code diffs to latest Good: diff --git a/module/os/windows/zfs/zfs_znode.c b/module/os/windows/zfs/zfs_znode.c
index f501b3ac1..eff94188f 100644
--- a/module/os/windows/zfs/zfs_znode.c
+++ b/module/os/windows/zfs/zfs_znode.c
@@ -624,9 +624,6 @@ zfs_znode_alloc(zfsvfs_t *zfsvfs, dmu_buf_t *db, int blksz,
zp->z_zfsvfs = zfsvfs;
mutex_exit(&zfsvfs->z_znodes_lock);
- if (zp->z_blksz == 0)
- zp->z_blksz = zfs_blksz(zp);
-
return (zp);
}
diff --git a/module/zfs/zfs_log.c b/module/zfs/zfs_log.c
index f3013ba2d..8210e3a94 100644
--- a/module/zfs/zfs_log.c
+++ b/module/zfs/zfs_log.c
@@ -646,8 +646,8 @@ zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype,
}
#ifdef _WIN32
- if (zp->z_zfsvfs->z_os->os_spa->spa_min_alloc > blocksize)
- blocksize = zp->z_zfsvfs->z_os->os_spa->spa_min_alloc;
+ if (blocksize == 0)
+ blocksize = 512;
#endif
if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
Bad: diff --git a/module/zfs/zfs_log.c b/module/zfs/zfs_log.c
index f3013ba2d..70b16e373 100644
--- a/module/zfs/zfs_log.c
+++ b/module/zfs/zfs_log.c
@@ -645,11 +645,6 @@ zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype,
return;
}
-#ifdef _WIN32
- if (zp->z_zfsvfs->z_os->os_spa->spa_min_alloc > blocksize)
- blocksize = zp->z_zfsvfs->z_os->os_spa->spa_min_alloc;
-#endif
-
if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT)
write_state = WR_INDIRECT;
else if (!spa_has_slogs(zilog->zl_spa) &&
I'm pretty sure this is not about |
Also, the |
OKOK sorry, so you are saying performance wise, you want the change in zfs_znode.c? The zfs_log_write I thought was the slow issue, and shouldnt be 0 if the new code is there, since it should set it |
Yes, it seems performance wise it only makes a difference if set in I also just noticed something else interesting regarding this, give me a few minutes to investigate. |
Ok, nevermind, I though performance recovers, but now after 30 minutes it is still slow.
Outdated, see #318 (comment) |
ok, so? go back to znode_alloc code, and just make sure blocksize is not 0? |
So, setting the This also means that the latest |
I think it should be enough to set |
so you are saying its enough to not-change-anything in znode? and just change zfs_log ? |
Yes |
diff --git a/module/os/windows/zfs/zfs_znode.c b/module/os/windows/zfs/zfs_znode.c
index f501b3ac1..eff94188f 100644
--- a/module/os/windows/zfs/zfs_znode.c
+++ b/module/os/windows/zfs/zfs_znode.c
@@ -624,9 +624,6 @@ zfs_znode_alloc(zfsvfs_t *zfsvfs, dmu_buf_t *db, int blksz,
zp->z_zfsvfs = zfsvfs;
mutex_exit(&zfsvfs->z_znodes_lock);
- if (zp->z_blksz == 0)
- zp->z_blksz = zfs_blksz(zp);
-
return (zp);
}
diff --git a/module/zfs/zfs_log.c b/module/zfs/zfs_log.c
index f3013ba2d..8210e3a94 100644
--- a/module/zfs/zfs_log.c
+++ b/module/zfs/zfs_log.c
@@ -646,8 +646,8 @@ zfs_log_write(zilog_t *zilog, dmu_tx_t *tx, int txtype,
}
#ifdef _WIN32
- if (zp->z_zfsvfs->z_os->os_spa->spa_min_alloc > blocksize)
- blocksize = zp->z_zfsvfs->z_os->os_spa->spa_min_alloc;
+ if (blocksize == 0)
+ blocksize = 512;
#endif
if (zilog->zl_logbias == ZFS_LOGBIAS_THROUGHPUT) basically this |
But, this
should work, if blocksize is 0, it should set it to something greater than that? |
That works too, but by setting it fixed you don't need |
yeah, but i thought we wanted to set it to ashift, so that it is faster. |
My tests showed that it does not matter what we set in The performance difference comes only from setting |
ok so then, check for blocksize==0, and set to say, SPA_MINBLOCKSIZE or whatever its called. then in znode, we want to set it to ashift? |
If we have the |
It also seems clear that |
ok if its left as 0, then the zfs_write() tunes it up, usually by powers of 2. So we just need a 0 test in zfs_log. Understood. What is unusual is the path it takes in zfs_log_write(). Linux etc, does not generally enter |
Oh, that makes sense. I don't really have a good grasp of the code yet. Did you check the With preallocation disabled I never saw any performance difference. Edit: swapped good/bad |
I managed to add preallocation support to the Windows I used the MSYS2 64-bit version. You'll need a configuration like this: [global]
ioengine=windowsaio
direct=1
iodepth=8
time_based
runtime=600
directory=H\:\testdir
filename_format=testfile_$jobname_$jobnum_$filenum.bin
bs=32k
size=16G
fallocate=native
[sequential_write]
rw=write
numjobs=8 change the directory parameter (note the escaped If you change it to |
I noticed that the write speed instantly drops, the gradual dropping of displayed speed in |
Bad results:
|
Hmmm, very weird, fio is also slow in the case where rclone is fast???? Disabling preallocation makes fio fast again. Now I'm wondering if rclone actually changes its IO pattern based on some kind of block size detection? |
I'm reopening this because I think we still need to figure this out further. |
|
Yea, so the |
Ok, so the difference with This is the config: [global]
ioengine=windowsaio
direct=0
iodepth=8
time_based
runtime=600
directory=D\:\testdir
filename_format=testfile_$jobname_$jobnum_$filenum.bin
bs=32k
size=16G
fallocate=native
[sequential_write]
rw=write
numjobs=8 I wonder why direct IO is always so much slower. |
So, to summarize what I found out:
|
Bad
Good: diff --git a/module/os/windows/zfs/zfs_znode.c b/module/os/windows/zfs/zfs_znode.c
index f501b3ac1..eff94188f 100644
--- a/module/os/windows/zfs/zfs_znode.c
+++ b/module/os/windows/zfs/zfs_znode.c
@@ -624,9 +624,6 @@ zfs_znode_alloc(zfsvfs_t *zfsvfs, dmu_buf_t *db, int blksz,
zp->z_zfsvfs = zfsvfs;
mutex_exit(&zfsvfs->z_znodes_lock);
- if (zp->z_blksz == 0)
- zp->z_blksz = zfs_blksz(zp);
-
return (zp);
}
|
OK, if I have understood things, I cleaned up the code in both places. If we are happy with it, we can close this. Sorry for the delay, wanted to push out the macOS version |
The code you pushed is pretty much the same as I've been testing the last few days so I'm going to close. |
This seems to be a regression that started with 1ecc0ca. See #283 for some more context.
I just logged
zpool iostat 5
for an 8 threaded rclone copy to a newly created pool and it seems like there is some write amplification going on whenblksz
is not set to 4096.iostat_512.txt
iostat_4096.txt
Note that in the 512 byte case iostat reports ~80-100MB/s IO with a high write operation count, but rclone itself only reports 5-10MB/s. What I also noticed is that it takes a long time to even get the rclone process to abort the copy.
In the 4096 case iostat reports ~160-240MB/s and a much lower write operation count, which is about what I would expect and have seen before 1ecc0ca. rclone itself also reports a very similar speed range as iostat during the copy. I can also abort rclone within seconds instead of minutes.
Specifying
--local-no-preallocate
on the rclone command line also makes it go fast again.The text was updated successfully, but these errors were encountered: