improve batching done in zil_commit() #6337

prakashsurya · 2017-07-11T17:24:42Z

This is a first attempt at porting the OpenZFS pull request here:

0000 improve batching done in zil_commit() openzfs#421

This hasn't seen much testing apart from verifying it compiles in an
Ubuntu VM, and it doesn't panic when using a very light sync write
workload using fio. I'll have to rely on the automated testing and
volunteers to verify the correctness and performance; if needed, I can
try to allocate some time to do this myself, but I doubt I'll have the
time to do that in the near future.

One big difference between this version, and the version posted to
OpenZFS, is how the lifetime of an itx_t structure differs between the
codebases. In this Linux port, the itx_t structure maintains an
itx_callback field, and this function needs to be called after the itx
has been committed to disk. Thus, the itx_t must wait to be destroyed
until it's been committed to disk. This is in contrast to how the
itx_t is immediately destroyed once it's committed to an lwb in the
OpenZFS codebase, but prior to the lwb being committed to disk.

To adapt the new algorithm to this itx_callback requirement, an
lwb_itxs list was added to the lwb_t structure. Each time an itx is
committed to an lwb, it is added to the lwb's lwb_itxs list. This list
is used to keep track of the itxs such that their callback can be called
after the itx is safe on disk, and only then will the itx be destroyed.

Additionally, this attempts to address the same deficiencies that
Richard Yao is addressing in the following pull requests:

I'd be great to perform some head-to-head comparisons of this with
Richard's work, and see which parts of each design work best.

Signed-off-by: Prakash Surya [email protected]

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the ZFS on Linux code style requirements.
I have updated the documentation accordingly.
I have read the CONTRIBUTING document.
I have added tests to cover my changes.
All new and existing tests passed.
All commit messages are properly formatted and contain Signed-off-by.
Change has been approved by a ZFS on Linux member.

behlendorf · 2017-07-11T20:52:31Z

It would be great if you could update the commit message comment to reference the SPL PR. With that hint the buildbot will be able to build and test this PR.

Requires-spl: refs/pull/627/head

[edit] Oh you did! Nevermind!

prakashsurya · 2017-07-11T21:48:37Z

yup, I added that directive. I'm not quite sure how to interpret the test failures yet (although I haven't spent much time trying); is there any documentation to help me out with that?

behlendorf · 2017-07-11T23:09:48Z

The console and ztest logs are my favorite place to start if the test suite doesn't run to completion. We're currently not collecting cores but stack trace is often more than enough. We've add a little bit of documention describing how to run the same tests locally but it'd be nice to add more.

[ 2562.047197] VERIFY(!list_is_empty(list)) failed
[ 2562.050654] PANIC at list.h:126:list_remove()
[ 2562.054008] Showing stack for process 24511
[ 2562.054013] CPU: 1 PID: 24511 Comm: z_null_int Tainted: P           OE   4.9.2-kmemleak #1
[ 2562.054014] Hardware name: Xen HVM domU, BIOS 4.2.amazon 11/11/2016
[ 2562.054017]  ffff93b70b063b50 ffffffffad619156 ffffffffc0e749ff 000000000000007e
[ 2562.054023]  ffff93b70b063b60 ffffffffc0a6a802 ffff93b70b063ce8 ffffffffc0a6aad9
[ 2562.054029]  ffff8828b3158000 ffff882900000028 ffff93b70b063cf8 ffff93b70b063c98
[ 2562.054035] Call Trace:
[ 2562.054044]  [<ffffffffad619156>] dump_stack+0xb0/0xea
[ 2562.054065]  [<ffffffffc0a6a802>] spl_dumpstack+0x62/0x70 [spl]
[ 2562.054080]  [<ffffffffc0a6aad9>] spl_panic+0xd9/0x120 [spl]
[ 2562.054087]  [<ffffffffad21d7e7>] ? stack_trace_call+0x67/0xb0
[ 2562.054093]  [<ffffffffadc9e387>] ? ftrace_call+0x5/0x34
[ 2562.054098]  [<ffffffffad34c27a>] ? __delete_object+0x5a/0xb0
[ 2562.054116]  [<ffffffffc0a6aa05>] ? spl_panic+0x5/0x120 [spl]
[ 2562.054274]  [<ffffffffc0d4eced>] zil_lwb_flush_vdevs_done+0x22d/0x640 [zfs]
[ 2562.054402]  [<ffffffffc0d65dd6>] zio_done+0xa16/0x2310 [zfs]
[ 2562.054533]  [<ffffffffc0d653c5>] ? zio_done+0x5/0x2310 [zfs]
[ 2562.054663]  [<ffffffffc0d59749>] zio_execute+0x149/0x3f0 [zfs]
[ 2562.054680]  [<ffffffffc0a667ed>] taskq_thread+0x36d/0x760 [spl]
[ 2562.054687]  [<ffffffffad10bc80>] ? try_to_wake_up+0x680/0x680
[ 2562.054701]  [<ffffffffc0a66480>] ? taskq_thread_spawn+0x80/0x80 [spl]
[ 2562.054707]  [<ffffffffad0fabd9>] kthread+0x119/0x150
[ 2562.054713]  [<ffffffffad0faac0>] ? kthread_flush_work+0x1f0/0x1f0
[ 2562.054717]  [<ffffffffadc9ccf5>] ret_from_fork+0x25/0x30

prakashsurya · 2017-07-12T00:37:07Z

perfect, thanks!

prakashsurya · 2017-07-12T16:37:28Z

Any clues about these build failures:

In file included from /usr/src/spl-0.7.0/include/sys/sysmacros.h:31:0,
                 from /tmp/zfs-build-buildbot-DH81uCAS/BUILD/zfs-kmod-0.7.0/_kmod_build_3.13.0-117-generic/../zfs-0.7.0/module/zfs/zio.c:27:
/tmp/zfs-build-buildbot-DH81uCAS/BUILD/zfs-kmod-0.7.0/_kmod_build_3.13.0-117-generic/../zfs-0.7.0/module/zfs/zio.c: In function 'zio_shrink':
/usr/src/spl-0.7.0/include/sys/debug.h:76:64: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 #define VERIFY3P(x,y,z) VERIFY3_IMPL(x, y, z, uintptr_t, "%p", (void *))
                                                                ^
/usr/src/spl-0.7.0/include/sys/debug.h:70:6: note: in definition of macro 'VERIFY3_IMPL'
      CAST (LEFT), CAST (RIGHT)))
      ^
/usr/src/spl-0.7.0/include/sys/debug.h:113:26: note: in expansion of macro 'VERIFY3P'
 #define ASSERT3P(x,y,z)  VERIFY3P(x, y, z)
                          ^
/tmp/zfs-build-buildbot-DH81uCAS/BUILD/zfs-kmod-0.7.0/_kmod_build_3.13.0-117-generic/../zfs-0.7.0/module/zfs/zio.c:1163:2: note: in expansion of macro 'ASSERT3P'
  ASSERT3P(zio->io_orig_size, ==, zio->io_size);
  ^
/usr/src/spl-0.7.0/include/sys/debug.h:76:64: error: cast to pointer from integer of different size [-Werror=int-to-pointer-cast]
 #define VERIFY3P(x,y,z) VERIFY3_IMPL(x, y, z, uintptr_t, "%p", (void *))
                                                                ^
/usr/src/spl-0.7.0/include/sys/debug.h:70:19: note: in definition of macro 'VERIFY3_IMPL'
      CAST (LEFT), CAST (RIGHT)))
                   ^
/usr/src/spl-0.7.0/include/sys/debug.h:113:26: note: in expansion of macro 'VERIFY3P'
 #define ASSERT3P(x,y,z)  VERIFY3P(x, y, z)
                          ^
/tmp/zfs-build-buildbot-DH81uCAS/BUILD/zfs-kmod-0.7.0/_kmod_build_3.13.0-117-generic/../zfs-0.7.0/module/zfs/zio.c:1163:2: note: in expansion of macro 'ASSERT3P'
  ASSERT3P(zio->io_orig_size, ==, zio->io_size);
  ^
cc1: all warnings being treated as errors

?

I don't see these when I build locally, and I didn't modify the ASSERT3P macro, so I'm confused as to how my change might be causing these. are these build failures normal and/or safe to ignore?

behlendorf · 2017-07-13T16:58:39Z

module/zfs/zio.c

-	ASSERT(zio->io_orig_size == zio->io_size);
-	ASSERT(size <= zio->io_size);
+	ASSERT3P(zio->io_executor, ==, NULL);
+	ASSERT3P(zio->io_orig_size, ==, zio->io_size);


This should be ASSERT3U. The build failure you're seeing is coming from 32-bit system where the pointer size is 32-bit and not 64-bit. The ASSERT3P macro is forcing a cast of a uint64_t to a uintptr_t which is only 32-bit on failing architectures. It looks like ASSERT3P is used incorrectly in a zio_wait() too and maybe elsewhere.

thanks! I should have caught that myself, sorry for the trouble.

This is a first attempt at porting the OpenZFS pull request here: - openzfs/openzfs#421 This hasn't seen much testing apart from verifying it compiles in an Ubuntu VM, and it doesn't panic when using a very light sync write workload using fio. I'll have to rely on the automated testing and volunteers to verify the correctness and performance; if needed, I can try to allocate some time to do this myself, but I doubt I'll have the time to do that in the near future. One big difference between this version, and the version posted to OpenZFS, is how the lifetime of an `itx_t` structure differs between the codebases. In this Linux port, the `itx_t` structure maintains an `itx_callback` field, and this function needs to be called after the itx has been committed to disk. Thus, the `itx_t` must wait to be destroyed until it's been committed to disk. This is in contrast to how the `itx_t` is immediately destroyed once it's committed to an lwb in the OpenZFS codebase, but prior to the lwb being committed to disk. To adapt the new algorithm to this `itx_callback` requirement, an `lwb_itxs` list was added to the `lwb_t` structure. Each time an itx is committed to an lwb, it is added to the lwb's `lwb_itxs` list. This list is used to keep track of the itxs such that their callback can be called after the itx is safe on disk, and only then will the itx be destroyed. Additionally, this attempts to address the same deficiencies that Richard Yao is addressing in the following pull requests: - #6133 - openzfs/openzfs#404 I'd be great to perform some head-to-head comparisons of this with Richard's work, and see which parts of each design work best. Signed-off-by: Prakash Surya <[email protected]>

prakashsurya · 2017-08-22T00:52:34Z

@behlendorf this is causing a hang when running the mmap_read_001_pos test case, due to readmmap hanging when it calls msync; I can easily reproduce it in my VM:

delphix@ubuntu:~$ pstree -p 2337
test-runner.py(2337)-+-sudo(3183)---mmap_read_001_p(3185)---readmmap(3198)
                     `-{test-runner.py}(3184)
delphix@ubuntu:~$ sudo cat /proc/3198/stack
[<ffffffff9bdafb68>] wait_on_page_bit_common+0x118/0x1d0
[<ffffffff9bdafd34>] __filemap_fdatawait_range+0x114/0x190
[<ffffffff9bdafdc4>] filemap_fdatawait_range+0x14/0x30
[<ffffffff9bdb2477>] filemap_write_and_wait_range+0x57/0x90
[<ffffffffc08f049d>] zpl_fsync+0x3d/0x110 [zfs]
[<ffffffff9be7b93b>] vfs_fsync_range+0x4b/0xb0
[<ffffffff9bdf6af2>] SyS_msync+0x182/0x200
[<ffffffff9c4d453b>] entry_SYSCALL_64_fastpath+0x1e/0xad
[<ffffffffffffffff>] 0xffffffffffffffff

do you have any clues as to what might be going on? I'm not familiar with the mmap code, is there linux specific changes that I may have broken in this patch? I'll have to do some reading to better understand how filemap_write_and_wait_range is supposed to work...

FWIW, I can run the readmmap binary outside of the test framework (with the test process still stuck) and it appears to work:

delphix@ubuntu:~$ sudo ./tests/zfs-tests/cmd/readmmap/readmmap /var/tmp/testdir/test-read-file
good data from read: buf[313]=1

behlendorf · 2017-08-22T04:33:32Z

do you have any clues as to what might be going on?

I'm not sure of the detail but I've got a good guess where to start looking.

The stack you posted is what I'd expect to happen if for some reason the zil_callback_t passed to zfs_log_write() didn't run on each page. filemap_write_and_wait_range() is waiting to be notified that each page is safe on disk. The callback (zfs_putpage_commit_cb) is responsible for that notification by ending the writeback on each dirty page and marking it clean.

I haven't had a chance to carefully go over the patch but is it possible there's a case which was missed where the callback doesn't run?

For background there's a nice description in commit 119a394 regarding why this was needed for Linux.

prakashsurya · 2017-08-22T13:44:22Z

Thanks, Brian. That's basically what I was thinking too, so it's good to get a second opinion. I'll revisit the code and see if botched something. OpenZFS doesn't have the "itx callback", so I wouldn't be surprised if I got that subtly wrong when pulling this code over to linux.

prakashsurya mentioned this pull request Jul 11, 2017

improve batching done in zil_commit() openzfs/spl#627

Closed

behlendorf requested changes Jul 13, 2017

View reviewed changes

prakashsurya mentioned this pull request Aug 19, 2017

ZIL Pipelining #6133

Closed

12 tasks

prakashsurya closed this Aug 25, 2017

prakashsurya deleted the zol-DLPX-52375 branch August 25, 2017 19:24

prakashsurya mentioned this pull request Aug 25, 2017

OpenZFS 8585 - improve batching done in zil_commit() #6566

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

improve batching done in zil_commit() #6337

improve batching done in zil_commit() #6337

prakashsurya commented Jul 11, 2017 •

edited

Loading

behlendorf commented Jul 11, 2017 •

edited

Loading

prakashsurya commented Jul 11, 2017

behlendorf commented Jul 11, 2017

prakashsurya commented Jul 12, 2017

prakashsurya commented Jul 12, 2017

behlendorf Jul 13, 2017

prakashsurya Jul 13, 2017

prakashsurya commented Aug 22, 2017 •

edited

Loading

behlendorf commented Aug 22, 2017

prakashsurya commented Aug 22, 2017

improve batching done in zil_commit() #6337

improve batching done in zil_commit() #6337

Conversation

prakashsurya commented Jul 11, 2017 • edited Loading

Types of changes

Checklist:

behlendorf commented Jul 11, 2017 • edited Loading

prakashsurya commented Jul 11, 2017

behlendorf commented Jul 11, 2017

prakashsurya commented Jul 12, 2017

prakashsurya commented Jul 12, 2017

behlendorf Jul 13, 2017

Choose a reason for hiding this comment

prakashsurya Jul 13, 2017

Choose a reason for hiding this comment

prakashsurya commented Aug 22, 2017 • edited Loading

behlendorf commented Aug 22, 2017

prakashsurya commented Aug 22, 2017

prakashsurya commented Jul 11, 2017 •

edited

Loading

behlendorf commented Jul 11, 2017 •

edited

Loading

prakashsurya commented Aug 22, 2017 •

edited

Loading