-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
some copied files are still corrupted (chunks replaced by zeros) #15933
Comments
cc @robn |
<sam_> thulle: also, it matters as to when the corruption occurs. is it when Ran the steps manually first without issue, then made a script that passed the first loop, but stopped on the 2nd loop:
On the 2nd loop I got:
|
<PMT> neat. the next question i'd have, is if you have brt enabled, and set zfs_bclone_wait_dirty=1, does the issue also go away?
Started the script again, but it stalled and something seemed stuck writing at ~10MB/s to the nvme. Killed the script and saw this in syslog:
checking the process list I suspect this straggler is the culprit, only remaining child of the ebuild python process:
edit: As soon as I disabled zfs_bclone_wait_dirty the write load disappeared and the cp-process died. |
Rebooted and tried the same thing again, but not killing the script this time, instead just disabling zfs_bclone_wait_dirty.
Not killing the script and disabling zfs_bclone_wait_dirty it completes instead, no corruption detected.
Two theories: <thulle> tiny window that's only hit w higher frequencies? Stabilizing the core speeds a bit:
Letting it vary more:
|
<PMT> Please be sure to include the output of "zfs get all [DATASET THIS HAPPENS ON]"
|
<PMT> the one other experiment would be to try zfs_dmu_offset_next_sync
|
Yeah this reproduces really easily on 6.7.6 and 2.2.3 with a dataset with all default settings and the tunable flipped to enable BRT. Gross. |
How about #15588 - I wonder if that changes the outcome? |
I tried, and no. |
Thanks for trying - and damn! |
"Really easily" as in "just as easily as before"? |
Rebased and running this all night... numinit/nixpkgs@c0712aa
I'll update in case anything happens but it didn't repro immediately which is a good sign. |
@thulle Are you able to reproduce this with |
See #15994 for a possible explanation for |
10,000,000 iterations of 16k blocks on each of 8 cores, and... no trouble.
Is this another problem? |
Yes. |
So, not racing holes; probably very specifically pointing toward something else related to block cloning, or at least a second-order issue thereof, then. Maybe something ARC related? |
I think this is The file is miscopied if it has been synced out once, pages are dirty, and the kernel has not flushed any of those pages yet. (See Here is a trivial reproducer:
Output:
What's happening here?
This happens because none of the pages have been written back yet, so the file looks clean to the clone code. But the contents are the last-synced rather than the latest mmap'ed state. (Also note that merely reading the file without using mmap or clone will flush dirty pages hiding the bug.) golang build is a reproducer because it builds the tool chain multiple times in the work directory and overwrites the contents with mmap. This reliably triggers the sequence above. The script above succeeds with no differences for |
That tracks with what I privately found, which was that turning the conditional check for cached pages triggering a flush in My theory when I last looked at this was that because we didn't hold the range lock until after we did this check, there was a nasty window for us to have passed that check, not trigger the flush, and then go bang. But attempting to move that check inside the range lock (in addition to causing fun with nested locks sometimes) seemed to still break where unconditional flushing did not, suggesting that there was more to it. Then life got busy for the past few days and I haven't looked again since. Another interesting question is whether this reproduces on FreeBSD, because my theory for the issue I was observing (which was that, in my testing at least, I ended up with everything past a certain point zeroed where it shouldn't have been) was that the filesize was changing between when we read it and used it for the check, or between the check and when we actually did the probe, resulting in comedy, but since FreeBSD's mapping of the check doesn't actually use the range arguments, it might not have that problem if that was indeed the problem. Anyway, if someone wants to just slap a FIXME on this, I imagine you could do something like turning
into
with a new tunable plumbed in, and have a nice day for now. But this feels like the problem there is that a clone should be triggering that before it tries it, not flushing before anything that might reference it is hitting it later. (cf. #15772). So similar to how
after the clone, we probably want something like (untested):
. |
Where that tunable is an if guard around the above patch. |
Don't try it, probably. This ameliorates the reproducer script listed, but trying the original report, of Wonder if what I'm seeing is this fix turning this test case into #15994. |
Nah, just me rushing in trying things. What I actually wanted was:
But I didn't think carefully about what ...and if I do that in there, I deadlock on the rangelock. So let's move that around... |
rrevans/zfs@ae5efbe fixes both emerge go (using @thulle's step-at-a-time testing script) and the reproducer script for me. I've run both in a loop for 30 minutes with no failures so far. The dirty page flush happens before the rangelock in I am not sure this is correct or sufficient or if there are other bclone/mmap gaps. (And, it flushes the whole file when it could maybe flush only the cloned range.) That said, it does pass ZTS |
That doesn't fix it, for me, as I already said above. |
Sorry, to be less terse, I wound up with rincebrain@e55d3b0 because you can't flush that under the rangelock without deadlocking sometimes, so there's going to be a window where it might change between the flush and the lock again. So I added a case for the check inside the lock as well, told it to retry once to hopefully catch more of the trivial cases of it being dirtied with inconvenient timing infrequently, and then fall back to not reflinking in the event that it's still dirty. I confirmed that it sometimes did still reflink with that code (since I was slightly worried I might be missing some detail of how it worked that implied the check would always be true), and then pushed it to see whether the CI blew up. It can probably be improved to avoid that window, but similarly, I could not produce incorrect behavior after hammering it for some time, and anything simpler still failed some of the time. (Not, admittedly, very often, but "one in 50 runs" is still "not zero times".) (Also, to be clear, I'm redoing my tests with your patch to see if my memory is faulty of having tried that, because I would certainly prefer a simpler solution...) |
I'll need to look closer on the area, but just a first thought: if it is impossible to reliably flush the caches, we could abort cloning in zn_has_cached_data() case. I.e. we could flush caches first before taking the lock, but if somebody is still writing the mmap()'ed region when we get the lock, then give up completely. Though since we'd need to commit the txg before we'll be able to clone the file, we may just give up on it from the beginning and don't bother to even flush the cache, leaving it to fall-back copy routine to do. |
@amotin +1 Looking more at this, I'm suspecting background writeback is a confounding factor (and then in turn also system load). Staring at But note the fallback copy routine might seek holes and end up calling p.s. I don't think it's correct to use rangelocks to coordinate with mmap writes since page locking is entirely separate and mmap writes are inherently racy. IOW if there are racing page writes there's no way to guarantee any outcome anyway, right? This is true even for filesystems that use the pagecache directly. That said, copying should always reliably observe page writes guaranteed to be dirtied before clone starts (e.g. the build tool process |
@rincebrain As I currently understand, the problem for non-BRT is that sometimes mmap page dirtiness doesn't reach the DMU before That said, that approach is a regression for sparse copies without mmap, and I'm not sure we need to break that case. I'm thoroughly convinced that the original fix works for the original problem and its root cause, though it's still got theoretical data races which I think are unreachable in practice. I've been studying this space and all the code ever since in pursuit of an implementation that doesn't need to force txg syncs because I think ZFS shouldn't have limits in this regard (and I have working code that indeed does that and passes ZTS, ztest, and my own stress tests). While #16000 is super conservative I'd vote for adding code that keeps llseek functional EXCEPT for files with mappings (dirty or not) which is an easy test to add. Maybe leave in the option to fully disable as cheap advice for users in case this analysis is wrong somehow...? |
Given the history, I'm really in favor of "super conservative" at the moment. I'm not opposed to other improvements, obviously, but given that distros tend to ship releases and take forever to cherrypick fixes from the future if they won't update (if they do it at all), I'm really in favor of safety models that don't require assuming people run a newer version if we're wrong about it being safe. Also, I have a similar document detailing the history, with commits, so it's interesting to read someone else's deep dive, thank you. A little curious why you prepared it, but. |
@Gendra13 can you please try with 20 or 200 files instead of 2000 on 2.1.5 / 5.15.0 without bclone but with Occasionally when developing the repro I saw smaller numbers of files being more effective for some cases including this one if memory serves. |
Breaks on 5.10.0-27
e: Were you actually asking about 2.1.5, or no? Because bclone only went in on 2.2, so that setting wouldn't exist there... |
Same applies to 2.1.5 stock with either dmu_offset_next_sync.
0:
|
That was weird. When reducing the files down to 200 (or 20) I couldn't trigger the bug at all, even with Then I repeated the test with
|
I don't think the problems at hand are possible without fancy llseek or block clone. If we attempt a direct read, then the cached mapping pages are used as the source of truth of the file state instead of the dirty buffers in the DMU/ARC. This happens as long as Thus, assuming those paths work, skipping fancy llseek and reflink is always going to be correct and avoid any dependence on writeback or flushing dirty pages to the DMU.
No, I can't think of any other cases. Fancy llseek getting it wrong is the proximate cause of all known problems pre-BRT. It's the same problem as #15526 but triggered by mmap instead of racing syncs. Edit: I'm also convinced Linux has the correct state already because the other seek flavors work, especially SEEK_END (see also O_APPEND maybe). In short, while at least some older releases are affected, I'm convinced that triggering the writeback flavor of this bug for
Note that for golang it seems to rewrite the outputs during build, so the files are briefly sparse during the build process but this is enough to create a window for (2): stat reports on-disk block usage from the prior moment the file was sparse if it happens to be flushed as part of a normal txg sync during the window where it is sparse. (I think this explains why golang repros are really hard pre-BRT: in addition to hitting a writeback race the system also needs to hit a txg flush in the small window that the output file is sparse.) |
Okay... kernel tracing to the rescue. TL;DR I ran my repro with the On flush, the kernel calls After a bit more poking, I gathered stacks for
generated from
invoked as (from ZFS src root):
The only place that Sure enough, if I trace the
And a bit more poking I find that
The final failing sequence prints something like:
My repro has new instrumentation to dump out blocks that don't match, and I'm running that with This outputs something like this immediately after the traces above:
All of blocks 993, 994, and 995 failed putpage in the trace. Some others did too, but they were probably zeros that didn't change from the last write to the mapping in the repro loop. The new instrumentation: # write, copy, compare, and print an error on mismatch
write() {
truncate --size=4M "${f?}"
python3 -c "${tool}" "${f?}" &>/dev/null
cpout=$(cp --debug "${f?}" "${f?}.2" 2>&1)
diff=$(diff -q "${f?}" "${f?}.2")
if (($?)); then
echo "$diff" "($(echo "$cpout" | tail -1))"
python -c "
import sys
def blocks(f):
while True:
b = f.read(1<<12)
if b:
yield b
else:
break
with open(sys.argv[1], 'rb') as f1, open(sys.argv[2], 'rb') as f2:
for n, (b1, b2) in enumerate(zip(blocks(f1), blocks(f2))):
if b1 != b2:
print(sys.argv[1], 'block', n, 'differs')
" "${f?}" "${f?}.2"
fi
} TL;DR With more tracing, I confirmed all this by comparing the mismatched blocks reported in the repro to those that failed to write during the second call to Meanwhile, Linux will only retries failed writeback pages for Note this is happening block-by-block with the first A simple fix, I guess, is to plumb the error up and loop on flush until it succeeds if it returns ERESTART. Edit: See trace_mmap.bt for my full bpftrace tooling. |
rrevans/zfs@c7100fc is my attempt at a simple retry loop to fix writeback gaps. I'm a few minutes into testing, and so far I am unable to trigger failures with any of:
Write throughput is lower due to the synchronous ZIL commits. I would appreciate any feedback or attempts at reproducing with this patch (especially any glaring problems caused by synchronous flushing). EDIT: I've run the repro loop and the golang emerge test script for 6 hours without reproducing any corruption. During this time I had a tuning script randomly setting the above module parameters to different values every 90 seconds. ZTS also passes running locally. |
@rrevans Stupid question, will the fix still work with sync=disabled? |
|
@IvanVolosyuk Yes, and I've confirmed it works. |
@rrevans - Kudos on your outstanding investigation. While I haven't been able to reproduce the corruption running your script overnight, it's worth noting that I got several instances of the following stack trace. Not sure if these are related to your patch though. I ran with
|
@ixhamza Thanks I appreciate the feedback This message is the userland thread calling the sync system call taking a long time to finish. The kernel notices after a while and flags it. It's somewhat expected as the repro script is spamming a lot of load, but I suspect that there may be a separate bug causing sync to livelock (block on ongoing writes vs. triggering a single pass of writeback work). If you stop the script, does the sync process eventually terminate? It would be a large concern if the thread has become stuck forever. Note that lots and lots I/O may have to flush first - you can look at dirty and write back page counts in /proc/meminfo. These should be decreasing with writeback after interrupting the script. Investigation of the sync call is on my to-do list. I've seen other usual behaviors like sync(3) blocking on background txg sync instead of triggering one, and I don't understand enough there yet to know what is going on. |
@rrevans - I just ran the script again for about 12 minutes and noticed that only one sync process was able to spawn; it wasn't able to finish during the script's lifetime. After interrupting the script with |
@rrevans Addition of error handling looks right to me, unless kernel expects errors to be silently ignored and it supposed to reissue silently writes later by itself. What makes me really wonder though, why do we even use TXG_NOWAIT with the dmu_tx_assign() in Linux zfs_putpage()? I see Illumos switched to TXG_WAIT 10 years ago: https://www.illumos.org/issues/4347 , FreeBSD done it 7 years ago: freebsd/freebsd-src@16b46572fa997 , why Linux still uses TXG_NOWAIT? My first guess was that the context where it is called is not sleepable, but I see there:
, where dmu_tx_wait() does sleep, even if though only once without retrial. It makes no sense to me. Could we switch it to TXG_WAIT similar to FreeBSD and call it a day? |
@amotin I agree it makes no sense, and As for how did we get here, 3c0e5c0f45 removed the loop but left the wait behind. I don't know why TXG_WAIT was not used at that time since it is still sleeping. |
@rrevans nice work running this down!
For the most part I think so. It looks like when the original Illumos change was ported
But as @rrevans pointed out here's where things went wrong:
Over the years the kernel has significantly changed exactly how this page writeback was done and retries were handled. Unfortunately, no so much that it obviously broke things. When this change was originally ported from Illumos the Linux 3.10 kernel was hot off the presses. I'd also suggest we switch to |
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages already in writeback unless data-integrity sync is requested. 2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be skipped due to DMU pushing back on TX assign. 3) Add missing mmap flush when doing block cloning. 4) While here, pass errors from putpage to writepage/writepages. This change fixes corruption edge cases, but unfortunately adds synchronous ZIL flushes for dirty mmap pages to llseek and bclone operations. It may be possible to avoid these sync writes later but would need more tricky refactoring of the writeback code. Fixes: openzfs#15933 Signed-off-by: Robert Evans <[email protected]>
#16019 fixes this with synchronous writeback and I have some ideas how to improve the writeback code to avoid synchronous writes, but they need more invasive refactoring of how writeback works. |
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages already in writeback unless data-integrity sync is requested. 2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be skipped due to DMU pushing back on TX assign. 3) Add missing mmap flush when doing block cloning. 4) While here, pass errors from putpage to writepage/writepages. This change fixes corruption edge cases, but unfortunately adds synchronous ZIL flushes for dirty mmap pages to llseek and bclone operations. It may be possible to avoid these sync writes later but would need more tricky refactoring of the writeback code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes openzfs#15933 Closes openzfs#16019
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages already in writeback unless data-integrity sync is requested. 2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be skipped due to DMU pushing back on TX assign. 3) Add missing mmap flush when doing block cloning. 4) While here, pass errors from putpage to writepage/writepages. This change fixes corruption edge cases, but unfortunately adds synchronous ZIL flushes for dirty mmap pages to llseek and bclone operations. It may be possible to avoid these sync writes later but would need more tricky refactoring of the writeback code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes #15933 Closes #16019
This release contains a fix for openzfs/zfs#15933. Closes: https://bugs.gentoo.org/928518 Bug: https://bugs.gentoo.org/815469 Bug: https://bugs.gentoo.org/917224 Signed-off-by: Sam James <[email protected]>
This release contains a fix for openzfs/zfs#15933. Closes: https://bugs.gentoo.org/928518 Bug: https://bugs.gentoo.org/815469 Bug: https://bugs.gentoo.org/917224 Signed-off-by: Sam James <[email protected]>
@thulle if possible would you mind confirming your compiling |
Is that still open or done? Yours lopiuh |
1) Make mmap flushes synchronous. Linux may skip flushing dirty pages already in writeback unless data-integrity sync is requested. 2) Change zfs_putpage to use TXG_WAIT. Otherwise dirty pages may be skipped due to DMU pushing back on TX assign. 3) Add missing mmap flush when doing block cloning. 4) While here, pass errors from putpage to writepage/writepages. This change fixes corruption edge cases, but unfortunately adds synchronous ZIL flushes for dirty mmap pages to llseek and bclone operations. It may be possible to avoid these sync writes later but would need more tricky refactoring of the writeback code. Reviewed-by: Alexander Motin <[email protected]> Reviewed-by: Brian Behlendorf <[email protected]> Signed-off-by: Robert Evans <[email protected]> Closes openzfs#15933 Closes openzfs#16019
System information
Describe the problem you're observing
After seeing 2.2.3 released and #15526 closed I enabled blockcloning again and tried my unfailing reproducer: compiling go.
It still seems to trigger the/some bug.
Worth noting is that I'm running on top of LUKS. Scrub reports no issues.
Describe how to reproduce the problem
The text was updated successfully, but these errors were encountered: