dmu_buf_will_clone: fix race in transition back to NOFILL #15566

robn · 2023-11-23T13:03:44Z

Motivation and Context

#15526 and the related unpleasantness of the week.

Description

Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state.

This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state.

Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data.

Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data.

This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down.

Notes

How I got here is interesting. First, added DBUF_VERIFY(db) in dmu_buf_will_clone(), just after taking and just before releasing db_mtx.

Then ran:

dd if=/dev/random of=/tank/file bs=1M count=64 status=progress
zpool sync

while true ; do clonefile -t -c /tank/file /tank/clone ; done &
while true ; do dd if=/dev/random of=/tank/clone bs=4K count=16384 status=progress ; done &

wait

This immediatly blew up on the second DBUF_VERIFY, before releasing the lock. Examing the dbuf, it was in state DB_CACHED , no block pointer, no dirty records, but db_data set, so assertions tripped there.

And then I just kept following the breadcrumbs until arrived here, with DBUF_VERIFY passing.

Of course this is all assuming that DBUF_VERIFY is even correct for a clone write, but I think it is (as much as I can be sure of anything anymore).

More notes

I am not sure if this is right. I wonder what happens if the dbuf is written to, then is cloned late while that write is being synced out. I don't know if this even makes sense but dbuf_fix_old_data() exists and it seems like this is not unlike the != NOFILL case in dbuf_dirty(). But I'm very tired so.

Even if this is right, it isn't the whole story for #15526. There's a similar thing that can be tripped in 2.1 (pre block cloning). So even if you love this, don't assume its the whole game.

How Has This Been Tested?

block_cloning tests pass, so at least clones still sorta work. DBUF_VERIFY is happy. My gut says its not worse, but this stuff is mad. Past performance is no guarantee of future results.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Performance enhancement (non-breaking change which improves efficiency)
Code cleanup (non-breaking change which makes code smaller or more readable)
Breaking change (fix or feature that would cause existing functionality to change)
Library ABI change (libzfs, libzfs_core, libnvpair, libuutil and libzfsbootenv)
Documentation (a change to man pages or other documentation)

Checklist:

My code follows the OpenZFS code style requirements.
I have updated the documentation accordingly.
I have read the contributing document.
I have added tests to cover my changes.
I have run the ZFS Test Suite with this change applied.
All commit messages are properly formatted and contain Signed-off-by.

robn · 2023-11-24T13:55:01Z

A day later, I believe #15571 is the actual fix for #15526, but I suspect that the block cloning bug fixed here made it a little easier to hit because in the gap between dmu_buf_will_clone() and dmu_buf_will_not_fill(), the dbuf would be clean, and so the dnode too.

(that and coreutils 9.2/9.3 doing clones and sparse by default).

Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Signed-off-by: Rob Norris <[email protected]>

pjd · 2023-11-28T09:00:59Z

I run my tests (mostly testing various corner cases) with those changes and they all passed. The change also looks good.
Thank you for working on this!

Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15566 Closes openzfs#15526

Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15566 Closes #15526

Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15566 Closes openzfs#15526

robn mentioned this pull request Nov 23, 2023

some copied files are corrupted (chunks replaced by zeros) #15526

Closed

mtippmann mentioned this pull request Nov 23, 2023

Page Fault on Pool Import #15513

Closed

robn force-pushed the clone-nofill-race-fix branch from fcb666f to d83b567 Compare November 23, 2023 22:56

robn force-pushed the clone-nofill-race-fix branch from d83b567 to 430453f Compare November 25, 2023 01:27

behlendorf added the Status: Code Review Needed Ready for review and testing label Nov 28, 2023

behlendorf approved these changes Nov 28, 2023

View reviewed changes

behlendorf added Status: Accepted Ready to integrate (reviewed, tested) and removed Status: Code Review Needed Ready for review and testing labels Nov 28, 2023

behlendorf merged commit 688514e into openzfs:master Nov 28, 2023
24 of 26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

dmu_buf_will_clone: fix race in transition back to NOFILL #15566

dmu_buf_will_clone: fix race in transition back to NOFILL #15566

robn commented Nov 23, 2023

robn commented Nov 24, 2023

pjd commented Nov 28, 2023

dmu_buf_will_clone: fix race in transition back to NOFILL #15566

dmu_buf_will_clone: fix race in transition back to NOFILL #15566

Conversation

robn commented Nov 23, 2023

Motivation and Context

Description

Notes

More notes

How Has This Been Tested?

Types of changes

Checklist:

robn commented Nov 24, 2023

pjd commented Nov 28, 2023