-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dmu_buf_will_clone: fix race in transition back to NOFILL #15566
Conversation
fcb666f
to
d83b567
Compare
A day later, I believe #15571 is the actual fix for #15526, but I suspect that the block cloning bug fixed here made it a little easier to hit because in the gap between (that and coreutils 9.2/9.3 doing clones and sparse by default). |
Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Signed-off-by: Rob Norris <[email protected]>
d83b567
to
430453f
Compare
I run my tests (mostly testing various corner cases) with those changes and they all passed. The change also looks good. |
Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15566 Closes openzfs#15526
Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes #15566 Closes #15526
Previously, dmu_buf_will_clone() would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data in db_data, but the dbuf in the wrong state. This is eventually corrected when the dbuf state is made NOFILL, and dbuf_noread() called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state. Any caller acquiring the lock before the call into dmu_buf_will_not_fill() can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data. Even after the state was switched to NOFILL, the old data was still not cleaned out until dbuf_noread(), which is another gap for a caller to take the lock and read the wrong data. This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The DBUF_VERIFY() calls confirm that the dbuf is in a valid state when the lock is down. Sponsored-by: Klara, Inc. Sponsored-By: OpenDrives Inc. Reviewed-by: Brian Behlendorf <[email protected]> Reviewed-by: Pawel Jakub Dawidek <[email protected]> Signed-off-by: Rob Norris <[email protected]> Closes openzfs#15566 Closes openzfs#15526
Motivation and Context
#15526 and the related unpleasantness of the week.
Description
Previously,
dmu_buf_will_clone()
would roll back any dirty record, but would not clean out the modified data nor reset the state before releasing the lock. That leaves the last-written data indb_data
, but the dbuf in the wrong state.This is eventually corrected when the dbuf state is made
NOFILL
, anddbuf_noread()
called (which clears out the old data), but at this point its too late, because the lock was already dropped with that invalid state.Any caller acquiring the lock before the call into
dmu_buf_will_not_fill()
can find what appears to be a clean, readable buffer, and would take the wrong state from it: it should be getting the data from the cloned block, not from earlier (unwritten) dirty data.Even after the state was switched to
NOFILL
, the old data was still not cleaned out untildbuf_noread()
, which is another gap for a caller to take the lock and read the wrong data.This commit fixes all this by properly cleaning up the previous state and then setting the new state before dropping the lock. The
DBUF_VERIFY()
calls confirm that the dbuf is in a valid state when the lock is down.Notes
How I got here is interesting. First, added
DBUF_VERIFY(db)
indmu_buf_will_clone()
, just after taking and just before releasingdb_mtx
.Then ran:
This immediatly blew up on the second
DBUF_VERIFY
, before releasing the lock. Examing the dbuf, it was in stateDB_CACHED
, no block pointer, no dirty records, butdb_data
set, so assertions tripped there.And then I just kept following the breadcrumbs until arrived here, with
DBUF_VERIFY
passing.Of course this is all assuming that
DBUF_VERIFY
is even correct for a clone write, but I think it is (as much as I can be sure of anything anymore).More notes
I am not sure if this is right. I wonder what happens if the dbuf is written to, then is cloned late while that write is being synced out. I don't know if this even makes sense but
dbuf_fix_old_data()
exists and it seems like this is not unlike the!= NOFILL
case indbuf_dirty()
. But I'm very tired so.Even if this is right, it isn't the whole story for #15526. There's a similar thing that can be tripped in 2.1 (pre block cloning). So even if you love this, don't assume its the whole game.
How Has This Been Tested?
block_cloning
tests pass, so at least clones still sorta work.DBUF_VERIFY
is happy. My gut says its not worse, but this stuff is mad. Past performance is no guarantee of future results.Types of changes
Checklist:
Signed-off-by
.