-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove recursive read lock that could deadlock Blockstore #30203
Conversation
This deadlock could only occur on nodes that call Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do. Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot. If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario. This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available.
let _lock = self.check_lowest_cleanup_slot(slot)?; | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I initially started upon a more complicated change but landed here. My reasoning for simply ripping this out is:
- The lock is only held for
get_completed_ranges()
so it wasn't providing us any guarantees for when we subsequently read the shreds from rocksdb - This function issues a single (atomic) rocksdb read operation, so lock not needed for this function
So, by removing this, the only thing that we lose is distinguishing a slot being unavailable vs. the slot having been cleaned up when calling get_completed_ranges()
directly. The only spot we care about BlockstoreError::SlotCleanedUp
is in RPC after calling get_rooted_block()
. But, since get_rooted_block()
calls check_lowest_cleanup_slot()
to check the slot and hold the lock, we don't have any change of behavior from RPC's point of view.
Lines 1017 to 1038 in 375f9ae
fn check_slot_cleaned_up<T>( | |
&self, | |
result: &std::result::Result<T, BlockstoreError>, | |
slot: Slot, | |
) -> Result<()> { | |
let first_available_block = self | |
.blockstore | |
.get_first_available_block() | |
.unwrap_or_default(); | |
let err: Error = RpcCustomError::BlockCleanedUp { | |
slot, | |
first_available_block, | |
} | |
.into(); | |
if let Err(BlockstoreError::SlotCleanedUp) = result { | |
return Err(err); | |
} | |
if slot < first_available_block { | |
return Err(err); | |
} | |
Ok(()) | |
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We attempt to grab a read lock within lowest_cleanup_slot()
here as well; however, when we're executing this portion, shred_bytes.none()
should NOT be true ever because the lock claimed from get_rooted_blocks()
at the top prevents cleanup.
solana/ledger/src/blockstore.rs
Lines 2942 to 2945 in 375f9ae
if shred_bytes.is_none() { | |
if let Some(slot_meta) = slot_meta { | |
if slot > self.lowest_cleanup_slot() { | |
panic!( |
Lastly, there is another issue around this lock that has been discussed in #27195. This PR does not aim to solve the issue discussed in 27195. 27195 describes a similar type of race where the following could happen (line numbers for below code block)
One solution to that issue could be to hold the lock for the duration of |
Has Jito tested this patch at all? |
I just chatted with @segfaultdoc, he was going to apply the patch and give it a spin so I think we can hold off on shipping until we get some runtime there. And one more note - we didn't hit this because our warehouse nodes do not set |
started running it roughly 5hrs ago with |
Good deal. For what it is worth, the time-to-deadlock should scale with size of
The point being, you can shorten the time window by reducing |
good to know, looks like we'll find out in 4ish hours 🙏 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cool, thanks for the testing @segfaultdoc. This fix lgtm.
However, one question:
Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution.
Does the RwLock change to favoring writers mean that in v1.15 a block could now get cleaned up while get_rooted_block
is running?
Good question and no - That being said, The above fact is probably worthy of a comment in |
Shipping this in given two more things:
|
This deadlock could only occur on nodes that call Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do. Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot. If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario. This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available. (cherry picked from commit 328b674)
…kport of #30203) (#30300) Remove recursive read lock that could deadlock Blockstore (#30203) This deadlock could only occur on nodes that call Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do. Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot. If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario. This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available. (cherry picked from commit 328b674) Co-authored-by: steviez <[email protected]>
…s#30203) This deadlock could only occur on nodes that call Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do. Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot. If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario. This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available.
Problem
This deadlock could only occur on nodes that call
Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do.
Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot.
If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario.
Summary of Changes
This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available.
#29901 has more information on when the issue was first observed, as well as some additional information such as actually observed callstacks from deadlocked warehouse (using bigtable) node.
Callstack That Could Hit Deadlock
In regards
One thing that isn't 100% to me is that the behavior to make RwLock favor writers on Linux was added in Rust 1.62. However, the report came in from Jito where they were rebased on top of v1.13 or v1.14 (which have older Rust versions). @segfaultdoc confirmed that they were using the
./cargo
script to build, so they were using the "supported" Rust version.Additionally, segfaultdoc mentioned that he saw the issue with Jito v1.13.5+ and v1.14.13 (they rebase their commits on top of our tags). But, the issue didn't show in Jito v1.13.4 for them ... this is pretty peculiar as the diff between v1.13.4 and v1.13.5 isn't too big:
88c87d1 obviously sticks out as it changes LedgerCleanupService behavior, but that commit didn't change the locking behavior. Not observing the issue on a given version isn't definitive (not seeing it doesn't mean it couldn't happen), so maybe that is the case or maybe there is some nth-order effect from one of these other changes.
Fixes #29901