Remove recursive read lock that could deadlock Blockstore #30203

steviez · 2023-02-08T21:21:39Z

Problem

This deadlock could only occur on nodes that call
Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do.

Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot.

If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario.

Summary of Changes

This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available.

#29901 has more information on when the issue was first observed, as well as some additional information such as actually observed callstacks from deadlocked warehouse (using bigtable) node.

Callstack That Could Hit Deadlock

+ Blockstore::get_rooted_block()
+!!! Blockstore::check_lowest_cleanup_slot() (line 1926) to obtain read lock
+--+ Blockstore:: get_complete_block() (line 1929)
   +--+ Blockstore::get_slot_entries() (line 1948)
      +--+ Blockstore::get_slot_entries_with_shred_info() (line 2778)
         +--+ Blockstore::get_completed_ranges() (line 2790)
            +!!! Blockstore::check_lowest_cleanup_slot() (line 2856) to obtain read lock

// RPC Thread                       |  // LedgerCleanupService Thread
                                    |
// First call that gets read lock   |
let _ = lowest_cleanup_slot.read(); |
                                    |  // LCS gets write lock next
                                    |  let _ = lowest_cleanup_slot.write();
// Second read lock that could      |
// deadlock us                      |
let _ = lowest_cleanup_slot.read(); |

In regards

Rust < 1.62 behavior would allow the second read-lock to go before the write-lock, given that read-locks could go in parallel. This could hypothetically starve the writer, but with our workload, I don't think this was happening in any meaningful way (ie probably seconds at most of starvation)
Rust >= 1.62 behavior will prevent the second read lock since there was a write lock request ahead of it.

One thing that isn't 100% to me is that the behavior to make RwLock favor writers on Linux was added in Rust 1.62. However, the report came in from Jito where they were rebased on top of v1.13 or v1.14 (which have older Rust versions). @segfaultdoc confirmed that they were using the ./cargo script to build, so they were using the "supported" Rust version.

Additionally, segfaultdoc mentioned that he saw the issue with Jito v1.13.5+ and v1.14.13 (they rebase their commits on top of our tags). But, the issue didn't show in Jito v1.13.4 for them ... this is pretty peculiar as the diff between v1.13.4 and v1.13.5 isn't too big:

$ git log v1.13.4 v1.13.5 --pretty=oneline
9138494335acec3e5406e77587607aaba30ab95e (tag: v1.13.5) MultiIteratorScanner - improve banking stage performance with high contention
88c87d12d9c85f508e4e83528a68567b8e3c095c Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger (backport #26651) (#28721)
f3471e54560388a10cc56d7a0990fafaf967132d Backport timeout async send tasks to v1.13 (#28750)
959b760ca89ff192107d4a1115fc5a69ed071c8b chore: bump tj-actions/changed-files to v34 (backport #28674) (#28675)
3d4a5f69998fee65a65fb94b6a7cc94e72223fcc clean_dead_slots_from_accounts_index unrefs correctly (backport #2746… (#28582)
2e69d6d6426422f90a96edbf30fd72006a40a274 Limit async tasks spawn to runtime for quic transaction send (backport #28452) (#28603)
9d211065cc4ffc3d3b85a50a8425da0dc82df7f9 Backport of #28599 to v1.13 (#28602)
40920b4ebe9f73cecad28ac93910d677f86f0bcb chore: fix solana-private pipeline (#28609)
bf2acd04846538cd8926459891f8a7e6e072905f stats for staked/unstaked repair requests (backport #28215) (#28598)
9382e90c08613be460d77cce6118882f0ada1a3c make ping cache rate limit delay configurable (backport #27955) (#28595)
dcf4cc6471ce96f08a6c3ae4957721a17238869b ledger-tool: remove inefficient base58 encoding options (backport #28488) (#28494)
c090a6b446ac4da7fd3063fac0eee73eb4abb5b2 Bump Version to 1.13.5 (#28506)
3b81ee9c2602620ca4904df20bdf098bff47aaf9 (tag: v1.13.4) Fixes typo in bootstrap log (backport #28483) (#28491)

88c87d1 obviously sticks out as it changes LedgerCleanupService behavior, but that commit didn't change the locking behavior. Not observing the issue on a given version isn't definitive (not seeing it doesn't mean it couldn't happen), so maybe that is the case or maybe there is some nth-order effect from one of these other changes.

Fixes #29901

This deadlock could only occur on nodes that call Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do. Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot. If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario. This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available.

steviez · 2023-02-08T21:29:19Z

ledger/src/blockstore.rs

-        let _lock = self.check_lowest_cleanup_slot(slot)?;
-


I initially started upon a more complicated change but landed here. My reasoning for simply ripping this out is:

The lock is only held for get_completed_ranges() so it wasn't providing us any guarantees for when we subsequently read the shreds from rocksdb

This function issues a single (atomic) rocksdb read operation, so lock not needed for this function

So, by removing this, the only thing that we lose is distinguishing a slot being unavailable vs. the slot having been cleaned up when calling get_completed_ranges() directly. The only spot we care about BlockstoreError::SlotCleanedUp is in RPC after calling get_rooted_block(). But, since get_rooted_block() calls check_lowest_cleanup_slot() to check the slot and hold the lock, we don't have any change of behavior from RPC's point of view.

solana/rpc/src/rpc.rs

Lines 1017 to 1038 in 375f9ae

fn check_slot_cleaned_up<T>(

&self,

result: &std::result::Result<T, BlockstoreError>,

slot: Slot,

) -> Result<()> {

let first_available_block = self

.blockstore

.get_first_available_block()

.unwrap_or_default();

let err: Error = RpcCustomError::BlockCleanedUp {

slot,

first_available_block,

}

.into();

if let Err(BlockstoreError::SlotCleanedUp) = result {

return Err(err);

}

if slot < first_available_block {

return Err(err);

}

Ok(())

}

steviez

We attempt to grab a read lock within lowest_cleanup_slot() here as well; however, when we're executing this portion, shred_bytes.none() should NOT be true ever because the lock claimed from get_rooted_blocks() at the top prevents cleanup.

solana/ledger/src/blockstore.rs

Lines 2942 to 2945 in 375f9ae

    
           if shred_bytes.is_none() { 
        
               if let Some(slot_meta) = slot_meta { 
        
                   if slot > self.lowest_cleanup_slot() { 
        
                       panic!(

steviez · 2023-02-08T21:56:51Z

Lastly, there is another issue around this lock that has been discussed in #27195. This PR does not aim to solve the issue discussed in 27195.

27195 describes a similar type of race where the following could happen (line numbers for below code block)

Thread X issues a call to Blockstore::get_entries_with_shred_info()
Thread X reads the SlotMeta for some slot S (S > lowest_cleanup_slot) on line 2790
LedgerCleanupService updated lowest_cleanup_slot to T where T > S

Thread X attempts to read shreds / deserialize into Entries via Blockstore::get_entries_in_data_block() on line 2812 / line 2820

solana/ledger/src/blockstore.rs

Lines 2784 to 2832 in bfc29a6

    
               pub fn get_slot_entries_with_shred_info( 
        
                   &self, 
        
                   slot: Slot, 
        
                   start_index: u64, 
        
                   allow_dead_slots: bool, 
        
               ) -> Result<(Vec<Entry>, u64, bool)> { 
        
                   let (completed_ranges, slot_meta) = self.get_completed_ranges(slot, start_index)?; 
        
                   // Check if the slot is dead *after* fetching completed ranges to avoid a race 
        
                   // where a slot is marked dead by another thread before the completed range query finishes. 
        
                   // This should be sufficient because full slots will never be marked dead from another thread, 
        
                   // this can only happen during entry processing during replay stage. 
        
                   if self.is_dead(slot) && !allow_dead_slots { 
        
                       return Err(BlockstoreError::DeadSlot); 
        
                   } else if completed_ranges.is_empty() { 
        
                       return Ok((vec![], 0, false)); 
        
                   } 
        
                   let slot_meta = slot_meta.unwrap(); 
        
                   let num_shreds = completed_ranges 
        
                       .last() 
        
                       .map(|(_, end_index)| u64::from(*end_index) - start_index + 1) 
        
                       .unwrap_or(0); 
        
                   let entries: Result<Vec<Vec<Entry>>> = if completed_ranges.len() <= 1 { 
        
                       completed_ranges 
        
                           .into_iter() 
        
                           .map(|(start_index, end_index)| { 
        
                               self.get_entries_in_data_block(slot, start_index, end_index, Some(&slot_meta)) 
        
                           }) 
        
                           .collect() 
        
                   } else { 
        
                       PAR_THREAD_POOL.install(|| { 
        
                           completed_ranges 
        
                               .into_par_iter() 
        
                               .map(|(start_index, end_index)| { 
        
                                   self.get_entries_in_data_block( 
        
                                       slot, 
        
                                       start_index, 
        
                                       end_index, 
        
                                       Some(&slot_meta), 
        
                                   ) 
        
                               }) 
        
                               .collect() 
        
                       }) 
        
                   }; 
        
                   let entries: Vec<Entry> = entries?.into_iter().flatten().collect(); 
        
                   Ok((entries, num_shreds, slot_meta.is_full())) 
        
               }

The above fails since the shreds are no longer available for slot S. But, since S is older than lowest_cleanup_slot, we acknowledge that the slot was cleaned up and bail out. If S > lowest_cleanup_slot, something else is wrong and we panic

solana/ledger/src/blockstore.rs

Lines 2942 to 2945 in 375f9ae

    
           if shred_bytes.is_none() { 
        
               if let Some(slot_meta) = slot_meta { 
        
                   if slot > self.lowest_cleanup_slot() { 
        
                       panic!(

One solution to that issue could be to hold the lock for the duration of get_slot_entries_with_shred_info(); however, this could have other repercussions so as mentioned, this PR does not aim to tackle that.

CriesofCarrots · 2023-02-08T23:55:16Z

Has Jito tested this patch at all?

steviez · 2023-02-08T23:59:06Z

Has Jito tested this patch at all?

I just chatted with @segfaultdoc, he was going to apply the patch and give it a spin so I think we can hold off on shipping until we get some runtime there.

And one more note - we didn't hit this because our warehouse nodes do not set --limit-ledger-size. With this flag, nothing is ever purged so LedgerCleanupService never grabs a write lock for lowest_cleanup_slot.

t-nelson · 2023-02-09T07:19:46Z

my eyebrows tend to go up when i see a callstack like this. almost always recursive locks are only needed when the architecture is busted. also...

segfaultdoc · 2023-02-10T07:04:22Z

Has Jito tested this patch at all?

I just chatted with @segfaultdoc, he was going to apply the patch and give it a spin so I think we can hold off on shipping until we get some runtime there.

And one more note - we didn't hit this because our warehouse nodes do not set --limit-ledger-size. With this flag, nothing is ever purged so LedgerCleanupService never grabs a write lock for lowest_cleanup_slot.

started running it roughly 5hrs ago with limit-ledger-size set to 100000000, typically see the node lock up within 12-20hrs

steviez · 2023-02-10T07:13:21Z

started running it roughly 5hrs ago with limit-ledger-size set to 100000000, typically see the node lock up within 12-20hrs

Good deal. For what it is worth, the time-to-deadlock should scale with size of --limit-ledger-size, as LedgerCleanupService won't start getting the write lock until the ledger is full:

100M shreds / (1k shreds / slot) = 100k slots
100k slots * 0.5s / slot = 50k seconds
50k seconds * (1 hour / 3600 seconds) ~= 14 hours

The point being, you can shorten the time window by reducing --limit-ledger-size down to 50M if you'd like

segfaultdoc · 2023-02-10T17:41:17Z

50k seconds * (1 hour / 3600 seconds) ~= 14 hours

good to know, looks like we'll find out in 4ish hours 🙏

segfaultdoc · 2023-02-12T23:07:02Z

Still running like a champ. The dotted vertical line on the CPU metrics chart is when I deployed this PR

CriesofCarrots

Cool, thanks for the testing @segfaultdoc. This fix lgtm.

However, one question:

Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution.

Does the RwLock change to favoring writers mean that in v1.15 a block could now get cleaned up while get_rooted_block is running?

steviez · 2023-02-13T21:40:46Z

Does the RwLock change to favoring writers mean that in v1.15 a block could now get cleaned up while get_rooted_block is running?

Good question and no - Blockstore::get_rooted_block() still grabs a read lock first thing in that will block LedgerCleanupService from executing the purge.

That being said, Blockstore::get_complete_block() (which is called by RPC if the block is confirmed but not rooted) doesn't get a read lock. But, LedgerCleanupService only cleans up stuff older than the latest root, so we wouldn't attempt to clean up a confirmed block.

The above fact is probably worthy of a comment in get_complete_block() implementation to call out why it isn't necessary to grab the lock there.

steviez · 2023-02-13T23:33:15Z

Shipping this in given two more things:

t-nelson confirmed he had "no complaints"
segfaultdoc confirmed that the node with this patch is still running, which is further evidence that this PR seems to avoid the issue

This deadlock could only occur on nodes that call Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do. Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot. If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario. This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available. (cherry picked from commit 328b674)

…kport of #30203) (#30300) Remove recursive read lock that could deadlock Blockstore (#30203) This deadlock could only occur on nodes that call Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do. Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot. If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario. This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available. (cherry picked from commit 328b674) Co-authored-by: steviez <[email protected]>

…s#30203) This deadlock could only occur on nodes that call Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do. Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot. If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario. This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available.

steviez commented Feb 8, 2023

View reviewed changes

steviez requested review from CriesofCarrots and t-nelson February 8, 2023 21:57

CriesofCarrots approved these changes Feb 13, 2023

View reviewed changes

steviez merged commit 328b674 into solana-labs:master Feb 13, 2023

steviez added the v1.15 (abandoned) The v1.15 branch has been abandoned label Feb 13, 2023

steviez deleted the blockstore_deadlock branch February 13, 2023 23:33

mergify bot mentioned this pull request Feb 13, 2023

v1.15: Remove recursive read lock that could deadlock Blockstore (backport of #30203) #30300

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove recursive read lock that could deadlock Blockstore #30203

Remove recursive read lock that could deadlock Blockstore #30203

steviez commented Feb 8, 2023 •

edited

Loading

steviez Feb 8, 2023

steviez left a comment •

edited

Loading

steviez commented Feb 8, 2023

CriesofCarrots commented Feb 8, 2023

steviez commented Feb 8, 2023

t-nelson commented Feb 9, 2023

segfaultdoc commented Feb 10, 2023

steviez commented Feb 10, 2023

segfaultdoc commented Feb 10, 2023

segfaultdoc commented Feb 12, 2023

CriesofCarrots left a comment

steviez commented Feb 13, 2023

steviez commented Feb 13, 2023

	fn check_slot_cleaned_up<T>(
	&self,
	result: &std::result::Result<T, BlockstoreError>,
	slot: Slot,
	) -> Result<()> {
	let first_available_block = self
	.blockstore
	.get_first_available_block()
	.unwrap_or_default();
	let err: Error = RpcCustomError::BlockCleanedUp {
	slot,
	first_available_block,
	}
	.into();
	if let Err(BlockstoreError::SlotCleanedUp) = result {
	return Err(err);
	}
	if slot < first_available_block {
	return Err(err);
	}
	Ok(())
	}

	if shred_bytes.is_none() {
	if let Some(slot_meta) = slot_meta {
	if slot > self.lowest_cleanup_slot() {
	panic!(

Remove recursive read lock that could deadlock Blockstore #30203

Remove recursive read lock that could deadlock Blockstore #30203

Conversation

steviez commented Feb 8, 2023 • edited Loading

Problem

Summary of Changes

Callstack That Could Hit Deadlock

steviez Feb 8, 2023

Choose a reason for hiding this comment

steviez left a comment • edited Loading

Choose a reason for hiding this comment

steviez commented Feb 8, 2023

CriesofCarrots commented Feb 8, 2023

steviez commented Feb 8, 2023

t-nelson commented Feb 9, 2023

segfaultdoc commented Feb 10, 2023

steviez commented Feb 10, 2023

segfaultdoc commented Feb 10, 2023

segfaultdoc commented Feb 12, 2023

CriesofCarrots left a comment

Choose a reason for hiding this comment

steviez commented Feb 13, 2023

steviez commented Feb 13, 2023

steviez commented Feb 8, 2023 •

edited

Loading

steviez left a comment •

edited

Loading