Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove recursive read lock that could deadlock Blockstore #30203

Merged
merged 1 commit into from
Feb 13, 2023

Conversation

steviez
Copy link
Contributor

@steviez steviez commented Feb 8, 2023

Problem

This deadlock could only occur on nodes that call
Blockstore::get_rooted_block(). Regular validators don't call this function, RPC nodes and nodes that have BigTableUploadService enabled do.

Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution. As part of the callstack of get_rooted_block(), Blockstore::get_completed_ranges() will get called, which also grabs a read lock on lowest_cleanup_slot.

If LedgerCleanupService attempts to grab a write lock between the read lock calls, we could hit a deadlock if priority is given to the write lock request in this scenario.

Summary of Changes

This change removes the call to get the read lock in get_completed_ranges(). The lock is only held for the scope of this function, which is a single rocksdb read and thus not needed. This does mean that a different error will be returned if the requested slot was below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp would have been thrown; the rocksdb error will be bubbled up now. Note that callers of get_rooted_block() will still get the SlotCleanedUp error when appropriate because get_rooted_block() grabs the lock. If the slot is unavailable, it will return immediately. If the slot is available, get_rooted_block() holding the lock means the slot will remain available.

#29901 has more information on when the issue was first observed, as well as some additional information such as actually observed callstacks from deadlocked warehouse (using bigtable) node.

Callstack That Could Hit Deadlock

+ Blockstore::get_rooted_block()
+!!! Blockstore::check_lowest_cleanup_slot() (line 1926) to obtain read lock
+--+ Blockstore:: get_complete_block() (line 1929)
   +--+ Blockstore::get_slot_entries() (line 1948)
      +--+ Blockstore::get_slot_entries_with_shred_info() (line 2778)
         +--+ Blockstore::get_completed_ranges() (line 2790)
            +!!! Blockstore::check_lowest_cleanup_slot() (line 2856) to obtain read lock
// RPC Thread                       |  // LedgerCleanupService Thread
                                    |
// First call that gets read lock   |
let _ = lowest_cleanup_slot.read(); |
                                    |  // LCS gets write lock next
                                    |  let _ = lowest_cleanup_slot.write();
// Second read lock that could      |
// deadlock us                      |
let _ = lowest_cleanup_slot.read(); |

In regards

  • Rust < 1.62 behavior would allow the second read-lock to go before the write-lock, given that read-locks could go in parallel. This could hypothetically starve the writer, but with our workload, I don't think this was happening in any meaningful way (ie probably seconds at most of starvation)
  • Rust >= 1.62 behavior will prevent the second read lock since there was a write lock request ahead of it.

One thing that isn't 100% to me is that the behavior to make RwLock favor writers on Linux was added in Rust 1.62. However, the report came in from Jito where they were rebased on top of v1.13 or v1.14 (which have older Rust versions). @segfaultdoc confirmed that they were using the ./cargo script to build, so they were using the "supported" Rust version.

Additionally, segfaultdoc mentioned that he saw the issue with Jito v1.13.5+ and v1.14.13 (they rebase their commits on top of our tags). But, the issue didn't show in Jito v1.13.4 for them ... this is pretty peculiar as the diff between v1.13.4 and v1.13.5 isn't too big:

$ git log v1.13.4 v1.13.5 --pretty=oneline
9138494335acec3e5406e77587607aaba30ab95e (tag: v1.13.5) MultiIteratorScanner - improve banking stage performance with high contention
88c87d12d9c85f508e4e83528a68567b8e3c095c Delete files older than the lowest_cleanup_slot in LedgerCleanupService::cleanup_ledger (backport #26651) (#28721)
f3471e54560388a10cc56d7a0990fafaf967132d Backport timeout async send tasks to v1.13 (#28750)
959b760ca89ff192107d4a1115fc5a69ed071c8b chore: bump tj-actions/changed-files to v34 (backport #28674) (#28675)
3d4a5f69998fee65a65fb94b6a7cc94e72223fcc clean_dead_slots_from_accounts_index unrefs correctly (backport #2746… (#28582)
2e69d6d6426422f90a96edbf30fd72006a40a274 Limit async tasks spawn to runtime for quic transaction send (backport #28452) (#28603)
9d211065cc4ffc3d3b85a50a8425da0dc82df7f9 Backport of #28599 to v1.13 (#28602)
40920b4ebe9f73cecad28ac93910d677f86f0bcb chore: fix solana-private pipeline (#28609)
bf2acd04846538cd8926459891f8a7e6e072905f stats for staked/unstaked repair requests (backport #28215) (#28598)
9382e90c08613be460d77cce6118882f0ada1a3c make ping cache rate limit delay configurable (backport #27955) (#28595)
dcf4cc6471ce96f08a6c3ae4957721a17238869b ledger-tool: remove inefficient base58 encoding options (backport #28488) (#28494)
c090a6b446ac4da7fd3063fac0eee73eb4abb5b2 Bump Version to 1.13.5 (#28506)
3b81ee9c2602620ca4904df20bdf098bff47aaf9 (tag: v1.13.4) Fixes typo in bootstrap log (backport #28483) (#28491)

88c87d1 obviously sticks out as it changes LedgerCleanupService behavior, but that commit didn't change the locking behavior. Not observing the issue on a given version isn't definitive (not seeing it doesn't mean it couldn't happen), so maybe that is the case or maybe there is some nth-order effect from one of these other changes.

Fixes #29901

This deadlock could only occur on nodes that call
Blockstore::get_rooted_block(). Regular validators don't call this
function, RPC nodes and nodes that have BigTableUploadService enabled
do.

Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot
right at the start to check if the block has been cleaned up, and to
ensure it doesn't get cleaned up during execution. As part of the
callstack of get_rooted_block(), Blockstore::get_completed_ranges() will
get called, which also grabs a read lock on lowest_cleanup_slot.

If LedgerCleanupService attempts to grab a write lock between the read
lock calls, we could hit a deadlock if priority is given to the write
lock request in this scenario.

This change removes the call to get the read lock in
get_completed_ranges(). The lock is only held for the scope of this
function, which is a single rocksdb read and thus not needed. This does
mean that a different error will be returned if the requested slot was
below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp
would have been thrown; the rocksdb error will be bubbled up now. Note
that callers of get_rooted_block() will still get the SlotCleanedUp
error when appropriate because get_rooted_block() grabs the lock. If the
slot is unavailable, it will return immediately. If the slot is
available, get_rooted_block() holding the lock means the slot will
remain available.
Comment on lines -2881 to -2882
let _lock = self.check_lowest_cleanup_slot(slot)?;

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I initially started upon a more complicated change but landed here. My reasoning for simply ripping this out is:

  • The lock is only held for get_completed_ranges() so it wasn't providing us any guarantees for when we subsequently read the shreds from rocksdb
  • This function issues a single (atomic) rocksdb read operation, so lock not needed for this function

So, by removing this, the only thing that we lose is distinguishing a slot being unavailable vs. the slot having been cleaned up when calling get_completed_ranges() directly. The only spot we care about BlockstoreError::SlotCleanedUp is in RPC after calling get_rooted_block(). But, since get_rooted_block() calls check_lowest_cleanup_slot() to check the slot and hold the lock, we don't have any change of behavior from RPC's point of view.

solana/rpc/src/rpc.rs

Lines 1017 to 1038 in 375f9ae

fn check_slot_cleaned_up<T>(
&self,
result: &std::result::Result<T, BlockstoreError>,
slot: Slot,
) -> Result<()> {
let first_available_block = self
.blockstore
.get_first_available_block()
.unwrap_or_default();
let err: Error = RpcCustomError::BlockCleanedUp {
slot,
first_available_block,
}
.into();
if let Err(BlockstoreError::SlotCleanedUp) = result {
return Err(err);
}
if slot < first_available_block {
return Err(err);
}
Ok(())
}

Copy link
Contributor Author

@steviez steviez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We attempt to grab a read lock within lowest_cleanup_slot() here as well; however, when we're executing this portion, shred_bytes.none() should NOT be true ever because the lock claimed from get_rooted_blocks() at the top prevents cleanup.

if shred_bytes.is_none() {
if let Some(slot_meta) = slot_meta {
if slot > self.lowest_cleanup_slot() {
panic!(

@steviez
Copy link
Contributor Author

steviez commented Feb 8, 2023

Lastly, there is another issue around this lock that has been discussed in #27195. This PR does not aim to solve the issue discussed in 27195.

27195 describes a similar type of race where the following could happen (line numbers for below code block)

  • Thread X issues a call to Blockstore::get_entries_with_shred_info()

  • Thread X reads the SlotMeta for some slot S (S > lowest_cleanup_slot) on line 2790

  • LedgerCleanupService updated lowest_cleanup_slot to T where T > S

  • Thread X attempts to read shreds / deserialize into Entries via Blockstore::get_entries_in_data_block() on line 2812 / line 2820

    pub fn get_slot_entries_with_shred_info(
    &self,
    slot: Slot,
    start_index: u64,
    allow_dead_slots: bool,
    ) -> Result<(Vec<Entry>, u64, bool)> {
    let (completed_ranges, slot_meta) = self.get_completed_ranges(slot, start_index)?;
    // Check if the slot is dead *after* fetching completed ranges to avoid a race
    // where a slot is marked dead by another thread before the completed range query finishes.
    // This should be sufficient because full slots will never be marked dead from another thread,
    // this can only happen during entry processing during replay stage.
    if self.is_dead(slot) && !allow_dead_slots {
    return Err(BlockstoreError::DeadSlot);
    } else if completed_ranges.is_empty() {
    return Ok((vec![], 0, false));
    }
    let slot_meta = slot_meta.unwrap();
    let num_shreds = completed_ranges
    .last()
    .map(|(_, end_index)| u64::from(*end_index) - start_index + 1)
    .unwrap_or(0);
    let entries: Result<Vec<Vec<Entry>>> = if completed_ranges.len() <= 1 {
    completed_ranges
    .into_iter()
    .map(|(start_index, end_index)| {
    self.get_entries_in_data_block(slot, start_index, end_index, Some(&slot_meta))
    })
    .collect()
    } else {
    PAR_THREAD_POOL.install(|| {
    completed_ranges
    .into_par_iter()
    .map(|(start_index, end_index)| {
    self.get_entries_in_data_block(
    slot,
    start_index,
    end_index,
    Some(&slot_meta),
    )
    })
    .collect()
    })
    };
    let entries: Vec<Entry> = entries?.into_iter().flatten().collect();
    Ok((entries, num_shreds, slot_meta.is_full()))
    }

  • The above fails since the shreds are no longer available for slot S. But, since S is older than lowest_cleanup_slot, we acknowledge that the slot was cleaned up and bail out. If S > lowest_cleanup_slot, something else is wrong and we panic

    if shred_bytes.is_none() {
    if let Some(slot_meta) = slot_meta {
    if slot > self.lowest_cleanup_slot() {
    panic!(

One solution to that issue could be to hold the lock for the duration of get_slot_entries_with_shred_info(); however, this could have other repercussions so as mentioned, this PR does not aim to tackle that.

@CriesofCarrots
Copy link
Contributor

Has Jito tested this patch at all?

@steviez
Copy link
Contributor Author

steviez commented Feb 8, 2023

Has Jito tested this patch at all?

I just chatted with @segfaultdoc, he was going to apply the patch and give it a spin so I think we can hold off on shipping until we get some runtime there.

And one more note - we didn't hit this because our warehouse nodes do not set --limit-ledger-size. With this flag, nothing is ever purged so LedgerCleanupService never grabs a write lock for lowest_cleanup_slot.

@t-nelson
Copy link
Contributor

t-nelson commented Feb 9, 2023

my eyebrows tend to go up when i see a callstack like this. almost always recursive locks are only needed when the architecture is busted. also...
Screenshot from 2023-02-09 00-18-22

@segfaultdoc
Copy link
Contributor

Has Jito tested this patch at all?

I just chatted with @segfaultdoc, he was going to apply the patch and give it a spin so I think we can hold off on shipping until we get some runtime there.

And one more note - we didn't hit this because our warehouse nodes do not set --limit-ledger-size. With this flag, nothing is ever purged so LedgerCleanupService never grabs a write lock for lowest_cleanup_slot.

started running it roughly 5hrs ago with limit-ledger-size set to 100000000, typically see the node lock up within 12-20hrs

@steviez
Copy link
Contributor Author

steviez commented Feb 10, 2023

started running it roughly 5hrs ago with limit-ledger-size set to 100000000, typically see the node lock up within 12-20hrs

Good deal. For what it is worth, the time-to-deadlock should scale with size of --limit-ledger-size, as LedgerCleanupService won't start getting the write lock until the ledger is full:

  • 100M shreds / (1k shreds / slot) = 100k slots
  • 100k slots * 0.5s / slot = 50k seconds
  • 50k seconds * (1 hour / 3600 seconds) ~= 14 hours

The point being, you can shorten the time window by reducing --limit-ledger-size down to 50M if you'd like

@segfaultdoc
Copy link
Contributor

  • 50k seconds * (1 hour / 3600 seconds) ~= 14 hours

good to know, looks like we'll find out in 4ish hours 🙏

@segfaultdoc
Copy link
Contributor

Screen Shot 2023-02-12 at 6 06 41 PM
Still running like a champ. The dotted vertical line on the CPU metrics chart is when I deployed this PR

Copy link
Contributor

@CriesofCarrots CriesofCarrots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool, thanks for the testing @segfaultdoc. This fix lgtm.

However, one question:

Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot right at the start to check if the block has been cleaned up, and to ensure it doesn't get cleaned up during execution.

Does the RwLock change to favoring writers mean that in v1.15 a block could now get cleaned up while get_rooted_block is running?

@steviez
Copy link
Contributor Author

steviez commented Feb 13, 2023

Does the RwLock change to favoring writers mean that in v1.15 a block could now get cleaned up while get_rooted_block is running?

Good question and no - Blockstore::get_rooted_block() still grabs a read lock first thing in that will block LedgerCleanupService from executing the purge.

That being said, Blockstore::get_complete_block() (which is called by RPC if the block is confirmed but not rooted) doesn't get a read lock. But, LedgerCleanupService only cleans up stuff older than the latest root, so we wouldn't attempt to clean up a confirmed block.

The above fact is probably worthy of a comment in get_complete_block() implementation to call out why it isn't necessary to grab the lock there.

@steviez
Copy link
Contributor Author

steviez commented Feb 13, 2023

Shipping this in given two more things:

  • t-nelson confirmed he had "no complaints"
  • segfaultdoc confirmed that the node with this patch is still running, which is further evidence that this PR seems to avoid the issue

@steviez steviez merged commit 328b674 into solana-labs:master Feb 13, 2023
@steviez steviez added the v1.15 (abandoned) The v1.15 branch has been abandoned label Feb 13, 2023
@steviez steviez deleted the blockstore_deadlock branch February 13, 2023 23:33
mergify bot pushed a commit that referenced this pull request Feb 13, 2023
This deadlock could only occur on nodes that call
Blockstore::get_rooted_block(). Regular validators don't call this
function, RPC nodes and nodes that have BigTableUploadService enabled
do.

Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot
right at the start to check if the block has been cleaned up, and to
ensure it doesn't get cleaned up during execution. As part of the
callstack of get_rooted_block(), Blockstore::get_completed_ranges() will
get called, which also grabs a read lock on lowest_cleanup_slot.

If LedgerCleanupService attempts to grab a write lock between the read
lock calls, we could hit a deadlock if priority is given to the write
lock request in this scenario.

This change removes the call to get the read lock in
get_completed_ranges(). The lock is only held for the scope of this
function, which is a single rocksdb read and thus not needed. This does
mean that a different error will be returned if the requested slot was
below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp
would have been thrown; the rocksdb error will be bubbled up now. Note
that callers of get_rooted_block() will still get the SlotCleanedUp
error when appropriate because get_rooted_block() grabs the lock. If the
slot is unavailable, it will return immediately. If the slot is
available, get_rooted_block() holding the lock means the slot will
remain available.

(cherry picked from commit 328b674)
steviez pushed a commit that referenced this pull request Feb 14, 2023
…kport of #30203) (#30300)

Remove recursive read lock that could deadlock Blockstore (#30203)

This deadlock could only occur on nodes that call
Blockstore::get_rooted_block(). Regular validators don't call this
function, RPC nodes and nodes that have BigTableUploadService enabled
do.

Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot
right at the start to check if the block has been cleaned up, and to
ensure it doesn't get cleaned up during execution. As part of the
callstack of get_rooted_block(), Blockstore::get_completed_ranges() will
get called, which also grabs a read lock on lowest_cleanup_slot.

If LedgerCleanupService attempts to grab a write lock between the read
lock calls, we could hit a deadlock if priority is given to the write
lock request in this scenario.

This change removes the call to get the read lock in
get_completed_ranges(). The lock is only held for the scope of this
function, which is a single rocksdb read and thus not needed. This does
mean that a different error will be returned if the requested slot was
below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp
would have been thrown; the rocksdb error will be bubbled up now. Note
that callers of get_rooted_block() will still get the SlotCleanedUp
error when appropriate because get_rooted_block() grabs the lock. If the
slot is unavailable, it will return immediately. If the slot is
available, get_rooted_block() holding the lock means the slot will
remain available.

(cherry picked from commit 328b674)

Co-authored-by: steviez <[email protected]>
nickfrosty pushed a commit to nickfrosty/solana that referenced this pull request Mar 12, 2023
…s#30203)

This deadlock could only occur on nodes that call
Blockstore::get_rooted_block(). Regular validators don't call this
function, RPC nodes and nodes that have BigTableUploadService enabled
do.

Blockstore::get_rooted_block() grabs a read lock on lowest_cleanup_slot
right at the start to check if the block has been cleaned up, and to
ensure it doesn't get cleaned up during execution. As part of the
callstack of get_rooted_block(), Blockstore::get_completed_ranges() will
get called, which also grabs a read lock on lowest_cleanup_slot.

If LedgerCleanupService attempts to grab a write lock between the read
lock calls, we could hit a deadlock if priority is given to the write
lock request in this scenario.

This change removes the call to get the read lock in
get_completed_ranges(). The lock is only held for the scope of this
function, which is a single rocksdb read and thus not needed. This does
mean that a different error will be returned if the requested slot was
below lowest_cleanup_slot. Previously, a BlockstoreError::SlotCleanedUp
would have been thrown; the rocksdb error will be bubbled up now. Note
that callers of get_rooted_block() will still get the SlotCleanedUp
error when appropriate because get_rooted_block() grabs the lock. If the
slot is unavailable, it will return immediately. If the slot is
available, get_rooted_block() holding the lock means the slot will
remain available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.15 (abandoned) The v1.15 branch has been abandoned
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2x Warehouse Nodes Stopped Replaying and Uploading Blocks But Remained Online
4 participants