Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

2x Warehouse Nodes Stopped Replaying and Uploading Blocks But Remained Online #29901

Closed
buffalu opened this issue Jan 25, 2023 · 18 comments · Fixed by #30203
Closed

2x Warehouse Nodes Stopped Replaying and Uploading Blocks But Remained Online #29901

buffalu opened this issue Jan 25, 2023 · 18 comments · Fixed by #30203
Assignees
Labels
community Community contribution

Comments

@buffalu
Copy link
Contributor

buffalu commented Jan 25, 2023

Problem

Our two warehouse nodes running v1.14.13-jito stopped processing slots last night. They were still online, but weren't able to progress. One stopped around 3AM CT and another stopped around 7:45AM CT.

They also stopped uploading to bigtable

Identities:

  • 5oXsXDXM8kmcK3KjSpurxB8DsJUqWCrSiD2kcLWEniSy
  • ATb9AnWw12SsUGtG4LadUhHL276Tf51u1XTmt8kFY3iG

Proposed Solution

Determine why they stopped freezing slots and fix

@buffalu buffalu added the community Community contribution label Jan 25, 2023
@buffalu
Copy link
Contributor Author

buffalu commented Jan 25, 2023

Screen Shot 2023-01-25 at 9 48 40 AM

@buffalu
Copy link
Contributor Author

buffalu commented Jan 25, 2023

bigtable slot upload chart

Screen Shot 2023-01-25 at 9 55 51 AM

@buffalu buffalu changed the title 2x Warehouse Nodes Stopped Replaying and Uploading Blocks 2x Warehouse Nodes Stopped Replaying and Uploading Blocks But Remained Online Jan 25, 2023
@steviez
Copy link
Contributor

steviez commented Jan 25, 2023

I believe bigtable will only upload rooted slots. It is possible that banks were getting frozen but not rooted (ie a consensus divergence); can you check your logs or metrics to see if your nodes were freezing banks?

@buffalu
Copy link
Contributor Author

buffalu commented Jan 26, 2023

it happened again on 5oXsXDXM8kmcK3KjSpurxB8DsJUqWCrSiD2kcLWEniSy

logs: https://drive.google.com/file/d/1WVpB8pjVoUHg7WWlWdtd4uggE8SmE_fv/view?usp=share_link

bigtable upload (times are central time, jan 25 2023)
22:39:24 was last block uploaded

Screen Shot 2023-01-25 at 11 01 29 PM

@buffalu
Copy link
Contributor Author

buffalu commented Jan 26, 2023

last frozen bank:

Jan 26 04:39:35 ny-mainnet-warehouse-1 solana-validator[2643753]: [2023-01-26T04:39:35.044682103Z INFO  solana_runtime::bank] bank frozen: 174429976 hash: 2Fn27poQeRZUqbuyQgFPvzqjkReuYH2SDnAkupofpFuC accounts_delta: FUg7UnKSqy4K4W4TBCkHx8y6pPfamFWNU2LfTfdu3NWy signature_count: 2524 last_blockhash: 2avz5dUychyLytLUCpcpDHpYP16fMC6JEeHHYadRuwmw capitalization: 539312136320050691

@buffalu
Copy link
Contributor Author

buffalu commented Jan 26, 2023

looks like a total deadlock? this is replay-loop-timing-stats which seems like it should be emitted every iteration with 1s rate limit

Screen Shot 2023-01-25 at 11 14 47 PM

@steviez
Copy link
Contributor

steviez commented Jan 26, 2023

Taking a quick look at your log, I would agree that deadlock seems like a possibility. I think getting some stracktraces would be helpful to figure out where things are stuck. General process:

  1. Rebuild binary with debug symbols
RUSTFLAGS='-g -C force-frame-pointers=yes' cargo build --release --bin solana-validator
  1. Restart process and wait for hang again
  2. Attach gdb and dump backtrace for all threads (the (gdb) lines are run within gdb)
$ gdb -p <PID>
(gdb) set logging file ~/gdb.log
(gdb) set logging on
(gdb) thread apply all bt full
(gdb) quit
$ cat ~/gdb.log

With the log, you should be able to inspect the backtraces and see which threads are stuck.

@segfaultdoc
Copy link
Contributor

Taking a quick look at your log, I would agree that deadlock seems like a possibility. I think getting some stracktraces would be helpful to figure out where things are stuck. General process:

  1. Rebuild binary with debug symbols
RUSTFLAGS='-g -C force-frame-pointers=yes' cargo build --release --bin solana-validator
  1. Restart process and wait
  2. Wait for hang again
  3. Attach gdb and dump backtrace for all threads (the (gdb) lines are run within gdb)
$ gdb -p <PID>
(gdb) set logging file ~/gdb.log
(gdb) set logging on
(gdb) thread apply all bt full
(gdb) quit
$ cat ~/gdb.log

With the log, you should be able to inspect the backtraces and see which threads are stuck.

Trying to get gdb installed on a package manager-less system is a much more daunting task than I would have imagined. In the meantime I downgraded one of the nodes to 1.13.6 and seeing the same symptoms of no bigtable uploads

@kunal365roy
Copy link

We've seen issues similar to this and have founding removing the accounts, ledger and snapshots directory and restarting while fetching the snapshot will reestablish bt connection + upload reliably. I was hoping this patch in 1.13.6 would stop the bt upload sync #28728

@steviez
Copy link
Contributor

steviez commented Jan 31, 2023

We've seen issues similar to this

Did your node completely lock up (ie stop replaying slot) or did your node just stop uploading to bigtable while continuing to remain with the tip of the cluster? The issue you linked sounds like a session issue (ie the latter) whereas this issue is exploring a fundamental issue where the validator client grinds to a halt across the board.

removing the accounts, ledger and snapshots directory and restarting while fetching the snapshot will reestablish bt connection + upload reliably

This is a bandaid, and should really be a last resort. Did you ever try stopping/restarting your nodes along the way without wiping everything? If you're still encountering this, do you currently have any open issues and/or PR's ?

Additionally, if you're uploading to bigtable, suppose:

  • Your nodes last uploads at slot X
  • You then wipe your local state and restart
  • Upon restart, your node fetches a snapshot at slot Y

You now have a hole in your history from (X, Y] ...

@steviez steviez self-assigned this Feb 1, 2023
@segfaultdoc
Copy link
Contributor

Here's a gdb stacktrace of our locked up node
[
gdb.log
](url)

@kunal365roy
Copy link

Did your node completely lock up (ie stop replaying slot) or did your node just stop uploading to bigtable while continuing to remain with the tip of the cluster? The issue you linked sounds like a session issue (ie the latter) whereas this issue is exploring a fundamental issue where the validator client grinds to a halt across the board.

Ah I read too fast apologize, we also have this as a separate issue. I'm referring to the one where the validator continues to sink but loses the bt session, fixed by the latest release.

My validators have been very inconsistently syncing as of late

Additionally, if you're uploading to bigtable, suppose:
Your nodes last uploads at slot X
You then wipe your local state and restart
Upon restart, your node fetches a snapshot at slot Y
You now have a hole in your history from (X, Y] ...

Yeah we have an alerter and backfill async job that fills that with the solana ledger tool

@segfaultdoc
Copy link
Contributor

Adding to this thread that 1.13.5 ran into the issue, bumping down to 1.13.4

@steviez
Copy link
Contributor

steviez commented Feb 6, 2023

Here's a gdb stacktrace of our locked up node [ gdb.log ](url)

Hi @segfaultdoc - the log you posted appears to be truncated a bit, I don't see all threads; any chance you saved the output and have the full log available?

@segfaultdoc
Copy link
Contributor

Here it is:
gdb.log

@steviez
Copy link
Contributor

steviez commented Feb 7, 2023

Thanks for the updated gdb output. It looks like some of the threads are still missing, but there is output for more threads from this second output which help shed some light. It looks like some threads are stuck on this lock:

pub lowest_cleanup_slot: RwLock<Slot>,

I see solana-ledger-p (solana-ledger-purge) thread waiting on this lock:

*blockstore.lowest_cleanup_slot.write().unwrap() = lowest_cleanup_slot;

Thread 967 (Thread 0x7ea9d1d1a640 (LWP 1010644) "solana-ledger-p"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1  0x000055a9efa6a825 in std::sys::unix::futex::futex_wait () at library/std/src/sys/unix/futex.rs:62
No locals.
#2  0x000055a9edb68602 in std::sys::unix::locks::futex_rwlock::RwLock::write_contended () at library/std/src/sys/unix/locks/futex_rwlock.rs:220
No locals.
#3  0x000055a9ee0e15e3 in std::sys::unix::locks::futex_rwlock::RwLock::write (self=0x7f02562e5138) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/sys/unix/locks/futex_rwlock.rs:156
No locals.
#4  std::sys_common::rwlock::MovableRwLock::write (self=0x7f02562e5138) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/sys_common/rwlock.rs:102
No locals.
#5  std::sync::rwlock::RwLock<u64>::write<u64> (self=0x7f02562e5138) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/sync/rwlock.rs:298
No locals.
#6  solana_core::ledger_cleanup_service::{impl#0}::cleanup_ledger::{closure#0} () at core/src/ledger_cleanup_service.rs:240

Several bigtable-upload threads are also waiting on the same lock (through a series of function calls):

let lowest_cleanup_slot = self.lowest_cleanup_slot.read().unwrap();

Thread 783 (Thread 0x7eac2d4d0640 (LWP 1010407) "bigtable-upload"):
#0  syscall () at ../sysdeps/unix/sysv/linux/x86_64/syscall.S:38
No locals.
#1  0x000055a9efa6a825 in std::sys::unix::futex::futex_wait () at library/std/src/sys/unix/futex.rs:62
No locals.
#2  0x000055a9edb684cc in std::sys::unix::locks::futex_rwlock::RwLock::read_contended () at library/std/src/sys/unix/locks/futex_rwlock.rs:139
No locals.
#3  0x000055a9ee85750f in std::sys::unix::locks::futex_rwlock::RwLock::read (self=0x7f02562e5138) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/sys/unix/locks/futex_rwlock.rs:88
        state = <optimized out>
        state = <optimized out>
#4  std::sys_common::rwlock::MovableRwLock::read (self=0x7f02562e5138) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/sys_common/rwlock.rs:86
No locals.
#5  std::sync::rwlock::RwLock<u64>::read<u64> (self=0x7f02562e5138) at /rustc/a55dd71d5fb0ec5a6a3a9e8c27b2127ba491ce52/library/std/src/sync/rwlock.rs:210
No locals.
#6  solana_ledger::blockstore::Blockstore::check_lowest_cleanup_slot (self=0x7f02562e5010, slot=175822617) at ledger/src/blockstore.rs:2576
No locals.
#7  0x000055a9ee85d3ca in solana_ledger::blockstore::Blockstore::get_completed_ranges (self=0x89, slot=3221225478, start_index=139648018373181) at ledger/src/blockstore.rs:3193
No locals.
#8  0x000055a9ee85ce73 in solana_ledger::blockstore::Blockstore::get_slot_entries_with_shred_info (self=0x89, slot=175822617, start_index=0, allow_dead_slots=false) at ledger/src/blockstore.rs:3150
No locals.
#9  0x000055a9ee855db0 in solana_ledger::blockstore::Blockstore::get_slot_entries (self=0x7f02562e5010, shred_start_index=0, slot=<optimized out>) at ledger/src/blockstore.rs:3138
No locals.

We do .unwrap() when we attempt to grab the locks; these callsites didn't panic so the lock seemingly is NOT poisoned. Also, I didn't see any mention of panics from y'all, but might be a worthwhile sanity check real quick.

The full output could be helpful in narrowing down the search space, altho there should only be a handful of threads that grab a writer on that lock so we might be able to proceed without it as well.

@steviez
Copy link
Contributor

steviez commented Feb 8, 2023

Got back to this, looks like we have a call-path that will attempt to obtain the a read-lock a second time while it is already held. The callpath is:

+ Blockstore::get_rooted_block()
+!!! Blockstore::check_lowest_cleanup_slot() (line 1926) to obtain read lock
+--+ Blockstore:: get_complete_block() (line 1929)
   +--+ Blockstore::get_slot_entries() (line 1948)
      +--+ Blockstore::get_slot_entries_with_shred_info() (line 2778)
         +--+ Blockstore::get_completed_ranges() (line 2790)
            +!!! Blockstore::check_lowest_cleanup_slot() (line 2856) to obtain read lock

pub fn get_rooted_block(
&self,
slot: Slot,
require_previous_blockhash: bool,
) -> Result<VersionedConfirmedBlock> {
datapoint_info!("blockstore-rpc-api", ("method", "get_rooted_block", String));
let _lock = self.check_lowest_cleanup_slot(slot)?;
if self.is_root(slot) {
return self.get_complete_block(slot, require_previous_blockhash);
}
Err(BlockstoreError::SlotNotRooted)
}

pub fn get_complete_block(
&self,
slot: Slot,
require_previous_blockhash: bool,
) -> Result<VersionedConfirmedBlock> {
let slot_meta_cf = self.db.column::<cf::SlotMeta>();
let slot_meta = match slot_meta_cf.get(slot)? {
Some(slot_meta) => slot_meta,
None => {
info!("SlotMeta not found for slot {}", slot);
return Err(BlockstoreError::SlotUnavailable);
}
};
if slot_meta.is_full() {
let slot_entries = self.get_slot_entries(slot, 0)?;
if !slot_entries.is_empty() {
let blockhash = slot_entries
.last()
.map(|entry| entry.hash)
.unwrap_or_else(|| panic!("Rooted slot {slot:?} must have blockhash"));
let slot_transaction_iterator = slot_entries
.into_iter()
.flat_map(|entry| entry.transactions)
.map(|transaction| {
if let Err(err) = transaction.sanitize(
// Don't enable additional sanitization checks until
// all clusters have activated the static program id
// feature gate so that bigtable upload isn't affected
false, // require_static_program_ids
) {
warn!(
"Blockstore::get_block sanitize failed: {:?}, \
slot: {:?}, \
{:?}",
err, slot, transaction,
);
}
transaction
});
let parent_slot_entries = slot_meta
.parent_slot
.and_then(|parent_slot| {
self.get_slot_entries(parent_slot, /*shred_start_index:*/ 0)
.ok()
})
.unwrap_or_default();
if parent_slot_entries.is_empty() && require_previous_blockhash {
return Err(BlockstoreError::ParentEntriesUnavailable);
}
let previous_blockhash = if !parent_slot_entries.is_empty() {
get_last_hash(parent_slot_entries.iter()).unwrap()
} else {
Hash::default()
};
let rewards = self
.rewards_cf
.get_protobuf_or_bincode::<StoredExtendedRewards>(slot)?
.unwrap_or_default()
.into();
// The Blocktime and BlockHeight column families are updated asynchronously; they
// may not be written by the time the complete slot entries are available. In this
// case, these fields will be `None`.
let block_time = self.blocktime_cf.get(slot)?;
let block_height = self.block_height_cf.get(slot)?;
let block = VersionedConfirmedBlock {
previous_blockhash: previous_blockhash.to_string(),
blockhash: blockhash.to_string(),
// If the slot is full it should have parent_slot populated
// from shreds received.
parent_slot: slot_meta.parent_slot.unwrap(),
transactions: self
.map_transactions_to_statuses(slot, slot_transaction_iterator)?,
rewards,
block_time,
block_height,
};
return Ok(block);
}
}
Err(BlockstoreError::SlotUnavailable)
}

pub fn get_slot_entries(&self, slot: Slot, shred_start_index: u64) -> Result<Vec<Entry>> {
self.get_slot_entries_with_shred_info(slot, shred_start_index, false)
.map(|x| x.0)
}

pub fn get_slot_entries_with_shred_info(
&self,
slot: Slot,
start_index: u64,
allow_dead_slots: bool,
) -> Result<(Vec<Entry>, u64, bool)> {
let (completed_ranges, slot_meta) = self.get_completed_ranges(slot, start_index)?;
// Check if the slot is dead *after* fetching completed ranges to avoid a race
// where a slot is marked dead by another thread before the completed range query finishes.
// This should be sufficient because full slots will never be marked dead from another thread,
// this can only happen during entry processing during replay stage.
if self.is_dead(slot) && !allow_dead_slots {
return Err(BlockstoreError::DeadSlot);
} else if completed_ranges.is_empty() {
return Ok((vec![], 0, false));
}
let slot_meta = slot_meta.unwrap();
let num_shreds = completed_ranges
.last()
.map(|(_, end_index)| u64::from(*end_index) - start_index + 1)
.unwrap_or(0);
let entries: Result<Vec<Vec<Entry>>> = if completed_ranges.len() <= 1 {
completed_ranges
.into_iter()
.map(|(start_index, end_index)| {
self.get_entries_in_data_block(slot, start_index, end_index, Some(&slot_meta))
})
.collect()
} else {
PAR_THREAD_POOL.install(|| {
completed_ranges
.into_par_iter()
.map(|(start_index, end_index)| {
self.get_entries_in_data_block(
slot,
start_index,
end_index,
Some(&slot_meta),
)
})
.collect()
})
};
let entries: Vec<Entry> = entries?.into_iter().flatten().collect();
Ok((entries, num_shreds, slot_meta.is_full()))
}

fn get_completed_ranges(
&self,
slot: Slot,
start_index: u64,
) -> Result<(CompletedRanges, Option<SlotMeta>)> {
let _lock = self.check_lowest_cleanup_slot(slot)?;
let slot_meta_cf = self.db.column::<cf::SlotMeta>();
let slot_meta = slot_meta_cf.get(slot)?;
if slot_meta.is_none() {
return Ok((vec![], slot_meta));
}
let slot_meta = slot_meta.unwrap();
// Find all the ranges for the completed data blocks
let completed_ranges = Self::get_completed_data_ranges(
start_index as u32,
&slot_meta.completed_data_indexes,
slot_meta.consumed as u32,
);
Ok((completed_ranges, Some(slot_meta)))
}

So, if LedgerCleanupService attempts to purge between the start and end of above callpath, we could enter a deadlock as the writer will block the second reader. I'll look at re-arranging stuff to avoid this and throw a PR up

@steviez
Copy link
Contributor

steviez commented Feb 9, 2023

@buffalu @segfaultdoc - The proposed fix is a little scary given that we're playing with locks, so we just want to be extra sure that there isn't anything else at play.

Could you please confirm whether this warehouse node is doing anything else (ie serving RPC requests)? Additionally, can you comment on CPU utilization before/during the observed hang?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
community Community contribution
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants