-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid to miss to root for local slots before the hard fork #19912
Avoid to miss to root for local slots before the hard fork #19912
Conversation
Codecov Report
@@ Coverage Diff @@
## master #19912 +/- ##
=========================================
- Coverage 82.1% 81.9% -0.3%
=========================================
Files 628 631 +3
Lines 171471 174118 +2647
=========================================
+ Hits 140878 142615 +1737
- Misses 30593 31503 +910 |
fd0abd5
to
f616d28
Compare
core/src/validator.rs
Outdated
reconcile_blockstore_roots_with_external_source( | ||
ExternalRootSource::HardFork(hard_fork_restart_slot), | ||
&blockstore, | ||
&mut last_blockstore_root, | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so this is the meat of this pr
core/src/consensus.rs
Outdated
"Reconciling slots as root based on tower root: {:?} ({}..{}) ", | ||
new_roots, tower_root, last_blockstore_root | ||
"Reconciling slots as root based on external root: {:?} ({}..{}) ", | ||
new_roots, external_root, last_blockstore_root | ||
); | ||
blockstore.set_roots(new_roots.iter())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@carllin I wonder this .set_roots(...)
should be accompanied with set_duplicate_confirmed_slots_and_hashes(...)
like in blockstore_processor?:
solana/ledger/src/blockstore_processor.rs
Line 1169 in 5bbb0da
blockstore.set_duplicate_confirmed_slots_and_hashes(rooted_slots.into_iter()).expect("Blockstore::set_duplicate_confirmed should succeed"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun ah yeah it should, good catch!
Although I think you'll see that there are no calls to fn set_duplicate_confirmed_slots_and_hashes
in 1.6 and 1.7 currently 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
after some code reading, i had to make a kind of compromise regarding the treatment of duplicate confirmed slots
...: https://github.com/solana-labs/solana/pull/19912/files#r902224585
so, i've defined small fn
(mark_slots_as_if_rooted_normally_at_startup
) for the sole sake of documentation to make the nuance very explicit, to my taste. i hope it'll be more immune to comment rots. ;)
db859c1
to
195f20a
Compare
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
@ryoqun , any chance of resuscitating this?? |
@CriesofCarrots thanks for caring this pr. yeah, i haven't forgetting this pr at all... so stressful that i can't get to this for this extended period of time.... I'm planning to work on this, this week really. |
350c018
to
cb4e410
Compare
0ef68ea
to
c924ff8
Compare
status: I resuscitated this pr with all review so far being addressed. I'll write replies to review comments later. |
// Unfortunately, we can't supply duplicate-confirmed hashes, | ||
// because it can't be guaranteed to be able to replay these slots | ||
// under this code-path's limited condition (i.e. those shreds | ||
// might not be available, etc...) also correctly overcoming this | ||
// limitation is hard... | ||
blockstore.mark_slots_as_if_rooted_normally_at_startup( | ||
new_roots.into_iter().map(|root| (root, None)).collect(), | ||
false, | ||
)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hmmm yeah this is yet another tricky edge case to reason about, but I think it should be ok.
Not marking these slots as duplicate confirmed just means:
- we can't serve requests to people asking for the correct version of these slots via
AncestorHashesRepairType
- Should be safe as long as we're not freezing the slot again during replay (not happening b/c rooted)
// blockstore.last_root() might have been updated already. | ||
// so take a &mut param both to input (and output iff we update root) | ||
last_blockstore_root: &mut Slot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
there
@@ -430,7 +429,7 @@ pub mod tests { | |||
OptimisticallyConfirmedBank::locked_from_bank_forks_root(&bank_forks), | |||
)), | |||
&poh_recorder, | |||
tower, | |||
None, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
info!("Tower state: {:?}", tower); | ||
tower | ||
} else { | ||
warn!("creating default tower...."); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
|
||
impl<'a> From<ProcessBlockStore<'a>> for Tower { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mvines I don't think this is a correct use of From
, which is introduced at #23852 (https://github.com/solana-labs/solana/pull/23852/files#r902237878).
This is converting quite different things. ideally, From
should be a cast for conceptually similar things. Also, this made code reading a bit hard because i can't easily spot when ProcessBlockStore::process
is exactly executed now. (i.e. searching the correct .into()
is hard)
it seems this trait workaround is only needed for this test test_tvu_exit and i think alternative approach should arguably be tolerable (still not ideal, i skipped properly introducing enum or trait just for this purpose...): https://github.com/solana-labs/solana/pull/19912/files#r902243960
btw, i didn't know there is an identity/no-op blanket From
/Into
impl provided here
https://doc.rust-lang.org/src/core/convert/mod.rs.html#559:
// From (and thus Into) is reflexive
#[stable(feature = "rust1", since = "1.0.0")]
#[rustc_const_unstable(feature = "const_convert", issue = "88674")]
impl<T> const From<T> for T {
/// Returns the argument unchanged.
fn from(t: T) -> T {
t
}
}
granted, this refactor isn't relevant to this pr but i couldn't resist the urge, when i was code-reading to guarantee the safety of the scary blockstore mutation with hard fork params.
happy to separate this change into another preparatory pr, if you desire. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, this looks fine to me
this is finally done! @carllin could you review this? cc: @CriesofCarrots |
sanity checked and got:
seems working properly. |
seems this worked for 2023-02-26 mainnet-beta outage:
|
assert_eq!(&slots_a[slots_a.len() - roots_a.len()..].to_vec(), &roots_a); | ||
assert_eq!(&slots_b[slots_b.len() - roots_b.len()..].to_vec(), &roots_b); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
here
/// Dangerous. Currently only needed for a local-cluster test | ||
pub fn unset_parent(&mut self) { | ||
self.parent_slot = None; | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Problem
some rooted slots could be omitted to be marked.
some rooted slots could be omitted after restart (hard fork), if the hard fork slot is ahead of the validator's latest ledger root before the restart.
This is usually the case because current practice of cluster restart is based on the opt-conf-ed slot. (i.e. no node should have rooted that slot yet)
ultimately, this leads to hole in bigtable.
cc: @mvines @sakridge @carllin
Summary of Changes
fix it. (this is wip; I'm working on to fix the actual hole now).impl finishedrelated: #13536