-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RestartLastVotedForkSlots for wen_restart. #33239
Changes from 14 commits
ce81bbc
0edf40e
1ec981b
a3749b2
79bedf8
719534d
25e0117
1fcc81d
b7077a7
354673a
0028f64
46d2054
cc11d42
b0cead0
5b7a724
4821d8a
ae8d01d
1ee2699
78a9ded
866228b
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -4,7 +4,7 @@ use { | |
contact_info::ContactInfo, | ||
deprecated, | ||
duplicate_shred::{DuplicateShred, DuplicateShredIndex, MAX_DUPLICATE_SHREDS}, | ||
epoch_slots::EpochSlots, | ||
epoch_slots::{CompressedSlotsVec, EpochSlots}, | ||
legacy_contact_info::LegacyContactInfo, | ||
}, | ||
bincode::{serialize, serialized_size}, | ||
|
@@ -38,6 +38,8 @@ pub const MAX_VOTES: VoteIndex = 32; | |
|
||
pub type EpochSlotsIndex = u8; | ||
pub const MAX_EPOCH_SLOTS: EpochSlotsIndex = 255; | ||
// We now keep 81000 slots, 81000/MAX_SLOTS_PER_ENTRY = 5. | ||
pub(crate) const MAX_RESTART_LAST_VOTED_FORK_SLOTS: EpochSlotsIndex = 5; | ||
|
||
/// CrdsValue that is replicated across the cluster | ||
#[derive(Serialize, Deserialize, Clone, Debug, PartialEq, Eq, AbiExample)] | ||
|
@@ -94,6 +96,7 @@ pub enum CrdsData { | |
DuplicateShred(DuplicateShredIndex, DuplicateShred), | ||
SnapshotHashes(SnapshotHashes), | ||
ContactInfo(ContactInfo), | ||
RestartLastVotedForkSlots(EpochSlotsIndex, RestartLastVotedForkSlots), | ||
} | ||
|
||
impl Sanitize for CrdsData { | ||
|
@@ -132,6 +135,12 @@ impl Sanitize for CrdsData { | |
} | ||
CrdsData::SnapshotHashes(val) => val.sanitize(), | ||
CrdsData::ContactInfo(node) => node.sanitize(), | ||
CrdsData::RestartLastVotedForkSlots(ix, slots) => { | ||
if *ix >= MAX_RESTART_LAST_VOTED_FORK_SLOTS { | ||
return Err(SanitizeError::ValueOutOfBounds); | ||
} | ||
slots.sanitize() | ||
} | ||
} | ||
} | ||
} | ||
|
@@ -145,7 +154,7 @@ pub(crate) fn new_rand_timestamp<R: Rng>(rng: &mut R) -> u64 { | |
impl CrdsData { | ||
/// New random CrdsData for tests and benchmarks. | ||
fn new_rand<R: Rng>(rng: &mut R, pubkey: Option<Pubkey>) -> CrdsData { | ||
let kind = rng.gen_range(0..7); | ||
let kind = rng.gen_range(0..8); | ||
// TODO: Implement other kinds of CrdsData here. | ||
// TODO: Assign ranges to each arm proportional to their frequency in | ||
// the mainnet crds table. | ||
|
@@ -157,6 +166,10 @@ impl CrdsData { | |
3 => CrdsData::AccountsHashes(AccountsHashes::new_rand(rng, pubkey)), | ||
4 => CrdsData::Version(Version::new_rand(rng, pubkey)), | ||
5 => CrdsData::Vote(rng.gen_range(0..MAX_VOTES), Vote::new_rand(rng, pubkey)), | ||
6 => CrdsData::RestartLastVotedForkSlots( | ||
rng.gen_range(0..MAX_RESTART_LAST_VOTED_FORK_SLOTS), | ||
RestartLastVotedForkSlots::new_rand(rng, pubkey), | ||
), | ||
_ => CrdsData::EpochSlots( | ||
rng.gen_range(0..MAX_EPOCH_SLOTS), | ||
EpochSlots::new_rand(rng, pubkey), | ||
|
@@ -485,6 +498,50 @@ impl Sanitize for NodeInstance { | |
} | ||
} | ||
|
||
#[derive(Serialize, Deserialize, Clone, Default, PartialEq, Eq, AbiExample, Debug)] | ||
pub struct RestartLastVotedForkSlots { | ||
pub from: Pubkey, | ||
pub wallclock: u64, | ||
pub slots: CompressedSlotsVec, | ||
pub last_voted_slot: Slot, | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With this new struct, do we still need There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Removed. |
||
pub last_voted_hash: Hash, | ||
} | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I would add There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Done. |
||
|
||
impl Sanitize for RestartLastVotedForkSlots { | ||
fn sanitize(&self) -> std::result::Result<(), SanitizeError> { | ||
self.slots.sanitize()?; | ||
self.last_voted_hash.sanitize() | ||
} | ||
} | ||
|
||
impl RestartLastVotedForkSlots { | ||
pub fn new(from: Pubkey, now: u64, last_voted_slot: Slot, last_voted_hash: Hash) -> Self { | ||
Self { | ||
from, | ||
wallclock: now, | ||
slots: CompressedSlotsVec::new(), | ||
last_voted_slot, | ||
last_voted_hash, | ||
} | ||
} | ||
|
||
/// New random Version for tests and benchmarks. | ||
pub fn new_rand<R: Rng>(rng: &mut R, pubkey: Option<Pubkey>) -> Self { | ||
let pubkey = pubkey.unwrap_or_else(solana_sdk::pubkey::new_rand); | ||
Self { | ||
from: pubkey, | ||
wallclock: new_rand_timestamp(rng), | ||
slots: CompressedSlotsVec::new_rand(rng), | ||
last_voted_slot: rng.gen_range(0..512), | ||
last_voted_hash: Hash::new_unique(), | ||
} | ||
} | ||
|
||
pub fn fill(&mut self, update: &[Slot]) -> usize { | ||
self.slots.fill(update) | ||
} | ||
} | ||
|
||
/// Type of the replicated value | ||
/// These are labels for values in a record that is associated with `Pubkey` | ||
#[derive(PartialEq, Hash, Eq, Clone, Debug)] | ||
|
@@ -501,6 +558,7 @@ pub enum CrdsValueLabel { | |
DuplicateShred(DuplicateShredIndex, Pubkey), | ||
SnapshotHashes(Pubkey), | ||
ContactInfo(Pubkey), | ||
RestartLastVotedForkSlots(EpochSlotsIndex, Pubkey), | ||
} | ||
|
||
impl fmt::Display for CrdsValueLabel { | ||
|
@@ -524,6 +582,9 @@ impl fmt::Display for CrdsValueLabel { | |
write!(f, "SnapshotHashes({})", self.pubkey()) | ||
} | ||
CrdsValueLabel::ContactInfo(_) => write!(f, "ContactInfo({})", self.pubkey()), | ||
CrdsValueLabel::RestartLastVotedForkSlots(ix, _) => { | ||
write!(f, "RestartLastVotedForkSlots({}, {})", ix, self.pubkey()) | ||
} | ||
} | ||
} | ||
} | ||
|
@@ -543,6 +604,7 @@ impl CrdsValueLabel { | |
CrdsValueLabel::DuplicateShred(_, p) => *p, | ||
CrdsValueLabel::SnapshotHashes(p) => *p, | ||
CrdsValueLabel::ContactInfo(pubkey) => *pubkey, | ||
CrdsValueLabel::RestartLastVotedForkSlots(_, p) => *p, | ||
} | ||
} | ||
} | ||
|
@@ -593,6 +655,7 @@ impl CrdsValue { | |
CrdsData::DuplicateShred(_, shred) => shred.wallclock, | ||
CrdsData::SnapshotHashes(hash) => hash.wallclock, | ||
CrdsData::ContactInfo(node) => node.wallclock(), | ||
CrdsData::RestartLastVotedForkSlots(_, slots) => slots.wallclock, | ||
} | ||
} | ||
pub fn pubkey(&self) -> Pubkey { | ||
|
@@ -609,6 +672,7 @@ impl CrdsValue { | |
CrdsData::DuplicateShred(_, shred) => shred.from, | ||
CrdsData::SnapshotHashes(hash) => hash.from, | ||
CrdsData::ContactInfo(node) => *node.pubkey(), | ||
CrdsData::RestartLastVotedForkSlots(_, slots) => slots.from, | ||
} | ||
} | ||
pub fn label(&self) -> CrdsValueLabel { | ||
|
@@ -627,6 +691,9 @@ impl CrdsValue { | |
CrdsData::DuplicateShred(ix, shred) => CrdsValueLabel::DuplicateShred(*ix, shred.from), | ||
CrdsData::SnapshotHashes(_) => CrdsValueLabel::SnapshotHashes(self.pubkey()), | ||
CrdsData::ContactInfo(node) => CrdsValueLabel::ContactInfo(*node.pubkey()), | ||
CrdsData::RestartLastVotedForkSlots(ix, _) => { | ||
CrdsValueLabel::RestartLastVotedForkSlots(*ix, self.pubkey()) | ||
} | ||
} | ||
} | ||
pub fn contact_info(&self) -> Option<&LegacyContactInfo> { | ||
|
@@ -1073,4 +1140,38 @@ mod test { | |
assert!(node.should_force_push(&pubkey)); | ||
assert!(!node.should_force_push(&Pubkey::new_unique())); | ||
} | ||
|
||
#[test] | ||
fn test_restart_last_voted_fork_slots() { | ||
let keypair = Keypair::new(); | ||
let slot = 53; | ||
let slot_parent = slot - 5; | ||
let mut slots = | ||
RestartLastVotedForkSlots::new(keypair.pubkey(), timestamp(), slot, Hash::default()); | ||
let original_slots_vec = [slot_parent, slot]; | ||
slots.fill(&original_slots_vec); | ||
let ix = 1; | ||
let value = CrdsValue::new_signed( | ||
CrdsData::RestartLastVotedForkSlots(ix, slots.clone()), | ||
&keypair, | ||
); | ||
assert_eq!(value.sanitize(), Ok(())); | ||
let label = value.label(); | ||
assert_eq!( | ||
label, | ||
CrdsValueLabel::RestartLastVotedForkSlots(ix, keypair.pubkey()) | ||
); | ||
assert_eq!(label.pubkey(), keypair.pubkey()); | ||
assert_eq!(value.wallclock(), slots.wallclock); | ||
let retrived_slots = slots.slots.to_slots(0); | ||
assert_eq!(retrived_slots.len(), 2); | ||
assert_eq!(retrived_slots[0], slot_parent); | ||
assert_eq!(retrived_slots[1], slot); | ||
|
||
let bad_value = CrdsValue::new_signed( | ||
CrdsData::RestartLastVotedForkSlots(MAX_RESTART_LAST_VOTED_FORK_SLOTS, slots), | ||
&keypair, | ||
); | ||
assert_eq!(bad_value.sanitize(), Err(SanitizeError::ValueOutOfBounds)) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this
81000
coming from?These CRDS values aren't really cheap. Do we really need this many?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number comes from solana-foundation/solana-improvement-documents#46.
During an outage we don't know each other's status, so there is no way to know how lagged others are. So we send out 9 hours of slots, hoping that we can find out about an outage in 9 hours.
The values are big, but we thought it was okay because these messages are only transmitted during an outage. Later after I add the command line --wen_restart flag I can come back and filter out these CRDS values during non-restart mode if that makes you feel safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
81_000
slots is ~5MB of data per node in the cluster!!!Also, I am not sure if a node is
81_000
slots behind other nodes, it can repair all those slots, so what is the point?I think we really need to revise this number lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, 9 hours is chosen because in the past our outage resolution normally takes 4~7 hours just for the validator restart part. And we add a few hours to be on the cautious side. Not every validator operator has a pager, but most of them will tend to things within 9 hours.
Sending 81k slots on the Gossip messages doesn't mean you need to repair all those slots. Most of them are probably older than your local root so you will instantly dump it on the floor.
The tricky part here is you don't know what the status of others are so you really want to send more than they probably need to make the restart successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are 2 different scenarios:
Even before all these, why do we need to send the parent slots to begin with?
We are already able to repair a fork if we are missing some of the parent slots in the fork by sending repair orphan request:
https://github.com/solana-labs/solana/blob/bc2b37276/core/src/repair/serve_repair.rs#L240-L243
Why do we need to duplicate that protocol here? The node can already recover the whole fork by repairing off the last slot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I think 81k slots is less than 5MB, because it's a bitmap. I assume the current MAX_SLOTS_PER_ENTRY is chosen so each EpochSlots piece fits within a 1460 byte UDP packet, and MAX_RESTART_LAST_VOTED_FORK_SLOTS is 5, which is ~7KB per validator? Let's say it's a 64K UDP packet, that's still smaller than 5MB per validator.
How much memory does the current EpochSlots messages occupy? Considering MAX_EPOCH_SLOTS is 255, this should be much smaller than that.
Is my calculation off?
Also, if the whole cluster is stuck, I think operators will be okay with consuming some bandwidth to quickly bring the cluster back up. If sending 81k slot information means they actually don't need to repair all the slots, maybe we would even save more on the repair bandwidth.
I think the "repair a batch at a time backwards" method you mentioned sounds interesting, but you never know what random validators can throw at you, so a few validators throwing you slots into the far future can make you really busy at repair for quite some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behzad assume the worst, turbine / repair is congested (similar to last outage) this protocol will always find the OC restart slot, which is prone to human error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshwinSekar gossip has a lot more load than turbine and repair combined. I don't think we can solve congestion on one protocol by adding more load to the one already more overloaded.
@wen-coding When this was initially discussed we were talking about only 2 new crds value. This commit itself is adding 5 new pretty big values where each one fills up a packet.
It also has some overlap with already existing repair orphan request and I am not convinced this duplication will buy us anything in practice.
The design that you cannot tell slots.last() is not the parent of last_voted_slot and you rely on receiving all crds values is also not ideal because gossip does not provide such guarantees.
Can you revisit the design addressing these issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about it more, I think you had a point about we probably do not need to send 81k slots in most cases. Reading previous outage reports, validators around the world normally entered the outage at about the same time so the difference of their roots is normally much smaller than duration of the outages.
That said, I still want to play it safe and send enough slots to rescue the occasional outliers back, so I propose lowering the original 81k number to MAX_SLOTS_PER_ENTRY which is 16384. This lowers the number of packets per validator from 5 to 1, and the code handling the packet don't need to deal with holes in between any more.
@carllin @mvines @t-nelson Are you okay with lowering 81k slots to 16k slots given the reasoning above?
@behzadnouri are you okay with it given we are only adding 1 RestartLastVotedForkSlots per validator now?
regarding "some overlap with already existing repair orphan request", I think:
I'd be happy to organize a quick meeting if we can't reach agreement over github reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thats ok, with only one single such crds value the parent is also inferred, which should address @behzadnouri's concern