-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add RestartLastVotedForkSlots for wen_restart. #33239
Add RestartLastVotedForkSlots for wen_restart. #33239
Conversation
Can you please add each new crds value in its own separate PR? |
Sure, reverted RestartHeaviestFork, this PR now only has RestartLastVoteForkSlots. |
The coverage ci is failing:
You need to update the abi digest. |
gossip/src/cluster_info.rs
Outdated
fn push_epoch_slots_or_restart_last_voted_fork_slots( | ||
&self, | ||
mut update: &[Slot], | ||
last_vote_bankhash: Option<Hash>, | ||
) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you mixing this value with EpochSlots
?
This function has become more convoluted with several if is_epoch_slot
branches.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split into two functions now.
gossip/src/cluster_info.rs
Outdated
let origin = entry.value.pubkey(); | ||
gossip_crds.get_shred_version(&origin) == self_shred_version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we need this check, shouldn't shred version be included in the RestartLastVotedForkSlots
?
There is no guarantee that the RestartLastVotedForkSlots
and the ContactInfo
you obtain shred-version from are in sync.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@wen-coding do we need to include shred_version
in RestartLastVotedForkSlots
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was copied from get_epoch_slots, so I think all other Gossip messages which check whether shred_version matches would have the same problem, since we could always have a lag on ContactInfo shred-version.
So shred_version is just a sanity check, because it is so easy to circumvent in a real attack. But it's nice to have for naive users pointing mainnet validators at testnet or vice versa. I could go either way here, but would prefer to not include shred_version in RestartLastVotedForkSlots for brevity. Using a contact-info shred_version which may occasionally go out of sync is fine, we will drop some messages but can pick them up later I believe? For RestartLastVotedForkSlots since we draw a line between Gossip messages in restart and those not in restart, this check is less important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code was copied from get_epoch_slots
- get_epoch_slots may have been done improperly. copying the same mistakes will not help.
- For
EpochSlots
we are not expecting a shred-version change. Here there is always a shred-version change when theRestartLastVotedForkSlots
is pushed or needed.
I think all other Gossip messages which check whether shred_version matches would have the same problem
most of those which check shred-version are node's sockets which are obtained from the same contact-info message which has shred-version.
For RestartLastVotedForkSlots since we draw a line between Gossip messages in restart and those not in restart, this check is less important.
I don't understand this, why?
My understanding is that for RestartLastVotedForkSlots
it is even more important, because, again, there is always a shred-version change when the RestartLastVotedForkSlots
is pushed or needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, maybe it is overstatement to say this check is less important.
The shred_version change was put in mainly to prevent RestartLastVotedForkSlots messages from occupying CRDS table space in non-restarting validators because they are big, but later we decided to add extra check "normal validators just don't accept RestartLastVotedForkSlots at all no matter what shred_version it has", so the shred_version change becomes an extra protection.
Let's say one validator has its contact info propagating slower than RestartLastVotedForkSlots, so some neighbors dropped its RestartLastVotedForkSlots because of shred_version mismatch. I was hoping newly restarted validators will do a Gossip pull for all messages and somehow help spread the lost RestartLastVotedForkSlots messages to others?
If that's not the case then it might be better to add shred_version in the messages themselves.
gossip/src/cluster_info.rs
Outdated
}) | ||
.map(|entry| match &entry.value.data { | ||
CrdsData::RestartLastVotedForkSlots(index, slots, last_vote, last_vote_hash) => { | ||
(*index, slots.clone(), *last_vote, *last_vote_hash) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A tuple with 4 fields is not ideal.
I don't think you would need to expose EpochSlotsIndex
outside of gossip code.
You probably also need to define a struct for RestartLastVotedForkSlots
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
gossip/src/crds_value.rs
Outdated
@@ -38,6 +38,8 @@ pub const MAX_VOTES: VoteIndex = 32; | |||
|
|||
pub type EpochSlotsIndex = u8; | |||
pub const MAX_EPOCH_SLOTS: EpochSlotsIndex = 255; | |||
// We now keep 81000 slots, 81000/MAX_SLOTS_PER_ENTRY = 5. | |||
pub const MAX_RESTART_LAST_VOTED_FORK_SLOTS: EpochSlotsIndex = 5; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
does this have to be pub?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed to pub(crate).
gossip/src/crds_value.rs
Outdated
6 => CrdsData::RestartLastVotedForkSlots( | ||
rng.gen_range(0..MAX_EPOCH_SLOTS), | ||
EpochSlots::new_rand(rng, pubkey), | ||
rng.gen_range(0..512), | ||
Hash::new_unique(), | ||
), | ||
_ => CrdsData::EpochSlots( | ||
rng.gen_range(0..MAX_RESTART_LAST_VOTED_FORK_SLOTS), | ||
EpochSlots::new_rand(rng, pubkey), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks wrong.
You are using MAX_EPOCH_SLOTS
for RestartLastVotedForkSlots
and MAX_RESTART_LAST_VOTED_FORK_SLOTS
for EpochSlots
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oops, good catch, thank for pointing that out.
gossip/src/crds_value.rs
Outdated
@@ -38,6 +38,8 @@ pub const MAX_VOTES: VoteIndex = 32; | |||
|
|||
pub type EpochSlotsIndex = u8; | |||
pub const MAX_EPOCH_SLOTS: EpochSlotsIndex = 255; | |||
// We now keep 81000 slots, 81000/MAX_SLOTS_PER_ENTRY = 5. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where is this 81000
coming from?
These CRDS values aren't really cheap. Do we really need this many?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The number comes from solana-foundation/solana-improvement-documents#46.
During an outage we don't know each other's status, so there is no way to know how lagged others are. So we send out 9 hours of slots, hoping that we can find out about an outage in 9 hours.
The values are big, but we thought it was okay because these messages are only transmitted during an outage. Later after I add the command line --wen_restart flag I can come back and filter out these CRDS values during non-restart mode if that makes you feel safer.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
81_000
slots is ~5MB of data per node in the cluster!!!
Also, I am not sure if a node is 81_000
slots behind other nodes, it can repair all those slots, so what is the point?
I think we really need to revise this number lower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I understand it, 9 hours is chosen because in the past our outage resolution normally takes 4~7 hours just for the validator restart part. And we add a few hours to be on the cautious side. Not every validator operator has a pager, but most of them will tend to things within 9 hours.
Sending 81k slots on the Gossip messages doesn't mean you need to repair all those slots. Most of them are probably older than your local root so you will instantly dump it on the floor.
The tricky part here is you don't know what the status of others are so you really want to send more than they probably need to make the restart successful.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sending 81k slots on the Gossip messages doesn't mean you need to repair all those slots. Most of them are probably older than your local root so you will instantly dump it on the floor.
There are 2 different scenarios:
- If a node has some of the intermediate slots, then it can already repair missing slots and complete the chain. No need to send 81k slots to it.
- If a node does not have any of the intermediate slots, do we really expect it to be able to repair 81k slots?
Even before all these, why do we need to send the parent slots to begin with?
We are already able to repair a fork if we are missing some of the parent slots in the fork by sending repair orphan request:
https://github.com/solana-labs/solana/blob/bc2b37276/core/src/repair/serve_repair.rs#L240-L243
Why do we need to duplicate that protocol here? The node can already recover the whole fork by repairing off the last slot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way, I think 81k slots is less than 5MB, because it's a bitmap. I assume the current MAX_SLOTS_PER_ENTRY is chosen so each EpochSlots piece fits within a 1460 byte UDP packet, and MAX_RESTART_LAST_VOTED_FORK_SLOTS is 5, which is ~7KB per validator? Let's say it's a 64K UDP packet, that's still smaller than 5MB per validator.
How much memory does the current EpochSlots messages occupy? Considering MAX_EPOCH_SLOTS is 255, this should be much smaller than that.
Is my calculation off?
Also, if the whole cluster is stuck, I think operators will be okay with consuming some bandwidth to quickly bring the cluster back up. If sending 81k slot information means they actually don't need to repair all the slots, maybe we would even save more on the repair bandwidth.
I think the "repair a batch at a time backwards" method you mentioned sounds interesting, but you never know what random validators can throw at you, so a few validators throwing you slots into the far future can make you really busy at repair for quite some time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@behzad assume the worst, turbine / repair is congested (similar to last outage) this protocol will always find the OC restart slot, which is prone to human error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@AshwinSekar gossip has a lot more load than turbine and repair combined. I don't think we can solve congestion on one protocol by adding more load to the one already more overloaded.
@wen-coding When this was initially discussed we were talking about only 2 new crds value. This commit itself is adding 5 new pretty big values where each one fills up a packet.
It also has some overlap with already existing repair orphan request and I am not convinced this duplication will buy us anything in practice.
The design that you cannot tell slots.last() is not the parent of last_voted_slot and you rely on receiving all crds values is also not ideal because gossip does not provide such guarantees.
Can you revisit the design addressing these issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thought about it more, I think you had a point about we probably do not need to send 81k slots in most cases. Reading previous outage reports, validators around the world normally entered the outage at about the same time so the difference of their roots is normally much smaller than duration of the outages.
That said, I still want to play it safe and send enough slots to rescue the occasional outliers back, so I propose lowering the original 81k number to MAX_SLOTS_PER_ENTRY which is 16384. This lowers the number of packets per validator from 5 to 1, and the code handling the packet don't need to deal with holes in between any more.
@carllin @mvines @t-nelson Are you okay with lowering 81k slots to 16k slots given the reasoning above?
@behzadnouri are you okay with it given we are only adding 1 RestartLastVotedForkSlots per validator now?
regarding "some overlap with already existing repair orphan request", I think:
- If we are in an outage, apparently that's because repair orphan request didn't save us
- Repair orphan request can figure out parent relationship between blocks and repair missing blocks on the fork, but it's not a global view so it can't tell us whether some blocks can be ignored. To quickly reach consensus between validators you do need a global view, which is something the current repair service can't give you, you do need some sort of global consensus to make validators agree on something again.
I'd be happy to organize a quick meeting if we can't reach agreement over github reviews.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think thats ok, with only one single such crds value the parent is also inferred, which should address @behzadnouri's concern
Codecov Report
@@ Coverage Diff @@
## master #33239 +/- ##
========================================
Coverage 81.7% 81.7%
========================================
Files 807 807
Lines 218252 218366 +114
========================================
+ Hits 178438 178618 +180
+ Misses 39814 39748 -66 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This code is several separate changes:
- Defining the new
RestartLastVotedForkSlots
struct and adding it to gossip CRDS table. - An internal CRDS index for retrieval, which I believe is unnecessary.
- The api for pushing and retrieving the values in cluster_info.rs.
It would save everyone's time and reduce number of back and forth if we just do one at a time and review smaller changes.
gossip/src/crds.rs
Outdated
@@ -169,6 +171,7 @@ impl Default for Crds { | |||
votes: BTreeMap::default(), | |||
epoch_slots: BTreeMap::default(), | |||
duplicate_shreds: BTreeMap::default(), | |||
restart_last_voted_fork_slots: BTreeMap::default(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If these values are only going to matter during restart, then we don't need the overhead of this index.
You can just use this function:
https://github.com/solana-labs/solana/blob/6db57f81d/gossip/src/crds.rs#L380-L390
and filter on RestartLastVotedForkSlots
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair, removed now.
gossip/src/crds_value.rs
Outdated
@@ -132,6 +135,12 @@ impl Sanitize for CrdsData { | |||
} | |||
CrdsData::SnapshotHashes(val) => val.sanitize(), | |||
CrdsData::ContactInfo(node) => node.sanitize(), | |||
CrdsData::RestartLastVotedForkSlots(ix, slots) => { | |||
if *ix as usize >= MAX_RESTART_LAST_VOTED_FORK_SLOTS as usize { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do you need as usize
here?
aren't both values of type EpochSlotsIndex
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
gossip/src/crds_value.rs
Outdated
@@ -157,6 +166,10 @@ impl CrdsData { | |||
3 => CrdsData::AccountsHashes(AccountsHashes::new_rand(rng, pubkey)), | |||
4 => CrdsData::Version(Version::new_rand(rng, pubkey)), | |||
5 => CrdsData::Vote(rng.gen_range(0..MAX_VOTES), Vote::new_rand(rng, pubkey)), | |||
6 => CrdsData::RestartLastVotedForkSlots( | |||
0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this is hardcoded to 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
gossip/src/crds_value.rs
Outdated
@@ -485,6 +498,69 @@ impl Sanitize for NodeInstance { | |||
} | |||
} | |||
|
|||
#[derive(Serialize, Deserialize, Clone, Default, PartialEq, Eq, AbiExample)] | |||
pub struct RestartLastVotedForkSlots { | |||
pub slots: EpochSlots, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you using EpochSlots
here?
How are the EpochSlots
related to RestartLastVotedForkSlots
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mainly want to reuse the compress/uncompress logic of EpochSlots, all I need is adding last_vote to EpochSlots.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you want to use something like Vec<CompressedSlots>
:
https://github.com/solana-labs/solana/blob/5dbc19ccb/gossip/src/epoch_slots.rs#L230
not the entire EpcohSlots
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also why do we need last_voted_slot
separately?
Isn't that already the last slot in the slots
field?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because one Gossip message may not fit 81k slots, so we need to split it into multiple RestartLastVotedForkSlots messages, only the last message has last_voted_slot.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I get one of those RestartLastVotedForkSlots
messages before the last one which says, for example:
last_voted_slot == 1000
slots.last() == 10
how are we going to identify that 10
is not the parent of 10000
? We may not ever get the last RestartLastVotedForkSlots
for that node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right now the protocol only does pure aggregation, when you get RestartLastVotedForkSlots from any staked nodes it means that person has voted for all the slots in this message. last_voted_slot and last_voted_hash and timestamp are used together with pubkey to be unique key, in case some validator sends out a new message and we want to replace previous votes.
We can tolerate about 5% packet loss, if we have more than 5% packet loss then this restart protocol might not give you a definite answer, but it has at least made an effort to replace some important blocks which >42% of the validators already voted for.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might need a counter on the last vote to tell people what the expected number of RestartLastVotedForkSlots
is so we don't misinterpret the parent
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We won't misinterpret the parent, because when we traverse to select HeaviestFork we are using the parent information on the BankFork. But it's nice not to miss part of the RestartLastVotedForkSlots, because that might mean you didn't count all of the stakes. I'll add the counter.
gossip/src/crds_value.rs
Outdated
impl fmt::Debug for RestartLastVotedForkSlots { | ||
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result { | ||
write!( | ||
f, | ||
"RestartLastVotedForkSlots {{ slots: {:?} last_voted: {}({}) }}", | ||
self.slots, self.last_voted_slot, self.last_voted_hash | ||
) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need this?
Can't it be just #[derive(Debug)]
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
gossip/src/crds_value.rs
Outdated
pub fn new_from_fields( | ||
slots: EpochSlots, | ||
last_voted_slot: Slot, | ||
last_voted_hash: Hash, | ||
) -> Self { | ||
Self { | ||
slots, | ||
last_voted_slot, | ||
last_voted_hash, | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If all fields are pub
this is redundant.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
gossip/src/crds.rs
Outdated
@@ -1232,7 +1233,7 @@ mod tests { | |||
let keypair = &keypairs[rng.gen_range(0..keypairs.len())]; | |||
let value = CrdsValue::new_rand(&mut rng, Some(keypair)); | |||
let local_timestamp = new_rand_timestamp(&mut rng); | |||
if let Ok(()) = crds.insert(value, local_timestamp, GossipRoute::LocalMessage) { | |||
if let Ok(()) = crds.insert(value.clone(), local_timestamp, GossipRoute::LocalMessage) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why .clone()
here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
gossip/src/crds_value.rs
Outdated
pub slots: CompressedSlotsVec, | ||
pub last_voted_slot: Slot, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With this new struct, do we still need last_voted_slot
?
Can it be just slots.last()
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Removed.
pub slots: CompressedSlotsVec, | ||
pub last_voted_slot: Slot, | ||
pub last_voted_hash: Hash, | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would add shred_version
here too.
It is only 2 bytes, and we need to filter on matching shred_version
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done.
gossip/src/epoch_slots.rs
Outdated
pub struct CompressedSlotsVec { | ||
slots: Vec<CompressedSlots>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you remove this new type and just use Vec<CompressedSlots>
in RestartLastVotedForkSlots
?
We can refactor common code once the code is finalized and stable.
No need to do that in this PR while also modifying EpochSlots
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Changed.
Problem
Gossip change part 1 of wen_restart.
Summary of Changes
Add the following new Gossip messages:
It will only be propagated during wen_restart, and we will use a new shred_version during wen_restart to avoid polluting a normal cluster.