Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add RestartLastVotedForkSlots for wen_restart. #33239

Merged
merged 20 commits into from
Oct 9, 2023

Conversation

wen-coding
Copy link
Contributor

@wen-coding wen-coding commented Sep 13, 2023

Problem

Gossip change part 1 of wen_restart.

Summary of Changes

Add the following new Gossip messages:

  • RestartLastVotedForkSlots

It will only be propagated during wen_restart, and we will use a new shred_version during wen_restart to avoid polluting a normal cluster.

@wen-coding wen-coding self-assigned this Sep 13, 2023
@behzadnouri
Copy link
Contributor

Can you please add each new crds value in its own separate PR?
(probably the 2nd once the 1st one is merged).

@wen-coding wen-coding changed the title Add RestartLastVotedForkSlots and RestartHeaviestFork for wen_restart. Add RestartLastVotedForkSlots for wen_restart. Sep 13, 2023
@wen-coding
Copy link
Contributor Author

Can you please add each new crds value in its own separate PR? (probably the 2nd once the 1st one is merged).

Sure, reverted RestartHeaviestFork, this PR now only has RestartLastVoteForkSlots.

@behzadnouri
Copy link
Contributor

The coverage ci is failing:
https://buildkite.com/solana-labs/solana/builds/101766#018a9572-1d6c-460e-827d-686358eee39f
in this test:

cluster_info::Protocol_frozen_abi::test_abi_digest

You need to update the abi digest.

Comment on lines 912 to 916
fn push_epoch_slots_or_restart_last_voted_fork_slots(
&self,
mut update: &[Slot],
last_vote_bankhash: Option<Hash>,
) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you mixing this value with EpochSlots?
This function has become more convoluted with several if is_epoch_slot branches.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split into two functions now.

Comment on lines 1297 to 1298
let origin = entry.value.pubkey();
gossip_crds.get_shred_version(&origin) == self_shred_version
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we need this check, shouldn't shred version be included in the RestartLastVotedForkSlots?
There is no guarantee that the RestartLastVotedForkSlots and the ContactInfo you obtain shred-version from are in sync.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@wen-coding do we need to include shred_version in RestartLastVotedForkSlots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was copied from get_epoch_slots, so I think all other Gossip messages which check whether shred_version matches would have the same problem, since we could always have a lag on ContactInfo shred-version.

So shred_version is just a sanity check, because it is so easy to circumvent in a real attack. But it's nice to have for naive users pointing mainnet validators at testnet or vice versa. I could go either way here, but would prefer to not include shred_version in RestartLastVotedForkSlots for brevity. Using a contact-info shred_version which may occasionally go out of sync is fine, we will drop some messages but can pick them up later I believe? For RestartLastVotedForkSlots since we draw a line between Gossip messages in restart and those not in restart, this check is less important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code was copied from get_epoch_slots

  • get_epoch_slots may have been done improperly. copying the same mistakes will not help.
  • For EpochSlots we are not expecting a shred-version change. Here there is always a shred-version change when the RestartLastVotedForkSlots is pushed or needed.

I think all other Gossip messages which check whether shred_version matches would have the same problem

most of those which check shred-version are node's sockets which are obtained from the same contact-info message which has shred-version.

For RestartLastVotedForkSlots since we draw a line between Gossip messages in restart and those not in restart, this check is less important.

I don't understand this, why?
My understanding is that for RestartLastVotedForkSlots it is even more important, because, again, there is always a shred-version change when the RestartLastVotedForkSlots is pushed or needed.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, maybe it is overstatement to say this check is less important.

The shred_version change was put in mainly to prevent RestartLastVotedForkSlots messages from occupying CRDS table space in non-restarting validators because they are big, but later we decided to add extra check "normal validators just don't accept RestartLastVotedForkSlots at all no matter what shred_version it has", so the shred_version change becomes an extra protection.

Let's say one validator has its contact info propagating slower than RestartLastVotedForkSlots, so some neighbors dropped its RestartLastVotedForkSlots because of shred_version mismatch. I was hoping newly restarted validators will do a Gossip pull for all messages and somehow help spread the lost RestartLastVotedForkSlots messages to others?

If that's not the case then it might be better to add shred_version in the messages themselves.

})
.map(|entry| match &entry.value.data {
CrdsData::RestartLastVotedForkSlots(index, slots, last_vote, last_vote_hash) => {
(*index, slots.clone(), *last_vote, *last_vote_hash)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A tuple with 4 fields is not ideal.
I don't think you would need to expose EpochSlotsIndex outside of gossip code.
You probably also need to define a struct for RestartLastVotedForkSlots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -38,6 +38,8 @@ pub const MAX_VOTES: VoteIndex = 32;

pub type EpochSlotsIndex = u8;
pub const MAX_EPOCH_SLOTS: EpochSlotsIndex = 255;
// We now keep 81000 slots, 81000/MAX_SLOTS_PER_ENTRY = 5.
pub const MAX_RESTART_LAST_VOTED_FORK_SLOTS: EpochSlotsIndex = 5;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

does this have to be pub?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed to pub(crate).

Comment on lines 169 to 177
6 => CrdsData::RestartLastVotedForkSlots(
rng.gen_range(0..MAX_EPOCH_SLOTS),
EpochSlots::new_rand(rng, pubkey),
rng.gen_range(0..512),
Hash::new_unique(),
),
_ => CrdsData::EpochSlots(
rng.gen_range(0..MAX_RESTART_LAST_VOTED_FORK_SLOTS),
EpochSlots::new_rand(rng, pubkey),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks wrong.
You are using MAX_EPOCH_SLOTS for RestartLastVotedForkSlots
and MAX_RESTART_LAST_VOTED_FORK_SLOTS for EpochSlots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oops, good catch, thank for pointing that out.

@@ -38,6 +38,8 @@ pub const MAX_VOTES: VoteIndex = 32;

pub type EpochSlotsIndex = u8;
pub const MAX_EPOCH_SLOTS: EpochSlotsIndex = 255;
// We now keep 81000 slots, 81000/MAX_SLOTS_PER_ENTRY = 5.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where is this 81000 coming from?
These CRDS values aren't really cheap. Do we really need this many?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The number comes from solana-foundation/solana-improvement-documents#46.

During an outage we don't know each other's status, so there is no way to know how lagged others are. So we send out 9 hours of slots, hoping that we can find out about an outage in 9 hours.

The values are big, but we thought it was okay because these messages are only transmitted during an outage. Later after I add the command line --wen_restart flag I can come back and filter out these CRDS values during non-restart mode if that makes you feel safer.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

81_000 slots is ~5MB of data per node in the cluster!!!

Also, I am not sure if a node is 81_000 slots behind other nodes, it can repair all those slots, so what is the point?
I think we really need to revise this number lower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I understand it, 9 hours is chosen because in the past our outage resolution normally takes 4~7 hours just for the validator restart part. And we add a few hours to be on the cautious side. Not every validator operator has a pager, but most of them will tend to things within 9 hours.

Sending 81k slots on the Gossip messages doesn't mean you need to repair all those slots. Most of them are probably older than your local root so you will instantly dump it on the floor.

The tricky part here is you don't know what the status of others are so you really want to send more than they probably need to make the restart successful.

Copy link
Contributor

@behzadnouri behzadnouri Sep 19, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sending 81k slots on the Gossip messages doesn't mean you need to repair all those slots. Most of them are probably older than your local root so you will instantly dump it on the floor.

There are 2 different scenarios:

  • If a node has some of the intermediate slots, then it can already repair missing slots and complete the chain. No need to send 81k slots to it.
  • If a node does not have any of the intermediate slots, do we really expect it to be able to repair 81k slots?

Even before all these, why do we need to send the parent slots to begin with?
We are already able to repair a fork if we are missing some of the parent slots in the fork by sending repair orphan request:
https://github.com/solana-labs/solana/blob/bc2b37276/core/src/repair/serve_repair.rs#L240-L243
Why do we need to duplicate that protocol here? The node can already recover the whole fork by repairing off the last slot.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way, I think 81k slots is less than 5MB, because it's a bitmap. I assume the current MAX_SLOTS_PER_ENTRY is chosen so each EpochSlots piece fits within a 1460 byte UDP packet, and MAX_RESTART_LAST_VOTED_FORK_SLOTS is 5, which is ~7KB per validator? Let's say it's a 64K UDP packet, that's still smaller than 5MB per validator.

How much memory does the current EpochSlots messages occupy? Considering MAX_EPOCH_SLOTS is 255, this should be much smaller than that.

Is my calculation off?

Also, if the whole cluster is stuck, I think operators will be okay with consuming some bandwidth to quickly bring the cluster back up. If sending 81k slot information means they actually don't need to repair all the slots, maybe we would even save more on the repair bandwidth.

I think the "repair a batch at a time backwards" method you mentioned sounds interesting, but you never know what random validators can throw at you, so a few validators throwing you slots into the far future can make you really busy at repair for quite some time.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@behzad assume the worst, turbine / repair is congested (similar to last outage) this protocol will always find the OC restart slot, which is prone to human error

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@AshwinSekar gossip has a lot more load than turbine and repair combined. I don't think we can solve congestion on one protocol by adding more load to the one already more overloaded.

@wen-coding When this was initially discussed we were talking about only 2 new crds value. This commit itself is adding 5 new pretty big values where each one fills up a packet.
It also has some overlap with already existing repair orphan request and I am not convinced this duplication will buy us anything in practice.
The design that you cannot tell slots.last() is not the parent of last_voted_slot and you rely on receiving all crds values is also not ideal because gossip does not provide such guarantees.
Can you revisit the design addressing these issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thought about it more, I think you had a point about we probably do not need to send 81k slots in most cases. Reading previous outage reports, validators around the world normally entered the outage at about the same time so the difference of their roots is normally much smaller than duration of the outages.

That said, I still want to play it safe and send enough slots to rescue the occasional outliers back, so I propose lowering the original 81k number to MAX_SLOTS_PER_ENTRY which is 16384. This lowers the number of packets per validator from 5 to 1, and the code handling the packet don't need to deal with holes in between any more.

@carllin @mvines @t-nelson Are you okay with lowering 81k slots to 16k slots given the reasoning above?

@behzadnouri are you okay with it given we are only adding 1 RestartLastVotedForkSlots per validator now?

regarding "some overlap with already existing repair orphan request", I think:

  1. If we are in an outage, apparently that's because repair orphan request didn't save us
  2. Repair orphan request can figure out parent relationship between blocks and repair missing blocks on the fork, but it's not a global view so it can't tell us whether some blocks can be ignored. To quickly reach consensus between validators you do need a global view, which is something the current repair service can't give you, you do need some sort of global consensus to make validators agree on something again.

I'd be happy to organize a quick meeting if we can't reach agreement over github reviews.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think thats ok, with only one single such crds value the parent is also inferred, which should address @behzadnouri's concern

@codecov
Copy link

codecov bot commented Sep 15, 2023

Codecov Report

Merging #33239 (866228b) into master (1d91b60) will increase coverage by 0.0%.
Report is 8 commits behind head on master.
The diff coverage is 93.2%.

@@           Coverage Diff            @@
##           master   #33239    +/-   ##
========================================
  Coverage    81.7%    81.7%            
========================================
  Files         807      807            
  Lines      218252   218366   +114     
========================================
+ Hits       178438   178618   +180     
+ Misses      39814    39748    -66     

Copy link
Contributor

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code is several separate changes:

  1. Defining the new RestartLastVotedForkSlots struct and adding it to gossip CRDS table.
  2. An internal CRDS index for retrieval, which I believe is unnecessary.
  3. The api for pushing and retrieving the values in cluster_info.rs.

It would save everyone's time and reduce number of back and forth if we just do one at a time and review smaller changes.

@@ -169,6 +171,7 @@ impl Default for Crds {
votes: BTreeMap::default(),
epoch_slots: BTreeMap::default(),
duplicate_shreds: BTreeMap::default(),
restart_last_voted_fork_slots: BTreeMap::default(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these values are only going to matter during restart, then we don't need the overhead of this index.
You can just use this function:
https://github.com/solana-labs/solana/blob/6db57f81d/gossip/src/crds.rs#L380-L390
and filter on RestartLastVotedForkSlots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair, removed now.

@@ -132,6 +135,12 @@ impl Sanitize for CrdsData {
}
CrdsData::SnapshotHashes(val) => val.sanitize(),
CrdsData::ContactInfo(node) => node.sanitize(),
CrdsData::RestartLastVotedForkSlots(ix, slots) => {
if *ix as usize >= MAX_RESTART_LAST_VOTED_FORK_SLOTS as usize {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do you need as usize here?
aren't both values of type EpochSlotsIndex.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -157,6 +166,10 @@ impl CrdsData {
3 => CrdsData::AccountsHashes(AccountsHashes::new_rand(rng, pubkey)),
4 => CrdsData::Version(Version::new_rand(rng, pubkey)),
5 => CrdsData::Vote(rng.gen_range(0..MAX_VOTES), Vote::new_rand(rng, pubkey)),
6 => CrdsData::RestartLastVotedForkSlots(
0,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why this is hardcoded to 0?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@@ -485,6 +498,69 @@ impl Sanitize for NodeInstance {
}
}

#[derive(Serialize, Deserialize, Clone, Default, PartialEq, Eq, AbiExample)]
pub struct RestartLastVotedForkSlots {
pub slots: EpochSlots,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you using EpochSlots here?
How are the EpochSlots related to RestartLastVotedForkSlots?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mainly want to reuse the compress/uncompress logic of EpochSlots, all I need is adding last_vote to EpochSlots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you want to use something like Vec<CompressedSlots>:
https://github.com/solana-labs/solana/blob/5dbc19ccb/gossip/src/epoch_slots.rs#L230
not the entire EpcohSlots.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also why do we need last_voted_slot separately?
Isn't that already the last slot in the slots field?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because one Gossip message may not fit 81k slots, so we need to split it into multiple RestartLastVotedForkSlots messages, only the last message has last_voted_slot.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I get one of those RestartLastVotedForkSlots messages before the last one which says, for example:

last_voted_slot == 1000
slots.last() == 10

how are we going to identify that 10 is not the parent of 10000? We may not ever get the last RestartLastVotedForkSlots for that node.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right now the protocol only does pure aggregation, when you get RestartLastVotedForkSlots from any staked nodes it means that person has voted for all the slots in this message. last_voted_slot and last_voted_hash and timestamp are used together with pubkey to be unique key, in case some validator sends out a new message and we want to replace previous votes.

We can tolerate about 5% packet loss, if we have more than 5% packet loss then this restart protocol might not give you a definite answer, but it has at least made an effort to replace some important blocks which >42% of the validators already voted for.

Copy link
Contributor

@carllin carllin Sep 29, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Might need a counter on the last vote to tell people what the expected number of RestartLastVotedForkSlots is so we don't misinterpret the parent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We won't misinterpret the parent, because when we traverse to select HeaviestFork we are using the parent information on the BankFork. But it's nice not to miss part of the RestartLastVotedForkSlots, because that might mean you didn't count all of the stakes. I'll add the counter.

Comment on lines 514 to 522
impl fmt::Debug for RestartLastVotedForkSlots {
fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
write!(
f,
"RestartLastVotedForkSlots {{ slots: {:?} last_voted: {}({}) }}",
self.slots, self.last_voted_slot, self.last_voted_hash
)
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this?
Can't it be just #[derive(Debug)]?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 537 to 547
pub fn new_from_fields(
slots: EpochSlots,
last_voted_slot: Slot,
last_voted_hash: Hash,
) -> Self {
Self {
slots,
last_voted_slot,
last_voted_hash,
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If all fields are pub this is redundant.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

@@ -1232,7 +1233,7 @@ mod tests {
let keypair = &keypairs[rng.gen_range(0..keypairs.len())];
let value = CrdsValue::new_rand(&mut rng, Some(keypair));
let local_timestamp = new_rand_timestamp(&mut rng);
if let Ok(()) = crds.insert(value, local_timestamp, GossipRoute::LocalMessage) {
if let Ok(()) = crds.insert(value.clone(), local_timestamp, GossipRoute::LocalMessage) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why .clone() here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

Comment on lines 497 to 498
pub slots: CompressedSlotsVec,
pub last_voted_slot: Slot,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With this new struct, do we still need last_voted_slot?
Can it be just slots.last()?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removed.

pub slots: CompressedSlotsVec,
pub last_voted_slot: Slot,
pub last_voted_hash: Hash,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would add shred_version here too.
It is only 2 bytes, and we need to filter on matching shred_version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

Comment on lines 229 to 230
pub struct CompressedSlotsVec {
slots: Vec<CompressedSlots>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you remove this new type and just use Vec<CompressedSlots> in RestartLastVotedForkSlots?

We can refactor common code once the code is finalized and stable.
No need to do that in this PR while also modifying EpochSlots.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed.

@carllin carllin mentioned this pull request Oct 6, 2023
behzadnouri
behzadnouri previously approved these changes Oct 9, 2023
@wen-coding wen-coding merged commit 0a38108 into solana-labs:master Oct 9, 2023
16 checks passed
@wen-coding wen-coding deleted the wen_restart_gossip_change branch October 9, 2023 22:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants