Add RestartLastVotedForkSlots for wen_restart. #33239

wen-coding · 2023-09-13T18:58:24Z

Problem

Gossip change part 1 of wen_restart.

Summary of Changes

Add the following new Gossip messages:

RestartLastVotedForkSlots

It will only be propagated during wen_restart, and we will use a new shred_version during wen_restart to avoid polluting a normal cluster.

behzadnouri · 2023-09-13T21:13:26Z

Can you please add each new crds value in its own separate PR?
(probably the 2nd once the 1st one is merged).

wen-coding · 2023-09-13T21:41:12Z

Can you please add each new crds value in its own separate PR? (probably the 2nd once the 1st one is merged).

Sure, reverted RestartHeaviestFork, this PR now only has RestartLastVoteForkSlots.

behzadnouri · 2023-09-15T13:32:26Z

The coverage ci is failing:
https://buildkite.com/solana-labs/solana/builds/101766#018a9572-1d6c-460e-827d-686358eee39f
in this test:

cluster_info::Protocol_frozen_abi::test_abi_digest

You need to update the abi digest.

behzadnouri · 2023-09-15T13:36:34Z

gossip/src/cluster_info.rs

+    fn push_epoch_slots_or_restart_last_voted_fork_slots(
+        &self,
+        mut update: &[Slot],
+        last_vote_bankhash: Option<Hash>,
+    ) {


Why are you mixing this value with EpochSlots?
This function has become more convoluted with several if is_epoch_slot branches.

Split into two functions now.

behzadnouri · 2023-09-15T13:40:45Z

gossip/src/cluster_info.rs

+                let origin = entry.value.pubkey();
+                gossip_crds.get_shred_version(&origin) == self_shred_version


If we need this check, shouldn't shred version be included in the RestartLastVotedForkSlots?
There is no guarantee that the RestartLastVotedForkSlots and the ContactInfo you obtain shred-version from are in sync.

@wen-coding do we need to include shred_version in RestartLastVotedForkSlots?

This code was copied from get_epoch_slots, so I think all other Gossip messages which check whether shred_version matches would have the same problem, since we could always have a lag on ContactInfo shred-version.

So shred_version is just a sanity check, because it is so easy to circumvent in a real attack. But it's nice to have for naive users pointing mainnet validators at testnet or vice versa. I could go either way here, but would prefer to not include shred_version in RestartLastVotedForkSlots for brevity. Using a contact-info shred_version which may occasionally go out of sync is fine, we will drop some messages but can pick them up later I believe? For RestartLastVotedForkSlots since we draw a line between Gossip messages in restart and those not in restart, this check is less important.

This code was copied from get_epoch_slots

get_epoch_slots may have been done improperly. copying the same mistakes will not help.

For EpochSlots we are not expecting a shred-version change. Here there is always a shred-version change when the RestartLastVotedForkSlots is pushed or needed.

I think all other Gossip messages which check whether shred_version matches would have the same problem

most of those which check shred-version are node's sockets which are obtained from the same contact-info message which has shred-version.

For RestartLastVotedForkSlots since we draw a line between Gossip messages in restart and those not in restart, this check is less important.

I don't understand this, why?
My understanding is that for RestartLastVotedForkSlots it is even more important, because, again, there is always a shred-version change when the RestartLastVotedForkSlots is pushed or needed.

Okay, maybe it is overstatement to say this check is less important.

The shred_version change was put in mainly to prevent RestartLastVotedForkSlots messages from occupying CRDS table space in non-restarting validators because they are big, but later we decided to add extra check "normal validators just don't accept RestartLastVotedForkSlots at all no matter what shred_version it has", so the shred_version change becomes an extra protection.

Let's say one validator has its contact info propagating slower than RestartLastVotedForkSlots, so some neighbors dropped its RestartLastVotedForkSlots because of shred_version mismatch. I was hoping newly restarted validators will do a Gossip pull for all messages and somehow help spread the lost RestartLastVotedForkSlots messages to others?

If that's not the case then it might be better to add shred_version in the messages themselves.

behzadnouri · 2023-09-15T13:42:14Z

gossip/src/cluster_info.rs

+            })
+            .map(|entry| match &entry.value.data {
+                CrdsData::RestartLastVotedForkSlots(index, slots, last_vote, last_vote_hash) => {
+                    (*index, slots.clone(), *last_vote, *last_vote_hash)


A tuple with 4 fields is not ideal.
I don't think you would need to expose EpochSlotsIndex outside of gossip code.
You probably also need to define a struct for RestartLastVotedForkSlots.

behzadnouri · 2023-09-15T13:46:43Z

gossip/src/crds_value.rs

@@ -38,6 +38,8 @@ pub const MAX_VOTES: VoteIndex = 32;

 pub type EpochSlotsIndex = u8;
 pub const MAX_EPOCH_SLOTS: EpochSlotsIndex = 255;
+// We now keep 81000 slots, 81000/MAX_SLOTS_PER_ENTRY = 5.
+pub const MAX_RESTART_LAST_VOTED_FORK_SLOTS: EpochSlotsIndex = 5;


does this have to be pub?

Changed to pub(crate).

behzadnouri · 2023-09-15T13:50:58Z

gossip/src/crds_value.rs

+            6 => CrdsData::RestartLastVotedForkSlots(
                rng.gen_range(0..MAX_EPOCH_SLOTS),
                EpochSlots::new_rand(rng, pubkey),
+                rng.gen_range(0..512),
+                Hash::new_unique(),
+            ),
+            _ => CrdsData::EpochSlots(
+                rng.gen_range(0..MAX_RESTART_LAST_VOTED_FORK_SLOTS),
+                EpochSlots::new_rand(rng, pubkey),


This looks wrong.
You are using MAX_EPOCH_SLOTS for RestartLastVotedForkSlots
and MAX_RESTART_LAST_VOTED_FORK_SLOTS for EpochSlots.

Oops, good catch, thank for pointing that out.

behzadnouri · 2023-09-15T13:54:51Z

gossip/src/crds_value.rs

@@ -38,6 +38,8 @@ pub const MAX_VOTES: VoteIndex = 32;

 pub type EpochSlotsIndex = u8;
 pub const MAX_EPOCH_SLOTS: EpochSlotsIndex = 255;
+// We now keep 81000 slots, 81000/MAX_SLOTS_PER_ENTRY = 5.


Where is this 81000 coming from?
These CRDS values aren't really cheap. Do we really need this many?

The number comes from solana-foundation/solana-improvement-documents#46.

During an outage we don't know each other's status, so there is no way to know how lagged others are. So we send out 9 hours of slots, hoping that we can find out about an outage in 9 hours.

The values are big, but we thought it was okay because these messages are only transmitted during an outage. Later after I add the command line --wen_restart flag I can come back and filter out these CRDS values during non-restart mode if that makes you feel safer.

81_000 slots is ~5MB of data per node in the cluster!!!

Also, I am not sure if a node is 81_000 slots behind other nodes, it can repair all those slots, so what is the point?
I think we really need to revise this number lower.

As I understand it, 9 hours is chosen because in the past our outage resolution normally takes 4~7 hours just for the validator restart part. And we add a few hours to be on the cautious side. Not every validator operator has a pager, but most of them will tend to things within 9 hours.

Sending 81k slots on the Gossip messages doesn't mean you need to repair all those slots. Most of them are probably older than your local root so you will instantly dump it on the floor.

The tricky part here is you don't know what the status of others are so you really want to send more than they probably need to make the restart successful.

Sending 81k slots on the Gossip messages doesn't mean you need to repair all those slots. Most of them are probably older than your local root so you will instantly dump it on the floor.

There are 2 different scenarios:

If a node has some of the intermediate slots, then it can already repair missing slots and complete the chain. No need to send 81k slots to it.

If a node does not have any of the intermediate slots, do we really expect it to be able to repair 81k slots?

Even before all these, why do we need to send the parent slots to begin with?
We are already able to repair a fork if we are missing some of the parent slots in the fork by sending repair orphan request:
https://github.com/solana-labs/solana/blob/bc2b37276/core/src/repair/serve_repair.rs#L240-L243
Why do we need to duplicate that protocol here? The node can already recover the whole fork by repairing off the last slot.

By the way, I think 81k slots is less than 5MB, because it's a bitmap. I assume the current MAX_SLOTS_PER_ENTRY is chosen so each EpochSlots piece fits within a 1460 byte UDP packet, and MAX_RESTART_LAST_VOTED_FORK_SLOTS is 5, which is ~7KB per validator? Let's say it's a 64K UDP packet, that's still smaller than 5MB per validator.

How much memory does the current EpochSlots messages occupy? Considering MAX_EPOCH_SLOTS is 255, this should be much smaller than that.

Is my calculation off?

Also, if the whole cluster is stuck, I think operators will be okay with consuming some bandwidth to quickly bring the cluster back up. If sending 81k slot information means they actually don't need to repair all the slots, maybe we would even save more on the repair bandwidth.

I think the "repair a batch at a time backwards" method you mentioned sounds interesting, but you never know what random validators can throw at you, so a few validators throwing you slots into the far future can make you really busy at repair for quite some time.

@behzad assume the worst, turbine / repair is congested (similar to last outage) this protocol will always find the OC restart slot, which is prone to human error

@AshwinSekar gossip has a lot more load than turbine and repair combined. I don't think we can solve congestion on one protocol by adding more load to the one already more overloaded.

@wen-coding When this was initially discussed we were talking about only 2 new crds value. This commit itself is adding 5 new pretty big values where each one fills up a packet.
It also has some overlap with already existing repair orphan request and I am not convinced this duplication will buy us anything in practice.
The design that you cannot tell slots.last() is not the parent of last_voted_slot and you rely on receiving all crds values is also not ideal because gossip does not provide such guarantees.
Can you revisit the design addressing these issues.

Thought about it more, I think you had a point about we probably do not need to send 81k slots in most cases. Reading previous outage reports, validators around the world normally entered the outage at about the same time so the difference of their roots is normally much smaller than duration of the outages.

That said, I still want to play it safe and send enough slots to rescue the occasional outliers back, so I propose lowering the original 81k number to MAX_SLOTS_PER_ENTRY which is 16384. This lowers the number of packets per validator from 5 to 1, and the code handling the packet don't need to deal with holes in between any more.

@carllin @mvines @t-nelson Are you okay with lowering 81k slots to 16k slots given the reasoning above?

@behzadnouri are you okay with it given we are only adding 1 RestartLastVotedForkSlots per validator now?

regarding "some overlap with already existing repair orphan request", I think:

If we are in an outage, apparently that's because repair orphan request didn't save us

Repair orphan request can figure out parent relationship between blocks and repair missing blocks on the fork, but it's not a global view so it can't tell us whether some blocks can be ignored. To quickly reach consensus between validators you do need a global view, which is something the current repair service can't give you, you do need some sort of global consensus to make validators agree on something again.

I'd be happy to organize a quick meeting if we can't reach agreement over github reviews.

I think thats ok, with only one single such crds value the parent is also inferred, which should address @behzadnouri's concern

…ope.

codecov · 2023-09-15T17:24:23Z

Codecov Report

Merging #33239 (866228b) into master (1d91b60) will increase coverage by 0.0%.
Report is 8 commits behind head on master.
The diff coverage is 93.2%.

@@           Coverage Diff            @@
##           master   #33239    +/-   ##
========================================
  Coverage    81.7%    81.7%            
========================================
  Files         807      807            
  Lines      218252   218366   +114     
========================================
+ Hits       178438   178618   +180     
+ Misses      39814    39748    -66

behzadnouri

This code is several separate changes:

Defining the new RestartLastVotedForkSlots struct and adding it to gossip CRDS table.
An internal CRDS index for retrieval, which I believe is unnecessary.
The api for pushing and retrieving the values in cluster_info.rs.

It would save everyone's time and reduce number of back and forth if we just do one at a time and review smaller changes.

behzadnouri · 2023-09-17T19:39:16Z

gossip/src/crds.rs

@@ -169,6 +171,7 @@ impl Default for Crds {
            votes: BTreeMap::default(),
            epoch_slots: BTreeMap::default(),
            duplicate_shreds: BTreeMap::default(),
+            restart_last_voted_fork_slots: BTreeMap::default(),


If these values are only going to matter during restart, then we don't need the overhead of this index.
You can just use this function:
https://github.com/solana-labs/solana/blob/6db57f81d/gossip/src/crds.rs#L380-L390
and filter on RestartLastVotedForkSlots.

That's fair, removed now.

behzadnouri · 2023-09-17T19:41:22Z

gossip/src/crds_value.rs

@@ -132,6 +135,12 @@ impl Sanitize for CrdsData {
            }
            CrdsData::SnapshotHashes(val) => val.sanitize(),
            CrdsData::ContactInfo(node) => node.sanitize(),
+            CrdsData::RestartLastVotedForkSlots(ix, slots) => {
+                if *ix as usize >= MAX_RESTART_LAST_VOTED_FORK_SLOTS as usize {


Why do you need as usize here?
aren't both values of type EpochSlotsIndex.

behzadnouri · 2023-09-17T19:42:01Z

gossip/src/crds_value.rs

@@ -157,6 +166,10 @@ impl CrdsData {
            3 => CrdsData::AccountsHashes(AccountsHashes::new_rand(rng, pubkey)),
            4 => CrdsData::Version(Version::new_rand(rng, pubkey)),
            5 => CrdsData::Vote(rng.gen_range(0..MAX_VOTES), Vote::new_rand(rng, pubkey)),
+            6 => CrdsData::RestartLastVotedForkSlots(
+                0,


Why this is hardcoded to 0?

behzadnouri · 2023-09-17T19:44:56Z

gossip/src/crds_value.rs

@@ -485,6 +498,69 @@ impl Sanitize for NodeInstance {
    }
 }

+#[derive(Serialize, Deserialize, Clone, Default, PartialEq, Eq, AbiExample)]
+pub struct RestartLastVotedForkSlots {
+    pub slots: EpochSlots,


Why are you using EpochSlots here?
How are the EpochSlots related to RestartLastVotedForkSlots?

I mainly want to reuse the compress/uncompress logic of EpochSlots, all I need is adding last_vote to EpochSlots.

I think you want to use something like Vec<CompressedSlots>:
https://github.com/solana-labs/solana/blob/5dbc19ccb/gossip/src/epoch_slots.rs#L230
not the entire EpcohSlots.

Also why do we need last_voted_slot separately?
Isn't that already the last slot in the slots field?

Because one Gossip message may not fit 81k slots, so we need to split it into multiple RestartLastVotedForkSlots messages, only the last message has last_voted_slot.

If I get one of those RestartLastVotedForkSlots messages before the last one which says, for example:

last_voted_slot == 1000 slots.last() == 10

how are we going to identify that 10 is not the parent of 10000? We may not ever get the last RestartLastVotedForkSlots for that node.

Right now the protocol only does pure aggregation, when you get RestartLastVotedForkSlots from any staked nodes it means that person has voted for all the slots in this message. last_voted_slot and last_voted_hash and timestamp are used together with pubkey to be unique key, in case some validator sends out a new message and we want to replace previous votes.

We can tolerate about 5% packet loss, if we have more than 5% packet loss then this restart protocol might not give you a definite answer, but it has at least made an effort to replace some important blocks which >42% of the validators already voted for.

Might need a counter on the last vote to tell people what the expected number of RestartLastVotedForkSlots is so we don't misinterpret the parent

We won't misinterpret the parent, because when we traverse to select HeaviestFork we are using the parent information on the BankFork. But it's nice not to miss part of the RestartLastVotedForkSlots, because that might mean you didn't count all of the stakes. I'll add the counter.

behzadnouri · 2023-09-17T19:46:05Z

gossip/src/crds_value.rs

+impl fmt::Debug for RestartLastVotedForkSlots {
+    fn fmt(&self, f: &mut fmt::Formatter<'_>) -> fmt::Result {
+        write!(
+            f,
+            "RestartLastVotedForkSlots {{ slots: {:?} last_voted: {}({}) }}",
+            self.slots, self.last_voted_slot, self.last_voted_hash
+        )
+    }
+}


Why do we need this?
Can't it be just #[derive(Debug)]?

behzadnouri · 2023-09-17T19:46:49Z

gossip/src/crds_value.rs

+    pub fn new_from_fields(
+        slots: EpochSlots,
+        last_voted_slot: Slot,
+        last_voted_hash: Hash,
+    ) -> Self {
+        Self {
+            slots,
+            last_voted_slot,
+            last_voted_hash,
+        }
+    }


If all fields are pub this is redundant.

behzadnouri · 2023-09-18T14:08:28Z

gossip/src/crds.rs

@@ -1232,7 +1233,7 @@ mod tests {
            let keypair = &keypairs[rng.gen_range(0..keypairs.len())];
            let value = CrdsValue::new_rand(&mut rng, Some(keypair));
            let local_timestamp = new_rand_timestamp(&mut rng);
-            if let Ok(()) = crds.insert(value, local_timestamp, GossipRoute::LocalMessage) {
+            if let Ok(()) = crds.insert(value.clone(), local_timestamp, GossipRoute::LocalMessage) {


Why .clone() here?

…stVotedForkSlots.

behzadnouri · 2023-10-05T16:52:52Z

gossip/src/crds_value.rs

+    pub slots: CompressedSlotsVec,
+    pub last_voted_slot: Slot,


With this new struct, do we still need last_voted_slot?
Can it be just slots.last()?

behzadnouri · 2023-10-05T16:53:39Z

gossip/src/crds_value.rs

+    pub slots: CompressedSlotsVec,
+    pub last_voted_slot: Slot,
+    pub last_voted_hash: Hash,
+}


I would add shred_version here too.
It is only 2 bytes, and we need to filter on matching shred_version.

behzadnouri · 2023-10-05T17:01:41Z

gossip/src/epoch_slots.rs

+pub struct CompressedSlotsVec {
+    slots: Vec<CompressedSlots>,


Can you remove this new type and just use Vec<CompressedSlots> in RestartLastVotedForkSlots?

We can refactor common code once the code is finalized and stable.
No need to do that in this PR while also modifying EpochSlots.

Add RestartLastVotedForkSlots and RestartHeaviestFork for wen_restart.

ce81bbc

wen-coding self-assigned this Sep 13, 2023

wen-coding and others added 2 commits September 13, 2023 12:17

Fix linter errors.

0edf40e

Merge branch 'master' into wen_restart_gossip_change

1ec981b

wen-coding requested a review from behzadnouri September 13, 2023 21:09

Revert RestartHeaviestFork, it will be added in another PR.

a3749b2

wen-coding changed the title ~~Add RestartLastVotedForkSlots and RestartHeaviestFork for wen_restart.~~ Add RestartLastVotedForkSlots for wen_restart. Sep 13, 2023

wen-coding added 2 commits September 13, 2023 15:17

Merge branch 'master' into wen_restart_gossip_change

79bedf8

Merge branch 'master' into wen_restart_gossip_change

719534d

behzadnouri reviewed Sep 15, 2023

View reviewed changes

wen-coding added 2 commits September 15, 2023 09:31

Update frozen abi message.

25e0117

Fix wrong number in test generation, change to pub(crate) to limit sc…

1fcc81d

…ope.

wen-coding added 2 commits September 15, 2023 16:04

Separate push_epoch_slots and push_restart_last_voted_fork_slots.

b7077a7

Add RestartLastVotedForkSlots data structure.

354673a

behzadnouri reviewed Sep 17, 2023

View reviewed changes

Remove unused parts to make PR smaller.

0028f64

behzadnouri reviewed Sep 18, 2023

View reviewed changes

wen-coding and others added 3 commits September 18, 2023 09:57

Remove unused clone.

46d2054

Use CompressedSlotsVec to share code between EpochSlots and RestartLa…

cc11d42

…stVotedForkSlots.

Merge branch 'master' into wen_restart_gossip_change

b0cead0

wen-coding requested a review from behzadnouri September 28, 2023 17:09

wen-coding and others added 3 commits October 1, 2023 23:02

Add total_messages to show how many messages are there.

5b7a724

Merge branch 'master' into wen_restart_gossip_change

4821d8a

Reduce RestartLastVotedForkSlots to one packet (16k slots).

ae8d01d

behzadnouri reviewed Oct 5, 2023

View reviewed changes

carllin mentioned this pull request Oct 6, 2023

Add wen_restart module #33344

Merged

Replace last_vote_slot with shred_version, revert CompressedSlotsVec.

1ee2699

behzadnouri previously approved these changes Oct 9, 2023

View reviewed changes

Merge branch 'master' into wen_restart_gossip_change

78a9ded

wen-coding dismissed behzadnouri’s stale review via 78a9ded October 9, 2023 19:57

behzadnouri approved these changes Oct 9, 2023

View reviewed changes

Merge branch 'master' into wen_restart_gossip_change

866228b

wen-coding merged commit 0a38108 into solana-labs:master Oct 9, 2023
16 checks passed

wen-coding deleted the wen_restart_gossip_change branch October 9, 2023 22:08

willhickey mentioned this pull request Mar 28, 2024

v1.18 commits - please ignore anza-xyz/agave#475

Closed

		let origin = entry.value.pubkey();
		gossip_crds.get_shred_version(&origin) == self_shred_version

Add RestartLastVotedForkSlots for wen_restart. #33239

Add RestartLastVotedForkSlots for wen_restart. #33239

Conversation

wen-coding commented Sep 13, 2023 • edited Loading

Problem

Summary of Changes

behzadnouri commented Sep 13, 2023

wen-coding commented Sep 13, 2023

behzadnouri commented Sep 15, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

behzadnouri Sep 19, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Sep 15, 2023 • edited Loading

Codecov Report

behzadnouri left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin Sep 29, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

wen-coding commented Sep 13, 2023 •

edited

Loading

behzadnouri Sep 19, 2023 •

edited

Loading

codecov bot commented Sep 15, 2023 •

edited

Loading

carllin Sep 29, 2023 •

edited

Loading