-
Notifications
You must be signed in to change notification settings - Fork 47
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CATCHUP] Repro Restart Bug + Fix #3686
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just one question
HotShotEvent::TimeoutVoteSend(vote) => { | ||
*maybe_action = Some(HotShotAction::Vote); | ||
Some(( | ||
vote.signing_key(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I checked over the update_action
logic and this looks good 👍
@@ -217,7 +228,7 @@ where | |||
self.last_decided_leaf.clone(), | |||
TestInstanceState::new(self.async_delay_config.clone()), | |||
None, | |||
view_number, | |||
read_storage.last_actioned_view().await, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this equivalent to what we do in the sequencer? Do we use the last actioned view or something else?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This should be equivalent, the sequencer only has access to storage and has a very similar function to get the view number
Closes #3681, #3682
This PR:
Does two things main things:
TimeoutVote
s as forQuorumVote
sThe scenario the the test repros is and the bug we had is as follows:
last_actioned_view
QuorumVote
in (no DA actions are counted, nor were TimeoutVotes or view sync related messages).What went wrong:
n
so viewn
succeedsn+1
a proposal is formed with the QC forn
n+1
times out because no DAC can be formed (1/4 DA committee members are voting)n+2
n+2
has a valid proposal with a TCn+2
times out because of lack of DAC and we enter view syncWhen the next node is taken down a Quorum will no longer be able to be formed and the cycle is broken
So what view does each store as the last actioned view which it will restart from? There are 3 possibilities
f
) it'll just be the last regular view it voted and proposed in. Typically this will be much lower than the rest as progress could still be madeAfter restart the nodes in bucket 1 will be able to join view sync for any view of the others (since the restart in a lower view). The nodes in bucket 2 will restarted into a higher view than the nodes in bucket 3. The nodes in bucket 2 will also be in unique views (only one leader per view so only one node will action per view since nobody is sending
QuorumVotes
). The nodes in bucket 3 will have the same view stored.2f+1
nodes exactly are in either bucket 2 or 3. The issue that happens is if more thanf
nodes are in bucket 2 (they proposed, but their proposal failed due to lack of QC) then after restart consensus will stall. Even with the help of bucked 1 nodes (by chance if they are in the same view as bucket 1) their will be less than 2f+1 nodes to form a view sync commit certificate. Even withf+1
nodes agreeing on the view still possible the nodes in bucket 2 will not join view sync because it's for a lower view than they are in.What happens if we count Timeout Votes as actions in this scenario. Again their are 3 possibilities, but it's much different. Consider what happens when the
f+1
th node goes down. No more TCs can be formed, and no view sync certs can be formed. This means there can be no more valid proposals and everyone stalls in view sync. Lets call the last view with a Quorum of nodes participating ase
. A TC may or will be formed fore
(if it were not thene-1
would be the last view with a Quorum and the same logic would apply. i.e. there must be some last view with a certificate and we are calling ite
). A proposal fore+1
may or may not get sent out and nodes may or may not see it and may or may not timeout.f
nodes down may be in any view equal or earlier to the last successful viewe+1
and are not the leader make up this view. These nodes may either not see a proposale+1
or they may be taken offline before timing out that view. They will havee
as their last voted viewe+1
and send a TimeoutVote. They will restart in e+1So on restart in the worst case there are
f
nodes in a view <= toe
(bucket 1) and2f+1
nodes in either viewe
ore+1
. This means either f+1 nodes must be in either bucket 2 or 3. Since the bucket 1 nodes are in a lesser or equal view they will join the nodes in whichever bucket hasf+1
nodes when they form precommit certificate meaning at least2f+1
nodes will sync to the same view.In other words by counting timeout votes as an action we make this exactly the same as the case for quorum votes.
The test I added fails without the change to store the time out view and the logs look very much like the real logs we saw in datadog.
What are the changes
up
action because we want to reuse the storage and external channel of the node when it comes back up. We need the external channel because other task tasks need to read from it to verify the decide events etc. and it's not trivial to add new streams to e.g. the safety taskview_change.rs
looks like it was an abandoned refactorThis PR does not:
Key places to review:
Make sure the test I added simulates the scenario I am describing
Make sure the change to network.rs does what I'm saying
Run the new integration test