Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Compute Switch Threshold #9218

Merged
merged 18 commits into from
May 12, 2020
Merged

Compute Switch Threshold #9218

merged 18 commits into from
May 12, 2020

Conversation

carllin
Copy link
Contributor

@carllin carllin commented Apr 1, 2020

Problem

Missing computation of switch threshold for optimistic confirmation

Summary of Changes

  1. Refactor ReplayStage to not vote when switching threshold fails, instead reset to heaviest descendant of last vote (Will factor this out into another PR)

  2. Compute switching threshold

Fixes #

// 2) Not from before the current root as we can't determine if
// anything before the root was an ancestor of `last_vote` or not
if !last_vote_ancestors.contains(lockout_interval_start)
&& ancestors.contains_key(lockout_interval_start)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@aeyakovenko, I'm filtering out any branches from before the root, so those forks can't be included in the switching proofs, even though they may be locked out above our last vote. I don't think it's a huge issue, but I could be missing something...

lockout_intervals
.entry(vote.expiration_slot())
.or_insert_with(|| vec![])
.push((vote.slot, key));
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will add these keys to the replay_stage all_pubkeys so they pull from the same reference pool to reduce memory usage.

}
(locked_out_stake as f64 / total_stake as f64) > SWITCH_FORK_THRESHOLD
})
.unwrap_or(true)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't generate the proof yet, want to make sure these incremental changes don't break anything first

@carllin carllin force-pushed the FixReplayStage3 branch 2 times, most recently from a6e9a44 to bfc974f Compare April 2, 2020 04:18
@codecov
Copy link

codecov bot commented Apr 2, 2020

Codecov Report

Merging #9218 into master will increase coverage by 0.0%.
The diff coverage is 96.5%.

@@           Coverage Diff           @@
##           master   #9218    +/-   ##
=======================================
  Coverage    80.4%   80.4%            
=======================================
  Files         284     285     +1     
  Lines       66235   66388   +153     
=======================================
+ Hits        53263   53420   +157     
+ Misses      12972   12968     -4     

@carllin carllin force-pushed the FixReplayStage3 branch 5 times, most recently from 08b4b28 to 1deaba9 Compare April 3, 2020 09:16
@stale
Copy link

stale bot commented Apr 12, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 12, 2020
@aeyakovenko
Copy link
Member

Looks pretty good. We really need to pull the consensus stuff out into a non networked simulation environment.

@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 12, 2020
aeyakovenko
aeyakovenko previously approved these changes Apr 12, 2020
@mergify mergify bot dismissed aeyakovenko’s stale review April 13, 2020 00:21

Pull request has been modified.

@stale
Copy link

stale bot commented Apr 20, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 20, 2020
@stale
Copy link

stale bot commented Apr 27, 2020

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Apr 27, 2020
@carllin carllin reopened this May 4, 2020
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label May 4, 2020
@carllin carllin marked this pull request as ready for review May 9, 2020 22:49
@carllin
Copy link
Contributor Author

carllin commented May 12, 2020

Seems to be stable for last 5 hours with regular partitioning every 10 mins, and a few hours of stability in between.
Screen Shot 2020-05-11 at 6 55 46 PM

@carllin carllin merged commit 59de1b3 into solana-labs:master May 12, 2020
@mvines mvines mentioned this pull request May 13, 2020
1 task
@aeyakovenko
Copy link
Member

@carllin do these tests verify that the network can recover from a 33 - 4 failure? Assuming the threshold is 4

@carllin
Copy link
Contributor Author

carllin commented May 13, 2020

@aeyakovenko, if the threshold is 4, then the switching threshold is (100 + 4 - 66.66) = 37.34. Then you can tolerate at most 100 - 37.4 * 2 = 25.2 failure. Any more than 25% that and assuming the worst case 50-50 partition, then validators will get stuck on their respective partitions without being able to switch.

There's currently no test that sets up this exact scenario, but this one is pretty close: https://github.com/solana-labs/solana/blob/master/local-cluster/tests/local_cluster.rs#L376. It currently kill 9/29 ~= 31% of the stake, but I think it currently passes because it kills the validator after the partition is resolved: https://github.com/solana-labs/solana/blob/master/local-cluster/tests/local_cluster.rs#L319-L324, which I think means the validator has voted sufficiently during/after the partition to use their votes in a switching proof.

If we were instead to toggle the test to kill the leader before the partition, that would probably test this case.

@aeyakovenko
Copy link
Member

@carllin but once the partition recovers the nodes can come back. ideally we have a local cluster test and a nightly partition test that induces this scenario.

@carllin
Copy link
Contributor Author

carllin commented May 14, 2020

@aeyakovenko even after the partition resolves, if > 25% are dead then each side of the partition may get stuck, as there's not enough stake on the other side of the partition to generate a switching proof.

The only possible way out of this hole is if then if the smaller/less staked partition then itself sub-partitions/forks, and then people on that side of the partition also vote on their own fork (they would have to think their fork is the heaviest, which may not happen if they detect the major fork) allowing some of the validators on that side of the partition to generate a switching proof to switch. But this seems like a very unlikely escape

We can add a local cluster test, the nightly partition test can be an expansion of the existing nightly partition tests + ability to kill some of the nodes.

@aeyakovenko
Copy link
Member

@carllin i meant the network should recover as soon as we are under 25% dead

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants