Compute Switch Threshold #9218

carllin · 2020-04-01T07:17:31Z

Problem

Missing computation of switch threshold for optimistic confirmation

Summary of Changes

Refactor ReplayStage to not vote when switching threshold fails, instead reset to heaviest descendant of last vote (Will factor this out into another PR)
Compute switching threshold

Fixes #

carllin · 2020-04-01T07:21:31Z

core/src/consensus.rs

+                            // 2) Not from before the current root as we can't determine if
+                            // anything before the root was an ancestor of `last_vote` or not
+                            if !last_vote_ancestors.contains(lockout_interval_start)
+                                && ancestors.contains_key(lockout_interval_start)


@aeyakovenko, I'm filtering out any branches from before the root, so those forks can't be included in the switching proofs, even though they may be locked out above our last vote. I don't think it's a huge issue, but I could be missing something...

carllin · 2020-04-01T07:26:10Z

core/src/consensus.rs

+                lockout_intervals
+                    .entry(vote.expiration_slot())
+                    .or_insert_with(|| vec![])
+                    .push((vote.slot, key));


Will add these keys to the replay_stage all_pubkeys so they pull from the same reference pool to reduce memory usage.

carllin · 2020-04-01T07:57:45Z

core/src/consensus.rs

+                }
+                (locked_out_stake as f64 / total_stake as f64) > SWITCH_FORK_THRESHOLD
+            })
+            .unwrap_or(true)


This doesn't generate the proof yet, want to make sure these incremental changes don't break anything first

codecov · 2020-04-02T04:18:09Z

Codecov Report

Merging #9218 into master will increase coverage by 0.0%.
The diff coverage is 96.5%.

@@           Coverage Diff           @@
##           master   #9218    +/-   ##
=======================================
  Coverage    80.4%   80.4%            
=======================================
  Files         284     285     +1     
  Lines       66235   66388   +153     
=======================================
+ Hits        53263   53420   +157     
+ Misses      12972   12968     -4

stale · 2020-04-12T03:39:56Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

aeyakovenko · 2020-04-12T19:06:50Z

Looks pretty good. We really need to pull the consensus stuff out into a non networked simulation environment.

Pull request has been modified.

stale · 2020-04-20T08:09:48Z

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

stale · 2020-04-27T10:29:41Z

This stale pull request has been automatically closed. Thank you for your contributions.

…tors/descendants map for consistency with BankForks and progress map

… expensive

…sss invalid banks

carllin · 2020-05-12T01:57:28Z

Seems to be stable for last 5 hours with regular partitioning every 10 mins, and a few hours of stability in between.

aeyakovenko · 2020-05-13T11:12:15Z

@carllin do these tests verify that the network can recover from a 33 - 4 failure? Assuming the threshold is 4

carllin · 2020-05-13T18:54:49Z

@aeyakovenko, if the threshold is 4, then the switching threshold is (100 + 4 - 66.66) = 37.34. Then you can tolerate at most 100 - 37.4 * 2 = 25.2 failure. Any more than 25% that and assuming the worst case 50-50 partition, then validators will get stuck on their respective partitions without being able to switch.

There's currently no test that sets up this exact scenario, but this one is pretty close: https://github.com/solana-labs/solana/blob/master/local-cluster/tests/local_cluster.rs#L376. It currently kill 9/29 ~= 31% of the stake, but I think it currently passes because it kills the validator after the partition is resolved: https://github.com/solana-labs/solana/blob/master/local-cluster/tests/local_cluster.rs#L319-L324, which I think means the validator has voted sufficiently during/after the partition to use their votes in a switching proof.

If we were instead to toggle the test to kill the leader before the partition, that would probably test this case.

aeyakovenko · 2020-05-13T22:03:54Z

@carllin but once the partition recovers the nodes can come back. ideally we have a local cluster test and a nightly partition test that induces this scenario.

carllin · 2020-05-14T10:50:21Z

@aeyakovenko even after the partition resolves, if > 25% are dead then each side of the partition may get stuck, as there's not enough stake on the other side of the partition to generate a switching proof.

The only possible way out of this hole is if then if the smaller/less staked partition then itself sub-partitions/forks, and then people on that side of the partition also vote on their own fork (they would have to think their fork is the heaviest, which may not happen if they detect the major fork) allowing some of the validators on that side of the partition to generate a switching proof to switch. But this seems like a very unlikely escape

We can add a local cluster test, the nightly partition test can be an expansion of the existing nightly partition tests + ability to kill some of the nodes.

aeyakovenko · 2020-05-14T12:39:35Z

@carllin i meant the network should recover as soon as we are under 25% dead

carllin requested a review from aeyakovenko April 1, 2020 07:17

carllin force-pushed the FixReplayStage3 branch from 4326c80 to 53314c5 Compare April 1, 2020 07:18

carllin commented Apr 1, 2020

View reviewed changes

carllin force-pushed the FixReplayStage3 branch from 53314c5 to 119a9ef Compare April 1, 2020 07:23

carllin commented Apr 1, 2020

View reviewed changes

carllin force-pushed the FixReplayStage3 branch 2 times, most recently from a6e9a44 to bfc974f Compare April 2, 2020 04:18

carllin force-pushed the FixReplayStage3 branch 5 times, most recently from 08b4b28 to 1deaba9 Compare April 3, 2020 09:16

carllin mentioned this pull request Apr 10, 2020

Simplify vote simulation #9435

Merged

stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 12, 2020

stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 12, 2020

aeyakovenko previously approved these changes Apr 12, 2020

View reviewed changes

carllin force-pushed the FixReplayStage3 branch from 6996b96 to b1e9996 Compare April 13, 2020 00:21

stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Apr 20, 2020

stale bot closed this Apr 27, 2020

carllin reopened this May 4, 2020

stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label May 4, 2020

carllin force-pushed the FixReplayStage3 branch from b1e9996 to 93644b0 Compare May 4, 2020 02:31

carllin force-pushed the FixReplayStage3 branch from 143cdc6 to dd262c4 Compare May 9, 2020 22:49

carllin marked this pull request as ready for review May 9, 2020 22:49

carllin added 18 commits May 11, 2020 12:19

Account for descendants < root not existing in BankForks, purge ances…

5ae0ee5

…tors/descendants map for consistency with BankForks and progress map

Add switching threshold check

901bcb0

For now run select_fork() before check_threshold() b/c latter is more…

85fbd0f

… expensive

Update earliest vote to be max(root, earliest_vote) so we don't accce…

b14cf84

…sss invalid banks

logging

0aeca7c

Update voting simulation

e34cb89

Cleanup

061b2ba

Refactor

d23034f

Rework interval tracking

1e9c2a6

Test

92108f2

Simplify vote simulator

c1c7f11

Refactor pubkey references

9cbd3ca

Fixes

308f601

Cleanup

beffe71

Fix rebase

64e4551

Tests

6127fb5

Bump switch thresholdd to 38%

1d8b53a

Clippy

7e974a4

carllin force-pushed the FixReplayStage3 branch from dd262c4 to 7e974a4 Compare May 11, 2020 19:19

carllin merged commit 59de1b3 into solana-labs:master May 12, 2020

mvines mentioned this pull request May 13, 2020

Save/restore Tower #9902

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Compute Switch Threshold #9218

Compute Switch Threshold #9218

carllin commented Apr 1, 2020

carllin Apr 1, 2020

carllin Apr 1, 2020

carllin Apr 1, 2020

codecov bot commented Apr 2, 2020 •

edited

Loading

stale bot commented Apr 12, 2020

aeyakovenko commented Apr 12, 2020

stale bot commented Apr 20, 2020

stale bot commented Apr 27, 2020

carllin commented May 12, 2020 •

edited

Loading

aeyakovenko commented May 13, 2020

carllin commented May 13, 2020 •

edited

Loading

aeyakovenko commented May 13, 2020

carllin commented May 14, 2020

aeyakovenko commented May 14, 2020

Compute Switch Threshold #9218

Compute Switch Threshold #9218

Conversation

carllin commented Apr 1, 2020

Problem

Summary of Changes

carllin Apr 1, 2020

Choose a reason for hiding this comment

carllin Apr 1, 2020

Choose a reason for hiding this comment

carllin Apr 1, 2020

Choose a reason for hiding this comment

codecov bot commented Apr 2, 2020 • edited Loading

Codecov Report

stale bot commented Apr 12, 2020

aeyakovenko commented Apr 12, 2020

stale bot commented Apr 20, 2020

stale bot commented Apr 27, 2020

carllin commented May 12, 2020 • edited Loading

aeyakovenko commented May 13, 2020

carllin commented May 13, 2020 • edited Loading

aeyakovenko commented May 13, 2020

carllin commented May 14, 2020

aeyakovenko commented May 14, 2020

codecov bot commented Apr 2, 2020 •

edited

Loading

carllin commented May 12, 2020 •

edited

Loading

carllin commented May 13, 2020 •

edited

Loading