Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[leader election] Improve leader election for first 20 rounds #5683

Merged
merged 1 commit into from
Nov 23, 2022

Conversation

igor-aptos
Copy link
Contributor

exclude_rounds is 20 by default, which means we don't consider last 20 rounds in leader reputation history - but we do so from the same epoch - we consider everyone is caught up with the previous epoch.

genesis is epoch=0, round=0
first block is epoch=1 round=1, and that is a reconfig, that triggers epoch 2. because votes in block metadata are for previous round, they are empty in that round. So in that round - only "active" node is proposer.

For first 20 rounds of epoch 2, that is the only history we have - and so proposer of epoch=1 round=1 will always be elected for the first 20 rounds.

That can cause issues in tests, as they might be executed over very few rounds. So removing epoch=1,round=1 from being considered for history in epoch 2.

We can make this change without backward compatible gating, as this only affects first window rounds (10 * num validators), and chain will continue successfully after that (and all active chains are pass the epoch 2)

Description

Test Plan

used test_basic_fault_tolerance test, and confirmed before the change that leader doesn't change for first 20 rounds, and after this change it does.

exclude_rounds is 20 by default, which means we don't consider last
20 rounds in leader reputation history - but we do so from the same epoch -
we consider everyone is caught up with the previous epoch.

genesis is epoch=0, round=0
first block is epoch=1 round=1, and that is a reconfig, that triggers epoch 2.
because votes in block metadata are for previous round, they are empty in that round.
So in that round - only "active" node is proposer.

For first 20 rounds of epoch 2, that is the only history we have - and so proposer of
epoch=1 round=1 will always be elected for the first 20 rounds.

That can cause issues in tests, as they might be executed over very few rounds.
So removing epoch=1,round=1 from being considered for history in epoch 2.

We can make this change without backward compatible gating, as this only affects
first window rounds (10 * num validators), and chain will continue successfully
after that (and all active chains are pass the epoch 2)
let first_epoch_to_consider = std::cmp::max(
1,
if epoch_state.epoch == 1 { 1 } else { 2 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t quite understand why we want to consider epoch=1 round=1 when we are still in epoch=1?
i.e., why can’t we just change the min epoch to consider as 2 rather than 1

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do have multiple rounds in epoch 1, they are just short-circuted and removed after we move to epoch 2.
so we should have leader election work normally in epoch 1. we just don't want to take history from epoch 1 into epoch 2 leader election

https://gist.githubusercontent.com/igor-aptos/8127b53add8fd9e9a83d9dc299a4eb92/raw/6f6506f2f6ca8a583ab830e26e3a93a7f9940947/gistfile1.txt

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this expression is getting ugly, maybe this is simpler?

if epoch <= 2 {
  use_history_from_previous_epoch_max_count = 0;
}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's not the same. use_history_from_previous_epoch_max_count is 5 by default (in case we have multiple consecutive reconfigs/governance proposals. So we want first_epoch_to_consider to be 2 for epoch = 6 too.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm as long as after epoch 1 we have more than 20 rounds it should be good enough?

Copy link
Contributor Author

@igor-aptos igor-aptos Nov 23, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see what you are saying, but even though expression is shorter, I find it much more confusing at what it does. I think with the comment, the code is understandable.

so I am going to land as is.

Copy link
Contributor

@zekun000 zekun000 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good findings!

let first_epoch_to_consider = std::cmp::max(
1,
if epoch_state.epoch == 1 { 1 } else { 2 },
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this expression is getting ugly, maybe this is simpler?

if epoch <= 2 {
  use_history_from_previous_epoch_max_count = 0;
}

@igor-aptos igor-aptos added the CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR label Nov 23, 2022
Copy link
Contributor

@bchocho bchocho left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@github-actions

This comment has been minimized.

@github-actions

This comment has been minimized.

@github-actions
Copy link
Contributor

✅ Forge suite land_blocking success on 0a0e99eee17535b391ab7371c2144220068ee62a

performance benchmark with full nodes : 6278 TPS, 6152 ms latency, 16400 ms p99 latency,(!) expired 8560 out of 2689520 txns
Test Ok

@github-actions
Copy link
Contributor

✅ Forge suite compat success on testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0a0e99eee17535b391ab7371c2144220068ee62a

Compatibility test results for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0a0e99eee17535b391ab7371c2144220068ee62a (PR)
1. Check liveness of validators at old version: testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b
compatibility::simple-validator-upgrade::liveness-check : 7521 TPS, 5123 ms latency, 6800 ms p99 latency,no expired txns
2. Upgrading first Validator to new version: 0a0e99eee17535b391ab7371c2144220068ee62a
compatibility::simple-validator-upgrade::single-validator-upgrade : 4543 TPS, 9148 ms latency, 12300 ms p99 latency,no expired txns
3. Upgrading rest of first batch to new version: 0a0e99eee17535b391ab7371c2144220068ee62a
compatibility::simple-validator-upgrade::half-validator-upgrade : 4738 TPS, 8534 ms latency, 10800 ms p99 latency,no expired txns
4. upgrading second batch to new version: 0a0e99eee17535b391ab7371c2144220068ee62a
compatibility::simple-validator-upgrade::rest-validator-upgrade : 6904 TPS, 5562 ms latency, 9600 ms p99 latency,no expired txns
5. check swarm health
Compatibility test for testnet_2d8b1b57553d869190f61df1aaf7f31a8fc19a7b ==> 0a0e99eee17535b391ab7371c2144220068ee62a passed
Test Ok

@igor-aptos igor-aptos merged commit 49d77e6 into main Nov 23, 2022
@igor-aptos igor-aptos deleted the igor/first_20_rounds_leader_election branch November 23, 2022 22:56
areshand pushed a commit to areshand/aptos-core-1 that referenced this pull request Dec 18, 2022
…labs#5683)

exclude_rounds is 20 by default, which means we don't consider last
20 rounds in leader reputation history - but we do so from the same epoch -
we consider everyone is caught up with the previous epoch.

genesis is epoch=0, round=0
first block is epoch=1 round=1, and that is a reconfig, that triggers epoch 2.
because votes in block metadata are for previous round, they are empty in that round.
So in that round - only "active" node is proposer.

For first 20 rounds of epoch 2, that is the only history we have - and so proposer of
epoch=1 round=1 will always be elected for the first 20 rounds.

That can cause issues in tests, as they might be executed over very few rounds.
So removing epoch=1,round=1 from being considered for history in epoch 2.

We can make this change without backward compatible gating, as this only affects
first window rounds (10 * num validators), and chain will continue successfully
after that (and all active chains are pass the epoch 2)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
CICD:run-e2e-tests when this label is present github actions will run all land-blocking e2e tests from the PR
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants