-
Notifications
You must be signed in to change notification settings - Fork 4.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Skip leader slots until a vote lands #15607
Conversation
@carllin does it look reasonable at all? |
8aa468f
to
f5cfa8f
Compare
hmmm so if somebody is blowing away their ledger and restarting, I'm not sure this would help, i.e.
They then blow away the ledger, vote on slot 1, set |
Slot 2 will be frozen already if they created slot 3 on top of it. So they won't be able to land their vote for slot 1 in slot 2. The |
yeah it seems at the very least you'll have to check that the vote signature landed in a rooted fork to ensure you're near the tip. Also of note, I think there's still an edge case if you were a bit ahead of the major fork on a different fork, i.e.
If you restart on the major fork and vote on slot 6, which lands in slot 7, you may recreate your vote slot again for slot 20 on the major fork. |
I was thinking we'd suppress this behavior when |
ffa2626
to
e6c7867
Compare
e6c7867
to
a78cb26
Compare
a78cb26
to
e654f4b
Compare
I need to handle the case where the node is the sole bootstrap leader or maybe 1 of many or maybe any case where the node might not be able to land a vote and needs to produce slots anyway. I was trying to think if there is an elegant way to figure that out. |
eab57d7
to
73dced3
Compare
Codecov Report
@@ Coverage Diff @@
## master #15607 +/- ##
=======================================
Coverage 80.0% 80.0%
=======================================
Files 410 410
Lines 109070 109102 +32
=======================================
+ Hits 87338 87389 +51
+ Misses 21732 21713 -19 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm with nits!
@@ -293,6 +296,8 @@ impl ReplayStage { | |||
let mut partition_exists = false; | |||
let mut skipped_slots_info = SkippedSlotsInfo::default(); | |||
let mut replay_timing = ReplayTiming::default(); | |||
let mut voted_signatures = Vec::new(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let mut voted_signatures = Vec::new(); | |
let mut voted_signatures = Vec::with_capacity(201); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why 201?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I chose a 200 signature limit for how large this can be. I'm not sure we need to size it initially, most cases should not use the full 200 and the path isn't that performance sensitive.
@@ -1360,6 +1360,13 @@ pub fn main() { | |||
.help("After processing the ledger and the next slot is SLOT, wait until a \ | |||
supermajority of stake is visible on gossip before starting PoH"), | |||
) | |||
.arg( | |||
Arg::with_name("no_wait_for_vote_to_start_leader") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
might be good to add a warning here that if > 33% of the cluster goes down and restarts, everyone will need to set this flag, otherwise even on restart nobody will make progress.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 33% goes down, we have to do a manual restart with --wait-for-supermajority
flag, right? In that case we detect that we had to wait for supermajority and we skip this check for the vote. So, it shouldn't be necessary to use it in that case. The only case we need it is if you are starting a bootstrap leader(s) without --wait-for-supermajority
core/src/validator.rs
Outdated
@@ -627,16 +629,19 @@ impl Validator { | |||
check_poh_speed(&genesis_config, None); | |||
} | |||
|
|||
if wait_for_supermajority( | |||
let (failed, did_wait) = wait_for_supermajority( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: did_wait
-> is_wait_unnecessary
, since in some cases like on hard forks, a wait didn't occur but is necessary (or even better is_wait_necessary
, but then we have to flip the booleans around below)
core/src/validator.rs
Outdated
if let Some(wait_for_supermajority) = config.wait_for_supermajority { | ||
match wait_for_supermajority.cmp(&bank.slot()) { | ||
std::cmp::Ordering::Less => return false, | ||
std::cmp::Ordering::Less => return (false, false), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is the second bool equal to false
ever an acceptable case? I.e. if everyone restarts together and is waiting for their vote to land in a root, nobody will vote, and so everyone will stall right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
did_wait
can be false, yes. Because nodes could still have --wait-for-supermajority even when the network has passed the wait slot given and they are trying to join the network. They wouldn't wait at all because the slot they loaded from is already past the wait-for slot and the shred version matches. In this case they should wait to vote because they have joined the network that is already started producing blocks.
73dced3
to
d77e912
Compare
d77e912
to
3652411
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I bet we'll need a no_wait_for_vote_to_start_leader: true
in this config block as well for TestValidator
:
solana/core/src/test_validator.rs
Lines 402 to 426 in d76ad33
let validator_config = ValidatorConfig { | |
rpc_addrs: Some(( | |
SocketAddr::new(IpAddr::V4(Ipv4Addr::new(0, 0, 0, 0)), node.info.rpc.port()), | |
SocketAddr::new( | |
IpAddr::V4(Ipv4Addr::new(0, 0, 0, 0)), | |
node.info.rpc_pubsub.port(), | |
), | |
)), | |
rpc_config, | |
accounts_hash_interval_slots: 100, | |
account_paths: vec![ledger_path.join("accounts")], | |
poh_verify: false, // Skip PoH verification of ledger on startup for speed | |
snapshot_config: Some(SnapshotConfig { | |
snapshot_interval_slots: 100, | |
snapshot_path: ledger_path.join("snapshot"), | |
snapshot_package_output_path: ledger_path.to_path_buf(), | |
archive_format: ArchiveFormat::Tar, | |
snapshot_version: SnapshotVersion::default(), | |
}), | |
enforce_ulimit_nofile: false, | |
warp_slot: config.warp_slot, | |
bpf_jit: !config.no_bpf_jit, | |
validator_exit: config.validator_exit.clone(), | |
..ValidatorConfig::default() | |
}; |
3652411
to
e37884f
Compare
added, thanks |
e37884f
to
220b5bd
Compare
220b5bd
to
2aea352
Compare
29ac6eb
to
dac0512
Compare
(cherry picked from commit b99ae8f) # Conflicts: # core/src/consensus.rs # core/src/replay_stage.rs
(cherry picked from commit b99ae8f)
(cherry picked from commit b99ae8f) Co-authored-by: sakridge <[email protected]>
Should have
|
let me look |
I don't think that is the cause of it. Those validators are not staked at the beginning, so they can wait to land a rooted vote. I had that print in a bad spot but: #16156 should fix it. |
This reverts commit b99ae8f.
This reverts commit b99ae8f.
can you try with #16156 |
#16156 is also failing |
This reverts commit b99ae8f.
Stabled passed here with it: It looks like a flaky condition in the test to me, it shouldn't really affect that test. The test doesn't actually trigger any of the modified paths of this PR. |
Problem
Nodes starting up after clearing their ledger can double sign for a slot when they start their leader slot again and are not aware they already produced a block for that slot.
Summary of Changes
Check to see if we have landed a vote since starting the node and wait to produce leader slots until then.
Fixes #