Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

tower: when syncing from vote state, update last_vote #32944

Merged
merged 2 commits into from
Aug 23, 2023

Conversation

AshwinSekar
Copy link
Contributor

@AshwinSekar AshwinSekar commented Aug 22, 2023

Problem

On replay, if validators observe a newer vote state from a frozen bank, they will adopt that vote state in favor of their local tower. This is to avoid an outdated local tower from sending out slashable votes, or in case of large divergences, having to wait forever for lockout to expire.

However when we adopt vote state, we do not update the tower's last_vote field. This causes problems in situations where our fork choice does not agree with the on chain vote state.
Consider the following:

[...] - 115 - 124 - 125 - 126 - 127
         \
          - 120 - 121 - 122 - 123

Previously our validator had voted on slots 124 and 126 which landed in block 127. Currently, our validator has replayed only up to 115 and has most recently voted on 115.

Next it receives the remaining blocks up to 127 to replay. Upon replay of 127 it realizes it has voted on slots 124 and 126 in the past, and updates the tower vote state to reflect that.

"Frozen bank vote state slot {:?}
is newer than our local vote state slot {:?},
adopting the bank vote state as our own.

Suppose that previously, our validator did not get a chance to observe the 120 fork which is why it voted on 124. However in our current run, fork choice rates 123 as the best slot. When it comes time to select forks, we select 123 as the heaviest slot and 123 as the heaviest slot descended from our last vote:

fn heaviest_slot_on_same_voted_fork(&self, tower: &Tower) -> Option<SlotHashKey> {
tower
.last_voted_slot_hash()

Because we have not updated our last vote, we think that the heaviest slot descends from our last vote, and incorrectly SwitchForkDecision::SameFork
if switch_slot == last_voted_slot || switch_slot_ancestors.contains(&last_voted_slot) {
// If the `switch_slot is a descendant of the last vote,
// no switching proof is necessary
return SwitchForkDecision::SameFork;

At this point we vote on 123, however because we adopted the vote state from 127 which contains votes for 124 and 126, this vote will fail. Unfortunately we use process_vote_unchecked which silently ignores this error. We think that the vote has succeeded, and construct our new last_vote from the tower. This last_vote is garbage, it contains slot 126, but has a vote_hash for slot 123.

process_vote_unchecked(&mut self.vote_state, vote);
VoteTransaction::from(VoteStateUpdate::new(
self.vote_state
.votes
.iter()
.map(|vote| vote.lockout)
.collect(),
self.vote_state.root_slot,
vote_hash,
))

Any further operations involving the tower + fork choice will now panic, as the last vote in the tower does not exist in fork choice (invalid SlotHashKey).

Summary of Changes

  • When adopting on chain vote state, update tower.last_vote as well.
  • tower.record_bank_vote... variants should use process_vote_unfiltered. Although it is unlikely we can recover in case this vote fails, we can log the error to make debugging such situations possible.

Fixes #32880
This is the proper fix for #32894

@AshwinSekar AshwinSekar marked this pull request as ready for review August 23, 2023 00:15
@codecov
Copy link

codecov bot commented Aug 23, 2023

Codecov Report

Merging #32944 (f24821a) into master (14d0759) will increase coverage by 0.0%.
The diff coverage is 90.1%.

@@           Coverage Diff           @@
##           master   #32944   +/-   ##
=======================================
  Coverage    82.0%    82.0%           
=======================================
  Files         784      784           
  Lines      212512   212565   +53     
=======================================
+ Hits       174313   174398   +85     
+ Misses      38199    38167   -32     

Comment on lines 587 to 594
let result = process_vote_unfiltered(
&mut self.vote_state,
&vote.slots,
&vote,
&[(vote_slot, vote_hash)],
epoch,
);
if result.is_err() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe just bubble up the error from process_vote_unchecked and error log here?

epoch,
);
if result.is_err() {
warn!(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could be as strong as error!

vote_slot, vote_hash, result
);
}
self.update_last_vote_from_vote_state(vote_hash);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we set this properly when we ingest a new on chain tower in ReplayStage::compute_bank_stats(), is this still necessary here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We still need it because we are recording a new vote here. this takes care of this block we were doing previously:

new_vote.set_timestamp(self.maybe_timestamp(self.last_voted_slot().unwrap_or_default()));
self.last_vote = new_vote;

Comment on lines +3079 to +3080
// If our local root is higher than the highest slot in `bank_vote_state` due to
// supermajority roots, then it's expected that the vote state will be empty.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not exactly clear to me how supermajority roots cause tower to be empty

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Anything that artificially increases the root could trigger this. I think snapshot is the best example, but I think supermajority roots could also cause this:

let _ = bank_forks.write().unwrap().set_root(
root,
accounts_background_request_sender,
None,
);

Comment on lines +3081 to +3083
// In this case we use the root as our last vote. This root cannot be None, because
// `tower.vote_state.last_voted_slot()` is None only if `tower.vote_state.root_slot`
// is Some.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

where is this logic guaranteed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

None <= None. If tower.vote_state.last_voted_slot() is None, this means bank_vote_state.last_voted_slot() must have originally been Some in order to pass this check:

if bank_vote_state.last_voted_slot()
                                > tower.vote_state.last_voted_slot()

That means it must have been adjusted here:

bank_vote_state .votes .retain(|lockout| lockout.slot() > local_root);

For it to have been adjusted, it must have passed this check:

if let Some(local_root) = tower.vote_state.root_slot {

So it follows that iff tower.vote_state.root_slot is Some then tower.vote_state.last_voted_slot() is None

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah yeah, i missed the logic where we adjust the root and filter out slots less than the root

@@ -3074,6 +3074,37 @@ impl ReplayStage {

tower.vote_state.root_slot = bank_vote_state.root_slot;
tower.vote_state.votes = bank_vote_state.votes;

let last_voted_slot = tower.vote_state.last_voted_slot().unwrap_or(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

doesn't this have to be Some because we checked

if bank_vote_state.last_voted_slot() > tower.vote_state.last_voted_slot()

above, so there must be at least some vote in bank_vote_state.last_voted_slot()

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

not necessarily, see here:

bank_vote_state
    .votes
    .retain(|lockout| lockout.slot() > local_root);

@AshwinSekar AshwinSekar added v1.14 v1.16 PRs that should be backported to v1.16 labels Aug 23, 2023
Comment on lines +3104 to +3106
progress
.get_hash(last_voted_slot)
.expect("Must exist for us to have frozen descendant"),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just wanted to make sure root bank exists in bank forks even if we load from a snapshot, I think it does because of bank_from_snapshot_archives()

@AshwinSekar AshwinSekar merged commit 329c6f1 into solana-labs:master Aug 23, 2023
18 checks passed
mergify bot pushed a commit that referenced this pull request Aug 23, 2023
* tower: when syncing from vote state, update last_vote

* pr: bubble error through unchecked

(cherry picked from commit 329c6f1)

# Conflicts:
#	core/src/consensus.rs
#	programs/vote/src/vote_state/mod.rs
mergify bot pushed a commit that referenced this pull request Aug 23, 2023
* tower: when syncing from vote state, update last_vote

* pr: bubble error through unchecked

(cherry picked from commit 329c6f1)

# Conflicts:
#	programs/vote/src/vote_state/mod.rs
AshwinSekar added a commit that referenced this pull request Aug 24, 2023
…t of #32944) (#32960)

* tower: when syncing from vote state, update last_vote (#32944)

* tower: when syncing from vote state, update last_vote

* pr: bubble error through unchecked

(cherry picked from commit 329c6f1)

# Conflicts:
#	programs/vote/src/vote_state/mod.rs

* fix conflicts

* fix bad merge

---------

Co-authored-by: Ashwin Sekar <[email protected]>
AshwinSekar added a commit that referenced this pull request Aug 24, 2023
…t of #32944) (#32959)

* tower: when syncing from vote state, update last_vote (#32944)

* tower: when syncing from vote state, update last_vote

* pr: bubble error through unchecked

(cherry picked from commit 329c6f1)

# Conflicts:
#	core/src/consensus.rs
#	programs/vote/src/vote_state/mod.rs

* Fix conflicts

---------

Co-authored-by: Ashwin Sekar <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.16 PRs that should be backported to v1.16
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Panic in core/src/heaviest_subtree_fork_choice.rs, thread solReplayStage
2 participants