Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Save/restore Tower #9902

Closed
wants to merge 2 commits into from
Closed

Save/restore Tower #9902

wants to merge 2 commits into from

Conversation

mvines
Copy link
Member

@mvines mvines commented May 6, 2020

Reboot of #7436

TODO:

context:
#6936

@codecov
Copy link

codecov bot commented May 6, 2020

Codecov Report

Merging #9902 into master will increase coverage by 0.0%.
The diff coverage is 80.0%.

@@           Coverage Diff           @@
##           master   #9902    +/-   ##
=======================================
  Coverage    81.5%   81.5%            
=======================================
  Files         288     280     -8     
  Lines       66475   65919   -556     
=======================================
- Hits        54198   53775   -423     
+ Misses      12277   12144   -133     

@mvines mvines force-pushed the tower branch 2 times, most recently from 5582bed to fe5c230 Compare May 7, 2020 08:36
@mvines mvines added the CI Pull Request is ready to enter CI label May 7, 2020
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label May 7, 2020
@mvines mvines force-pushed the tower branch 5 times, most recently from 26d9df3 to 0f02ffa Compare May 8, 2020 17:10
@mvines mvines requested review from carllin and t-nelson May 8, 2020 22:14
@mvines
Copy link
Member Author

mvines commented May 8, 2020

I think this is ready for review

t-nelson
t-nelson previously approved these changes May 8, 2020
Copy link
Contributor

@t-nelson t-nelson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Thanks for dragging this one over the line 🙏

error!("Tower restore failed: {:?}", err);
process::exit(1);
}
info!("Rebuilding tower from the latest vote account");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-nelson, do you remember what the original design was for the case where tower restoration failed? The danger with restoring from the latest vote account is that it could be missing some votes that have been submitted onto the network, but have not landed in any bank.

I think we said we should either signal to the user to make a new vote account and eat the warmup/cooldown for staking, or have them accept the risk of potentially submitting conflicting votes.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ultimately I think it comes down to what the slashing design is (which we don't have finalized, so it makes i hard to reason about this). For instance, if we decide slashing can only occur for votes that land in a bank, and that all votes before the root are not slashable (not sure if this is a reasonable assumption) and you have a reasonable guess for the range of your last vote, maybe you can just wait for a root to be made that is sufficiently far past your estimated last vote... just food for thought

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm... Yeah, there should definitely be user intervention here (at least by the time slashing is enabled. It is behind a CLI flag. Maybe it'd be sufficient to reverse its default behavior? That is require the tower by default.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed, there should be some level of intervention notification here

@@ -273,6 +293,14 @@ impl Tower {
self.lockouts.root_slot
}

pub fn last_lockout_vote_slot(&self) -> Option<Slot> {
Copy link
Contributor

@carllin carllin May 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will this always just return the last item in self.lockouts.votes since the votes are always ordered smallest to largest?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah but the vote ordering doesn't matter since this implementation does a max_by()

Copy link
Contributor

@carllin carllin May 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

oh I meant it seems like the implementation can be simplified to self.lockouts.votes.first().map(|v| v.slot)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, right. I was scared to make that sneaky assumption. Feels like it could silently break on me in the future. Should I just find some courage? 🦁

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hehe, true, but we could write a test that asserts this finds the largest!

@@ -432,6 +460,14 @@ impl Tower {
bank_weights.pop().map(|b| b.2)
}

pub fn adjust_lockouts_if_newer_root(&mut self, root_slot: Slot) {
let my_root_slot = self.lockouts.root_slot.unwrap_or(0);
if root_slot > my_root_slot {
Copy link
Contributor

@carllin carllin May 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-nelson what happens in this case if there are votes in the tower that are still locked out past the snapshot root?

Aka you have a fork structure that looks like:

           0
         /   \
       2       3

Most recently you voted for 2 in your tower, but your snapshot root is 3, you shouldn't vote until at least slot 4.

I thought that was why you had to consult the large set of ancestors in the snapshot root to see if you're still locked out/when you can start voting, or have we ditched that design?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah that sounds right. IIRC we decided to punt those changes to a later PR, being as it's a corner case of a corner case.

@mvines
Copy link
Member Author

mvines commented May 13, 2020

Crud, rebased and am now hitting an .unwrap() added by #9218

thread 'solana-replay-stage' panicked at 'called `Option::unwrap()` on a `None` value', core/src/consensus.rs:381:43

@mvines
Copy link
Member Author

mvines commented May 13, 2020

@carllin - hey can you please check out 89642e2. test_snapshot_restart_tower now hits this unwrap when the validator restarts from a saved tower due to how Tower::adjust_lockouts_if_newer_root() retains any votes the validator previously made that are higher than the snapshot root

let my_root_slot = self.lockouts.root_slot.unwrap_or(0);
if root_slot > my_root_slot {
self.lockouts.root_slot = Some(root_slot);
self.lockouts.votes.retain(|v| v.slot > root_slot);
Copy link
Contributor

@carllin carllin May 13, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-nelson I think this part is why we need to consult the ancestors, what if the votes higher than the root_slot don't descend from root_slot here. It may not be that much of an edge case. I vote for a slot

           0
         /   \
       2       3

I vote for slot 3, crash, and promptly boot back up. The rest of the cluster has rooted slot 2 and I get a snapshot for slot 2. Now my tower is 3, 2, which is inconsistent, as it's assumed tower is all one fork (see giant comment below for proposal to handle this).

@carllin
Copy link
Contributor

carllin commented May 13, 2020

@t-nelson @mvines doh! The only way that a slot X in the tower doesn't have ancestors when restarting from a snapshot for slot Y is if X doesn't descend from Y right? Otherwise, on restart, the blockstore_processor logic should replay all descendants of Y, and thus if X was a descendant of Y, it should appear in BankForks and thus should be in the ancestors map.

But that the situation described above can occur, and does need to be handled. @aeyakovenko feel free to double check me here, but I think this is how it should be handled:

The situation looks something like:

                                         0 (saved tower root)
                                          |
                                          2 (saved tower vote)
                                      /         \
                       (snapshot root) 3          4 (saved tower vote)

Your saved tower has 0, 2, 4, but the snapshot you are starting from has root 3.

Because blockstore proccessor will only play descendants of 3 on startup when generating BankForks, then BankForks won't include bank 4, so there will be no ancestor information for your latest voted slot 4. Thus if on startup we see this case: 89642e2#diff-d848cba681535238849e7e33ed32702bR382, then we know the snapshot root is not an ancestor of the latest vote.

So then how does this validator rejoin the main fork at root 3 in a safe way?

  1. In order to maintain the tower's integrity I don't think we can naively retain here: https://github.com/solana-labs/solana/pull/9902/files#diff-d848cba681535238849e7e33ed32702bR571, and set the root to the snapshot here: https://github.com/solana-labs/solana/pull/9902/files#diff-d848cba681535238849e7e33ed32702bR570. We need to first determine which vote in the saved tower was the latest ancestor of the snapshot slot. In the example above this would be 2. This ancestor can be found by consulting the chain of saved ancestors in the account state of the bank. Let everything greater than this ancestor in the tower be called the set G. G is the set of slots that is "locked out", aka lockouts would be violated if the validator were to switch forks. The validator should not vote on any slot less thanmax(g_i + lockout(g_i)) across all g_i in G (it should wait for such a slot).

  2. A valid switching proof is needed in order for this to happen, because the validator is switching forks from the last vote 4. In order to generate a valid switching proof, you need to see > SWITCH_FORK_THRESHOLD of the stake locked out on a slot X that is not an ancestor or descendant of 4, and the lockout(X) + X > 4. In this case the snapshot root 3 is an acceptable such X because we know by this point it's not an ancestor or descendant of 4, so we just need to wait until we see > SWITCH_FORK_THRESHOLD of the stake voting on 3 that are locked out past 4 before we can return true here: 89642e2#diff-d848cba681535238849e7e33ed32702bR382.
    To do this I think it's as simple as replacing the unwrap:

    let last_vote_ancestors = ancestors.get(&last_vote).unwrap();
    with unwrap_or_default(). With that change, these checks should all be skipped:
    if switch_slot == *last_vote || switch_slot_ancestors.contains(last_vote) {
    // If the `switch_slot is a descendant of the last vote,
    // no switching proof is neceessary
    return true;
    }
    // Should never consider switching to an ancestor
    // of your last vote
    assert!(!last_vote_ancestors.contains(&switch_slot));
    as they should all be true. This check here:
    for (_, value) in lockout_intervals.range((Included(last_vote), Unbounded)) {
    will then naturally only count lockouts that are past that latest vote (in this example would be latest vote = 4) on descendants of 3 (as those descendants are the only ones being played anyways).

When both conditions 1) and 2) are met, then the validator can finally vote on some descendant Y of the snapshot root. At this time, Y will be at the top of the tower and have popped off any non-descendants of the snapshot root (the entire set G from 1) will have been popped off). We can then set the root of the tower to the snapshot root (3 in this case) and the vote for Y will also write the tower to disk, thus committing and finalizing the recovery process.

@stale
Copy link

stale bot commented Jun 1, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 1, 2020
@ryoqun ryoqun self-assigned this Jun 2, 2020
@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 2, 2020
@stale
Copy link

stale bot commented Jun 9, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 9, 2020
@stale
Copy link

stale bot commented Jun 16, 2020

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Jun 16, 2020
@mvines mvines reopened this Jun 16, 2020
@ryoqun
Copy link
Member

ryoqun commented Jun 16, 2020

@mvines thanks for opening this! Finally, I think I can get my hands on this...

@stale stale bot removed the stale [bot only] Added to stale content; results in auto-close after a week. label Jun 16, 2020
}

// Given an untimely crash, tower may have roots that are not reflected in blockstore because
// `ReplayState::handle_votable_bank()` saves tower before setting blockstore roots
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@t-nelson @carllin (Hi! I'm taking over this pr to merge this time really from @mvines .)

A bit dumb question, why are we saving this into plain-old file instead of rocksdb?
I think we can just save this under rocksdb with new columnfamily and get atomicity for free with rocksdb's write batch: https://github.com/facebook/rocksdb/wiki/Column-Families#writebatch https://docs.rs/rocksdb/0.14.0/rocksdb/struct.Options.html#method.set_atomic_flush (we're using wal)

Also, saves tower before setting blockstore roots cannot be guaranteed because we currently aren't doing any fdatasync() or equivalent, considering validator process crash or even os-level crash?

And naive-fdatasync(tower bin file) here would hurt performance.. Also, this doesn't gurantee the write-ordering between the plain old file and rocksdb. Not just tower before blockstore, the opposite will be possible.

In overall, I think it's a lot easier we just rely on rocksdb? I'm glad to comment the reason if I'm missing something. :)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun storing in rocks should be fine, yeah. I think when @TristanDebrunner originally started this, there was an initiative to drop our RocksDB dependency, so avoided leaning on it more

Copy link
Contributor

@carllin carllin Jun 18, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun I think it's a durability issue. Votes are sensitive enough (slashing condition!) where we want to make sure the vote is persisted to disk before submitting the vote to the network. Even with a write batch + write ahead log, Rocks buffers a lot of the writes in memory buffers without guaranteeing they are immediately flushed.

Yeah I think we should probably be fsync'ing the file in tower.save

It would be good to measure the fsync performance, if it's really bad we can pipeline it with the rest of the replay logic so we don't halt on replaying further blocks. We just need to have a queue of votes that are pending commit to disk, and have not yet been submitted to the network.

@stale
Copy link

stale bot commented Jul 2, 2020

This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs.

@stale stale bot added the stale [bot only] Added to stale content; results in auto-close after a week. label Jul 2, 2020
@stale
Copy link

stale bot commented Jul 9, 2020

This stale pull request has been automatically closed. Thank you for your contributions.

@stale stale bot closed this Jul 9, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stale [bot only] Added to stale content; results in auto-close after a week.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants