-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Save/restore Tower #9902
Save/restore Tower #9902
Conversation
Codecov Report
@@ Coverage Diff @@
## master #9902 +/- ##
=======================================
Coverage 81.5% 81.5%
=======================================
Files 288 280 -8
Lines 66475 65919 -556
=======================================
- Hits 54198 53775 -423
+ Misses 12277 12144 -133 |
5582bed
to
fe5c230
Compare
26d9df3
to
0f02ffa
Compare
I think this is ready for review |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM! Thanks for dragging this one over the line 🙏
error!("Tower restore failed: {:?}", err); | ||
process::exit(1); | ||
} | ||
info!("Rebuilding tower from the latest vote account"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t-nelson, do you remember what the original design was for the case where tower restoration failed? The danger with restoring from the latest vote account is that it could be missing some votes that have been submitted onto the network, but have not landed in any bank.
I think we said we should either signal to the user to make a new vote account and eat the warmup/cooldown for staking, or have them accept the risk of potentially submitting conflicting votes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ultimately I think it comes down to what the slashing design is (which we don't have finalized, so it makes i hard to reason about this). For instance, if we decide slashing can only occur for votes that land in a bank, and that all votes before the root are not slashable (not sure if this is a reasonable assumption) and you have a reasonable guess for the range of your last vote, maybe you can just wait for a root to be made that is sufficiently far past your estimated last vote... just food for thought
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm... Yeah, there should definitely be user intervention here (at least by the time slashing is enabled. It is behind a CLI flag. Maybe it'd be sufficient to reverse its default behavior? That is require the tower by default.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed, there should be some level of intervention notification here
@@ -273,6 +293,14 @@ impl Tower { | |||
self.lockouts.root_slot | |||
} | |||
|
|||
pub fn last_lockout_vote_slot(&self) -> Option<Slot> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
will this always just return the last item in self.lockouts.votes
since the votes are always ordered smallest to largest?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah but the vote ordering doesn't matter since this implementation does a max_by()
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
oh I meant it seems like the implementation can be simplified to self.lockouts.votes.first().map(|v| v.slot)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, right. I was scared to make that sneaky assumption. Feels like it could silently break on me in the future. Should I just find some courage? 🦁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hehe, true, but we could write a test that asserts this finds the largest!
@@ -432,6 +460,14 @@ impl Tower { | |||
bank_weights.pop().map(|b| b.2) | |||
} | |||
|
|||
pub fn adjust_lockouts_if_newer_root(&mut self, root_slot: Slot) { | |||
let my_root_slot = self.lockouts.root_slot.unwrap_or(0); | |||
if root_slot > my_root_slot { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t-nelson what happens in this case if there are votes in the tower that are still locked out past the snapshot root?
Aka you have a fork structure that looks like:
0
/ \
2 3
Most recently you voted for 2 in your tower, but your snapshot root is 3
, you shouldn't vote until at least slot 4.
I thought that was why you had to consult the large set of ancestors in the snapshot root to see if you're still locked out/when you can start voting, or have we ditched that design?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah that sounds right. IIRC we decided to punt those changes to a later PR, being as it's a corner case of a corner case.
Crud, rebased and am now hitting an .unwrap() added by #9218
|
let my_root_slot = self.lockouts.root_slot.unwrap_or(0); | ||
if root_slot > my_root_slot { | ||
self.lockouts.root_slot = Some(root_slot); | ||
self.lockouts.votes.retain(|v| v.slot > root_slot); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t-nelson I think this part is why we need to consult the ancestors, what if the votes higher than the root_slot
don't descend from root_slot
here. It may not be that much of an edge case. I vote for a slot
0
/ \
2 3
I vote for slot 3, crash, and promptly boot back up. The rest of the cluster has rooted slot 2
and I get a snapshot for slot 2
. Now my tower is 3, 2
, which is inconsistent, as it's assumed tower is all one fork (see giant comment below for proposal to handle this).
@t-nelson @mvines doh! The only way that a slot But that the situation described above can occur, and does need to be handled. @aeyakovenko feel free to double check me here, but I think this is how it should be handled: The situation looks something like:
Your saved tower has 0, 2, 4, but the snapshot you are starting from has root 3. Because blockstore proccessor will only play descendants of 3 on startup when generating So then how does this validator rejoin the main fork at root
When both conditions 1) and 2) are met, then the validator can finally vote on some descendant |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This stale pull request has been automatically closed. Thank you for your contributions. |
@mvines thanks for opening this! Finally, I think I can get my hands on this... |
} | ||
|
||
// Given an untimely crash, tower may have roots that are not reflected in blockstore because | ||
// `ReplayState::handle_votable_bank()` saves tower before setting blockstore roots |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@t-nelson @carllin (Hi! I'm taking over this pr to merge this time really from @mvines .)
A bit dumb question, why are we saving this into plain-old file instead of rocksdb?
I think we can just save this under rocksdb with new columnfamily and get atomicity for free with rocksdb's write batch: https://github.com/facebook/rocksdb/wiki/Column-Families#writebatch https://docs.rs/rocksdb/0.14.0/rocksdb/struct.Options.html#method.set_atomic_flush (we're using wal)
Also, saves tower before setting blockstore roots
cannot be guaranteed because we currently aren't doing any fdatasync()
or equivalent, considering validator process crash or even os-level crash?
And naive-fdatasync(tower bin file)
here would hurt performance.. Also, this doesn't gurantee the write-ordering between the plain old file and rocksdb. Not just tower before blockstore
, the opposite will be possible.
In overall, I think it's a lot easier we just rely on rocksdb? I'm glad to comment the reason if I'm missing something. :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun storing in rocks should be fine, yeah. I think when @TristanDebrunner originally started this, there was an initiative to drop our RocksDB dependency, so avoided leaning on it more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryoqun I think it's a durability issue. Votes are sensitive enough (slashing condition!) where we want to make sure the vote is persisted to disk before submitting the vote to the network. Even with a write batch + write ahead log, Rocks buffers a lot of the writes in memory buffers without guaranteeing they are immediately flushed.
Yeah I think we should probably be fsync
'ing the file in tower.save
It would be good to measure the fsync
performance, if it's really bad we can pipeline it with the rest of the replay logic so we don't halt on replaying further blocks. We just need to have a queue of votes that are pending commit to disk, and have not yet been submitted to the network.
This pull request has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. |
This stale pull request has been automatically closed. Thank you for your contributions. |
Reboot of #7436
TODO:
context:
#6936