-
Notifications
You must be signed in to change notification settings - Fork 4.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix repair behavior concerning our own leader slots #30200
Conversation
4fc02ff
to
be83c95
Compare
be83c95
to
5ba4cd9
Compare
core/src/sigverify_shreds.rs
Outdated
let leader = leader_schedule_cache.slot_leader_at(slot, Some(bank))?; | ||
// Discard the shred if the slot leader is the node itself. | ||
(&leader != self_pubkey).then_some(leader) | ||
Some(leader) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this and the previous line can just be simplified to
leader_schedule_cache.slot_leader_at(slot, Some(bank))
how was this working before then? |
I don't think it was. While rare that leader would dump their own block, it seems they'd have to manually copy over or wipe and rely on snapshot to be able to restart. |
Wouldn't they need a new snapshot in the case of "ledger corruption/deletion" anyways? I think there were some reasons that sigverify was filtering out own's shred, and I am wondering if we can just keep it that way for now if it does not have much utility anyways. |
Yeah I think that's fine, the most important part is to panic before the dump occurs so that on downgrade and restart we still have the block and can play it correctly. The removal of the retransmission check was to cover any other cases where we don't have the block on restart which probably does not add that much utility. wdyt @carllin |
5ba4cd9
to
bd4d6be
Compare
Confirmed with carl that there is little utility in removing the circular transmission check as we now panic before we dump. This change will not modify the sigverify check. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but would be good to wait for Carl to stamp as well.
core/src/replay_stage.rs
Outdated
if let Some(leader_pubkey) = leader_schedule_cache.slot_leader_at(*duplicate_slot, None) { | ||
if leader_pubkey == *my_pubkey { | ||
panic!("We are attempting to dump a block that we produced. This indicates that we are producing duplicate blocks, or that there is a bug in our runtime/replay code causing us to compute different bank hashes than the rest of the cluster. We froze slot {} with hash {:?} while the cluster hash is {}", *duplicate_slot, frozen_hash, *correct_hash); | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of nested if
can you just do:
if Some(*my_pubkey) == leader_schedule_cache.slot_leader_at(*duplicate_slot, None) {
Can you please break the panic
line to shorter lines using the \
at line breaks?
similar to :
https://github.com/solana-labs/solana/blob/2b4a6a4dd/core/src/consensus.rs#L677
The panic description says "dump"; other places say "purge".
Would be good to use consistent terms throughout.
also the format arguments can be inlined:
https://rust-lang.github.io/rust-clippy/master/#uninlined_format_args
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The panic description says "dump"; other places say "purge".
Would be good to use consistent terms throughout.
I agree we do this quite often in both replay and repair. I fixed up some occurrences here and created #30225 to track cleanup for the rest. Will discuss with carl and wen on what the proper convention should be.
core/src/replay_stage.rs
Outdated
@@ -1172,6 +1175,8 @@ impl ReplayStage { | |||
poh_bank_slot: Option<Slot>, | |||
purge_repair_slot_counter: &mut PurgeRepairSlotCounter, | |||
dumped_slots_sender: &DumpedSlotsSender, | |||
my_pubkey: &Pubkey, | |||
leader_schedule_cache: &Arc<LeaderScheduleCache>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This does not need to be Arc
, just &LeaderScheduleCache
bd4d6be
to
587d35f
Compare
panic when trying to dump & repair a block that we produced
Problem
We should not dump our own leader slots ever, this indicates that manual intervention is required.
Summary of Changes
Panic when dumping our own leader slots.
Remove the circular transmission check so that we can repair our own leader slots in case of ledger corruption/deletion on restart.
Fixes #30126 #30197