Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix repair behavior concerning our own leader slots #30200

Merged
merged 1 commit into from
Feb 9, 2023

Conversation

AshwinSekar
Copy link
Contributor

@AshwinSekar AshwinSekar commented Feb 8, 2023

Problem

We should not dump our own leader slots ever, this indicates that manual intervention is required.

Summary of Changes

Panic when dumping our own leader slots.
Remove the circular transmission check so that we can repair our own leader slots in case of ledger corruption/deletion on restart.

Fixes #30126 #30197

@AshwinSekar AshwinSekar changed the title Dump leader panic Panic when dumping our own leader slots Feb 8, 2023
@AshwinSekar AshwinSekar changed the title Panic when dumping our own leader slots Fix repair behavior concerning our own leader slots Feb 8, 2023
@AshwinSekar AshwinSekar marked this pull request as ready for review February 8, 2023 20:58
@AshwinSekar AshwinSekar linked an issue Feb 8, 2023 that may be closed by this pull request
Comment on lines 146 to 147
let leader = leader_schedule_cache.slot_leader_at(slot, Some(bank))?;
// Discard the shred if the slot leader is the node itself.
(&leader != self_pubkey).then_some(leader)
Some(leader)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this and the previous line can just be simplified to

leader_schedule_cache.slot_leader_at(slot, Some(bank))

carllin
carllin previously approved these changes Feb 8, 2023
@behzadnouri
Copy link
Contributor

Remove the circular transmission check so that we can repair our own leader slots in case of ledger corruption/deletion on restart.

how was this working before then?

@AshwinSekar
Copy link
Contributor Author

Remove the circular transmission check so that we can repair our own leader slots in case of ledger corruption/deletion on restart.

how was this working before then?

I don't think it was. While rare that leader would dump their own block, it seems they'd have to manually copy over or wipe and rely on snapshot to be able to restart.

@behzadnouri
Copy link
Contributor

Remove the circular transmission check so that we can repair our own leader slots in case of ledger corruption/deletion on restart.

how was this working before then?

I don't think it was. While rare that leader would dump their own block, it seems they'd have to manually copy over or wipe and rely on snapshot to be able to restart.

Wouldn't they need a new snapshot in the case of "ledger corruption/deletion" anyways?

I think there were some reasons that sigverify was filtering out own's shred, and I am wondering if we can just keep it that way for now if it does not have much utility anyways.

@AshwinSekar
Copy link
Contributor Author

AshwinSekar commented Feb 8, 2023

Yeah I think that's fine, the most important part is to panic before the dump occurs so that on downgrade and restart we still have the block and can play it correctly. The removal of the retransmission check was to cover any other cases where we don't have the block on restart which probably does not add that much utility. wdyt @carllin

@AshwinSekar
Copy link
Contributor Author

AshwinSekar commented Feb 8, 2023

Confirmed with carl that there is little utility in removing the circular transmission check as we now panic before we dump. This change will not modify the sigverify check.

behzadnouri
behzadnouri previously approved these changes Feb 9, 2023
Copy link
Contributor

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but would be good to wait for Carl to stamp as well.

Comment on lines 1242 to 1248
if let Some(leader_pubkey) = leader_schedule_cache.slot_leader_at(*duplicate_slot, None) {
if leader_pubkey == *my_pubkey {
panic!("We are attempting to dump a block that we produced. This indicates that we are producing duplicate blocks, or that there is a bug in our runtime/replay code causing us to compute different bank hashes than the rest of the cluster. We froze slot {} with hash {:?} while the cluster hash is {}", *duplicate_slot, frozen_hash, *correct_hash);
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of nested if can you just do:

if Some(*my_pubkey) == leader_schedule_cache.slot_leader_at(*duplicate_slot, None) {

Can you please break the panic line to shorter lines using the \ at line breaks?
similar to :
https://github.com/solana-labs/solana/blob/2b4a6a4dd/core/src/consensus.rs#L677

The panic description says "dump"; other places say "purge".
Would be good to use consistent terms throughout.

also the format arguments can be inlined:
https://rust-lang.github.io/rust-clippy/master/#uninlined_format_args

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The panic description says "dump"; other places say "purge".
Would be good to use consistent terms throughout.

I agree we do this quite often in both replay and repair. I fixed up some occurrences here and created #30225 to track cleanup for the rest. Will discuss with carl and wen on what the proper convention should be.

@@ -1172,6 +1175,8 @@ impl ReplayStage {
poh_bank_slot: Option<Slot>,
purge_repair_slot_counter: &mut PurgeRepairSlotCounter,
dumped_slots_sender: &DumpedSlotsSender,
my_pubkey: &Pubkey,
leader_schedule_cache: &Arc<LeaderScheduleCache>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This does not need to be Arc, just &LeaderScheduleCache

@AshwinSekar AshwinSekar merged commit 67f6444 into solana-labs:master Feb 9, 2023
nickfrosty pushed a commit to nickfrosty/solana that referenced this pull request Mar 12, 2023
panic when trying to dump & repair a block that we produced
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
3 participants