Fix repair behavior concerning our own leader slots #30200

AshwinSekar · 2023-02-08T20:17:31Z

Problem

We should not dump our own leader slots ever, this indicates that manual intervention is required.

Summary of Changes

Panic when dumping our own leader slots.
Remove the circular transmission check so that we can repair our own leader slots in case of ledger corruption/deletion on restart.

Fixes #30126 #30197

carllin · 2023-02-08T22:08:52Z

core/src/sigverify_shreds.rs

                let leader = leader_schedule_cache.slot_leader_at(slot, Some(bank))?;
-                // Discard the shred if the slot leader is the node itself.
-                (&leader != self_pubkey).then_some(leader)
+                Some(leader)


I think this and the previous line can just be simplified to

leader_schedule_cache.slot_leader_at(slot, Some(bank))

behzadnouri · 2023-02-08T22:12:53Z

Remove the circular transmission check so that we can repair our own leader slots in case of ledger corruption/deletion on restart.

how was this working before then?

AshwinSekar · 2023-02-08T22:17:09Z

Remove the circular transmission check so that we can repair our own leader slots in case of ledger corruption/deletion on restart.

how was this working before then?

I don't think it was. While rare that leader would dump their own block, it seems they'd have to manually copy over or wipe and rely on snapshot to be able to restart.

behzadnouri · 2023-02-08T22:25:22Z

Remove the circular transmission check so that we can repair our own leader slots in case of ledger corruption/deletion on restart.

how was this working before then?

I don't think it was. While rare that leader would dump their own block, it seems they'd have to manually copy over or wipe and rely on snapshot to be able to restart.

Wouldn't they need a new snapshot in the case of "ledger corruption/deletion" anyways?

I think there were some reasons that sigverify was filtering out own's shred, and I am wondering if we can just keep it that way for now if it does not have much utility anyways.

AshwinSekar · 2023-02-08T22:31:07Z

Yeah I think that's fine, the most important part is to panic before the dump occurs so that on downgrade and restart we still have the block and can play it correctly. The removal of the retransmission check was to cover any other cases where we don't have the block on restart which probably does not add that much utility. wdyt @carllin

AshwinSekar · 2023-02-08T23:34:27Z

Confirmed with carl that there is little utility in removing the circular transmission check as we now panic before we dump. This change will not modify the sigverify check.

behzadnouri

lgtm, but would be good to wait for Carl to stamp as well.

behzadnouri · 2023-02-09T18:01:35Z

core/src/replay_stage.rs

+                    if let Some(leader_pubkey) = leader_schedule_cache.slot_leader_at(*duplicate_slot, None) {
+                        if leader_pubkey == *my_pubkey {
+                            panic!("We are attempting to dump a block that we produced. This indicates that we are producing duplicate blocks, or that there is a bug in our runtime/replay code causing us to compute different bank hashes than the rest of the cluster. We froze slot {} with hash {:?} while the cluster hash is {}", *duplicate_slot, frozen_hash, *correct_hash);
+                        }
+                    }


Instead of nested if can you just do:

if Some(*my_pubkey) == leader_schedule_cache.slot_leader_at(*duplicate_slot, None) {

Can you please break the panic line to shorter lines using the \ at line breaks?
similar to :
https://github.com/solana-labs/solana/blob/2b4a6a4dd/core/src/consensus.rs#L677

The panic description says "dump"; other places say "purge".
Would be good to use consistent terms throughout.

also the format arguments can be inlined:
https://rust-lang.github.io/rust-clippy/master/#uninlined_format_args

The panic description says "dump"; other places say "purge".
Would be good to use consistent terms throughout.

I agree we do this quite often in both replay and repair. I fixed up some occurrences here and created #30225 to track cleanup for the rest. Will discuss with carl and wen on what the proper convention should be.

behzadnouri · 2023-02-09T18:01:53Z

core/src/replay_stage.rs

@@ -1172,6 +1175,8 @@ impl ReplayStage {
        poh_bank_slot: Option<Slot>,
        purge_repair_slot_counter: &mut PurgeRepairSlotCounter,
        dumped_slots_sender: &DumpedSlotsSender,
+        my_pubkey: &Pubkey,
+        leader_schedule_cache: &Arc<LeaderScheduleCache>,


This does not need to be Arc, just &LeaderScheduleCache

panic when trying to dump & repair a block that we produced

AshwinSekar changed the title ~~Dump leader panic~~ Panic when dumping our own leader slots Feb 8, 2023

AshwinSekar changed the title ~~Panic when dumping our own leader slots~~ Fix repair behavior concerning our own leader slots Feb 8, 2023

AshwinSekar force-pushed the dump-leader-panic branch from 4fc02ff to be83c95 Compare February 8, 2023 20:57

AshwinSekar requested review from carllin and behzadnouri February 8, 2023 20:57

AshwinSekar marked this pull request as ready for review February 8, 2023 20:58

AshwinSekar force-pushed the dump-leader-panic branch from be83c95 to 5ba4cd9 Compare February 8, 2023 21:17

AshwinSekar linked an issue Feb 8, 2023 that may be closed by this pull request

Repair does not behave correctly on your own leader slots #30197

Closed

carllin reviewed Feb 8, 2023

View reviewed changes

carllin previously approved these changes Feb 8, 2023

View reviewed changes

AshwinSekar dismissed carllin’s stale review via bd4d6be February 8, 2023 23:32

AshwinSekar force-pushed the dump-leader-panic branch from 5ba4cd9 to bd4d6be Compare February 8, 2023 23:32

AshwinSekar requested a review from carllin February 9, 2023 01:57

behzadnouri previously approved these changes Feb 9, 2023

View reviewed changes

panic when trying to dump & repair a block that we produced

587d35f

AshwinSekar dismissed behzadnouri’s stale review via 587d35f February 9, 2023 19:22

AshwinSekar force-pushed the dump-leader-panic branch from bd4d6be to 587d35f Compare February 9, 2023 19:22

AshwinSekar requested a review from behzadnouri February 9, 2023 21:06

carllin approved these changes Feb 9, 2023

View reviewed changes

AshwinSekar merged commit 67f6444 into solana-labs:master Feb 9, 2023

nickfrosty pushed a commit to nickfrosty/solana that referenced this pull request Mar 12, 2023

Fix repair behavior concerning our own leader slots (solana-labs#30200)

2e57148

panic when trying to dump & repair a block that we produced

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix repair behavior concerning our own leader slots #30200

Fix repair behavior concerning our own leader slots #30200

AshwinSekar commented Feb 8, 2023 •

edited

Loading

carllin Feb 8, 2023

behzadnouri commented Feb 8, 2023

AshwinSekar commented Feb 8, 2023

behzadnouri commented Feb 8, 2023

AshwinSekar commented Feb 8, 2023 •

edited

Loading

AshwinSekar commented Feb 8, 2023 •

edited

Loading

behzadnouri left a comment

behzadnouri Feb 9, 2023

AshwinSekar Feb 9, 2023

behzadnouri Feb 9, 2023

Fix repair behavior concerning our own leader slots #30200

Fix repair behavior concerning our own leader slots #30200

Conversation

AshwinSekar commented Feb 8, 2023 • edited Loading

Problem

Summary of Changes

carllin Feb 8, 2023

Choose a reason for hiding this comment

behzadnouri commented Feb 8, 2023

AshwinSekar commented Feb 8, 2023

behzadnouri commented Feb 8, 2023

AshwinSekar commented Feb 8, 2023 • edited Loading

AshwinSekar commented Feb 8, 2023 • edited Loading

behzadnouri left a comment

Choose a reason for hiding this comment

behzadnouri Feb 9, 2023

Choose a reason for hiding this comment

AshwinSekar Feb 9, 2023

Choose a reason for hiding this comment

behzadnouri Feb 9, 2023

Choose a reason for hiding this comment

AshwinSekar commented Feb 8, 2023 •

edited

Loading

AshwinSekar commented Feb 8, 2023 •

edited

Loading

AshwinSekar commented Feb 8, 2023 •

edited

Loading