Check poh_recorder.start_slot() hasn't been dumped previously before checking it in ProgressMap. #2676

wen-coding · 2024-08-20T17:52:12Z

Problem

It's possible that poh_recorder.start_slot() has been dumped because of duplicate proof, in that case we would currently panic inside ProgressMap::get_propagated_stats_must_exist()

Summary of Changes

Check that start_slot still exists in ProgressMap before proceeding.

…checking it in progress_map.

AshwinSekar · 2024-08-20T18:02:04Z

Should check the rest of them as well:

1966:            progress_map.get_latest_leader_slot_must_exist(parent_slot)
1980:            .get_leader_propagation_slot_must_exist(parent_slot)
2086:                let latest_unconfirmed_leader_slot = progress_map.get_latest_leader_slot_must_exist(parent_slot)
3591:        if progress.get_leader_propagation_slot_must_exist(slot).0 {
3907:                    .get_leader_propagation_slot_must_exist(candidate_vote_bank.slot())
3982:        let mut current_leader_slot = progress.get_latest_leader_slot_must_exist(fork_tip);
4409:                        progress.get_latest_leader_slot_must_exist(*slot)

It looks like some of them might be safe, but good to double check.

Additionally it looks like we can get screwed here:

agave/core/src/replay_stage.rs

Lines 2037 to 2041 in 951bb04

    
           let parent = bank_forks 
        
               .read() 
        
               .unwrap() 
        
               .get(parent_slot) 
        
               .expect("parent_slot doesn't exist in bank forks");

parent_slot is also the poh_recorder.start_slot() so it could have been dumped.

wen-coding · 2024-08-20T18:04:21Z

Should check the rest of them as well:
1966:            progress_map.get_latest_leader_slot_must_exist(parent_slot)
1980:            .get_leader_propagation_slot_must_exist(parent_slot)
2086:                let latest_unconfirmed_leader_slot = progress_map.get_latest_leader_slot_must_exist(parent_slot)
3591:        if progress.get_leader_propagation_slot_must_exist(slot).0 {
3907:                    .get_leader_propagation_slot_must_exist(candidate_vote_bank.slot())
3982:        let mut current_leader_slot = progress.get_latest_leader_slot_must_exist(fork_tip);
4409:                        progress.get_latest_leader_slot_must_exist(*slot)
It looks like some of them might be safe, but good to double check.

Additionally it looks like we can get screwed here:

agave/core/src/replay_stage.rs

Lines 2037 to 2041 in 951bb04

let parent = bank_forks

.read()

.unwrap()

.get(parent_slot)

.expect("parent_slot doesn't exist in bank forks");

parent_slot is also the poh_recorder.start_slot() so it could have been dumped.

I wanted to put each check in a different PR, is it better to put all of them in the same PR?

AshwinSekar · 2024-08-20T18:16:20Z

I wanted to put each check in a different PR, is it better to put all of them in the same PR?

one PR is probably fine, i imagine it'll be a hassle managing and backporting multiple with possible merge conflicts

wen-coding · 2024-08-20T18:17:54Z

I wanted to put each check in a different PR, is it better to put all of them in the same PR?

one PR is probably fine, i imagine it'll be a hassle managing and backporting multiple with possible merge conflicts

That's fair, let me take a look at the other instances.

wen-coding · 2024-08-20T19:14:34Z

I wanted to put each check in a different PR, is it better to put all of them in the same PR?

one PR is probably fine, i imagine it'll be a hassle managing and backporting multiple with possible merge conflicts

That's fair, let me take a look at the other instances.

I checked all _must_exist in replay, and I think these are the only two places we would be in trouble, added comments in other places.

core/src/replay_stage.rs

Co-authored-by: Ashwin Sekar <[email protected]>

AshwinSekar · 2024-08-20T22:06:10Z

overall looks good, added some more reviewers to get a second opinion

bw-solana

LGTM

mergify · 2024-08-21T05:13:19Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

carllin · 2024-08-21T05:29:25Z

core/src/replay_stage.rs

+        if !progress.contains(&start_slot) {
+            warn!("Poh start slot {start_slot}, is missing from progress map. This indicates that we are in the middle of a dump and repair. Skipping retransmission of unpropagated leader slots");
+            return;
+        }
+


hmm if we're about to dump the slot, shouldn't we be resetting PoH to an ancestor of that slot? @AshwinSekar

Seems like select_vote_and_reset_forks should be handling the reset to account for latest duplicate we reset to before we get here

[2024-08-20T07:07:42.150220136Z WARN solana_core::repair::cluster_slot_state_verifier] Cluster duplicate confirmed slot 288668257 with hash 2kQ6z6XvYiFc3hnw5PqhwREa9yDChBDqneVrjYvcabeu, but our version has hash AytETM4WSS9AheCCDLdiBwXVWkkmQY59EEZzdUKzDuBo [2024-08-20T07:07:42.150226519Z INFO solana_core::consensus::heaviest_subtree_fork_choice] marking fork starting at: (288668257, AytETM4WSS9AheCCDLdiBwXVWkkmQY59EEZzdUKzDuBo) invalid candidate [2024-08-20T07:07:42.150244693Z INFO solana_core::consensus::heaviest_subtree_fork_choice] Fork choice for (288668259, D4281472KYmdbJg8YFUHuJc59oGMTzJhNFeEUUtaAKuH) setting latest invalid ancestor from None to 288668257 [2024-08-20T07:07:42.150252709Z INFO solana_core::consensus::heaviest_subtree_fork_choice] Fork choice for (288668258, Eq2j5UoAQFsJftEp7X5gtvF7MiykGznFySLT3mku3zwy) setting latest invalid ancestor from None to 288668257 [2024-08-20T07:07:42.150259652Z INFO solana_core::consensus::heaviest_subtree_fork_choice] Fork choice for (288668257, AytETM4WSS9AheCCDLdiBwXVWkkmQY59EEZzdUKzDuBo) setting latest invalid ancestor from None to 288668257 [2024-08-20T07:07:42.162790246Z INFO solana_core::replay_stage] 4mJV1MwoVG1Qztpwagnw396QGALvWQxbGXw5cChWKaJJ slot_weight: 288668260 0.0% 288668254 [2024-08-20T07:07:42.162816507Z INFO solana_core::replay_stage] validator fork confirmed 288668225 21957ms [2024-08-20T07:07:42.162821536Z INFO solana_core::replay_stage] validator fork confirmed 288668219 25771ms [2024-08-20T07:07:42.162825554Z INFO solana_core::replay_stage] validator fork confirmed 288668238 13693ms [2024-08-20T07:07:42.165829839Z INFO solana_core::replay_stage] Waiting to switch vote to 288668260, resetting to slot Some(288668259) for now, switch proof stake: 0, threshold stake: 106674872336469650, total stake: 280723348253867482

It looks like in this case we shouldn't reset to 288668259 which links to an invalid ancestor?

dump_then_repair happens in between the reset and maybe_start_leader.
It does have a check to ensure that it doesn't dump a slot that is an ancestor of a working_bank we currently have, however that doesn't help if maybe_start_leader is just about to tick to our leader slot.

Seems like select_vote_and_reset_forks should be handling the reset to account for latest duplicate we reset to before we get here

We do reset to the duplicate fork deepest descended from our last vote in case of switching such as the example wen posted above. To achieve what you're hinting at we should modify dump_then_repair to reset to the heaviest block descended from the last vote that is not dumped (or in case of lockout, from the switch slot) OR have select forks talk to the state machine in order to exclude blocks that could be potentially dumped.

I have little confidence in making a fork choice modification that we can backport to 2.0 at this time. I would prefer to hold off on starting leader until dump and repair has succeeded or we have reset to another bank. In the meantime we can make make the fork choice change on master.

hmm so the current order of events is:

Ingest new duplicate and update state machine and thus fork choice in process_duplicate_slots()

select_vote_and_reset_forks runs and picks the reset bank, which currently does not account for the new duplicates from 1?

Reset PoH happens to a block that will get dumped

dump and repair dumps block from progress map

Panic happens in retransmit_latest_unpropagated_leader_slot

So the issue even though we've updated fork choice in 1, the reset bank is still picking an invalid fork to reset to in 2 right?

I thought select vote and reset in 2. already accounted for slots marked duplicate by 1 that should be marked as invalid in fork choice

As far I see, yes. We currently use heaviest_bank_on_same_voted_fork for reset bank if switch threshold fails, and we don't seem to set heaviest_bank_on_same_voted_fork to a valid block, instead we change to deepest slot:
https://github.com/anza-xyz/agave/blob/master/core/src/consensus/heaviest_subtree_fork_choice.rs#L1138

Did I read that correctly?

…checking it in ProgressMap. (#2676) * Check poh_recorder.start_slot() hasn't been dumped previously before checking it in progress_map. * Add more comments and put in checks for maybe_start_leader. * Update core/src/replay_stage.rs Co-authored-by: Ashwin Sekar <[email protected]> * Use a slot which I am not leader to avoid dumping my own slot panic. * Address reviewer comments. * Address reviewer comments. --------- Co-authored-by: Ashwin Sekar <[email protected]> (cherry picked from commit 053faa6)

…efore checking it in ProgressMap. (backport of #2676) (#2691) Check poh_recorder.start_slot() hasn't been dumped previously before checking it in ProgressMap. (#2676) * Check poh_recorder.start_slot() hasn't been dumped previously before checking it in progress_map. * Add more comments and put in checks for maybe_start_leader. * Update core/src/replay_stage.rs Co-authored-by: Ashwin Sekar <[email protected]> * Use a slot which I am not leader to avoid dumping my own slot panic. * Address reviewer comments. * Address reviewer comments. --------- Co-authored-by: Ashwin Sekar <[email protected]> (cherry picked from commit 053faa6) Co-authored-by: Wen <[email protected]>

…checking it in ProgressMap. (anza-xyz#2676) * Check poh_recorder.start_slot() hasn't been dumped previously before checking it in progress_map. * Add more comments and put in checks for maybe_start_leader. * Update core/src/replay_stage.rs Co-authored-by: Ashwin Sekar <[email protected]> * Use a slot which I am not leader to avoid dumping my own slot panic. * Address reviewer comments. * Address reviewer comments. --------- Co-authored-by: Ashwin Sekar <[email protected]>

Check poh_recorder.start_slot() hasn't been dumped previously before …

1c34597

…checking it in progress_map.

wen-coding self-assigned this Aug 20, 2024

wen-coding requested review from carllin and AshwinSekar August 20, 2024 17:52

wen-coding linked an issue Aug 20, 2024 that may be closed by this pull request

Fix panic in ProgressMap when the underlying block may have been dumped and repaired #2671

Closed

Merge branch 'master' into replay_audit_must_exist

8f36a5c

wen-coding marked this pull request as draft August 20, 2024 18:18

Add more comments and put in checks for maybe_start_leader.

9f96749

wen-coding marked this pull request as ready for review August 20, 2024 19:13

AshwinSekar reviewed Aug 20, 2024

View reviewed changes

core/src/replay_stage.rs Outdated Show resolved Hide resolved

core/src/replay_stage.rs Outdated Show resolved Hide resolved

core/src/replay_stage.rs Outdated Show resolved Hide resolved

wen-coding and others added 3 commits August 20, 2024 14:01

Update core/src/replay_stage.rs

d2263d1

Co-authored-by: Ashwin Sekar <[email protected]>

Use a slot which I am not leader to avoid dumping my own slot panic.

5a50220

Address reviewer comments.

6591ddf

AshwinSekar requested review from bw-solana, steviez and behzadnouri August 20, 2024 22:05

Address reviewer comments.

ee936a6

bw-solana approved these changes Aug 20, 2024

View reviewed changes

AshwinSekar added the v2.0 Backport to v2.0 branch label Aug 21, 2024

carllin reviewed Aug 21, 2024

View reviewed changes

wen-coding merged commit 053faa6 into anza-xyz:master Aug 22, 2024
41 checks passed

wen-coding deleted the replay_audit_must_exist branch August 22, 2024 00:12

mergify bot mentioned this pull request Aug 22, 2024

v2.0: Check poh_recorder.start_slot() hasn't been dumped previously before checking it in ProgressMap. (backport of #2676) #2691

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Check poh_recorder.start_slot() hasn't been dumped previously before checking it in ProgressMap. #2676

Check poh_recorder.start_slot() hasn't been dumped previously before checking it in ProgressMap. #2676

wen-coding commented Aug 20, 2024

AshwinSekar commented Aug 20, 2024

wen-coding commented Aug 20, 2024

AshwinSekar commented Aug 20, 2024

wen-coding commented Aug 20, 2024

wen-coding commented Aug 20, 2024

AshwinSekar commented Aug 20, 2024

bw-solana left a comment

mergify bot commented Aug 21, 2024

carllin Aug 21, 2024 •

edited

Loading

wen-coding Aug 21, 2024

AshwinSekar Aug 21, 2024 •

edited

Loading

carllin Aug 21, 2024

carllin Aug 21, 2024 •

edited

Loading

wen-coding Aug 21, 2024

Check poh_recorder.start_slot() hasn't been dumped previously before checking it in ProgressMap. #2676

Check poh_recorder.start_slot() hasn't been dumped previously before checking it in ProgressMap. #2676

Conversation

wen-coding commented Aug 20, 2024

Problem

Summary of Changes

AshwinSekar commented Aug 20, 2024

wen-coding commented Aug 20, 2024

AshwinSekar commented Aug 20, 2024

wen-coding commented Aug 20, 2024

wen-coding commented Aug 20, 2024

AshwinSekar commented Aug 20, 2024

bw-solana left a comment

Choose a reason for hiding this comment

mergify bot commented Aug 21, 2024

carllin Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

wen-coding Aug 21, 2024

Choose a reason for hiding this comment

AshwinSekar Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

carllin Aug 21, 2024

Choose a reason for hiding this comment

carllin Aug 21, 2024 • edited Loading

Choose a reason for hiding this comment

wen-coding Aug 21, 2024

Choose a reason for hiding this comment

carllin Aug 21, 2024 •

edited

Loading

AshwinSekar Aug 21, 2024 •

edited

Loading

carllin Aug 21, 2024 •

edited

Loading