Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consensus (bank hash) divergence between v1.14. and master (mnb) #32889

Closed
steviez opened this issue Aug 18, 2023 · 15 comments
Closed

Consensus (bank hash) divergence between v1.14. and master (mnb) #32889

steviez opened this issue Aug 18, 2023 · 15 comments
Assignees

Comments

@steviez
Copy link
Contributor

steviez commented Aug 18, 2023

Problem

A consensus divergence was observed on the canary nodes running the tip of master against mnb. Sample log:

[2023-08-18T06:35:40.571950193Z WARN  ...]
Cluster duplicate confirmed slot 212112050 with hash 9FGFC7MDkVU3igJzk9uavzaFFRdxuDNpgpFrUCf3pjgh,
but our version has hash GoZ4MYBB21YiiVCHWZHzKaMebtvk1sFWBqpzT3nZwk6s

Running git-bisect with solana-ledger-tool, I was able to determine that #31291 causes the divergence. With 35ec7bf,

[... solana_runtime::bank]
bank frozen: 212112050 hash: GoZ4MYBB21YiiVCHWZHzKaMebtvk1sFWBqpzT3nZwk6s
accounts_delta: HVnSyWR9HYrhstNFfY76xK2S8Z3bRWnbYmqJeorjuTa2
signature_count: 1995
last_blockhash: CbHLuMAEFQdh7QZRvwgevKygZwEoB5Di9NJ3srboiCFK
capitalization: 555288903189881885,
stats: BankHashStats {
    num_updated_accounts: 7669,
    num_removed_accounts: 150,
    num_lamports_stored: 4668553361866345,
    total_data_len: 43475445,
    num_executable_accounts: 0
}

With d26e3ff (the parent of 35ec7bf), I got the correct hash:

[...  solana_runtime::bank] bank frozen: 212112050
hash: 9FGFC7MDkVU3igJzk9uavzaFFRdxuDNpgpFrUCf3pjgh
accounts_delta: HQZamW8eEtaF9Zdo4CEWTnvm67WUVs4gqayTffhfwKkx
signature_count: 1995
last_blockhash: CbHLuMAEFQdh7QZRvwgevKygZwEoB5Di9NJ3srboiCFK
capitalization: 555288903189881885,
stats: BankHashStats {
    num_updated_accounts: 7669,
    num_removed_accounts: 150,
    num_lamports_stored: 4668553361866345,
    total_data_len: 43475445,
    num_executable_accounts: 0
}

Proposed Solution

@steviez
Copy link
Contributor Author

steviez commented Aug 18, 2023

CC: @AshwinSekar @carllin @bji

@steviez
Copy link
Contributor Author

steviez commented Aug 18, 2023

Using the functionality introduced with #32632, I have debug files that are accessible here. Diffing these files, I see only account is different between the two:

Correct (from 9FGFC7MDkVU3igJzk9uavzaFFRdxuDNpgpFrUCf3pjgh file):

      "pubkey": "Fzrwo4KAmX7eGfQ5emmAHXs1hqoaiFXeH69P8LJNce5H",
      "hash": "6AozLvvt2bP8Ba3a9MGPcyUznVdYPLyLBsA1BUoSQLgF",
      "owner": "Vote111111111111111111111111111111111111111",
      "lamports": 30718862,
      "rent_epoch": 0,

Incorrect (from GoZ4MYBB21YiiVCHWZHzKaMebtvk1sFWBqpzT3nZwk6s file):

      "pubkey": "Fzrwo4KAmX7eGfQ5emmAHXs1hqoaiFXeH69P8LJNce5H",
      "hash": "FmNZaJAALnMfmpfDY7ie8TrrHkvvkznJGvxVFXjyBTgs",
      "owner": "Vote111111111111111111111111111111111111111",
      "lamports": 30718862,
      "rent_epoch": 0,

The files contain the account data as well (base64 encoded); can probably past these base64 strings into some rust code and decode into vote accounts to see what is different.

@bji
Copy link
Contributor

bji commented Aug 18, 2023

The files contain the account data as well (base64 encoded); can probably past these base64 strings into some rust code and decode into vote accounts to see what is different.

Thank you for taking these steps. I will decode the accounts and take a look.

@bji
Copy link
Contributor

bji commented Aug 18, 2023

The difference is in credits accounting; the 9FGFC7MDkVU3igJzk9uavzaFFRdxuDNpgpFrUCf3pjgh version has 607035 credits for that vote account, and the GoZ4MYBB21YiiVCHWZHzKaMebtvk1sFWBqpzT3nZwk6s version has 607034 credits.

@bji
Copy link
Contributor

bji commented Aug 18, 2023

I wonder - is it possible to get the most recent Vote or VoteStateUpdate instruction that preceeded the vote account snapshots for the vote account Fzrwo4KAmX7eGfQ5emmAHXs1hqoaiFXeH69P8LJNce5H?

Also the fact that the divergence was after the VoteStateUpdate feature was enabled, suggests to me that it's a difference in how credits are counted for VoteStateUpdate transactions.

@bji
Copy link
Contributor

bji commented Aug 18, 2023

Also I wonder if this is something that only happens on transition between VoteState built from Vote tx, to one built using VoteStateUpdate. The log line with the consensus divergence occurred just after the start of epoch 491, when the VoteStateUpdate feature had just become enabled.

@steviez
Copy link
Contributor Author

steviez commented Aug 18, 2023

Here is the transaction from 212112050 by the way:

      Version: legacy
      Recent Blockhash: AQV3Y3MXgWpXXaYXP1Ca7pGfa4UeShhwZZ6mPBKPAi1Q
      Signature 0: 4f1K9fevkEwui1bwd1H85Bq2r8CzcJCNZqvQe8PJZJM3qDwhzEsBWB9xtVZphMviHwYQ3UxCnLhiKsMdA8C9z5TG
      Signature 1: 4gyAHuvfVsNbxHVFkNd1yYQbc9WUebZarzGPDULN8pUNqksKGv1TeFfkqTsi32Dqo6X9xQxhkNywZNjDoBfNtozZ
      Account 0: srw- GuwejjjLXBcR31unxezBWvcjiGspba7spVUTUfcCwyu6 (fee payer)
      Account 1: sr-- BZZfR5nNzsUiXDLoFMf3gmv2GvNXNG8CRp61FwG4qMAe
      Account 2: -rw- Fzrwo4KAmX7eGfQ5emmAHXs1hqoaiFXeH69P8LJNce5H
      Account 3: -r-x Vote111111111111111111111111111111111111111
      Instruction 0
        Program:   Vote111111111111111111111111111111111111111 (3)
        Account 0: Fzrwo4KAmX7eGfQ5emmAHXs1hqoaiFXeH69P8LJNce5H (2)
        Account 1: BZZfR5nNzsUiXDLoFMf3gmv2GvNXNG8CRp61FwG4qMAe (1)
        CompactUpdateVoteState(VoteStateUpdate { lockouts: [
          Lockout { slot: 212111970, confirmation_count: 31 }, 
          Lockout { slot: 212111971, confirmation_count: 30 }, 
          Lockout { slot: 212111972, confirmation_count: 29 }, 
          Lockout { slot: 212111973, confirmation_count: 28 }, 
          Lockout { slot: 212111974, confirmation_count: 27 }, 
          Lockout { slot: 212111975, confirmation_count: 26 }, 
          Lockout { slot: 212111976, confirmation_count: 25 }, 
          Lockout { slot: 212111977, confirmation_count: 24 }, 
          Lockout { slot: 212111978, confirmation_count: 23 }, 
          Lockout { slot: 212111979, confirmation_count: 22 }, 
          Lockout { slot: 212111980, confirmation_count: 21 }, 
          Lockout { slot: 212111981, confirmation_count: 20 }, 
          Lockout { slot: 212111982, confirmation_count: 19 }, 
          Lockout { slot: 212111983, confirmation_count: 18 }, 
          Lockout { slot: 212111984, confirmation_count: 17 }, 
          Lockout { slot: 212111985, confirmation_count: 16 }, 
          Lockout { slot: 212111986, confirmation_count: 15 }, 
          Lockout { slot: 212111987, confirmation_count: 14 }, 
          Lockout { slot: 212111988, confirmation_count: 13 }, 
          Lockout { slot: 212111989, confirmation_count: 12 }, 
          Lockout { slot: 212111990, confirmation_count: 11 }, 
          Lockout { slot: 212111991, confirmation_count: 10 }, 
          Lockout { slot: 212111992, confirmation_count: 9 }, 
          Lockout { slot: 212111993, confirmation_count: 8 }, 
          Lockout { slot: 212111994, confirmation_count: 7 }, 
          Lockout { slot: 212111995, confirmation_count: 6 }, 
          Lockout { slot: 212111996, confirmation_count: 5 }, 
          Lockout { slot: 212111997, confirmation_count: 4 }, 
          Lockout { slot: 212111998, confirmation_count: 3 }, 
          Lockout { slot: 212111999, confirmation_count: 2 }, 
          Lockout { slot: 212112000, confirmation_count: 1 }],
        root: Some(212111969),
        hash: F4qq27pBxSyone6viKeoXFCxZ3YUANSuzimGoP1Vggze,
        timestamp: Some(1692340536) 
      })
      Status: Ok
        Fee: ◎0.00001
        Account 0 balance: ◎2.800891892 -> ◎2.800881892
        Account 1 balance: ◎0.07308136
        Account 2 balance: ◎0.030718862
        Account 3 balance: ◎0.000000001
      Compute Units Consumed: 0

@bji
Copy link
Contributor

bji commented Aug 18, 2023

Assuming that only the Lockouts, Root, and Credits are of importance for diagnosing this, here is the account data:

(It just confirms that the vote account tower state was updated from the contents of the VoteStateUpdate tx you referenced)

{
  "votes": [
    {
      "latency": 0,
      "lockout": {
        "slot": 212111970,
        "confirmation_count": 31
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111971,
        "confirmation_count": 30
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111972,
        "confirmation_count": 29
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111973,
        "confirmation_count": 28
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111974,
        "confirmation_count": 27
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111975,
        "confirmation_count": 26
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111976,
        "confirmation_count": 25
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111977,
        "confirmation_count": 24
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111978,
        "confirmation_count": 23
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111979,
        "confirmation_count": 22
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111980,
        "confirmation_count": 21
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111981,
        "confirmation_count": 20
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111982,
        "confirmation_count": 19
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111983,
        "confirmation_count": 18
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111984,
        "confirmation_count": 17
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111985,
        "confirmation_count": 16
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111986,
        "confirmation_count": 15
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111987,
        "confirmation_count": 14
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111988,
        "confirmation_count": 13
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111989,
        "confirmation_count": 12
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111990,
        "confirmation_count": 11
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111991,
        "confirmation_count": 10
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111992,
        "confirmation_count": 9
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111993,
        "confirmation_count": 8
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111994,
        "confirmation_count": 7
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111995,
        "confirmation_count": 6
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111996,
        "confirmation_count": 5
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111997,
        "confirmation_count": 4
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111998,
        "confirmation_count": 3
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212111999,
        "confirmation_count": 2
      }
    },
    {
      "latency": 0,
      "lockout": {
        "slot": 212112000,
        "confirmation_count": 1
      }
    }
  ],
  "root_slot": 212111969,
  "epoch_credits": [
    [
      489,
      278610,
      0
    ],
    [
      490,
      607003,
      278610
    ],
    [
      491,
      607035,
      607003
    ]
  ]
}

@bji
Copy link
Contributor

bji commented Aug 18, 2023

Need to see the vote account state prior to the VoteStateUpdate tx (and possibly the prior VoteStateUpdate tx as well, that put the vote account in that state).

Some more info: that vote account had no transactions in epoch 491 before the VoteStateUpdate that steviez referenced in slot 212112050. Therefore this was definitely a mismatch after the first VoteStateUpdate for this vote account.

@bji
Copy link
Contributor

bji commented Aug 18, 2023

I'm having a hard time finding any vote tx for this validator leading up to slot 212112050. Have gone all the way back to 212111960 (90 slots prior) without finding one yet.

@bji
Copy link
Contributor

bji commented Aug 18, 2023

Yeah the validator was delinquent for over 16,000 slots. So the edge case here seems to be when a VoteStateUpdate replaced the entirety of a vote state with new slots.

@bji
Copy link
Contributor

bji commented Aug 18, 2023

I think the error may be the change from this:

    // Count the number of slots at and before the new root within the current vote state lockouts.  Start with 1
    // for the new root.  The purpose of this is to know how many slots were rooted by this state update:
    // - The new root was rooted
    // - As were any slots that were in the current state but are not in the new state.  The only slots which
    //   can be in this set are those oldest slots in the current vote state that are not present in the
    //   new vote state; these have been "popped off the back" of the tower and thus represent finalized slots
    let mut finalized_slot_count = 1_u64;

    if let Some(new_root) = new_root {
        for current_vote in &vote_state.votes {
            // Find the first vote in the current vote state for a slot greater
            // than the new proposed root
            if current_vote.slot() <= new_root {
                current_vote_state_index = current_vote_state_index
                    .checked_add(1)
                    .expect("`current_vote_state_index` is bounded by `MAX_LOCKOUT_HISTORY` when processing new root");
                if current_vote.slot() != new_root {
                    finalized_slot_count = finalized_slot_count
                        .checked_add(1)
                        .expect("`finalized_slot_count` is bounded by `MAX_LOCKOUT_HISTORY` when processing new root");
                }
                continue;
            }

            break;
        }
    }

To this:

    // Accumulate credits earned by newly rooted slots.
    let mut earned_credits = 0_u64;

    if let Some(new_root) = new_root {
        for current_vote in &vote_state.votes {
            // Find the first vote in the current vote state for a slot greater
            // than the new proposed root
            if current_vote.slot() <= new_root {
                earned_credits = earned_credits
                    .checked_add(vote_state.credits_for_vote_at_index(current_vote_state_index))
                    .expect("`earned_credits` does not overflow");
                current_vote_state_index = current_vote_state_index
                    .checked_add(1)
                    .expect("`current_vote_state_index` is bounded by `MAX_LOCKOUT_HISTORY` when processing new root");
                continue;
            }

            break;
        }
    }

The if current_vote.slot() != new_root test was removed, which I guess would make an off-by-one error if the current_vote.slot() == new_root, where the old version of the code would not award a credit, but the new version would.

@bji
Copy link
Contributor

bji commented Aug 18, 2023

Sorry I have to run now but I think the bug is actually that the code prior to timely vote credits adds a credit when it shouldn't -- when the previous vote state had no slots and the new vote state has a new root. It would add 1 credit anyway (because finalized_slot_count is initialized to 1), even though no votes had been rooted.

The timely vote credits feature rewrites this logic and I believe properly adds 0 credits in this case, but this differs from the "normal" VoteStateUpdate path.

Still trying to work out the specifics of which situation causes a divergence in credits accounting before and after the change, but my strike thoughed text was wrong because if there is a new root, then a credit should be awarded.

@bji
Copy link
Contributor

bji commented Aug 19, 2023

OK I think I understand the problem.

Indeed the existing code is flawed, and the timely_vote_credits change fixes it.

What is happening is that when a validator is delinquent for a long time, then their first VoteStateUpdate after that long delinquency will include a new root, but it's not a slot that they ever voted on. The existing code (which came from the "one_credit_per_dequeue" feature) adds a credit just because there is a new root in the new vote state. But this is incorrect, as that new root was never voted on by the validator so should not earn a credit.

But the timely_vote_credits feature correctly only adds credits for slots from the old vote state that were "popped off of the tower" by the new vote, and doesn't add a credit just because there is a new root in the new vote state (although it would have added the credit if the new root were a slot that was actually voted on in the prior vote state; which is not the case in the circumstances leading to this bug).

I have written a test case which demonstrates this flaw:

bji@30b750c

I think that the way to deal with this is to fix the existing code to not have this flaw, which will require a new feature; and then once this feature is enabled, the timely vote credits feature can be re-enabled.

bji added a commit to bji/solana that referenced this issue Aug 19, 2023
bji added a commit to bji/solana that referenced this issue Aug 19, 2023
bji added a commit to bji/solana that referenced this issue Aug 20, 2023
bji added a commit to bji/solana that referenced this issue Aug 21, 2023
bji added a commit to bji/solana that referenced this issue Aug 23, 2023
bji added a commit to bji/solana that referenced this issue Aug 23, 2023
@steviez
Copy link
Contributor Author

steviez commented Sep 8, 2023

Closing this issue out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants