Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Hash mismatch on v1.5 in accounts simulation #14948

Closed
carllin opened this issue Jan 30, 2021 · 24 comments · Fixed by #15334
Closed

Hash mismatch on v1.5 in accounts simulation #14948

carllin opened this issue Jan 30, 2021 · 24 comments · Fixed by #15334
Assignees
Milestone

Comments

@carllin
Copy link
Contributor

carllin commented Jan 30, 2021

Problem

Bootstrap leader seems to have encountered an erroneous hash, causing it to fork off during the accounts migration tests

[2021-01-29T01:06:13.269096499Z INFO  solana_runtime::bank]
bank frozen: 181896 hash: 312MfFWZ9hpLWjaxbxZAAXiKdBE6dMwcHi3rghaPDN3F accounts_delta: GNBLwxAitac48gyceVwsdUUYNQYJV4AFaCKba7g7Rwjt signature_count: 10 last_blockhash: HFXB6GLYKH32nAhXeHUdVuqx5eCgFNYDJNx7wZiyAnb9 capitalization: 5500954163744417169

[2021-01-29T01:06:13.192369211Z INFO  solana_runtime::bank]
bank frozen: 181896 hash: Cq9w8kKAQsMe9YCwz75j5m2UzjEA5HEbKknQgNXLnbmq accounts_delta: A6Q4wAxkYcxo6eK3fe9q6FUfgSJ9nUwijVf2W7Fv1zRJ signature_count: 10 last_blockhash: HFXB6GLYKH32nAhXeHUdVuqx5eCgFNYDJNx7wZiyAnb9 capitalization: 5500954163744417169

Proposed Solution

Debug and fix.

The first obstacle seems to be that the snapshots generated during the test run into 'Load from snapshot failed: Serialize(Io(Custom { kind: Other, error: "incorrect layout/length/data" }))' when trying to boot from them. This may somehow be related to the update_accounts_hash() being commented out in AccountsBackgroundService. Trying to see if I can salvage these to avoid replaying from genesis.

@sakridge @CriesofCarrots @ryoqun

@carllin carllin added the v1.5 label Jan 30, 2021
@carllin carllin self-assigned this Jan 30, 2021
@carllin
Copy link
Contributor Author

carllin commented Jan 30, 2021

The first obstacle seems to be that the snapshots generated during the test run into 'Load from snapshot failed: Serialize(Io(Custom { kind: Other, error: "incorrect layout/length/data" }))' when trying to boot from them. This may somehow be related to the update_accounts_hash() being commented out in AccountsBackgroundService. Trying to see if I can salvage these to avoid replaying from genesis.

This actually only seems to be happening on the node that forked off...interesting. The check that is failing is the self.account_meta.lamports != 0 || self.clone_account() == Account::default() in append_vec::sanitize_lamports(). The "account" that's failing this check looks like:

StoredAccountMeta { meta: StoredMeta { write_version: 0, pubkey: 11111111111111111111111111111111, data_len: 0 }, account_meta: AccountMeta { lamports: 0, owner: 11111111111111111111111111111111, executable: false, rent_epoch: 4294967296 }, data: [], offset: 31192, stored_size: 136, hash: 1111111111111111VJBRrGXMNSx9EXYe92GTF5 }

I believe the above is a corrupted storage entry, as these appear multiple times, just usually they are all default so do not trigger the check to fail, looking like this:

System account: StoredAccountMeta { meta: StoredMeta { write_version: 0, pubkey: 11111111111111111111111111111111, data_len: 0 }, account_meta: AccountMeta { lamports: 0, owner: 11111111111111111111111111111111, executable: false, rent_epoch: 0 }, data: [], offset: 30240, stored_size: 136, hash: 11111111111111111111111111111111 }

Note these are not the system account, which looks like: StoredAccountMeta { meta: StoredMeta { write_version: 509, pubkey: 11111111111111111111111111111111, data_len: 14 }, account_meta: AccountMeta { lamports: 1, owner: NativeLoader1111111111111111111111111111111, executable: true, rent_epoch: 0 }, data: [115, 121, 115, 116, 101, 109, 95, 112, 114, 111, 103, 114, 97, 109], offset: 396536, stored_size: 152, hash: APNoNXDhUuQ8CLG8xRaQUUQxof1jsA8qLrhaQAM3NyJp }

This leads me to believe that some of the AppendVec's are not being stored properly/corrupted, maybe I've introduced a bug in append_accounts()...

@carllin
Copy link
Contributor Author

carllin commented Jan 30, 2021

So in both of the available snapshots, the corrupted AppendVec is for slot 174160, id: 331582

AppendVec { path: "/home/carl_solana_com/DebugConsensus/accounts/174160.331582", map: MmapMut { ptr: 0x7f0cd4ed2000, len: 286720 }, append_offset: Mutex { data: 285149 }, current_len: 285149, file_size: 286720, remove_on_drop: true }

The first ~70 accounts in the AppendVec are all readable, and then it becomes a bunch of the above garbage...

Given the mismatch popped up almost 7000 slots later in slot 181896, I'm guessing something must have corrupted the AppendVec, and then we read from the corrupted AppendVec. Maybe shrink or recycle. Would be interesting to see if all the accounts in that storage entry for that slot were actual accounts touched by transactions in that slot.

@mvines mvines added this to the v1.5.6 milestone Feb 5, 2021
@mvines
Copy link
Member

mvines commented Feb 5, 2021

I can confirm this occurred on devnet overnight:

da/dw both on v1.4:
bank frozen: 34795614 hash: aFdVTcLCPADoZRqUVkPTq4Rks8WF77a3kVvB4kAGTwF accounts_delta: DWx2PpTV4yJAY1s5YqknQEPwiruDU8MHSNKAdMxyqQhh signature_count: 4 last_blockhash: ELdRqmugxrpHNPXt7QgyUHY9UwNShsd5ta9KNL7KgD7k capitalization: 1010820021966907084

dv on v1.5:
bank frozen: 34795614 hash: 12UMLk4rZkkRRBrgPXDEaz8c85jBHzydqdER6j9M5TJR accounts_delta: DrsL2ZPEs9FNUhbMtNWFDXCaU9h75dkrfsB7hBXu8wDC signature_count: 4 last_blockhash: ELdRqmugxrpHNPXt7QgyUHY9UwNShsd5ta9KNL7KgD7k capitalization: 1010820021966907084

@mvines
Copy link
Member

mvines commented Feb 5, 2021

On dv, replaying slot 34795614 with solana-ledger-tool verify produces the correct hash (that da/dw had)

@carllin
Copy link
Contributor Author

carllin commented Feb 5, 2021

ooft, thanks for the heads up. I'll add more logging to both the 1.4 and 1.5 nodes and hopefully it'll repro. Was dv upgraded recently?

@mvines
Copy link
Member

mvines commented Feb 5, 2021

Within the last week or so when I rolled dw and da back to v1.4 while keeping dv on v1.5 to better simulate the current state of mainnet-beta.

I'm collecting all the logs/snapshots/ledgers from the machines and will share details here soon

@mvines
Copy link
Member

mvines commented Feb 5, 2021

Ok logs are in the ~/issue-14948 in dv/dw/da now. Also ~/issue-14948 on dv contains the full ledger (which is consistent across the machines) and you can run:

sol@validator-us-west1-b:~$ solana-ledger-tool --ledger issue-14948/ verify --halt-at-slot 34795615

to observe how dv using v1.5 calculates the correct bank hash with ledger-tool

@mvines mvines removed the v1.5 label Feb 5, 2021
@sakridge
Copy link
Member

sakridge commented Feb 5, 2021

min ledger: dv@~/issue-14948/min-ledger-14948

@sakridge
Copy link
Member

sakridge commented Feb 6, 2021

  Entry 61 - num_hashes: 1925, hashes: D8A1wu5bCLq4oExR1GY1K1TGRxg1nzSnHap8rbYmSThn, transactions: 1
    Transaction 0
      Recent Blockhash: 76dYvw5XE7guX7JUzDEatLxFxMyCEtZTCU2RQyBmFfTd
      Signature 0: 5wweTBQoMzwfwAheb8rxBW35eyydWkA9xZgJ28fPZqpjRDQ8st96jqzAa4Dot26b95KK6ePo3wn1Hadvuq9Q7eoF
      Signature 1: 5bV6Tg1JrbURFn5Z4LkdBQ3W4eZHLSvsnMBgzjN8GELAv1YZzzBdn74igEVeos9WFBSkZMq2RVL7ELsukT7ptQzv
      Signature 2: 6CZyTSKSUArmTHcEytj5Bgiib73zVpNqpKytwyN2HcygbA4VdqY2Jfw2hXDXmLSzQr9z4EHaWRk6N2tSwo9KvCA
      MessageHeader { num_required_signatures: 3, num_readonly_signed_accounts: 0, num_readonly_unsigned_accounts: 4 }
      Account 0: UXszRsAuGAYH9HHKBdRgx9E5ZU2fM2im8BeNUeyuZVR
      Account 1: 4mZ9cR7Tx9gKRtWnp1EySKye16uPXad2AEE754F8Rcp4
      Account 2: 7ePbDTiv3aXGt5W1V1gi6gdWJYM4WMvfYKNWYYcyHYgo
      Account 3: 42FuDW9zi4f2N2eDSkcejepUSGYkK1AmZbyQFAEed5vB
      Account 4: SysvarRent111111111111111111111111111111111
      Account 5: 11111111111111111111111111111111
      Account 6: DrDgTGU5CZehEvjEnSrVuELPKyQ1zkhe1WNbYMMCbH4r
      Instruction 0
        Program: 11111111111111111111111111111111 (5)
        Account 0: UXszRsAuGAYH9HHKBdRgx9E5ZU2fM2im8BeNUeyuZVR (0)
        Account 1: 4mZ9cR7Tx9gKRtWnp1EySKye16uPXad2AEE754F8Rcp4 (1)
        CreateAccount { lamports: 1955760, space: 153, owner: DrDgTGU5CZehEvjEnSrVuELPKyQ1zkhe1WNbYMMCbH4r }
      Instruction 1
        Program: 11111111111111111111111111111111 (5)
        Account 0: UXszRsAuGAYH9HHKBdRgx9E5ZU2fM2im8BeNUeyuZVR (0)
        Account 1: 7ePbDTiv3aXGt5W1V1gi6gdWJYM4WMvfYKNWYYcyHYgo (2)
        CreateAccount { lamports: 58310880, space: 8250, owner: DrDgTGU5CZehEvjEnSrVuELPKyQ1zkhe1WNbYMMCbH4r }
      Instruction 2
              Program: DrDgTGU5CZehEvjEnSrVuELPKyQ1zkhe1WNbYMMCbH4r (6)
        Account 0: 4mZ9cR7Tx9gKRtWnp1EySKye16uPXad2AEE754F8Rcp4 (1)
        Account 1: 7ePbDTiv3aXGt5W1V1gi6gdWJYM4WMvfYKNWYYcyHYgo (2)
        Account 2: 42FuDW9zi4f2N2eDSkcejepUSGYkK1AmZbyQFAEed5vB (3)
        Account 3: SysvarRent111111111111111111111111111111111 (4)
        Data: [197, 45, 4, 78, 201, 65, 227, 90, 60, 13, 86, 125, 70, 182, 243, 0, 64, 145, 106, 181, 101, 72, 164, 89, 201, 119, 245, 181, 174, 1
01, 210, 14, 167, 67, 207, 87, 132, 156, 177, 233, 7, 13, 152, 213, 6, 184, 151, 196, 41, 112, 199, 143, 212, 112, 182, 32, 144, 122, 136, 202, 62
, 103, 115, 216, 132, 210, 187, 138, 241, 109, 176, 240, 251, 4, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 170, 0, 0, 0]
      Status: Unavailable

This is one of 2 instructions in the slot 34795614, the other is a vote transaction.

The missing data length is 8403 = 169775 - 161372 and this transaction does two CreateAccount for 8403 bytes of space total.

@sakridge
Copy link
Member

sakridge commented Feb 6, 2021

load UXszRsAuGAYH9HHKBdRgx9E5ZU2fM2im8BeNUeyuZVR => 
Some((Account { lamports: 2997420724080 data.len: 0 owner: 11111111111111111111111111111111 
executable: false rent_epoch: 80 }, 34795611))

@carllin
Copy link
Contributor Author

carllin commented Feb 6, 2021

@sakridge I think it's as we suspected, it's some sort of InstructionError. I simulated the instruction error on that transaction, here's a familiar looking result:

[2021-02-06T03:50:32.194655519Z INFO  solana_runtime::bank] bank frozen: 34795614 hash: 12UMLk4rZkkRRBrgPXDEaz8c85jBHzydqdER6j9M5TJR accounts_delta: DrsL2ZPEs9FNUhbMtNWFDXCaU9h75dkrfsB7hBXu8wDC signature_count: 4 last_blockhash: ELdRqmugxrpHNPXt7QgyUHY9UwNShsd5ta9KNL7KgD7k capitalization: 1010820021966907084
[2021-02-06T03:50:32.194724499Z INFO  solana_runtime::bank] accounts hash slot: 34795614 stats: BankHashStats { num_updated_accounts: 14, num_removed_accounts: 0, num_lamports_stored: 5519723870721, total_data_len: 161372, num_executable_accounts: 0 }

@sakridge
Copy link
Member

sakridge commented Feb 6, 2021

Looks like a difference with JIT:

This is the status with bpf_jit enabled:

[2021-02-06T04:13:19.016641109Z INFO  solana_runtime::bank] 
slot: 34795614 result: Err(InstructionError(2, ProgramFailedToComplete)) 
tx: Transaction { 
signatures: [
5wweTBQoMzwfwAheb8rxBW35eyydWkA9xZgJ28fPZqpjRDQ8st96jqzAa4Dot26b95KK6ePo3wn1Hadvuq9Q7eoF, 5bV6Tg1JrbURFn5Z4LkdBQ3W4eZHLSvsnMBgzjN8GELAv1YZzzBdn74igEVeos9WFBSkZMq2RVL7ELsukT7ptQzv, 
6CZyTSKSUArmTHcEytj5Bgiib73zVpNqpKytwyN2HcygbA4VdqY2Jfw2hXDXmLSzQr9z4EHaWRk6N2tSwo9KvCA], 
message: 
Message { header: MessageHeader { num_required_signatures: 3, 
num_readonly_signed_accounts: 0, 
num_readonly_unsigned_accounts: 4 }, 
account_keys: [
UXszRsAuGAYH9HHKBdRgx9E5ZU2fM2im8BeNUeyuZVR, 
4mZ9cR7Tx9gKRtWnp1EySKye16uPXad2AEE754F8Rcp4, 
7ePbDTiv3aXGt5W1V1gi6gdWJYM4WMvfYKNWYYcyHYgo, 
42FuDW9zi4f2N2eDSkcejepUSGYkK1AmZbyQFAEed5vB, 
SysvarRent111111111111111111111111111111111, 
11111111111111111111111111111111, 
DrDgTGU5CZehEvjEnSrVuELPKyQ1zkhe1WNbYMMCbH4r], 
recent_blockhash: 76dYvw5XE7guX7JUzDEatLxFxMyCEtZTCU2RQyBmFfTd, 
instructions: [CompiledInstruction { 
program_id_index: 5, accounts: [0, 1], 
data: [0, 0, 0, 0, 176, 215, 29, 0, 0, 0, 0, 0, 153, 0, 0, 0, 0, 0, 0, 0, 190, 232, 10, 83, 77, 59, 142, 208, 168, 148, 182, 127,
113, 55, 52, 145, 132, 139, 231, 113, 16, 251, 76, 26, 227, 211, 122, 16, 24, 57, 215, 31] }, CompiledInstruction { program_id_index: 5, accounts
: [0, 2], data: [0, 0, 0, 0, 224, 192, 121, 3, 0, 0, 0, 0, 58, 32, 0, 0, 0, 0, 0, 0, 190, 232, 10, 83, 77, 59, 142, 208, 168, 148, 182, 127, 113,
55, 52, 145, 132, 139, 231, 113, 16, 251, 76, 26, 227, 211, 122, 16, 24, 57, 215, 31] }, CompiledInstruction { program_id_index: 6, accounts: [1,
2, 3, 4], data: [197, 45, 4, 78, 201, 65, 227, 90, 60, 13, 86, 125, 70, 182, 243, 0, 64, 145, 106, 181, 101, 72, 164, 89, 201, 119, 245, 181, 174,
101, 210, 14, 167, 67, 207, 87, 132, 156, 177, 233, 7, 13, 152, 213, 6, 184, 151, 196, 41, 112, 199, 143, 212, 112, 182, 32, 144, 122, 136, 202,
62, 103, 115, 216, 132, 210, 187, 138, 241, 109, 176, 240, 251, 4, 0, 0, 0, 0, 0, 0, 0, 2, 0, 0, 0, 0, 0, 0, 0, 170, 0, 0, 0] }] } }

cc @Lichtso @jackcmay

@mvines
Copy link
Member

mvines commented Feb 6, 2021

Oh! @Lichtso, can you please plumb the --bpf-jit flag into solana-ledger-tool verify .. as well

@mvines mvines changed the title Hash mismatch on v1.5 Hash mismatch on v1.5 (only with BPF JIT enabled?) Feb 6, 2021
@sakridge sakridge changed the title Hash mismatch on v1.5 (only with BPF JIT enabled?) Hash mismatch on v1.5 in accounts simulation Feb 6, 2021
@sakridge
Copy link
Member

sakridge commented Feb 6, 2021

The devnet occurance and the original testnet occurance are likely two different issues. I filed #15175 for the JIT issue specifically. Propose we move that debug to there.

@jackcmay
Copy link
Contributor

jackcmay commented Feb 6, 2021

@sakridge That transaction doesn't seem to have anything to do with jit from first look, three instructions, first one creates an account for key 1, second creates an account for key 2 and the third references key 1, 2, 3 (which does not exist on devnet), and 4 (rent) but references a program account (key 6) that does not exist (and therefore also not executable). Maybe jit is having some side effect (mem scribbler?) that is leading to corruption?

@jackcmay
Copy link
Contributor

jackcmay commented Feb 6, 2021

Replaying ledger, looks like those accounts are not found by the explorer but are in the snapshot

@jackcmay
Copy link
Contributor

jackcmay commented Feb 7, 2021

@Lichtso Looks like there is a divide by zero that happens only with the jit:

tx status: Err(InstructionError(2, ProgramFailedToComplete))
logs: [
"Program 11111111111111111111111111111111 invoke [1]",
"Program 11111111111111111111111111111111 success",
"Program 11111111111111111111111111111111 invoke [1]",
"Program 11111111111111111111111111111111 success", 
"Program DrDgTGU5CZehEvjEnSrVuELPKyQ1zkhe1WNbYMMCbH4r invoke [1]", 
"Program DrDgTGU5CZehEvjEnSrVuELPKyQ1zkhe1WNbYMMCbH4r consumed 200000 of 200000 compute units", 
"Program failed to complete: devide by zero at instruction 140603453572235", 
"Program DrDgTGU5CZehEvjEnSrVuELPKyQ1zkhe1WNbYMMCbH4r failed: Program failed to complete"]

(besides the spelling error (fixed), the pc that the error is referencing also looks bogus (140603453572235), I plan to debug further tonight but could use your eyes too, and I probably won't have much time to spend on it tomorrow.

To recreate:

  • pull the folder debug-14948 from ssh [email protected]
  • $ cd <solana>/ledger-tool
  • $ cargo run -- verify -l debug-14948 --skip-poh-verify

ledger-tool verify will re-run the transaction, you can add debug messages, etc...

@Mrmaxmeier
Copy link
Contributor

Mrmaxmeier commented Feb 7, 2021

The pc that the error is referencing also looks bogus (140603453572235), I plan to debug further tonight but could use your eyes too, and I probably won't have much time to spend on it tomorrow.

I believe I've hit these before. Here's a small reproducer that might help with debugging:

	.text
	.globl entry
	.p2align 3
func:
	r0 = 0
	r0 = 0
	r0 = 0
	.byte 0x18, 0x72, 0x72, 0x74, 0x00, 0x00, 0x00, 0x00
entry:
	.byte 0x00, 0x38, 0x07, 0x00, 0x00, 0x00, 0x00, 0x00

llvm-mc -triple bpf -filetype=obj -o out.o test.s && ld.lld -shared --entry=entry -o out.so out.o

This'll leak (non-deterministic) host pointers into the pc value:

⋊> ~/_/_/s/cli on main ⨯ cargo run --release -- --use jit -e ../out.so
Result: Err(UnsupportedInstruction(140729161669453))

I'm not quite sure it's the original issue though as it should only affect the program log, not the actual chain state.

out.so.gz

@jackcmay
Copy link
Contributor

jackcmay commented Feb 7, 2021

@Mrmaxmeier You've seen this issue on non-jit builds?

@jackcmay
Copy link
Contributor

jackcmay commented Feb 7, 2021

@Lichtso
Looks like we have at least three issues:

  • The failure is probably not a divide by zero issue, but instead a run-away program. The program consumed all of its compute budget. I bumped up the budget in testing and it again consumed all of the larger budget too and still reported a divide by zero
  • Reporting divide by zero when the issue is actually a compute budget exhaustion
  • instruction number reported in the divide by zero error (and possibly other errors (invalid instruction, ...)) is being reported incorrectly.

@Lichtso
Copy link
Contributor

Lichtso commented Feb 8, 2021

@jackcmay I think all three of these effects are caused by one thing: An invalid jump.

  • The jump ends up landing in e.g. the divide-by-zero exception handler, thus reporting that error.
  • As the compute budget is calculated using the now corrupted program-counter, that is incorrect as well.
  • And so is the reported program-counter value, as that needs to be set before running the exception handler, so it is just uninitialized.

@Lichtso
Copy link
Contributor

Lichtso commented Feb 9, 2021

@Mrmaxmeier The last inconsistency you found should be fixed with this:
https://github.com/solana-labs/rbpf/tree/fix/lddw_related_bugs

I also updated the CLI tool so that you can run the verifier on the executables.

@mvines mvines removed this from the v1.5.6 milestone Feb 10, 2021
@mvines mvines added this to the v1.5.7 milestone Feb 10, 2021
@Lichtso
Copy link
Contributor

Lichtso commented Feb 12, 2021

solana-labs/rbpf#140

@mvines
Copy link
Member

mvines commented Feb 13, 2021

@Lichtso - hey can we get a new rbpf crate shipped with the fix and into the v1.5 branch early next week?

@Lichtso Lichtso mentioned this issue Feb 15, 2021
@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 30, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants