-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Hash mismatch on v1.5 in accounts simulation #14948
Comments
This actually only seems to be happening on the node that forked off...interesting. The check that is failing is the
I believe the above is a corrupted storage entry, as these appear multiple times, just usually they are all default so do not trigger the check to fail, looking like this:
Note these are not the system account, which looks like: This leads me to believe that some of the AppendVec's are not being stored properly/corrupted, maybe I've introduced a bug in |
So in both of the available snapshots, the corrupted AppendVec is for slot
The first ~70 accounts in the AppendVec are all readable, and then it becomes a bunch of the above garbage... Given the mismatch popped up almost |
I can confirm this occurred on devnet overnight:
|
On |
ooft, thanks for the heads up. I'll add more logging to both the 1.4 and 1.5 nodes and hopefully it'll repro. Was |
Within the last week or so when I rolled I'm collecting all the logs/snapshots/ledgers from the machines and will share details here soon |
Ok logs are in the
to observe how |
min ledger: |
This is one of 2 instructions in the slot 34795614, the other is a vote transaction. The missing data length is 8403 = 169775 - 161372 and this transaction does two CreateAccount for 8403 bytes of space total. |
|
@sakridge I think it's as we suspected, it's some sort of InstructionError. I simulated the instruction error on that transaction, here's a familiar looking result:
|
Looks like a difference with JIT: This is the status with bpf_jit enabled:
|
Oh! @Lichtso, can you please plumb the |
The devnet occurance and the original testnet occurance are likely two different issues. I filed #15175 for the JIT issue specifically. Propose we move that debug to there. |
@sakridge That transaction doesn't seem to have anything to do with jit from first look, three instructions, first one creates an account for key 1, second creates an account for key 2 and the third references key 1, 2, 3 (which does not exist on devnet), and 4 (rent) but references a program account (key 6) that does not exist (and therefore also not executable). Maybe jit is having some side effect (mem scribbler?) that is leading to corruption? |
Replaying ledger, looks like those accounts are not found by the explorer but are in the snapshot |
@Lichtso Looks like there is a divide by zero that happens only with the jit:
(besides the spelling error (fixed), the pc that the error is referencing also looks bogus (140603453572235), I plan to debug further tonight but could use your eyes too, and I probably won't have much time to spend on it tomorrow. To recreate:
ledger-tool verify will re-run the transaction, you can add debug messages, etc... |
I believe I've hit these before. Here's a small reproducer that might help with debugging: .text
.globl entry
.p2align 3
func:
r0 = 0
r0 = 0
r0 = 0
.byte 0x18, 0x72, 0x72, 0x74, 0x00, 0x00, 0x00, 0x00
entry:
.byte 0x00, 0x38, 0x07, 0x00, 0x00, 0x00, 0x00, 0x00
This'll leak (non-deterministic) host pointers into the pc value: ⋊> ~/_/_/s/cli on main ⨯ cargo run --release -- --use jit -e ../out.so
Result: Err(UnsupportedInstruction(140729161669453)) I'm not quite sure it's the original issue though as it should only affect the program log, not the actual chain state. |
@Mrmaxmeier You've seen this issue on non-jit builds? |
@Lichtso
|
@jackcmay I think all three of these effects are caused by one thing: An invalid jump.
|
@Mrmaxmeier The last inconsistency you found should be fixed with this: I also updated the CLI tool so that you can run the verifier on the executables. |
@Lichtso - hey can we get a new rbpf crate shipped with the fix and into the v1.5 branch early next week? |
Problem
Bootstrap leader seems to have encountered an erroneous hash, causing it to fork off during the accounts migration tests
Proposed Solution
Debug and fix.
The first obstacle seems to be that the snapshots generated during the test run into
'Load from snapshot failed: Serialize(Io(Custom { kind: Other, error: "incorrect layout/length/data" }))'
when trying to boot from them. This may somehow be related to theupdate_accounts_hash()
being commented out inAccountsBackgroundService
. Trying to see if I can salvage these to avoid replaying from genesis.@sakridge @CriesofCarrots @ryoqun
The text was updated successfully, but these errors were encountered: