-
Notifications
You must be signed in to change notification settings - Fork 4.5k
validator version 1.16.16 crashed with "unable to freeze a block with bankhash" #33740
Comments
@bji - Assuming you haven't completely wiped your ledger, can you please post the file that was created within If you have it, I'll replay the slot myself to generate a hopefully good file, and then compare mine vs yours to figure out what account(s) / transaction(s) might have caused the issue |
Happened again today, both primary and secondary failing with the same error:
|
I provided the entire ledger in the bug description. And the original ledger is long since gone, it is necessary to completely delete it in order for the validator to start successfully in this situation. |
For completeness, just realized I misremembered and this feature did not get backported to v1.16 - sorry for the wild goose chase |
No problem - but is the complete ledger I linked to in the bug description not sufficient? Erm I guess it's 10,000 copied slots, not the whole ledger. Is it enough? |
The ledger you provided is sufficient for me to download and replay. However, replaying is not guaranteed to reproduce the failure (ie the majority of the cluster voted otherwise). That file I requested dumps the account state of every written account for that slot. So, even if we can't reproduce the failure, this file would allow me to see which account differed, and exactly how it differed. This is typically enough to work backwards to a specific TX, or we can see if it is a single bit flip or something |
By single bit flip do you mean, a hardware error on the machine? This is impossible in this particular instance because two machines had the same error at the same time. There is definitely something that actually happened to create consensus deviation. |
Sure, I'm not trying to brush off your report as a hardware fault. Whether it is a single bit or a whole 10 MB account, the file from when your node failed + a regenerated file from correct execution allow me to diff the account state. You may recall I had similar files when we had to momentarily back out one of your PRs, and those files seemingly helped us hone in on the issue much faster (I guess you could tell me since you did the in depth debugging 😉 ) Without this extra level detail, we are at the mercy of being able to reproduce the error reliably. But, given that the rest of the network deviated from how your node voted, I'm not optimistic we'd be able to reproduce. I'm pretty sure I pulled down those slots and attempted to reproduce a few weeks back, but I'm honestly not positive so I'll give it another go to be sure. And to confirm - are you still running |
Sorry, I didn't mean to imply that you were trying to brush off the report. I just wanted to make sure that it was clear that this one is very unlikely to be a hardware fault. Not still running 1.16.16-jito. The most recent occurance - a few hours ago - was on 1.16.18-jito. If you don't have time to do the state diff, I could do it with some hinting on the tooling I'll need to accomplish this task. Don't mean to put this issue all on you! |
👍
👍
Ha, well that comment bring us full circle to my earlier comment: So, whether there is any additional work to do boils down to whether the bad hash can be created with the ledger you provided. As mentioned, I thought I tried but not positive so I'll pull these down and give it a run |
Just happened again. I'll save the full ledger. |
For context, the original error:
I downloaded your ledger and replayed with several versions, namely:
All of these runs produced the correct hash Given that you have seemingly been able to reproduce this problem several times, I would advise cherry-picking the commit that #32632 introduced onto whichever |
Thanks, much appreciated. Is it possible that there was something restricted just to an in-memory issue that never made it to the ledger itself? Anyway, I will cherry pick that change and run it from now on. Will let you know if this happens again. |
So I was unable to apply that change to a 1.16 branch, there were a lot of conflicts and I didn't know how to resolve them, having little experience with the accounts database code. However, my secondary is running 1.17.5 and that branch has your bank_hash_details change, so IF this problem occurs again, then:
|
Yes, absolutely. That's what nice about that debug file - it does an accounts-db scan which will pull from memory or disk.
Ohh duh, sorry, I should have helped you with that 😬 . Let me see how bad the resolution looks
Good deal and I agree with your logic here in if we see the failure again, we should be able to isolate |
Crash occurred again, but this time ONLY on my primary, which was running 1.16.19-jito. The secondary was running 1.17.5 (not JITO), and it did not crash. This is the first time I've run different software versions on the primary and secondary and the first time that the crash did not duplicate across the two. Therefore, it seems really likely that the issue is isolated to:
|
I am going to switch JITO off of the primary, and run just 1.16.19 there. If that doesn't reproduce the issue, then JITO may be the cause. |
I recently became aware of this bug on Jito side: jito-foundation/jito-solana#449. However, that looks to have landed in v1.16.19-jito, so the fact that you saw with .19 would rule that out as a suspect. Also, here is a branch where you can find the debug PR I had mentioned cherry-picked onto v1.16: |
I build my own version of the JITO patches and it's entirely possible that I was missing that fix. In fact it seems really likely! |
That being said, I did just reproduce the crash with a stock 1.16.19 not running JITO. So I am now switching my primary and secondary to 1.17.5 which seems to be impervious to this issue. |
Recall that |
this bug was only related to simulating bundles, which zan's validator most likely wasn't doing. we are looking into this issue on our side as well |
Most likely wasn't doing? Or definitely wasn't doing? |
I appreciate that, but thus far the only software that hasn't crashed on me in this way is 1.17.5. So I'm sticking with that on my primary. I'll run your patch on 1.16.19 on my secondary. |
it would only be doing this if you had your RPC ports open and people were calling simulate_bundle |
My RPC ports aren't open, they are firewalled to only a few systems that only I have access to. However, I do run my own relayer -- can these simulates be routed through that? |
no |
I've tried downloading the snapshot twice, and both times I get this error while inflating:
@bji Do you happen to still have this snapshot available? |
No sorry I only have what is hosted on amazon aws. The files are so big that it's possible that something failed after a very long upload almost completed and I didn't notice it. Sorry. |
Given that we have progressed to Also, v1.17 has some of the debugging features baked in (whereas we didn't backport them until alter v1.16 releases) so we should be able to handle any reports in a more effective manner |
Problem
I am running a primary/secondary validator setup with two physically separate validators. Both validators were running 1.16.16-jito w/local vote mods (the vote mods are not believed to alter any aspect of accounts state, because the mods only alter which slots are voted on in replay_stage.rs).
Both primary and secondary crashed simultaneously with the same error message. This is two separate validators experiencing the same issue. Note that the secondary likely took the tower from the primary shortly before the primary crashed as is common behavior when a secondary's tower gets out of sync with the primary.
The log message on both validators was:
I have collected the following files from one of the validators:
snapshot:
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/snapshot-224107039-GTo6rcFciwNew8Bx9Hggv7n9oS5AZVeQGSXQMkaFfkfy.tar.zst
incremental snapshot:
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/incremental-snapshot-224107039-224116213-qg6B13UGAAMKDm6XqG6p6PchnZ7RAgjhbe759igiFce.tar.zst
ledger (copied 10,000 slots leading up to the crash):
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/ledger.tar.gz
validator logs leading up to the crash:
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/validator.log.gz
The text was updated successfully, but these errors were encountered: