Skip to content
This repository has been archived by the owner on Jan 22, 2025. It is now read-only.

validator version 1.16.16 crashed with "unable to freeze a block with bankhash" #33740

Closed
bji opened this issue Oct 17, 2023 · 31 comments
Closed
Labels
community Community contribution

Comments

@bji
Copy link
Contributor

bji commented Oct 17, 2023

Problem

I am running a primary/secondary validator setup with two physically separate validators. Both validators were running 1.16.16-jito w/local vote mods (the vote mods are not believed to alter any aspect of accounts state, because the mods only alter which slots are voted on in replay_stage.rs).

Both primary and secondary crashed simultaneously with the same error message. This is two separate validators experiencing the same issue. Note that the secondary likely took the tower from the primary shortly before the primary crashed as is common behavior when a secondary's tower gets out of sync with the primary.

The log message on both validators was:

[2023-10-16T22:47:33.894214958Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at 'We have tried to repair duplicate slot: 224116920 more than 10 times and are unable to freeze a block with bankhash 38rFNeVT1FGHResoXsgTN2cyyjBiJd6ZFcXWCPnXzZn7, instead we have a block with bankhash Some(GVZoNMHn1o8CB76i3LLRHjTA1p8f4iyTa6hn9eopWJh1). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting', core/src/replay_stage.rs:1585:25" location="core/src/replay_stage.rs:1585:25" version="1.16.16 (src:00000000; feat:4033350765, client:JitoLabs)"

I have collected the following files from one of the validators:

snapshot:
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/snapshot-224107039-GTo6rcFciwNew8Bx9Hggv7n9oS5AZVeQGSXQMkaFfkfy.tar.zst

incremental snapshot:
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/incremental-snapshot-224107039-224116213-qg6B13UGAAMKDm6XqG6p6PchnZ7RAgjhbe759igiFce.tar.zst

ledger (copied 10,000 slots leading up to the crash):
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/ledger.tar.gz

validator logs leading up to the crash:
https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/validator.log.gz

@bji bji added the community Community contribution label Oct 17, 2023
@steviez
Copy link
Contributor

steviez commented Nov 2, 2023

@bji - Assuming you haven't completely wiped your ledger, can you please post the file that was created within <LEDGER_DIR>/bank_hash_details/. If you have multiple (ie if your node auto restarts on failure), please post the lowest slot.

If you have it, I'll replay the slot myself to generate a hopefully good file, and then compare mine vs yours to figure out what account(s) / transaction(s) might have caused the issue

@bji
Copy link
Contributor Author

bji commented Nov 14, 2023

Happened again today, both primary and secondary failing with the same error:

thread 'solReplayStage' panicked at 'We have tried to repair duplicate slot: 229873482 more than 10 times and are unable to freeze a block with bankhash 51idF9qg8ntNMzhmUaMTvygV2nrPNCJ11uqQPvLsfVYU, instead we have a block with bankhash Some(ASJhPCw4LYXouujQSMZSMVLcfSU8dVwYTfv2czncF6Fv). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting', core/src/replay_stage.rs:1587:25
[2023-11-14T01:48:20.839284352Z ERROR solana_metrics::metrics] datapoint: panic program="validator" thread="solReplayStage" one=1i message="panicked at 'We have tried to repair duplicate slot: 229873482 more than 10 times and are unable to freeze a block with bankhash 51idF9qg8ntNMzhmUaMTvygV2nrPNCJ11uqQPvLsfVYU, instead we have a block with bankhash Some(ASJhPCw4LYXouujQSMZSMVLcfSU8dVwYTfv2czncF6Fv). This is most likely a bug in the runtime. At this point manual intervention is needed to make progress. Exiting', core/src/replay_stage.rs:1587:25" location="core/src/replay_stage.rs:1587:25" version="1.16.18 (src:00000000; feat:4033350765, client:JitoLabs)"

@bji
Copy link
Contributor Author

bji commented Nov 14, 2023

@bji - Assuming you haven't completely wiped your ledger, can you please post the file that was created within <LEDGER_DIR>/bank_hash_details/. If you have multiple (ie if your node auto restarts on failure), please post the lowest slot.

If you have it, I'll replay the slot myself to generate a hopefully good file, and then compare mine vs yours to figure out what account(s) / transaction(s) might have caused the issue

I provided the entire ledger in the bug description. And the original ledger is long since gone, it is necessary to completely delete it in order for the validator to start successfully in this situation.

@steviez
Copy link
Contributor

steviez commented Nov 14, 2023

Assuming you haven't completely wiped your ledger, can you please post the file that was created within <LEDGER_DIR>/bank_hash_details/. If you have multiple (ie if your node auto restarts on failure), please post the lowest slot.

If you have it, I'll replay the slot myself to generate a hopefully good file, and then compare mine vs yours to figure out what account(s) / transaction(s) might have caused the issue

For completeness, just realized I misremembered and this feature did not get backported to v1.16 - sorry for the wild goose chase

@bji
Copy link
Contributor Author

bji commented Nov 14, 2023

Assuming you haven't completely wiped your ledger, can you please post the file that was created within <LEDGER_DIR>/bank_hash_details/. If you have multiple (ie if your node auto restarts on failure), please post the lowest slot.
If you have it, I'll replay the slot myself to generate a hopefully good file, and then compare mine vs yours to figure out what account(s) / transaction(s) might have caused the issue

For completeness, just realized I misremembered and this feature did not get backported to v1.16 - sorry for the wild goose chase

No problem - but is the complete ledger I linked to in the bug description not sufficient? Erm I guess it's 10,000 copied slots, not the whole ledger. Is it enough?

@steviez
Copy link
Contributor

steviez commented Nov 14, 2023

No problem - but is the complete ledger I linked to in the bug description not sufficient?

The ledger you provided is sufficient for me to download and replay. However, replaying is not guaranteed to reproduce the failure (ie the majority of the cluster voted otherwise). That file I requested dumps the account state of every written account for that slot. So, even if we can't reproduce the failure, this file would allow me to see which account differed, and exactly how it differed. This is typically enough to work backwards to a specific TX, or we can see if it is a single bit flip or something

@bji
Copy link
Contributor Author

bji commented Nov 14, 2023

No problem - but is the complete ledger I linked to in the bug description not sufficient?

The ledger you provided is sufficient for me to download and replay. However, replaying is not guaranteed to reproduce the failure (ie the majority of the cluster voted otherwise). That file I requested dumps the account state of every written account for that slot. So, even if we can't reproduce the failure, this file would allow me to see which account differed, and exactly how it differed. This is typically enough to work backwards to a specific TX, or we can see if it is a single bit flip or something

By single bit flip do you mean, a hardware error on the machine?

This is impossible in this particular instance because two machines had the same error at the same time. There is definitely something that actually happened to create consensus deviation.

@steviez
Copy link
Contributor

steviez commented Nov 14, 2023

By single bit flip do you mean, a hardware error on the machine?
This is impossible in this particular instance because two machines had the same error at the same time. There is definitely something that actually happened to create consensus deviation.

Sure, I'm not trying to brush off your report as a hardware fault. Whether it is a single bit or a whole 10 MB account, the file from when your node failed + a regenerated file from correct execution allow me to diff the account state. You may recall I had similar files when we had to momentarily back out one of your PRs, and those files seemingly helped us hone in on the issue much faster (I guess you could tell me since you did the in depth debugging 😉 )

Without this extra level detail, we are at the mercy of being able to reproduce the error reliably. But, given that the rest of the network deviated from how your node voted, I'm not optimistic we'd be able to reproduce. I'm pretty sure I pulled down those slots and attempted to reproduce a few weeks back, but I'm honestly not positive so I'll give it another go to be sure.

And to confirm - are you still running v1.16.16-jito ?

@bji
Copy link
Contributor Author

bji commented Nov 14, 2023

By single bit flip do you mean, a hardware error on the machine?
This is impossible in this particular instance because two machines had the same error at the same time. There is definitely something that actually happened to create consensus deviation.

Sure, I'm not trying to brush off your report as a hardware fault. Whether it is a single bit or a whole 10 MB account, the file from when your node failed + a regenerated file from correct execution allow me to diff the account state. You may recall I had similar files when we had to momentarily back out one of your PRs, and those files seemingly helped us hone in on the issue much faster (I guess you could tell me since you did the in depth debugging 😉 )

Without this extra level detail, we are at the mercy of being able to reproduce the error reliably. But, given that the rest of the network deviated from how your node voted, I'm not optimistic we'd be able to reproduce. I'm pretty sure I pulled down those slots and attempted to reproduce a few weeks back, but I'm honestly not positive so I'll give it another go to be sure.

And to confirm - are you still running v1.16.16-jito ?

Sorry, I didn't mean to imply that you were trying to brush off the report. I just wanted to make sure that it was clear that this one is very unlikely to be a hardware fault.

Not still running 1.16.16-jito. The most recent occurance - a few hours ago - was on 1.16.18-jito.

If you don't have time to do the state diff, I could do it with some hinting on the tooling I'll need to accomplish this task. Don't mean to put this issue all on you!

@steviez
Copy link
Contributor

steviez commented Nov 14, 2023

Sorry, I didn't mean to imply that you were trying to brush off the report. I just wanted to make sure that it was clear that this one is very unlikely to be a hardware fault.

👍

Not still running 1.16.16-jito. The most recent occurance - a few hours ago - was on 1.16.18-jito.

👍

If you don't have time to do the state diff, I could do it with some hinting on the tooling I'll need to accomplish this task. Don't mean to put this issue all on you!

Ha, well that comment bring us full circle to my earlier comment:
"just realized I misremembered and this feature did not get backported to v1.16"

So, whether there is any additional work to do boils down to whether the bad hash can be created with the ledger you provided. As mentioned, I thought I tried but not positive so I'll pull these down and give it a run

@bji
Copy link
Contributor Author

bji commented Nov 14, 2023

Just happened again. I'll save the full ledger.

@steviez
Copy link
Contributor

steviez commented Nov 14, 2023

For context, the original error:

We have tried to repair duplicate slot: 224116920 more than 10 times and are unable to freeze a block with 
bankhash 38rFNeVT1FGHResoXsgTN2cyyjBiJd6ZFcXWCPnXzZn7, instead we have a block with
bankhash Some(GVZoNMHn1o8CB76i3LLRHjTA1p8f4iyTa6hn9eopWJh1)

I downloaded your ledger and replayed with several versions, namely:

v1.16.16
v1.16.16-jito
v1.16.16-jito w/ jemalloc debug enabled (this has caught one or two bugs in the past)

All of these runs produced the correct hash 38rFNeVT1FGHResoXsgTN2cyyjBiJd6ZFcXWCPnXzZn7 on slot 224116920.

Given that you have seemingly been able to reproduce this problem several times, I would advise cherry-picking the commit that #32632 introduced onto whichever v1.16 branch you're running. If it fails again, it will produce a file in <LEDGER_DIR>/bank_hash_details that will be useful for us to track down the issue

@bji
Copy link
Contributor Author

bji commented Nov 15, 2023

For context, the original error:

We have tried to repair duplicate slot: 224116920 more than 10 times and are unable to freeze a block with 
bankhash 38rFNeVT1FGHResoXsgTN2cyyjBiJd6ZFcXWCPnXzZn7, instead we have a block with
bankhash Some(GVZoNMHn1o8CB76i3LLRHjTA1p8f4iyTa6hn9eopWJh1)

I downloaded your ledger and replayed with several versions, namely:

v1.16.16
v1.16.16-jito
v1.16.16-jito w/ jemalloc debug enabled (this has caught one or two bugs in the past)

All of these runs produced the correct hash 38rFNeVT1FGHResoXsgTN2cyyjBiJd6ZFcXWCPnXzZn7 on slot 224116920.

Given that you have seemingly been able to reproduce this problem several times, I would advise cherry-picking the commit that #32632 introduced onto whichever v1.16 branch you're running. If it fails again, it will produce a file in <LEDGER_DIR>/bank_hash_details that will be useful for us to track down the issue

Thanks, much appreciated. Is it possible that there was something restricted just to an in-memory issue that never made it to the ledger itself?

Anyway, I will cherry pick that change and run it from now on. Will let you know if this happens again.

@bji
Copy link
Contributor Author

bji commented Nov 15, 2023

Given that you have seemingly been able to reproduce this problem several times, I would advise cherry-picking the commit that #32632 introduced onto whichever v1.16 branch you're running. If it fails again, it will produce a file in <LEDGER_DIR>/bank_hash_details that will be useful for us to track down the issue

So I was unable to apply that change to a 1.16 branch, there were a lot of conflicts and I didn't know how to resolve them, having little experience with the accounts database code.

However, my secondary is running 1.17.5 and that branch has your bank_hash_details change, so IF this problem occurs again, then:

  • If it doesn't happen on the secondary running the 1.17.5 branch, then we'll know that this was an issue isolated to the 1.16 branch
  • If it does happen on the secondary running the 1.17.5 branch, the bank_hash_details contents should be created and will be available for use in debugging

@steviez
Copy link
Contributor

steviez commented Nov 15, 2023

Thanks, much appreciated. Is it possible that there was something restricted just to an in-memory issue that never made it to the ledger itself?

Yes, absolutely. That's what nice about that debug file - it does an accounts-db scan which will pull from memory or disk.

So I was unable to apply that change to a 1.16 branch, there were a lot of conflicts and I didn't know how to resolve them, having little experience with the accounts database code.

Ohh duh, sorry, I should have helped you with that 😬 . Let me see how bad the resolution looks

However, my secondary is running 1.17.5 and that branch has your bank_hash_details change, so IF this problem occurs again, then:
...

Good deal and I agree with your logic here in if we see the failure again, we should be able to isolate

@bji
Copy link
Contributor Author

bji commented Nov 16, 2023

Crash occurred again, but this time ONLY on my primary, which was running 1.16.19-jito.

The secondary was running 1.17.5 (not JITO), and it did not crash. This is the first time I've run different software versions on the primary and secondary and the first time that the crash did not duplicate across the two.

Therefore, it seems really likely that the issue is isolated to:

  • 1.16 branch
  • or JITO
  • or JITO on 1.16

@bji
Copy link
Contributor Author

bji commented Nov 16, 2023

I am going to switch JITO off of the primary, and run just 1.16.19 there. If that doesn't reproduce the issue, then JITO may be the cause.

@steviez
Copy link
Contributor

steviez commented Nov 16, 2023

Crash occurred again, but this time ONLY on my primary, which was running 1.16.19-jito.

The secondary was running 1.17.5 (not JITO), and it did not crash. This is the first time I've run different software versions on the primary and secondary and the first time that the crash did not duplicate across the two.

Therefore, it seems really likely that the issue is isolated to:

  • 1.16 branch
  • or JITO
  • or JITO on 1.16

I recently became aware of this bug on Jito side: jito-foundation/jito-solana#449. However, that looks to have landed in v1.16.19-jito, so the fact that you saw with .19 would rule that out as a suspect.

Also, here is a branch where you can find the debug PR I had mentioned cherry-picked onto v1.16:
https://github.com/steviez/solana/commits/v1.16.19_bank_file

@bji
Copy link
Contributor Author

bji commented Nov 16, 2023

Crash occurred again, but this time ONLY on my primary, which was running 1.16.19-jito.
The secondary was running 1.17.5 (not JITO), and it did not crash. This is the first time I've run different software versions on the primary and secondary and the first time that the crash did not duplicate across the two.
Therefore, it seems really likely that the issue is isolated to:

  • 1.16 branch
  • or JITO
  • or JITO on 1.16

I recently became aware of this bug on Jito side: jito-foundation/jito-solana#449. However, that looks to have landed in v1.16.19-jito, so the fact that you saw with .19 would rule that out as a suspect.

Also, here is a branch where you can find the debug PR I had mentioned cherry-picked onto v1.16: https://github.com/steviez/solana/commits/v1.16.19_bank_file

I build my own version of the JITO patches and it's entirely possible that I was missing that fix.

In fact it seems really likely!

@bji
Copy link
Contributor Author

bji commented Nov 16, 2023

That being said, I did just reproduce the crash with a stock 1.16.19 not running JITO. So I am now switching my primary and secondary to 1.17.5 which seems to be impervious to this issue.

@steviez
Copy link
Contributor

steviez commented Nov 17, 2023

That being said, I did just reproduce the crash with a stock 1.16.19 not running JITO. So I am now switching my primary and secondary to 1.17.5 which seems to be impervious to this issue.

Recall that v1.17.5 isn't officially suggested for mnb so you may encounter other incompatibilities (hopefully not). If you're up for it, I did do the cherry-pick of that debug file to v1.16; that file would be pretty helpful for getting some insight into what the issue might be that you're facing.

@buffalu
Copy link
Contributor

buffalu commented Nov 17, 2023

I recently became aware of this bug on Jito side: jito-foundation/jito-solana#449. However, that looks to have landed in v1.16.19-jito, so the fact that you saw with .19 would rule that out as a suspect.

this bug was only related to simulating bundles, which zan's validator most likely wasn't doing.

we are looking into this issue on our side as well

@bji
Copy link
Contributor Author

bji commented Nov 17, 2023

I recently became aware of this bug on Jito side: jito-foundation/jito-solana#449. However, that looks to have landed in v1.16.19-jito, so the fact that you saw with .19 would rule that out as a suspect.

this bug was only related to simulating bundles, which zan's validator most likely wasn't doing.

we are looking into this issue on our side as well

Most likely wasn't doing? Or definitely wasn't doing?

@bji
Copy link
Contributor Author

bji commented Nov 17, 2023

That being said, I did just reproduce the crash with a stock 1.16.19 not running JITO. So I am now switching my primary and secondary to 1.17.5 which seems to be impervious to this issue.

Recall that v1.17.5 isn't officially suggested for mnb so you may encounter other incompatibilities (hopefully not). If you're up for it, I did do the cherry-pick of that debug file to v1.16; that file would be pretty helpful for getting some insight into what the issue might be that you're facing.

I appreciate that, but thus far the only software that hasn't crashed on me in this way is 1.17.5. So I'm sticking with that on my primary.

I'll run your patch on 1.16.19 on my secondary.

@buffalu
Copy link
Contributor

buffalu commented Nov 17, 2023

I recently became aware of this bug on Jito side: jito-foundation/jito-solana#449. However, that looks to have landed in v1.16.19-jito, so the fact that you saw with .19 would rule that out as a suspect.

this bug was only related to simulating bundles, which zan's validator most likely wasn't doing.
we are looking into this issue on our side as well

Most likely wasn't doing? Or definitely wasn't doing?

it would only be doing this if you had your RPC ports open and people were calling simulate_bundle

@bji
Copy link
Contributor Author

bji commented Nov 17, 2023

I recently became aware of this bug on Jito side: jito-foundation/jito-solana#449. However, that looks to have landed in v1.16.19-jito, so the fact that you saw with .19 would rule that out as a suspect.

this bug was only related to simulating bundles, which zan's validator most likely wasn't doing.
we are looking into this issue on our side as well

Most likely wasn't doing? Or definitely wasn't doing?

it would only be doing this if you had your RPC ports open and people were calling simulate_bundle

My RPC ports aren't open, they are firewalled to only a few systems that only I have access to.

However, I do run my own relayer -- can these simulates be routed through that?

@buffalu
Copy link
Contributor

buffalu commented Nov 17, 2023

I recently became aware of this bug on Jito side: jito-foundation/jito-solana#449. However, that looks to have landed in v1.16.19-jito, so the fact that you saw with .19 would rule that out as a suspect.

this bug was only related to simulating bundles, which zan's validator most likely wasn't doing.
we are looking into this issue on our side as well

Most likely wasn't doing? Or definitely wasn't doing?

it would only be doing this if you had your RPC ports open and people were calling simulate_bundle

My RPC ports aren't open, they are firewalled to only a few systems that only I have access to.

However, I do run my own relayer -- can these simulates be routed through that?

no

@brooksprumo
Copy link
Contributor

I have collected the following files from one of the validators:

snapshot: https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/snapshot-224107039-GTo6rcFciwNew8Bx9Hggv7n9oS5AZVeQGSXQMkaFfkfy.tar.zst

I've tried downloading the snapshot twice, and both times I get this error while inflating:

GSXQMkaFfkfy.tar.zst : 147160 MB...     zstd: snapshot-224107039-GTo6rcFciwNew8Bx9Hggv7n9oS5AZVeQGSXQMkaFfkfy.tar.zst: unsupported format

@bji Do you happen to still have this snapshot available?

@bji
Copy link
Contributor Author

bji commented Dec 8, 2023

I have collected the following files from one of the validators:
snapshot: https://s3.us-west-1.amazonaws.com/shinobi-systems.com/incident-2023.10.16/snapshot-224107039-GTo6rcFciwNew8Bx9Hggv7n9oS5AZVeQGSXQMkaFfkfy.tar.zst

I've tried downloading the snapshot twice, and both times I get this error while inflating:

GSXQMkaFfkfy.tar.zst : 147160 MB...     zstd: snapshot-224107039-GTo6rcFciwNew8Bx9Hggv7n9oS5AZVeQGSXQMkaFfkfy.tar.zst: unsupported format

@bji Do you happen to still have this snapshot available?

No sorry I only have what is hosted on amazon aws. The files are so big that it's possible that something failed after a very long upload almost completed and I didn't notice it. Sorry.

@buffalu
Copy link
Contributor

buffalu commented Dec 14, 2023

SELECT "message", "version" FROM "panic" WHERE ("message"::field =~ /^*We have tried to repair/) AND $timeFilter GROUP BY "host_id"::tag

Screenshot 2023-12-14 at 11 52 09 AM

@steviez
Copy link
Contributor

steviez commented Feb 5, 2024

Given that we have progressed to v1.17 on mnb, I'm going to close this issue out. There were a small handful of intermittent reports that we didn't get to the bottom of on v1.16, but enough stuff has changed where I think debugging will need to "start fresh". Can always look back here if we have reason to believe any v1.17+ reports (hopefully there are none) are the same issue / related.

Also, v1.17 has some of the debugging features baked in (whereas we didn't backport them until alter v1.16 releases) so we should be able to handle any reports in a more effective manner

@steviez steviez closed this as completed Feb 5, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
community Community contribution
Projects
None yet
Development

No branches or pull requests

4 participants