-
Notifications
You must be signed in to change notification settings - Fork 755
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
op-geth stuck with "State not available" #130
Comments
same issues here with mainnet node |
Can you both provide the following information: System Specs:
Config:
Add any other context about the problem here: |
System specs
Config:
Not sure what other context could be useful. Here is complete log output with verbosity set to debug:
and then an endless stream of:
|
many folks in optimism discord with same issue and there is no fix. anyone finding this issue, just nuke your node because this is unfixable. op team will tell you to try debug_sethead, then tell you "well its never worked before but maybe it will for you" |
@eyooooo im 0xChupa from vfat lol, yes indeed the only way to get our mainnet node running again was a fresh install |
It took 10 days to resync and guess what it died again after two days when it was synced. This is such a shame and carelessness of the people responsible. |
|
@imtipi I am sure this is not the case. It never happened before and it can't start happening to multiple people all of a sudden. |
Are you saying that anytime op-node is down while op-geth is running (e.g. because it's being updated in a docker-compose setting), there's a risk of the DB getting corrupted? |
no,you need to shutdown op-node first to make sure op-geth cannot getting data anymore |
Doesn't really make sense to me. If I shut down op-geth first, it can't get data anyway, because it's not running. As I initially wrote, I suspect that in my case the data breakage was caused by the L1 being temporarily unavailable. |
more like a speeding car suddenly crashing into a wall |
Geth needs to persist in-memory state on shutdown, at least with the hash-based DB format. They introduced an experimental new format, which we're working on adopting, but for now the hash-based DB format is more considered the more stable option. After DB corruption because of force-shutdown what ends up happening is that the "state" (storage/account mutations) is missing for the latest few blocks, but the block data is there. The corrupted geth node does not expose that the state is "missing", and ends up misleading the op-node into trying to build on top of the latest state, which is not there. And so unless geth finishes the "regenerating state" phase (which it often does not, due to tricky assumptions around the corrupted DB) it gets stuck. If you want to manually repair the DB, to avoid a sync from scratch, the steps to follow are:
It's not a pretty fix, but geth DB corruption is already a bad situation, and due to issues in geth we cannot automate this recovery process. Luckily the new Path-DB format is fixing this DB corruption issue, and we will be adopting these improvements from upstream geth as soon as we can. |
The geth DB corruption that causes "state not available" happens when geth shuts down, and is unrelated to any op-node or L1 interactions. You must configure your op-geth setup to allow it to gracefully shut down, and not hit some docker/systemd/other time-out. If it does time-out, the op-geth process is killed, and state will not be fully written to disk, causing it to be not available. |
thx @protolambda for this detailed explanation! Will try the recovery procedure, thanks again! |
hope to see your fix feedback here. |
I'll default to @protolambda's opinion, but I think you're going to want to have a much longer grace period. Something like 20 minutes to be safe. |
@protolambda Could you please help me with some questions. I got the same issue from the author due to a power cut. At step 2. I got this result. What number should I use to put in step 4 ?
I tried to put this number and the return is null.
I checked the latest block and it seemed it changed but I'm not sure it's correct or not.
I cannot run the last step to finish. It always returns me an error code.
Could you please help me ? I'm new to this. Thanks a lot. |
Try converting 110291757 to a hexadecimal string and run it under optimism path |
Completed a repair as per the instructions, and I wanted to share the process here. I might provide more test results (currently testing for stability and efforts to reduce data loss). @sbvegan @protolambda @dandavid3000
|
Overview:After completing the aforementioned fix process, it has been observed for 7 days. @sbvegan @protolambda
|
Some questions: Regarding the snapshotRecoveryNumber, sometimes it remains unchanged for extended periods, and at times, querying it shows 'nil' (even though the node can sync properly after a restart). |
Thanks for the instructions. I tried your method many times and I did not succeed. Op-node keeps saying "Walking back .....etc.." to a really old eth tx that was too far away from the current one. It never stopped because the last command to finish the process always gave me a return error. I ended up start a fresh node. It maybe long but better and I will keep backing up the database once every 2 weeks. |
Walking back.. is not bad, it takes time and I believe there is value in waiting. It is a good habit to backup your data regularly |
@kaliubuntu0206 can you explain what this does?
|
After synchronizing using the |
I don't think the |
|
I had a similar problem after prune state.Very nice and it worked!Thx! |
My optimism-goerli node got stuck with op-geth outputting this:
I can connect to the http RPC interface, but the head block is stuck at 12573071.
op-node meanwhile shows:
It's not clear to me what this means and what to do about it.
I suspect it's relate to the connected goerli L1 having been unavailable for a while. But it's now good again, yet this Optimism node seems to not be recovering.
The text was updated successfully, but these errors were encountered: