-
Notifications
You must be signed in to change notification settings - Fork 20.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
core: fix snapshot missing when recovery from crash #23496
core: fix snapshot missing when recovery from crash #23496
Conversation
@zzyalbert are you running a regular full-node or an archive node? |
Also, is it on mainnet (PoW) or a testnet (Clique) or maybe a private Clique network? Trying to figure out the trigger for the issue. |
a regular full-node |
yes, it was firstly discovered on my own private Clique network, then I read the code and found out any full-node may suffer the same problem when shutdown accidentally (like power failure or crush, which can be simulated by In this case, trieDB dirty cache and snapshots diffLayers will be lost and the state they has in disk may differ. After recovery from restart, the function |
I think this PR is wrong -- but I'd be willing to look into the root cause which prompted this PR to be created. When we have a missing snapshot layer at startup, we expect the snapshot generator to kick in. Meanwhile, blocks should continue to be processed. |
If crash happens(without persisting the state), the blockchain will rewind its head to a lower position than the disk layer. So that all the missing diff layers can be regenerated. It should always happen. But the corner cases are:
Then the gap will be created. In these two cases, Geth should pick it up by rebuilding. |
@holiman @rjl493456442 Maybe I should explain it more clearly, it was a bit obscure to describe, I'll give an example. Look at the following picture, it shows the condition of blocks, states, and snapshot after geth crashes. Because of the lost of memory cache, what we have are all the blocks, state tries at some blocks and disk layer of snapshot at some block. After geth restart, it begins to rewind its head
|
@zzyalbert thanks for the clarification, I understand now, and it's an interesting cornercase! |
One thing worth considering: if geth rewinds the head, going to an earlier state. Wouldn't it make sense to also remove the trie roots for the blocks that were 'forgotten'? Essentially, when we do a setHead, for whatever reason, it's a bit odd to leave the roots intact, meaning geth can just skip ahead back to where it was previously. |
15821f6
to
5d74404
Compare
@holiman Yep, It totally make sense. I just update my pr, which delete any state roots of skipped blocks during rewinding the head as I replied to your comment in code. Looking forward to your review |
We've discussed this in triage today; we're leaning towards that deleting roots is maybe a bit dangerous -- we also have refcounts and rollbacks from the downloader, and it may cause problems to start deleting stuff (also, it's not sufficient to delete only the latest, we'd need to delete the entire sequence of roots back to the block we're aiming for). So some version of the original fix is preferred, but we also don't want to introduce a new must-have dependency on the snapshots, if we can avoid it. @rjl493456442 will look into it more |
5d74404
to
eb87187
Compare
@rjl493456442 @holiman ok, I update my pr, which move the snap check out, making it look less must-have dependent on the snapshot |
@rjl493456442 could you please review my pr? |
… failure). It is because write known block only check block and state without snapshot, which will lead to gap between newest snapshot and newest block state. and snapshot would not be used in following block execution.
053a940
to
9937430
Compare
@zzyalbert Your latest fix looks good to me, although snapshot layer is identified by block state root instead of block hash. I have fixed this issue, added the tests and rebase against the master. But the changes in core package are really sensitive, especially in core.BlockChain. We definitely need more eyes to check the change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Minor nits, LGTM
@zzyalbert I pushed a more commit to fix a panic(was introduced by me) which uses the peek for retrieving the current iterated block by iterator. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
It is because write known block only checks block and state without snapshot, which could lead to gap between newest snapshot and newest block state. However, new blocks which would cause snapshot to become fixed were ignored, since state was already known. Co-authored-by: Gary Rong <[email protected]> Co-authored-by: Martin Holst Swende <[email protected]>
It is because write known block only checks block and state without snapshot, which could lead to gap between newest snapshot and newest block state. However, new blocks which would cause snapshot to become fixed were ignored, since state was already known. Co-authored-by: Gary Rong <[email protected]> Co-authored-by: Martin Holst Swende <[email protected]>
Fix snapshot missing when recovery from crash(like OOM or power failure). It is because write known block only check block and state without snapshot when recovering, which will lead to gap between newest snapshot and newest block state, and snapshot would not be used in following block executions.
Related logs:
I found the log above when geth shutdown gracefully, before that I just restart geth from crash