core: fix snapshot missing when recovery from crash #23496

zzyalbert · 2021-08-30T02:53:40Z

Fix snapshot missing when recovery from crash(like OOM or power failure). It is because write known block only check block and state without snapshot when recovering, which will lead to gap between newest snapshot and newest block state, and snapshot would not be used in following block executions.

Related logs:

ERROR[08-29|03:30:33.524] Failed to journal state snapshot err="snapshot [0xa85858f689482c30a06655a011698caeeedcbe76d2f4293a0b642b4f70027c93] missing"

I found the log above when geth shutdown gracefully, before that I just restart geth from crash

holiman · 2021-09-07T08:35:45Z

@zzyalbert are you running a regular full-node or an archive node?

karalabe · 2021-09-07T08:36:15Z

Also, is it on mainnet (PoW) or a testnet (Clique) or maybe a private Clique network? Trying to figure out the trigger for the issue.

zzyalbert · 2021-09-07T08:58:28Z

@zzyalbert are you running a regular full-node or an archive node?

a regular full-node

zzyalbert · 2021-09-07T09:46:20Z

Also, is it on mainnet (PoW) or a testnet (Clique) or maybe a private Clique network? Trying to figure out the trigger for the issue.

yes, it was firstly discovered on my own private Clique network, then I read the code and found out any full-node may suffer the same problem when shutdown accidentally (like power failure or crush, which can be simulated by kill -9)

In this case, trieDB dirty cache and snapshots diffLayers will be lost and the state they has in disk may differ. After recovery from restart, the function BlockChain.HasBlockAndState() triggered this issue. Because this function only check the state in trieDB, but not the state in snapshot. When some new block from remote peer is inserting in function BlockChain.insertChain(), the block with existed local block and state in trieDB will be recognized as known block, and skip execution. But if this block has no state in snapshot, the snapshot for this block will be missing, so as to the following new blocks from its peers.

core/block_validator.go

holiman · 2021-09-21T08:54:08Z

I think this PR is wrong -- but I'd be willing to look into the root cause which prompted this PR to be created. When we have a missing snapshot layer at startup, we expect the snapshot generator to kick in. Meanwhile, blocks should continue to be processed.
Was that not what happened?
Could you provide logs about where you encountered these problems?

rjl493456442 · 2021-09-21T09:08:25Z

In this case, trieDB dirty cache and snapshots diffLayers will be lost and the state they has in disk may differ. After recovery from restart, the function

If crash happens(without persisting the state), the blockchain will rewind its head to a lower position than the disk layer. So that all the missing diff layers can be regenerated. It should always happen.

But the corner cases are:

Geth restarts with disabling the snapshot, and in the next setup with snapshot enabled (gap is created "manually")
Geth persists the head state and then crash happens(snapshot journal is not stored) (the chance is really low)

Then the gap will be created. In these two cases, Geth should pick it up by rebuilding.

zzyalbert · 2021-09-24T07:01:38Z

@holiman @rjl493456442 Maybe I should explain it more clearly, it was a bit obscure to describe, I'll give an example.

Look at the following picture, it shows the condition of blocks, states, and snapshot after geth crashes. Because of the lost of memory cache, what we have are all the blocks, state tries at some blocks and disk layer of snapshot at some block.

After geth restart, it begins to rewind its head

It will search backward from the block that snapshot root referred to, which is block 1002
Then it will find the first block that has state trie as its head, which is block 1000
Then it will sync blocks from block 1000 and start to build new diff layer of snapshot from block 1003
But when it comes to block 1004, it has both block and state data on disk. Therefore, it will be recognized as a known block and skip block execution, which will also skip the generation of new diff layer of block 1004
Then the execution of following blocks from 1005 will not find its parent snapshot. The snapshot was interrupted.

holiman · 2021-09-27T09:39:18Z

@zzyalbert thanks for the clarification, I understand now, and it's an interesting cornercase!

holiman · 2021-09-27T13:23:40Z

One thing worth considering: if geth rewinds the head, going to an earlier state. Wouldn't it make sense to also remove the trie roots for the blocks that were 'forgotten'? Essentially, when we do a setHead, for whatever reason, it's a bit odd to leave the roots intact, meaning geth can just skip ahead back to where it was previously.

zzyalbert · 2021-09-28T03:54:34Z

One thing worth considering: if geth rewinds the head, going to an earlier state. Wouldn't it make sense to also remove the trie roots for the blocks that were 'forgotten'? Essentially, when we do a setHead, for whatever reason, it's a bit odd to leave the roots intact, meaning geth can just skip ahead back to where it was previously.

@holiman Yep, It totally make sense. I just update my pr, which delete any state roots of skipped blocks during rewinding the head as I replied to your comment in code.

Looking forward to your review

holiman · 2021-09-28T08:58:03Z

We've discussed this in triage today; we're leaning towards that deleting roots is maybe a bit dangerous -- we also have refcounts and rollbacks from the downloader, and it may cause problems to start deleting stuff (also, it's not sufficient to delete only the latest, we'd need to delete the entire sequence of roots back to the block we're aiming for).

So some version of the original fix is preferred, but we also don't want to introduce a new must-have dependency on the snapshots, if we can avoid it.

@rjl493456442 will look into it more

zzyalbert · 2021-09-30T01:51:32Z

We've discussed this in triage today; we're leaning towards that deleting roots is maybe a bit dangerous -- we also have refcounts and rollbacks from the downloader, and it may cause problems to start deleting stuff (also, it's not sufficient to delete only the latest, we'd need to delete the entire sequence of roots back to the block we're aiming for).

So some version of the original fix is preferred, but we also don't want to introduce a new must-have dependency on the snapshots, if we can avoid it.

@rjl493456442 will look into it more

@rjl493456442 @holiman ok, I update my pr, which move the snap check out, making it look less must-have dependent on the snapshot

zzyalbert · 2021-10-11T02:11:41Z

@rjl493456442 could you please review my pr？

… failure). It is because write known block only check block and state without snapshot, which will lead to gap between newest snapshot and newest block state. and snapshot would not be used in following block execution.

rjl493456442 · 2021-10-11T06:46:02Z

@zzyalbert Your latest fix looks good to me, although snapshot layer is identified by block state root instead of block hash. I have fixed this issue, added the tests and rebase against the master.

But the changes in core package are really sensitive, especially in core.BlockChain. We definitely need more eyes to check the change.

@holiman @karalabe Please take a look.

holiman

Minor nits, LGTM

core/blockchain.go

core/blockchain_repair_test.go

core/blockchain.go

rjl493456442 · 2021-10-12T11:59:28Z

@zzyalbert I pushed a more commit to fix a panic(was introduced by me) which uses the peek for retrieving the current iterated block by iterator.

holiman

LGTM

It is because write known block only checks block and state without snapshot, which could lead to gap between newest snapshot and newest block state. However, new blocks which would cause snapshot to become fixed were ignored, since state was already known. Co-authored-by: Gary Rong <[email protected]> Co-authored-by: Martin Holst Swende <[email protected]>

zzyalbert requested review from holiman, karalabe and rjl493456442 as code owners August 30, 2021 02:53

MariusVanDerWijden added the status:triage label Sep 7, 2021

holiman reviewed Sep 21, 2021

View reviewed changes

core/block_validator.go Outdated Show resolved Hide resolved

zzyalbert force-pushed the fix_snapshot_missing_after_crash branch from 15821f6 to 5d74404 Compare September 28, 2021 03:44

zzyalbert requested a review from holiman September 28, 2021 03:54

zzyalbert changed the title ~~core: fix snapshot missing when recovery from crash~~ core, trie: fix snapshot missing when recovery from crash Sep 28, 2021

zzyalbert force-pushed the fix_snapshot_missing_after_crash branch from 5d74404 to eb87187 Compare September 29, 2021 15:37

zzyalbert changed the title ~~core, trie: fix snapshot missing when recovery from crash~~ core: fix snapshot missing when recovery from crash Sep 30, 2021

zzyalbert and others added 3 commits October 11, 2021 14:30

core: add tests

b4c6456

core: fix the requireSnapshot

9937430

rjl493456442 force-pushed the fix_snapshot_missing_after_crash branch from 053a940 to 9937430 Compare October 11, 2021 06:32

core: update comments

3f8cbcb

holiman approved these changes Oct 11, 2021

View reviewed changes

core/blockchain.go Outdated Show resolved Hide resolved

core/blockchain_repair_test.go Outdated Show resolved Hide resolved

core: address comments

20bf9a9

holiman reviewed Oct 12, 2021

View reviewed changes

core/blockchain.go Outdated Show resolved Hide resolved

zzyalbert and others added 2 commits October 12, 2021 19:15

core: address comments

952dc44

core: fix the panic

34ff552

core: clarify semantics about skipblocks

a4f3a84

holiman approved these changes Oct 12, 2021

View reviewed changes

fjl removed the status:triage label Oct 19, 2021

fjl added this to the 1.10.11 milestone Oct 19, 2021

holiman modified the milestones: 1.10.11, 1.10.12 Oct 20, 2021

holiman merged commit c576fa1 into ethereum:master Nov 1, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

core: fix snapshot missing when recovery from crash #23496

core: fix snapshot missing when recovery from crash #23496

zzyalbert commented Aug 30, 2021

holiman commented Sep 7, 2021

karalabe commented Sep 7, 2021

zzyalbert commented Sep 7, 2021

zzyalbert commented Sep 7, 2021

holiman commented Sep 21, 2021

rjl493456442 commented Sep 21, 2021 •

edited

Loading

zzyalbert commented Sep 24, 2021 •

edited

Loading

holiman commented Sep 27, 2021

holiman commented Sep 27, 2021

zzyalbert commented Sep 28, 2021 •

edited

Loading

holiman commented Sep 28, 2021

zzyalbert commented Sep 30, 2021

zzyalbert commented Oct 11, 2021

rjl493456442 commented Oct 11, 2021

holiman left a comment

rjl493456442 commented Oct 12, 2021

holiman left a comment

core: fix snapshot missing when recovery from crash #23496

core: fix snapshot missing when recovery from crash #23496

Conversation

zzyalbert commented Aug 30, 2021

holiman commented Sep 7, 2021

karalabe commented Sep 7, 2021

zzyalbert commented Sep 7, 2021

zzyalbert commented Sep 7, 2021

holiman commented Sep 21, 2021

rjl493456442 commented Sep 21, 2021 • edited Loading

zzyalbert commented Sep 24, 2021 • edited Loading

holiman commented Sep 27, 2021

holiman commented Sep 27, 2021

zzyalbert commented Sep 28, 2021 • edited Loading

holiman commented Sep 28, 2021

zzyalbert commented Sep 30, 2021

zzyalbert commented Oct 11, 2021

rjl493456442 commented Oct 11, 2021

holiman left a comment

Choose a reason for hiding this comment

rjl493456442 commented Oct 12, 2021

holiman left a comment

Choose a reason for hiding this comment

rjl493456442 commented Sep 21, 2021 •

edited

Loading

zzyalbert commented Sep 24, 2021 •

edited

Loading

zzyalbert commented Sep 28, 2021 •

edited

Loading