eth/downloader: fix skeleton cleanup #28581

rjl493456442 · 2023-11-22T07:14:21Z

This pull request fixes a corner case in skeleton header deletion.

The background is Geth uses a backward-sync mechanism to fetch and assemble the skeleton header chain locally (guided by consensus client) and then extend block chain by consuming the skeleton headers. The consumed skeleton headers will be deleted from database to avoid accumulating junks.

The assumption is always held that a complete skeleton header chain is linked with local blockchain, specifically skeleton.tail.parent == blockchain.head. However the originally deletion logic always sets the blockchain.head as the new tail and delete headers beforehand.

In this fix, the next header of filled block will be regarded as the new tail, to keep the guarantee skeleton chain is linked with blockchain, instead of overlapping with it.

It's not a really critical issue because we just leave one more skeleton header in database, but it will result in some weird behaviors which we really want to avoid.

A: Avoid unexpected chain truncation

Before spinning up a new sync cycle, downloader needs to figure out the common ancestor of local chain and skeleton header chain by function findBeaconAncestor.

Due the fact that originally deletion logic sets the tail with the local chain head, the common ancestor will be chain.head-1.

Therefore, the excess block along with its other data(e.g. receipts) will be truncated from the ancient store by this logic

	        // Rewind the ancient store and blockchain if reorg happens.
		if origin+1 < frozen {
			if err := d.lightchain.SetHead(origin); err != nil {
				return err
			}
			log.Info("Truncated excess ancient chain segment", "oldhead", frozen-1, "newhead", origin)
		}

This behavior is really unexpected in normal cases, should only occur if the network reorg is deeper than our local chain.
Unfortunately It occurs a lot, e.g.

Nov 15 18:21:30 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:30.426] Syncing beacon headers                   downloaded=18,579,869 left=3,223,551  eta=5m40.255s
Nov 15 18:21:30 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-15|18:21:30.428] Reject duplicated disable operation
Nov 15 18:21:30 ip-10-0-0-11.ec2.internal geth[18425]: WARN [11-15|18:21:30.430] Rewinding blockchain to block            target=3,223,551
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded most recent local header          number=3,223,551  hash=a0aaff..914938 td=146,234,259,464,463,532,817 age=6y9mo4w
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded most recent local block           number=0          hash=d4e567..cb8fa3 td=17,179,869,184              age=54y7mo3w
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded most recent local snap block      number=3,223,551  hash=a0aaff..914938 td=146,234,259,464,463,532,817 age=6y9mo4w
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded last snap-sync pivot marker       number=18,578,974
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-15|18:21:33.424] Failed to reset txpool state             err="missing trie node d7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544 (path ) layer stale"
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-15|18:21:33.424] Failed to reset blobpool state           err="missing trie node d7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544 (path ) layer stale"

I think we can avoid it by applying this fix.

B: Annoying error log for deleting skeleton headers

Failed to clean stale beacon headers err="filled header below beacon header tail: 16554016 < 16554017"

This kind of log is pretty common and I realized that the filled header is just one block before the tail. My hunch is we spin up the backfiller but terminate it without extending any blocks into the local chain. It can be avoid with this fix.

holiman · 2023-11-22T08:59:52Z

This looks good to me, but it's really one of these PRs where the effects are non-trivial. Maybe @karalabe can review, or we'll need to have a quick chat about it

karalabe · 2023-12-11T14:19:09Z

eth/downloader/skeleton.go

-	if number < s.progress.Subchains[0].Tail {
+	// If the filled header is below and discontinuous with the linked subchain,
+	// something's corrupted internally. Report and error and refuse to do anything.
+	if number+1 < s.progress.Subchains[0].Tail {


I'm unsure that this is correct.

Number is the last item newly filled. When I start a sync loop, lets say I have N blocks in my chain. Then I skeleton sync, and will reconstruct headers from HEAD to N+1, and which point I have my subchain linked to the local chain.

I start backfilling. If I fail to fill anything, cleanStales does not get called. If I successfully fill 1 block, then filled will be N+1 (the newly filled one block).

In that case number == filled == N+1, which is == s.progress.Subchains[0].Tail. The only way to hit this error would be to backfill something we already had in our chain.

With the proposed modification however, backfilling my local head block would make the check pass, but that should not be possible, because I can only backfill stuff from my subchain, not 1 block below it.

If I fail to fill anything, cleanStales does not get called

I am not sure about it. cleanStales will be invoked anyway regardless if we successfully fill something or not. The CurrentSnapBlock() will be returned as the newly filled block.

Thus it's theoretically possible (1) the tail is N+1 and (2) filled is N, the error will be reported.

According the below, the will only call cleanStales if something was filled.

if filled := s.filler.suspend(); filled != nil { // If something was filled, try to delete stale sync helpers. If // unsuccessful, warn the user, but not much else we can do (it's // a programming error, just let users report an issue and don't // choke in the meantime). if err := s.cleanStales(filled); err != nil { log.Error("Failed to clean stale beacon headers", "err", err) } }

That said, you are also right that resuming the filler will always return the head snap block (or genesis I guess in case we're brand new).

defer func() { b.lock.Lock() b.filling = false b.filled = b.downloader.blockchain.CurrentSnapBlock() b.lock.Unlock() }()

Interesting, I guess this was a refactor after I designed the original idea. It kind of makes the filled check in my first code segment moot.

Ok, I agree with this +1, but we need to fix some comments that mention that it cannot happen. Namely,

https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/beaconsync.go#L52

// suspend cancels any background downloader threads and returns the last header
// that has been successfully backfilled (potentially in a previous run), or the genesis.

https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/skeleton.go#L164

// The method should return the last block header that has been successfully
// backfilled (in the current or a previous run), falling back to the genesis.

https://github.com/ethereum/go-ethereum/blob/master/eth/downloader/skeleton.go#L385

We should add an error log if the filled is nil (vs the current code that checks for non-nil-ness as a happy path only).

karalabe · 2023-12-11T14:24:47Z

eth/downloader/skeleton.go

 		batch = s.db.NewBatch()
 	)
-	s.progress.Subchains[0].Tail = number


Ah hmm, indeed this is wonky. If the tail is the first "missing" header, then it should be filled+1

karalabe · 2023-12-11T15:11:36Z

eth/downloader/skeleton.go

+	// If more headers were filled than available, push the entire subchain
+	// forward to keep tracking the node's block imports.
+	//
+	// Note that the new tail will be the filled one in this case, which is


This comment is not valid any more. The new tail might be either filled (if the HEAD itself was filled) or filled+1 if something earlier was filled. This is problematic for L1159, which sets the Head to number. In that case, you could end up with Tail > Head.

L1164 also feels broken with the fixed logic because we're writing filled as a skeleton header, but that's not the case any more if it was just filled. This might arguably be broken even in the current code, but it's at least uniformly broken. With the new code, you can end up with a skeleton header of filled, but maybe a Tail of filled+1, which is inconsistent.

Sorry I don't get it. I don't think it's possible to have HEAD as filled but TAIL as filled+1.

In the case of filling everything, both HEAD and TAIL should be set to filled in the new logic right?

karalabe · 2023-12-11T15:15:40Z

eth/downloader/skeleton.go

+	// chain is consumed.
+	newTail := filled
+	if number < s.progress.Subchains[0].Head {
+		newTail = rawdb.ReadSkeletonHeader(s.db, number+1)


This is problematic. newTail can either be filled or filled+1, but the code below acts as if it were one or the other. It would be very hard to reason about.

IMO if the happy path is filled+1, then we should special case filled and have a condition that does all the cleanups necessary for "filling the entire subchain"' then we can have the remainder of the code handle the partial fill scenario.

Handling both cases together below makes the code very hard to digest IMO.

I am wondering if we fill the whole chian, should we delete the skeleton chain totally? Theoretically it's the best way to handle the case, just need to confirm the non-existent skeleton won't break the other part.

rjl493456442 · 2023-12-12T12:28:24Z

@karalabe I updated the cleanup logic, not sure if this version is easier to understand. Please take another look.

rjl493456442 · 2023-12-13T03:46:38Z

It turns out this error is never reported during the snap sync, after did occur all the time after the initial sync.

fjl · 2024-01-11T13:47:06Z

Wait, we still haven't merged this!?

fjl · 2024-01-11T13:47:14Z

@karalabe PTAL

karalabe · 2024-01-30T12:58:16Z

eth/downloader/skeleton.go

+		// The skeleton chain is partially consumed, set the new tail as filled+1.
+		tail := rawdb.ReadSkeletonHeader(s.db, number+1)
+		if tail.ParentHash != filled.Hash() {
+			return fmt.Errorf("filled header is discontinuous with subchain: %d %s", number, filled.Hash())


Whilst I agree that this clause is an error as it should never happen, wondering if returning an error could end up getting stuck in some weird way if a bug is hit vs. being able to self recover if we blindly delete the "overlapping" stuff that was side-filled?

I guess the worst case is we keep all the skeleton headers in database, but the normal sync procedure won't be aborted/stuck?

I will probably prefer to print out the error now and to see if anyone really meet this very situation.

rjl493456442 · 2024-01-31T07:31:21Z

I have addressed the comments, @karalabe , please take another look!

karalabe

SGTM

* params: begin v.1.13.12 release cycle * internal/flags: fix typo (ethereum#28876) * core/types: fix and test handling of faulty nil-returning signer (ethereum#28879) This adds an error if the signer returns a nil value for one of the signature value fields. * README.md: fix travis badge (ethereum#28889) The hyperlink in the README file that directs to the Travis CI build was broken. This commit updates the link to point to the corrent build page. * eth/catalyst: allow payload attributes v1 in fcu v2 (ethereum#28882) At some point, `ForkchoiceUpdatedV2` stopped working for `PayloadAttributesV1` while `paris` was active. This was causing a few failures in hive. This PR fixes that, and also adds a gate in `ForkchoiceUpdatedV1` to disallow `PayloadAttributesV3`. * docs/postmortems: fix outdated link (ethereum#28893) * core: reset tx lookup cache if necessary (ethereum#28865) This pull request resets the txlookup cache if chain reorg happens, preventing them from remaining reachable. It addresses failures in the hive tests. * build: fix problem with windows line-endings in CI download (ethereum#28900) fixes ethereum#28890 * eth/downloader: fix skeleton cleanup (ethereum#28581) * eth/downloader: fix skeleton cleanup * eth/downloader: short circuit if nothing to delete * eth/downloader: polish the logic in cleanup * eth/downloader: address comments * deps: update memsize (ethereum#28916) * core/txpool/blobpool: post-crash cleanup and addition/removal metrics (ethereum#28914) * core/txpool/blobpool: clean up resurrected junk after a crash * core/txpool/blobpool: track transaction insertions and rejections * core/txpool/blobpool: linnnnnnnt * core/txpool: don't inject lazy resolved transactions into the container (ethereum#28917) * core/txpool: don't inject lazy resolved transactions into the container * core/txpool: minor typo fixes * core/types: fix typo (ethereum#28922) * p2p: fix accidental termination of portMappingLoop (ethereum#28911) * internal/flags: fix --miner.gasprice default listing (ethereum#28932) * all: fix typos in comments (ethereum#28881) * Makefile: add help target to display available targets (ethereum#28845) Co-authored-by: Martin HS <[email protected]> Co-authored-by: Felix Lange <[email protected]> * core: cache transaction indexing tail in memory (ethereum#28908) * eth, miner: fix enforcing the minimum miner tip (ethereum#28933) * eth, miner: fix enforcing the minimum miner tip * ethclient/simulated: fix failing test due the min tip change * accounts/abi/bind: fix simulater gas tip issue * core/state, core/vm: minor uint256 related perf improvements (ethereum#28944) * cmd,internal/era: implement `export-history` subcommand (ethereum#26621) * all: implement era format, add history importer/export * internal/era/e2store: refactor e2store to provide ReadAt interface * internal/era/e2store: export HeaderSize * internal/era: refactor era to use ReadAt interface * internal/era: elevate anonymous func to named * cmd/utils: don't store entire era file in-memory during import / export * internal/era: better abstraction between era and e2store * cmd/era: properly close era files * cmd/era: don't let defers stack * cmd/geth: add description for import-history * cmd/utils: better bytes buffer * internal/era: error if accumulator has more records than max allowed * internal/era: better doc comment * internal/era/e2store: rm superfluous reader, rm superfluous testcases, add fuzzer * internal/era: avoid some repetition * internal/era: simplify clauses * internal/era: unexport things * internal/era,cmd/utils,cmd/era: change to iterator interface for reading era entries * cmd/utils: better defer handling in history test * internal/era,cmd: add number method to era iterator to get the current block number * internal/era/e2store: avoid double allocation during write * internal/era,cmd/utils: fix lint issues * internal/era: add ReaderAt func so entry value can be read lazily Co-authored-by: lightclient <[email protected]> Co-authored-by: Martin Holst Swende <[email protected]> * internal/era: improve iterator interface * internal/era: fix rlp decode of header and correctly read total difficulty * cmd/era: fix rebase errors * cmd/era: clearer comments * cmd,internal: fix comment typos --------- Co-authored-by: Martin Holst Swende <[email protected]> * core,params: add holesky to default genesis function (ethereum#28903) * node, rpc: add configurable HTTP request limit (ethereum#28948) Adds a configurable HTTP request limit, and bumps the engine default * all: fix docstring names (ethereum#28923) * fix wrong comment * reviewers input * Update log/handler_glog.go --------- Co-authored-by: Martin HS <[email protected]> * ethclient/simulated: fix typo (ethereum#28952) (ethclient/simulated):fix typo * eth/gasprice: fix percentile validation in eth_feeHistory (ethereum#28954) * cmd/devp2p, eth: drop support for eth/67 (ethereum#28956) * params, core/forkid: add mainnet timestamp for Cancun (ethereum#28958) * params: add cancun timestamp for mainnet * core/forkid: add test for mainnet cancun forkid * core/forkid: update todo tests for cancun * internal/ethapi: add support for blobs in eth_fillTransaction (ethereum#28839) This change adds support for blob-transaction in certain API-endpoints, e.g. eth_fillTransaction. A follow-up PR will add support for signing such transactions. * internal/era: update block index format to be based on record offset (ethereum#28959) As mentioned in ethereum#26621, the block index format for era1 is not in line with the regular era block index. This change modifies the index so all relative offsets are based against the beginning of the block index record. * params: go-ethereum v1.13.12 stable --------- Co-authored-by: Martin Holst Swende <[email protected]> Co-authored-by: alex <[email protected]> Co-authored-by: protolambda <[email protected]> Co-authored-by: KeienWang <[email protected]> Co-authored-by: lightclient <[email protected]> Co-authored-by: rjl493456442 <[email protected]> Co-authored-by: Péter Szilágyi <[email protected]> Co-authored-by: zoereco <[email protected]> Co-authored-by: Chris Ziogas <[email protected]> Co-authored-by: Dimitris Apostolou <[email protected]> Co-authored-by: Halimao <[email protected]> Co-authored-by: Felix Lange <[email protected]> Co-authored-by: lmittmann <[email protected]> Co-authored-by: Sina Mahmoodi <[email protected]> Co-authored-by: Austin Roberts <[email protected]>

rjl493456442 requested review from karalabe and holiman as code owners November 22, 2023 07:14

rjl493456442 force-pushed the fix-skeleton-cleanup branch from c1305e3 to af91057 Compare November 22, 2023 07:39

fjl self-requested a review December 5, 2023 17:48

karalabe reviewed Dec 11, 2023

View reviewed changes

rjl493456442 force-pushed the fix-skeleton-cleanup branch from af91057 to 4d3662c Compare December 12, 2023 12:27

rjl493456442 force-pushed the fix-skeleton-cleanup branch from 4d3662c to 2388edd Compare December 12, 2023 12:33

rjl493456442 force-pushed the fix-skeleton-cleanup branch from 2388edd to 7d7979d Compare January 11, 2024 07:11

fjl added the status:triage label Jan 16, 2024

holiman removed the status:triage label Jan 23, 2024

holiman assigned karalabe Jan 23, 2024

karalabe reviewed Jan 30, 2024

View reviewed changes

rjl493456442 added 4 commits January 31, 2024 15:11

eth/downloader: fix skeleton cleanup

75929ca

eth/downloader: short circuit if nothing to delete

de234df

eth/downloader: polish the logic in cleanup

5b6571f

eth/downloader: address comments

eead5a2

rjl493456442 force-pushed the fix-skeleton-cleanup branch from 7d7979d to eead5a2 Compare January 31, 2024 07:31

karalabe approved these changes Jan 31, 2024

View reviewed changes

karalabe added this to the 1.13.12 milestone Jan 31, 2024

karalabe merged commit 5c67066 into ethereum:master Jan 31, 2024
2 of 3 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

eth/downloader: fix skeleton cleanup #28581

eth/downloader: fix skeleton cleanup #28581

rjl493456442 commented Nov 22, 2023 •

edited

Loading

holiman commented Nov 22, 2023

karalabe Dec 11, 2023

rjl493456442 Dec 12, 2023

karalabe Jan 30, 2024

karalabe Jan 30, 2024

karalabe Dec 11, 2023

karalabe Dec 11, 2023 •

edited

Loading

rjl493456442 Dec 12, 2023

karalabe Dec 11, 2023

rjl493456442 Dec 12, 2023

rjl493456442 commented Dec 12, 2023

rjl493456442 commented Dec 13, 2023

fjl commented Jan 11, 2024

fjl commented Jan 11, 2024

karalabe Jan 30, 2024

rjl493456442 Jan 31, 2024

rjl493456442 commented Jan 31, 2024

karalabe left a comment

eth/downloader: fix skeleton cleanup #28581

eth/downloader: fix skeleton cleanup #28581

Conversation

rjl493456442 commented Nov 22, 2023 • edited Loading

holiman commented Nov 22, 2023

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

karalabe Dec 11, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjl493456442 commented Dec 12, 2023

rjl493456442 commented Dec 13, 2023

fjl commented Jan 11, 2024

fjl commented Jan 11, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

rjl493456442 commented Jan 31, 2024

karalabe left a comment

Choose a reason for hiding this comment

rjl493456442 commented Nov 22, 2023 •

edited

Loading

karalabe Dec 11, 2023 •

edited

Loading