Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

eth/downloader: fix skeleton cleanup #28581

Merged
merged 4 commits into from
Jan 31, 2024

Conversation

rjl493456442
Copy link
Member

@rjl493456442 rjl493456442 commented Nov 22, 2023

This pull request fixes a corner case in skeleton header deletion.

The background is Geth uses a backward-sync mechanism to fetch and assemble the skeleton header chain locally (guided by consensus client) and then extend block chain by consuming the skeleton headers. The consumed skeleton headers will be deleted from database to avoid accumulating junks.

The assumption is always held that a complete skeleton header chain is linked with local blockchain, specifically skeleton.tail.parent == blockchain.head. However the originally deletion logic always sets the blockchain.head as the new tail and delete headers beforehand.

In this fix, the next header of filled block will be regarded as the new tail, to keep the guarantee skeleton chain is linked with blockchain, instead of overlapping with it.

It's not a really critical issue because we just leave one more skeleton header in database, but it will result in some weird behaviors which we really want to avoid.


A: Avoid unexpected chain truncation

Before spinning up a new sync cycle, downloader needs to figure out the common ancestor of local chain and skeleton header chain by function findBeaconAncestor.

Due the fact that originally deletion logic sets the tail with the local chain head, the common ancestor will be chain.head-1.

Therefore, the excess block along with its other data(e.g. receipts) will be truncated from the ancient store by this logic

	        // Rewind the ancient store and blockchain if reorg happens.
		if origin+1 < frozen {
			if err := d.lightchain.SetHead(origin); err != nil {
				return err
			}
			log.Info("Truncated excess ancient chain segment", "oldhead", frozen-1, "newhead", origin)
		}

This behavior is really unexpected in normal cases, should only occur if the network reorg is deeper than our local chain.
Unfortunately It occurs a lot, e.g.

Nov 15 18:21:30 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:30.426] Syncing beacon headers                   downloaded=18,579,869 left=3,223,551  eta=5m40.255s
Nov 15 18:21:30 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-15|18:21:30.428] Reject duplicated disable operation
Nov 15 18:21:30 ip-10-0-0-11.ec2.internal geth[18425]: WARN [11-15|18:21:30.430] Rewinding blockchain to block            target=3,223,551
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded most recent local header          number=3,223,551  hash=a0aaff..914938 td=146,234,259,464,463,532,817 age=6y9mo4w
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded most recent local block           number=0          hash=d4e567..cb8fa3 td=17,179,869,184              age=54y7mo3w
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded most recent local snap block      number=3,223,551  hash=a0aaff..914938 td=146,234,259,464,463,532,817 age=6y9mo4w
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: INFO [11-15|18:21:33.424] Loaded last snap-sync pivot marker       number=18,578,974
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-15|18:21:33.424] Failed to reset txpool state             err="missing trie node d7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544 (path ) layer stale"
Nov 15 18:21:33 ip-10-0-0-11.ec2.internal geth[18425]: ERROR[11-15|18:21:33.424] Failed to reset blobpool state           err="missing trie node d7f8974fb5ac78d9ac099b9ad5018bedc2ce0a72dad1827a1709da30580f0544 (path ) layer stale"

I think we can avoid it by applying this fix.


B: Annoying error log for deleting skeleton headers

Failed to clean stale beacon headers err="filled header below beacon header tail: 16554016 < 16554017"

This kind of log is pretty common and I realized that the filled header is just one block before the tail. My hunch is we spin up the backfiller but terminate it without extending any blocks into the local chain. It can be avoid with this fix.

@holiman
Copy link
Contributor

holiman commented Nov 22, 2023

This looks good to me, but it's really one of these PRs where the effects are non-trivial. Maybe @karalabe can review, or we'll need to have a quick chat about it

@fjl fjl self-requested a review December 5, 2023 17:48
if number < s.progress.Subchains[0].Tail {
// If the filled header is below and discontinuous with the linked subchain,
// something's corrupted internally. Report and error and refuse to do anything.
if number+1 < s.progress.Subchains[0].Tail {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm unsure that this is correct.

Number is the last item newly filled. When I start a sync loop, lets say I have N blocks in my chain. Then I skeleton sync, and will reconstruct headers from HEAD to N+1, and which point I have my subchain linked to the local chain.

I start backfilling. If I fail to fill anything, cleanStales does not get called. If I successfully fill 1 block, then filled will be N+1 (the newly filled one block).

In that case number == filled == N+1, which is == s.progress.Subchains[0].Tail. The only way to hit this error would be to backfill something we already had in our chain.

With the proposed modification however, backfilling my local head block would make the check pass, but that should not be possible, because I can only backfill stuff from my subchain, not 1 block below it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I fail to fill anything, cleanStales does not get called

I am not sure about it. cleanStales will be invoked anyway regardless if we successfully fill something or not. The CurrentSnapBlock() will be returned as the newly filled block.

Thus it's theoretically possible (1) the tail is N+1 and (2) filled is N, the error will be reported.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According the below, the will only call cleanStales if something was filled.

			if filled := s.filler.suspend(); filled != nil {
				// If something was filled, try to delete stale sync helpers. If
				// unsuccessful, warn the user, but not much else we can do (it's
				// a programming error, just let users report an issue and don't
				// choke in the meantime).
				if err := s.cleanStales(filled); err != nil {
					log.Error("Failed to clean stale beacon headers", "err", err)
				}
			}

That said, you are also right that resuming the filler will always return the head snap block (or genesis I guess in case we're brand new).

		defer func() {
			b.lock.Lock()
			b.filling = false
			b.filled = b.downloader.blockchain.CurrentSnapBlock()
			b.lock.Unlock()
		}()

Interesting, I guess this was a refactor after I designed the original idea. It kind of makes the filled check in my first code segment moot.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, I agree with this +1, but we need to fix some comments that mention that it cannot happen. Namely,

batch = s.db.NewBatch()
)
s.progress.Subchains[0].Tail = number
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah hmm, indeed this is wonky. If the tail is the first "missing" header, then it should be filled+1

// If more headers were filled than available, push the entire subchain
// forward to keep tracking the node's block imports.
//
// Note that the new tail will be the filled one in this case, which is
Copy link
Member

@karalabe karalabe Dec 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment is not valid any more. The new tail might be either filled (if the HEAD itself was filled) or filled+1 if something earlier was filled. This is problematic for L1159, which sets the Head to number. In that case, you could end up with Tail > Head.

L1164 also feels broken with the fixed logic because we're writing filled as a skeleton header, but that's not the case any more if it was just filled. This might arguably be broken even in the current code, but it's at least uniformly broken. With the new code, you can end up with a skeleton header of filled, but maybe a Tail of filled+1, which is inconsistent.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry I don't get it. I don't think it's possible to have HEAD as filled but TAIL as filled+1.

In the case of filling everything, both HEAD and TAIL should be set to filled in the new logic right?

// chain is consumed.
newTail := filled
if number < s.progress.Subchains[0].Head {
newTail = rawdb.ReadSkeletonHeader(s.db, number+1)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is problematic. newTail can either be filled or filled+1, but the code below acts as if it were one or the other. It would be very hard to reason about.

IMO if the happy path is filled+1, then we should special case filled and have a condition that does all the cleanups necessary for "filling the entire subchain"' then we can have the remainder of the code handle the partial fill scenario.

Handling both cases together below makes the code very hard to digest IMO.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am wondering if we fill the whole chian, should we delete the skeleton chain totally? Theoretically it's the best way to handle the case, just need to confirm the non-existent skeleton won't break the other part.

@rjl493456442
Copy link
Member Author

@karalabe I updated the cleanup logic, not sure if this version is easier to understand. Please take another look.

@rjl493456442
Copy link
Member Author

image

It turns out this error is never reported during the snap sync, after did occur all the time after the initial sync.

@fjl
Copy link
Contributor

fjl commented Jan 11, 2024

Wait, we still haven't merged this!?

@fjl
Copy link
Contributor

fjl commented Jan 11, 2024

@karalabe PTAL

// The skeleton chain is partially consumed, set the new tail as filled+1.
tail := rawdb.ReadSkeletonHeader(s.db, number+1)
if tail.ParentHash != filled.Hash() {
return fmt.Errorf("filled header is discontinuous with subchain: %d %s", number, filled.Hash())
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Whilst I agree that this clause is an error as it should never happen, wondering if returning an error could end up getting stuck in some weird way if a bug is hit vs. being able to self recover if we blindly delete the "overlapping" stuff that was side-filled?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess the worst case is we keep all the skeleton headers in database, but the normal sync procedure won't be aborted/stuck?

I will probably prefer to print out the error now and to see if anyone really meet this very situation.

@rjl493456442
Copy link
Member Author

I have addressed the comments, @karalabe , please take another look!

Copy link
Member

@karalabe karalabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@karalabe karalabe added this to the 1.13.12 milestone Jan 31, 2024
@karalabe karalabe merged commit 5c67066 into ethereum:master Jan 31, 2024
2 of 3 checks passed
lukanus pushed a commit to blocknative/go-ethereum that referenced this pull request Feb 19, 2024
* params: begin v.1.13.12 release cycle

* internal/flags: fix typo (ethereum#28876)

* core/types: fix and test handling of faulty nil-returning signer (ethereum#28879)

This adds an error if the signer returns a nil value for one of the signature value fields.

* README.md: fix travis badge (ethereum#28889)

The hyperlink in the README file that directs to the Travis CI build was broken.
This commit updates the link to point to the corrent build page.

* eth/catalyst: allow payload attributes v1 in fcu v2 (ethereum#28882)

At some point, `ForkchoiceUpdatedV2` stopped working for `PayloadAttributesV1` while `paris` was active. This was causing a few failures in hive. This PR fixes that, and also adds a gate in `ForkchoiceUpdatedV1` to disallow `PayloadAttributesV3`.

* docs/postmortems: fix outdated link (ethereum#28893)

* core: reset tx lookup cache if necessary (ethereum#28865)

This pull request resets the txlookup cache if chain reorg happens, 
preventing them from remaining reachable. It addresses failures in
the hive tests.

* build: fix problem with windows line-endings in CI download (ethereum#28900)

fixes ethereum#28890

* eth/downloader: fix skeleton cleanup (ethereum#28581)

* eth/downloader: fix skeleton cleanup

* eth/downloader: short circuit if nothing to delete

* eth/downloader: polish the logic in cleanup

* eth/downloader: address comments

* deps: update memsize (ethereum#28916)

* core/txpool/blobpool: post-crash cleanup and addition/removal metrics (ethereum#28914)

* core/txpool/blobpool: clean up resurrected junk after a crash

* core/txpool/blobpool: track transaction insertions and rejections

* core/txpool/blobpool: linnnnnnnt

* core/txpool: don't inject lazy resolved transactions into the container (ethereum#28917)

* core/txpool: don't inject lazy resolved transactions into the container

* core/txpool: minor typo fixes

* core/types: fix typo (ethereum#28922)

* p2p: fix accidental termination of portMappingLoop (ethereum#28911)

* internal/flags: fix --miner.gasprice default listing (ethereum#28932)

* all: fix typos in comments (ethereum#28881)

* Makefile: add help target to display available targets (ethereum#28845)


Co-authored-by: Martin HS <[email protected]>
Co-authored-by: Felix Lange <[email protected]>

* core: cache transaction indexing tail in memory (ethereum#28908)

* eth, miner: fix enforcing the minimum miner tip (ethereum#28933)

* eth, miner: fix enforcing the minimum miner tip

* ethclient/simulated: fix failing test due the min tip change

* accounts/abi/bind: fix simulater gas tip issue

* core/state, core/vm: minor uint256 related perf improvements (ethereum#28944)

* cmd,internal/era: implement `export-history` subcommand (ethereum#26621)

* all: implement era format, add history importer/export

* internal/era/e2store: refactor e2store to provide ReadAt interface

* internal/era/e2store: export HeaderSize

* internal/era: refactor era to use ReadAt interface

* internal/era: elevate anonymous func to named

* cmd/utils: don't store entire era file in-memory during import / export

* internal/era: better abstraction between era and e2store

* cmd/era: properly close era files

* cmd/era: don't let defers stack

* cmd/geth: add description for import-history

* cmd/utils: better bytes buffer

* internal/era: error if accumulator has more records than max allowed

* internal/era: better doc comment

* internal/era/e2store: rm superfluous reader, rm superfluous testcases, add fuzzer

* internal/era: avoid some repetition

* internal/era: simplify clauses

* internal/era: unexport things

* internal/era,cmd/utils,cmd/era: change to iterator interface for reading era entries

* cmd/utils: better defer handling in history test

* internal/era,cmd: add number method to era iterator to get the current block number

* internal/era/e2store: avoid double allocation during write

* internal/era,cmd/utils: fix lint issues

* internal/era: add ReaderAt func so entry value can be read lazily

Co-authored-by: lightclient <[email protected]>
Co-authored-by: Martin Holst Swende <[email protected]>

* internal/era: improve iterator interface

* internal/era: fix rlp decode of header and correctly read total difficulty

* cmd/era: fix rebase errors

* cmd/era: clearer comments

* cmd,internal: fix comment typos

---------

Co-authored-by: Martin Holst Swende <[email protected]>

* core,params: add holesky to default genesis function (ethereum#28903)

* node, rpc: add configurable HTTP request limit (ethereum#28948)

Adds a configurable HTTP request limit, and bumps the engine default

* all: fix docstring names (ethereum#28923)

* fix wrong comment

* reviewers input

* Update log/handler_glog.go

---------

Co-authored-by: Martin HS <[email protected]>

* ethclient/simulated: fix typo (ethereum#28952)

(ethclient/simulated):fix typo

* eth/gasprice: fix percentile validation in eth_feeHistory (ethereum#28954)

* cmd/devp2p, eth: drop support for eth/67 (ethereum#28956)

* params, core/forkid: add mainnet timestamp for Cancun (ethereum#28958)

* params: add cancun timestamp for mainnet

* core/forkid: add test for mainnet cancun forkid

* core/forkid: update todo tests for cancun

* internal/ethapi: add support for blobs in eth_fillTransaction (ethereum#28839)

This change adds support for blob-transaction in certain API-endpoints, e.g. eth_fillTransaction. A follow-up PR will add support for signing such transactions.

* internal/era: update block index format to be based on record offset (ethereum#28959)

As mentioned in ethereum#26621, the block index format for era1 is not in line with the regular era block index. This change modifies the index so all relative offsets are based against the beginning of the block index record.

* params: go-ethereum v1.13.12 stable

---------

Co-authored-by: Martin Holst Swende <[email protected]>
Co-authored-by: alex <[email protected]>
Co-authored-by: protolambda <[email protected]>
Co-authored-by: KeienWang <[email protected]>
Co-authored-by: lightclient <[email protected]>
Co-authored-by: rjl493456442 <[email protected]>
Co-authored-by: Péter Szilágyi <[email protected]>
Co-authored-by: zoereco <[email protected]>
Co-authored-by: Chris Ziogas <[email protected]>
Co-authored-by: Dimitris Apostolou <[email protected]>
Co-authored-by: Halimao <[email protected]>
Co-authored-by: Felix Lange <[email protected]>
Co-authored-by: lmittmann <[email protected]>
Co-authored-by: Sina Mahmoodi <[email protected]>
Co-authored-by: Austin Roberts <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants