-
Notifications
You must be signed in to change notification settings - Fork 20.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
eth/downloader: concurrent header downloads #2315
Conversation
74a736d
to
88a7812
Compare
Current coverage is 56.20%@@ develop #2315 diff @@
==========================================
Files 215 215
Lines 24238 24457 +219
Methods 0 0
Messages 0 0
Branches 0 0
==========================================
+ Hits 13541 13745 +204
- Misses 10694 10709 +15
Partials 3 3
|
1b9b9a4
to
c79b32c
Compare
c79b32c
to
5eeb60b
Compare
|
||
// No skeleton retrieval can be in progress, fail hard if so (huge implementation bug) | ||
if q.headerResults != nil { | ||
panic("skeleton assembly already in progress") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Considered using glog.Fatal
?
Question: would it matter if you didn't get the entire trace? Since 1.6
it only list the current goroutine
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This check is mostly for code quality reasons, so that if we screw up the code in the future it fails hard. I don't think that given the stack traces would be enough to figure out the reason that lead to this scenario. We can change it of course, just seems a plain panic is cleaner that a log embedded one.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still think it's worth using glog
instead of a naked panic
. While we can grep
or ack
on the string it would be nice to see where the panic originated from in the message itself.
What is causing the following:
Failure:
|
b46d2a1
to
b0463c8
Compare
b0463c8
to
e86619e
Compare
idle := func(p *peer) bool { | ||
return atomic.LoadInt32(&p.headerIdle) == 0 | ||
} | ||
throughput := func(p *peer) float64 { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why isn't this a method on the peer itself? It seems a bit odd to create a lambda when it's mandatory that all peers should be able return their header throughput anyway. If a peer isn't considered a header idle peer perhaps it should return 0
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't say the reason for originally doing it like this. This specific version is to be in line with the rest of the method closures, but I don't see why not all of the closures could be separated out as fully qualified methods on the peer. We can do that in a follow up PR if need be, I'd rather not modify all the other code too in this PR.
👍 |
982cd37
to
8906b2f
Compare
* ci: temp enable blobtx branch ci run; * Switch ON blobpool & ensure Cancun hardfork can occur (ethereum#2223) * feat: support blob storage & miscs; (ethereum#2229) * chainconfig: use cancun fork for BSC; * feat: fill WithdrawalsHash when BSC enable cancun fork; * rawdb: support to CRUD blobs; * freezer: support to freeze block blobs; * blockchain: add blob cache & blob query helper; * freezer: refactor addition table logic, add uts; * blobexpiry: add more extra expiry time, and logs; * parlia: implement IsDataAvailable function; * blob: refactor blob transfer logic; * blob: support config blob extra reserve; * blockchian: support to import block with blob & blobGasFee; (ethereum#2260) * blob: implement min&max gas price logic; * blockchian: support import side chain; * blobpool: reject the banned address; * blockchain: add chasing head for DA check; * params: update blob related config; * blockchain: opt data available checking performance; * params: modify blob related params; * gasprice: support BEP-336 blob gas price calculate; * blobTx: mining + brodcasting (ethereum#2253) * blobtx mining pass (ethereum#2282) * Sidecar fetching changes for 4844 (ethereum#2283) * ci: temp enable blobtx branch ci run; * Switch ON blobpool & ensure Cancun hardfork can occur (ethereum#2223) * feat: support blob storage & miscs; (ethereum#2229) * chainconfig: use cancun fork for BSC; feat: fill WithdrawalsHash when BSC enable cancun fork; * rawdb: support to CRUD blobs; * freezer: support to freeze block blobs; * blockchain: add blob cache & blob query helper; * freezer: refactor addition table logic, add uts; * blobexpiry: add more extra expiry time, and logs; * parlia: implement IsDataAvailable function; * blob: refactor blob transfer logic; * blob: support config blob extra reserve; * blockchian: support to import block with blob & blobGasFee; (ethereum#2260) * blob: implement min&max gas price logic; * blockchian: support import side chain; * blobpool: reject the banned address; * blockchain: add chasing head for DA check; * params: update blob related config; * blockchain: opt data available checking performance; * params: modify blob related params; * gasprice: support BEP-336 blob gas price calculate; * fix failed check for WithdrawalsHash (ethereum#2276) * eth: include sidecars in fitering of body * core: refactor sidecars name * eth: sidecars type refactor * core: remove extra from bad merge * eth: fix handlenewblock test after merge * Implement eth_getBlobSidecars && eth_getBlobSidecarByTxHash (ethereum#2286) * execution: add blob gas fee reward to system; * syncing: support blob syncing & DA checking; * naming: rename blobs to sidecars; * fix the semantics of WithXXX (ethereum#2293) * config: reduce sidecar cache to 1024 and rename (ethereum#2297) * fix: Withdrawals turn into empty from nil when BlockBody has Sidecars (ethereum#2301) * internal/api_test: add test case for eth_getBlobSidecars && eth_getBlobSidecarByTxHash (ethereum#2300) * consensus/misc: rollback CalcBlobFee (ethereum#2306) * flags: add new flags to override blobs' params; * freezer: fix blob ancient save error; * blobsidecar: add new sidecar struct with metadata; (ethereum#2315) * core/rawdb: optimize write block with sidecars (ethereum#2318) * core: more check for validity of sidecars * mev: add TxIndex for mev bid (ethereum#2325) * remove useless Config() (ethereum#2326) * fix WithSidecars (ethereum#2327) * fix: fix mined block sidecar issue; (ethereum#2328) * fix WithSidecars (ethereum#2329) --------- Co-authored-by: GalaIO <[email protected]> Co-authored-by: buddho <[email protected]> Co-authored-by: Satyajit Das <[email protected]> Co-authored-by: Eric <[email protected]> Co-authored-by: zzzckck <[email protected]>
Geth currently uses a single peer to pull all the headers of the blockchain from, and uses all of its peers to fill in the block bodies, receipts, state, etc that the header chain defined. This unfortunately introduces a serious bottleneck for newly joining peers, where if they select a shitty peer to pull the headers from, the entire sync takes ages.
The obvious solution is to download headers from multiple peers concurrently, but that also has its gotchas, specifically that consecutive headers downloaded from different peers might not match up to each other. This might happen due to malicious peers deliberately feeding junk or simply due to a large enough fork (e.g. homestead transition)... and actually this was specifically something the C++ codebase was bitten with causing hard crashes during the testnet homestead fork. The important lesson is that we need a mechanism to easily check if a batch of headers (seemingly) fits into our chain or not, without needing to actually import it.
Another important feature that we need to maintain is that even if we have pad peers, we should be able to sync in a reasonable time. If we keep piecing together malicious header batches and throwing them away, valid downloaded data will also be lost and inherently sync time would suffer. This currently is avoided by the single-peer header-download which ensure that if we "happen upon" a good peer, bad peers cannot affect us any more (and bad peers get thrown out quite quickly). The important takeaway is that if a single peer drives the header downloads, and it is good, bad peers cannot screw us any more, so it's essential to retain this capability even with concurrent headers.
Algorithm
The solution implemented by this PR is based on the concept of a header skeleton:
This this algorithms the concurrently pulled headers are guaranteed to fit the skeleton:
With the above algorithm, we can guarantee that only headers that map cleanly onto the master-peer's skeleton will be accepted, and we can do this with only lightweight hasing and don't need to process anything. Thus if the master-peer is good, sync completes fast and clean.
On the other hand if the master peer is bad and tries to feed an invalid skeleton chain, either no peers can send us the correct data to fill the gaps (in the case of which we assume an attacker and from the master-peer), or the master peer itself does feed us junk, which we can detect later during processing and drop the master-peer.
Caveats
The
eth
protocol doesn't support request/response IDs, so we can only issue one request of a single type (e.g. header request) to be able and still match up the replies with the requests. This means that we cannot isolate the master-peer to only retrieve skeleton headers and use others to fill the gaps, since we might only have one peer (e.g. private network setting). To work around this, the PR retrieves headers in batches:Although the above batching solution works, it also introduces an annoying latency when one batch is finished and another is started, whereby during the filling of a skeleton batch, other parts of the downloader might starve (e.g. receipts aren't being pulled). This however can be solved quite easily by checking if a header delivery fills in a prefix of the skeleton and immediately stream the already completed parts to the rest of the downloader, without waiting for all the subsequent gaps to be done.
Performance
This this PR syncing the current testnet (~781K blocks) took 7 minutes and 50 seconds.