Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix/loop-sync-mode #78

Merged
merged 13 commits into from
Apr 15, 2020
Merged

Fix/loop-sync-mode #78

merged 13 commits into from
Apr 15, 2020

Conversation

meowsbits
Copy link
Member

@meowsbits meowsbits commented Mar 19, 2020

Fixes #75

Refines #76, removing non-essential debug log statements... etc. Trying to minimize unnecessary refactorings or along-the-way improvements to limit the complexity of the science here.


Notes on the issue this resolves and how

  • Return an error if attempting an "impossible sync," instead of just logging that an issue should be filed.
  • Why was there an "impossible sync" in the first place?
    • What is an impossible sync?
      • Either new chain or old chain is not greater than 0.
      • In our case, new and old blocks are the same.
        • Old chain is large (eg. 452695), and new chain is empty.
        • Old and new blocks have same number and hash.
    • What was the context of the impossible sync?
      • In all cases I've seen personally, the resync scenario has followed a harsh kill (eg kill -9).
      • There have also been BAD BLOCKS reported, and I've added them to the ETC config.
  • Optimize download algorithm.
    • Don't attempt any find ancestor negotiation if we're at genesis (all sync modes). We already know the origin should be 1.
    • Don't attempt span-search common ancestor negotiation if the remote is sufficiently higher than our local.
    • Don't ever allow the resulting common ancestor to be above our local head.
      • How could it be common if our downloader (its blockchain actually) thinks its head is below it?
        • Because our blockchain's head marker (depending on sync mode) might be beneath the block that has been found to exist in the database.
          Blockchain head marking is a totally separate mechanism from the actual blockchain "state." It is a fallible map
          of the territory.
          If a blockchain head marker is rolled back, but block data above that new marker is not removed, then
          the database will contain "secret" data.
          In this case, the common ancestor negotiation will ask the database if a block is available, and if it is, it will
          satisfy the existing algorithmic condition and be returned positively.
  • Why does the chain want to (re)sync from ~3735? (That number in particular?)
    • Not sure.

Multiple clients have reported BAD BLOCKS
with this data.

Signed-off-by: meows <[email protected]>
Now, the spanSearch algorithm is only used if local
head is within MaxHeaderFetch distance of remote.

Span searches with a remote with a distant head don't
make sense.

The only thing this changes for the 'API' is that
the log lines are now Debug instead of Trace
for the different ancestor search algorithms.

Signed-off-by: meows <[email protected]>
return err
var origin = uint64(0)
if d.blockchain.CurrentHeader().Number.Uint64() > 0 {
origin, err = d.findAncestor(p, latest)
Copy link
Member Author

@meowsbits meowsbits Mar 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only negotiate a non-zero common ancestor if our current Header (should be common to all sync modes) is actually above genesis.

if remoteHeight-localHeight < uint64(MaxHeaderFetch) {
a, err := d.findAncestorSpanSearch(p, remoteHeight, localHeight, floor)
if err == nil {
if a > localHeight {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Never allow a "common ancestor" to be above our local head.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, PTAL my logic.

}

// Only attempt span search if local head is "close" to reported remote
if remoteHeight-localHeight < uint64(MaxHeaderFetch) {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only attempt the spanSearch algo if we're within throwing distance of the remote. Otherwise this would only return blocks above our local if anything at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But please double check my logic here.

Some tests apparently don't use a rreeeallll
downloader. This makes them not panic.

Signed-off-by: meows <[email protected]>
known = d.lightchain.HasHeader(h, n)
default:
log.Crit("unknown sync mode", "mode", d.mode)
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a new potential but theoretically-noop logic change, and like all audacious panics, it thinks it never should or will happen. The d.mode.String() will return "unknown."

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks legit.

These are nearly after the original ECIP1061 fork.

Signed-off-by: meows <[email protected]>
@meowsbits meowsbits marked this pull request as ready for review March 23, 2020 11:59
@meowsbits meowsbits changed the title WIP: Dev/20200319 Fix/loop-sync-mode Mar 23, 2020
@meowsbits meowsbits requested a review from soc1c March 23, 2020 12:01
@meowsbits meowsbits mentioned this pull request Mar 23, 2020
@q9f
Copy link
Contributor

q9f commented Mar 25, 2020

This branch gives me the following on kotti:

INFO [03-25|21:49:23.412] Imported new chain segment               blocks=1    txs=0 mgas=0.000 elapsed=676.413µs mgasps=0.000 number=2092502 hash=b46f4b…87f45c age=6m11s dirty=0.00B
INFO [03-25|21:49:23.418] Imported new chain segment               blocks=1    txs=2 mgas=0.051 elapsed=4.860ms   mgasps=10.432 number=2092503 hash=38bbe8…7efc7d age=5m56s dirty=4.59KiB
INFO [03-25|21:49:23.422] Imported new chain segment               blocks=1    txs=2 mgas=0.051 elapsed=3.530ms   mgasps=14.345 number=2092504 hash=4542bc…c2ef66 age=5m41s dirty=8.84KiB
INFO [03-25|21:49:23.430] Imported new chain segment               blocks=1    txs=2 mgas=0.051 elapsed=4.919ms   mgasps=10.307 number=2092505 hash=a084e0…d572a1 age=5m26s dirty=12.82KiB
INFO [03-25|21:49:23.433] Imported new chain segment               blocks=1    txs=2 mgas=0.051 elapsed=2.032ms   mgasps=24.944 number=2092506 hash=48eb17…0b3803 age=5m11s dirty=17.45KiB
INFO [03-25|21:49:23.435] Imported new chain segment               blocks=1    txs=0 mgas=0.000 elapsed=606.592µs mgasps=0.000  number=2092506 hash=0ff777…f23f6e age=5m11s dirty=17.45KiB
INFO [03-25|21:49:23.754] Imported new chain segment               blocks=1    txs=0 mgas=0.000 elapsed=8.369ms   mgasps=0.000  number=2092507 hash=93a2ee…bae189 age=4m56s dirty=17.45KiB
INFO [03-25|21:49:23.837] Imported new chain segment               blocks=2    txs=4 mgas=0.101 elapsed=5.614ms   mgasps=18.049 number=2092509 hash=dda05c…5c0e10 age=4m26s dirty=25.67KiB
INFO [03-25|21:49:23.842] Imported new chain segment               blocks=1    txs=0 mgas=0.000 elapsed=515.329µs mgasps=0.000  number=2092510 hash=34c5fb…28dd99 age=4m11s dirty=25.67KiB
INFO [03-25|21:49:23.915] Chain reorg detected                     number=2092509 hash=dda05c…5c0e10 drop=1 dropfrom=34c5fb…28dd99 add=1 addfrom=bd238b…a29fbb
INFO [03-25|21:49:23.920] Imported new chain segment               blocks=4    txs=8 mgas=0.203 elapsed=8.310ms   mgasps=24.389 number=2092513 hash=be2237…7e05cb age=3m26s dirty=43.12KiB
INFO [03-25|21:49:25.860] Deep froze chain segment                 blocks=13   elapsed=226.280ms number=2002512 hash=1ccd45…0c0883
WARN [03-25|21:49:44.857] Synchronisation failed, dropping peer    peer=e34e54dfc274445f err=timeout
INFO [03-25|21:52:02.486] Importing heavy sidechain segment        blocks=2048 start=3735    end=5782
ERROR[03-25|21:57:36.853] Impossible reorg, please file an issue   oldnum=3736 oldhash=cc596e…129218 newnum=3736 newhash=cc596e…129218
WARN [03-25|21:57:36.853] Synchronisation failed, dropping peer    peer=3af499147df72f36 err="retrieved hash chain is invalid"
INFO [03-25|21:59:27.143] Importing heavy sidechain segment        blocks=2048 start=32579   end=34626

I didn't observe that on #76

~/core-geth dev/20200319 one@ginan
❯ git log -n1
commit 6530aa15694751fd1c67a7579af1f0827639d8f4 (HEAD -> dev/20200319, origin/dev/20200319)
Author: meows <[email protected]>
Date:   Wed Mar 25 05:28:43 2020 -0500

    downloader: reorder methods for readability, add godoc-happy comments
    
    Signed-off-by: meows <[email protected]>

~/core-geth dev/20200319 one@ginan
❯ build/bin/geth version
CoreGeth
Version: 1.11.3-unstable
Git Commit: 6530aa15694751fd1c67a7579af1f0827639d8f4
Git Commit Date: 20200325
Architecture: amd64
Protocol Versions: [65 64 63]
Go Version: go1.14.1
Operating System: linux
GOPATH=
GOROOT=/usr/lib/go

@q9f
Copy link
Contributor

q9f commented Mar 25, 2020

Another node the same

INFO [03-25|22:01:53.413] Imported new chain segment               blocks=1    txs=2  mgas=0.051 elapsed=2.344ms   mgasps=21.598 number=2092549 hash=cbd3d3…dcc20a age=6m56s dirty=174.37KiB
INFO [03-25|22:01:53.417] Imported new chain segment               blocks=1    txs=0  mgas=0.000 elapsed=491.854µs mgasps=0.000  number=2092550 hash=098a76…b5be1c age=6m41s dirty=174.37KiB
INFO [03-25|22:01:53.421] Imported new chain segment               blocks=1    txs=2  mgas=0.051 elapsed=3.065ms   mgasps=16.541 number=2092550 hash=1e7491…9facae age=6m41s dirty=178.99KiB
INFO [03-25|22:01:53.843] Chain reorg detected                     number=2092549 hash=cbd3d3…dcc20a drop=1 dropfrom=098a76…b5be1c add=2 addfrom=692063…c084bd
INFO [03-25|22:01:53.845] Imported new chain segment               blocks=2    txs=4  mgas=0.101 elapsed=6.283ms   mgasps=16.130 number=2092552 hash=0b8282…f69264 age=6m11s dirty=187.83KiB
INFO [03-25|22:01:53.879] Imported new chain segment               blocks=1    txs=2  mgas=0.051 elapsed=2.053ms   mgasps=24.697 number=2092553 hash=5ad984…1dbd38 age=5m56s dirty=191.81KiB
INFO [03-25|22:01:53.881] Imported new chain segment               blocks=1    txs=0  mgas=0.000 elapsed=906.704µs mgasps=0.000  number=2092554 hash=286062…d41004 age=5m41s dirty=191.81KiB
INFO [03-25|22:01:53.886] Imported new chain segment               blocks=1    txs=2  mgas=0.051 elapsed=3.078ms   mgasps=16.473 number=2092554 hash=326f67…5bfe6b age=5m41s dirty=196.44KiB
INFO [03-25|22:01:53.887] Imported new chain segment               blocks=1    txs=0  mgas=0.000 elapsed=614.925µs mgasps=0.000  number=2092555 hash=4e4d17…6132f7 age=5m17s dirty=196.44KiB
INFO [03-25|22:01:53.895] Imported new chain segment               blocks=1    txs=4  mgas=0.101 elapsed=5.688ms   mgasps=17.828 number=2092556 hash=e044fc…2c06a7 age=5m2s  dirty=200.69KiB
INFO [03-25|22:01:53.897] Imported new chain segment               blocks=1    txs=0  mgas=0.000 elapsed=590.06µs  mgasps=0.000  number=2092557 hash=63184d…7bda42 age=4m47s dirty=200.69KiB
INFO [03-25|22:01:53.904] Imported new chain segment               blocks=1    txs=2  mgas=0.051 elapsed=3.592ms   mgasps=14.077 number=2092558 hash=18e55e…79fd3c age=4m32s dirty=205.31KiB
INFO [03-25|22:01:53.917] Imported new chain segment               blocks=6    txs=14 mgas=0.152 elapsed=12.831ms  mgasps=11.844 number=2092564 hash=aa616c…aa43e7 age=2m54s dirty=220.40KiB ignored=6
INFO [03-25|22:01:53.920] Imported new chain segment               blocks=1    txs=0  mgas=0.000 elapsed=458.301µs mgasps=0.000  number=2092562 hash=a38ef8…d568f8 age=3m29s dirty=220.40KiB
INFO [03-25|22:01:55.998] Looking for peers                        peercount=8 tried=97  static=0
INFO [03-25|22:02:12.359] Deep froze chain segment                 blocks=17   elapsed=239.178ms number=2002563 hash=9a22f3…13b7f1
INFO [03-25|22:04:10.754] Importing heavy sidechain segment        blocks=2048 start=3735    end=5782

@meowsbits
Copy link
Member Author

Thanks for the report!

  • Can you please include --verbosity 4 or greater logs? Seeing DEBUG levels will be useful.
  • Did the node in question ever suffer a harsh kill (eg. kill -9)?
  • Is it possible to share your chaindata database for reproduction?

I'll take another look at it. @tzdybal can you lend another set of 👀 as well?

@meowsbits
Copy link
Member Author

I'm tempted to add a bc.SetHead in the case the bc.import_ wants to reorg beyond the vars.ImmutabilityThreshold. This would ensure corrupt state(s) get purged if the downloader common ancestor mechanism misfires due to a bad HasBlockAndState.

@q9f
Copy link
Contributor

q9f commented Mar 25, 2020

  • Sending logs on Discord
  • I did not kill the nodes, this happens in regular operation
  • The db is 1.0 GB can be shared if we run out of ideas

@tzdybal
Copy link
Contributor

tzdybal commented Mar 26, 2020

I'm looking on this as well.

@meowsbits
Copy link
Member Author

P212: 716617 EIP213: 716617 EIP214: 716617 EIP2200: 2200013 EIP658: 716617 EIP7: 0 EthashECIP1010Continue: 2000000 EthashECIP1010Pause: 0 EthashECIP1017: 5000000 EthashECIP1041: 0 EthashEIP100B: 716617 EthashEIP2: 0 EthashHomestead: 0 "
INFO [03-25|22:55:41.287|eth/backend.go:170]                 Initialising Ethereum protocol           versions="[65 64 63]" network=6 dbversion=7
WARN [03-25|22:55:41.290|core/blockchain.go:344]             Head state missing, repairing chain      number=2092568 hash=08c64c…ce8434
INFO [03-25|22:55:41.291|core/blockchain.go:580]             Rewound blockchain to past state         number=2092565 hash=e80520…eff927

This line in the first few of boot-kotti-core-polis.log

@q9f
Copy link
Contributor

q9f commented Mar 26, 2020

Yeah, I keep getting head state missing on most of my nodes. Is anything corrupting the database?

@meowsbits
Copy link
Member Author

meowsbits commented Mar 26, 2020

Head state missing means something has gone wrong. The question is where when and why.

Clarity: I mean it could have been caused by a number of things.

@q9f
Copy link
Contributor

q9f commented Mar 26, 2020

But @meowsbits this is entirely a different issue! Let's fix the loop sync first.

Why didn't we see loops with #76? It just came back after compiling this branch.

@meowsbits meowsbits mentioned this pull request Mar 26, 2020
@meowsbits
Copy link
Member Author

I've read these logs, and don't see the loop sync issue there. ?

(And the logs in this gist are different than those originally reported at #78 (comment))

@q9f can you point to or highlight the lines in either of the logs in the gist that show the issue?

@q9f
Copy link
Contributor

q9f commented Mar 26, 2020

I've read these logs, and don't see the loop sync issue there. ?

That's bad luck. I cannot not turn on the bad-sync-loop switch and increasing the verbosity to 4 or 5 makes it impossible for me to see what's going on. I was just hoping it contained the sync loop.

(And the logs in this gist are different than those originally reported at #78 (comment))

Yeah, because I had to restart the node to increase verbosity.

@meowsbits
Copy link
Member Author

meowsbits commented Mar 26, 2020

Got it.

Just FYI, you can use the Javascript console debug.verbosity(n) to set it during runtime.

@q9f
Copy link
Contributor

q9f commented Mar 26, 2020

Ah good to know

Copy link
Contributor

@q9f q9f left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGMT in general as this actually improves the situation.

There is, however, a small performance impact for searching the entire tree for common ancestors.

INFO [03-30|12:00:25.754] Imported new chain segment               blocks=1    txs=0    mgas=0.000  elapsed=1.222ms   mgasps=0.000   number=2118251 hash=df27b9…ccb312 dirty=287.70KiB
INFO [03-30|12:01:10.595] Deep froze chain segment                 blocks=2    elapsed=1.189s    number=2028250 hash=5be331…dc36a9
INFO [03-30|12:02:14.851] Importing heavy sidechain segment        blocks=2048 start=3735    end=5782
ERROR[03-30|12:14:07.815] Impossible reorg, please file an issue   oldnum=3736  oldhash=cc596e…129218 newnum=3736  newhash=cc596e…129218
WARN [03-30|12:14:07.816] Synchronisation failed, dropping peer    peer=00759cb5fd8a35c8 err="retrieved hash chain is invalid"
WARN [03-30|12:14:08.485] Header broke chain ordering              number=2118276 hash=9267da…98f5df expected=2118252
WARN [03-30|12:14:08.489] Synchronisation failed, dropping peer    peer=5b1fe926e0ed8818 err="action from bad peer ignored"
INFO [03-30|12:14:08.529] Chain reorg detected                     number=2118250 hash=036a16…fc4211 drop=1 dropfrom=df27b9…ccb312 add=2 addfrom=2a9bb4…7629c0
INFO [03-30|12:14:08.530] Imported new chain segment               blocks=1    txs=0    mgas=0.000  elapsed=713.749ms mgasps=0.000   number=2118252 hash=2a9bb4…7629c0 age=13m30s   dirty=287.70KiB
INFO [03-30|12:14:08.535] Imported new chain segment               blocks=1    txs=0    mgas=0.000  elapsed=4.176ms   mgasps=0.000   number=2118252 hash=a5c57c…7f6950 age=13m30s   dirty=287.70KiB

^ here: 12 minutes.

@meowsbits
Copy link
Member Author

a small performance impact for searching the entire tree for common ancestors.

Can you say more about what do you mean here? This change actually improves the common ancestor networking step by skipping a useless span search for far-away peers, and skipping binary search when local is at genesis.

@q9f
Copy link
Contributor

q9f commented Mar 30, 2020

Ok, unsure then, I didn't really trace what's it doing these 12 minutes. I agree that it improves the situation overall. But it appears that the node entirely stops operating during that time it tries to reorganize.

@meowsbits
Copy link
Member Author

But it appears that the node entirely stops operating during that time it tries to reorganize.

👏 ver 👏 bose 👏 logs 👏 please 👏

😹 🕹️

Are you using a service runner to manage your instances? (I like systemd for a lot of reasons... one being that manages my logs for services very nicely.)

@meowsbits
Copy link
Member Author

meowsbits commented Mar 30, 2020

But it appears that the node entirely stops operating during that time it tries to reorganize.

@q9f
I might have an intuition what's going on here. Here's a hypothesis:

You know that reorg from block 3735? The protocol is attempting that insert and returning a known block error(or not) for those blocks if it needs a download. I don't think this means that you are "silently resyncing" the chain (because 12 minutes would be awful fast), but that the chain is "revalidating/reprocessing" the existing chain in search of a stable head
as it iterates forward from that original low number.

@meowsbits meowsbits merged commit 316b3e8 into master Apr 15, 2020
@meowsbits
Copy link
Member Author

Just FYI, rel 23cc3fa :

?       github.com/ethereum/go-ethereum/internal/web3ext        [no test files]
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x28 pc=0x9a3e5d]

goroutine 830644 [running]:
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).syncWithPeer(0xc01637ac40, 0xc015aa9d10, 0x9d16c1768a526be3, 0xb4f565cf9249fb63, 0xf029c88ca52145dc, 0x8a6f09394d5aed73, 0xc015af28c0, 0x0, 0x0)
        /home/ia/go/src/github.com/ethereum/go-ethereum/eth/downloader/downloader.go:447 +0x3cd
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).synchronise(0xc01637ac40, 0xc016208d80, 0x40, 0x9d16c1768a526be3, 0xb4f565cf9249fb63, 0xf029c88ca52145dc, 0x8a6f09394d5aed73, 0xc015af28c0, 0x2, 0x0, ...)
        /home/ia/go/src/github.com/ethereum/go-ethereum/eth/downloader/downloader.go:413 +0x36c
github.com/ethereum/go-ethereum/eth/downloader.(*Downloader).Synchronise(0xc01637ac40, 0xc016208d80, 0x40, 0x9d16c1768a526be3, 0xb4f565cf9249fb63, 0xf029c88ca52145dc, 0x8a6f09394d5aed73, 0xc015af28c0, 0x2, 0x187f739, ...)
        /home/ia/go/src/github.com/ethereum/go-ethereum/eth/downloader/downloader.go:325 +0x8e
github.com/ethereum/go-ethereum/les.(*clientHandler).synchronise(0xc016301ea0, 0xc01637bc00)
        /home/ia/go/src/github.com/ethereum/go-ethereum/les/sync.go:181 +0x69a
github.com/ethereum/go-ethereum/les.(*lightFetcher).newFetcherDistReqForSync.func3.1(0x11f1560, 0xc01637bc00, 0xc010a286e0)
        /home/ia/go/src/github.com/ethereum/go-ethereum/les/fetcher.go:509 +0x91
created by github.com/ethereum/go-ethereum/les.(*lightFetcher).newFetcherDistReqForSync.func3
        /home/ia/go/src/github.com/ethereum/go-ethereum/les/fetcher.go:506 +0xb5
FAIL    github.com/ethereum/go-ethereum/les     17.034s

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

clique perpetual resync from ~block3000
3 participants