Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Check + repair state on corrupted database #21650

Closed
wants to merge 2 commits into from

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Oct 2, 2020

This is a work in progress to heal the state.

If you are reading this, and have encountered a corrupt db (e.g. due to crash), then I'd be very glad if you give this PR a shot at repairing it. It may totally fail, but it would be highly valuable to get feedback from actual runs on actual problematic databases.

The idea is to run geth --repairtrie, which iterates the state and checks if everythiing is present.
If it is not, it finds out which parent nodes are present, and nukes them from the database, whereupon regular fast-sync can be performed to heal the database (to a more recent state). Essentially partial-fast-sync without wiping the entire datadir.

Current status:

INFO [10-02|12:18:55.891] State root                               blockhash="e0ecd1…d1808d" state root="886c86…c415e1"
INFO [10-02|12:18:55.891] Starting verification procedure 
ERROR[10-02|12:18:56.577] Trie error                               path=000e060e hash="eaf01a…b19a0a" parent="792b1b…77fddc" error="missing trie node 577f89cbf0c3d8f1ad051412318251f638fd753f778d034dfe0eac53909521da (path 000e060e09)"
ERROR[10-02|12:18:56.577] Parent                                   hash="792b1b…77fddc" parent="98cd0a…a15607" path=000e06
ERROR[10-02|12:18:56.577] Parent                                   hash="98cd0a…a15607" parent="550caf…a25a64" path=000e
ERROR[10-02|12:18:56.577] Parent                                   hash="550caf…a25a64" parent="886c86…c415e1" path=00
ERROR[10-02|12:18:56.577] Parent                                   hash="886c86…c415e1" parent="000000…000000" path=
Elements that need to be removed:

792b1bcade654c1c98d2899836b786feefaf2160ba96545067860aa32577fddc
98cd0a1b9bcbc19194f029db952b3675e192a000672ca88b3eb0818f79a15607
550caf10590e7218e3bad0c10688b148c7a15089edb1bf9aa3125c39a0a25a64
886c865949500f3155f6219b313b7ae67118cfff3791b306885575d3f0c415e1

trie error: missing trie node 577f89cbf0c3d8f1ad051412318251f638fd753f778d034dfe0eac53909521da (path 000e060e09) , parent 3cf17d15f40cc111edcb47ec623ef937c8d7944dec939f0efacd14ec3b0b1d01

Some TODOs

  • delete the entries from DB
  • trigger a repair-fastsync
  • also iterate/heal storage tries

@holiman
Copy link
Contributor Author

holiman commented Oct 2, 2020

First run, where I instrumented the trie thus:

--- a/trie/trie.go
+++ b/trie/trie.go
@@ -482,6 +482,10 @@ func (t *Trie) resolve(n node, prefix []byte) (node, error) {

 func (t *Trie) resolveHash(n hashNode, prefix []byte) (node, error) {
        hash := common.BytesToHash(n)
+       if bytes.Equal(prefix, []byte("\x00\x0e\x06\x0e\x09")){
+               return nil, &MissingNodeError{NodeHash: hash, Path: prefix}
+       }
+
        if node := t.db.node(hash); node != nil {
                return node, nil
        }

In order to trigger a missing node error. Running it triggers a fault inside a storage trie:

[user@work go-ethereum]$ make geth && ./build/bin/geth --goerli  repairtrie
env GO111MODULE=on go run build/ci.go install ./cmd/geth
>>> /usr/local/go/bin/go install -ldflags -X main.gitCommit=b7003af2b39910058dc7e493dca7f692866b8f82 -X main.gitDate=20201002 -trimpath -v ./cmd/geth
Done building.
Run "./build/bin/geth" to launch geth.
INFO [10-02|18:18:58.889] Maximum peer count                       ETH=50 LES=0 total=50
INFO [10-02|18:18:58.889] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
ERROR[10-02|18:18:58.891] Failed to enumerate USB devices          hub=ledger vendor=11415 failcount=1 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:18:58.891] Failed to enumerate USB devices          hub=trezor vendor=21324 failcount=1 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:18:58.891] Failed to enumerate USB devices          hub=trezor vendor=4617  failcount=1 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:18:58.891] Failed to enumerate USB devices          hub=ledger vendor=11415 failcount=2 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:18:58.891] Failed to enumerate USB devices          hub=trezor vendor=21324 failcount=2 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:18:58.891] Failed to enumerate USB devices          hub=trezor vendor=4617  failcount=2 err="failed to initialize libusb: libusb: unknown error [code -99]"
INFO [10-02|18:18:58.893] Set global gas cap                       cap=25000000
INFO [10-02|18:18:58.893] Allocated cache and file handles         database=/home/user/.ethereum/goerli/geth/chaindata cache=512.00MiB handles=262144
ERROR[10-02|18:18:59.894] Failed to enumerate USB devices          hub=trezor vendor=4617  failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:18:59.894] Failed to enumerate USB devices          hub=trezor vendor=21324 failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:18:59.902] Failed to enumerate USB devices          hub=ledger vendor=11415 failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
INFO [10-02|18:19:04.169] Opened ancient database                  database=/home/user/.ethereum/goerli/geth/chaindata/ancient
INFO [10-02|18:19:04.185] Persisted trie from memory database      nodes=361 size=51.17KiB time="857.461µs" gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [10-02|18:19:04.194] Loaded most recent local header          number=3505559 hash="13e0ab…8f844b" td=5091612 age=4m
INFO [10-02|18:19:04.194] Loaded most recent local full block      number=3505559 hash="13e0ab…8f844b" td=5091612 age=4m
INFO [10-02|18:19:04.194] Loaded most recent local fast block      number=3505559 hash="13e0ab…8f844b" td=5091612 age=4m
INFO [10-02|18:19:04.196] Loaded last fast-sync pivot marker       number=3505461
INFO [10-02|18:19:04.255] State root                               blockhash="13e0ab…8f844b" state root="402ddf…35123d"
INFO [10-02|18:19:04.255] Starting verification procedure
INFO [10-02|18:19:04.255] Verifying state trie                     elapsed="6.558µs" nodes=2   accounts=0 slots=0 lastAccount=0x path=0x
INFO [10-02|18:19:12.255] Verifying storage trie                   elapsed=8.000096209s nodes=49955 accounts=2040 slots=16782 lastAccount=0x014482cb6f805dba8fe8b14472f583d3b6c00b28f008cbe5eb4f221d8ae52e8e path=0x0e0a0a06
INFO [10-02|18:19:20.256] Verifying storage trie                   elapsed=16.000738904s nodes=124307 accounts=5128 slots=41760 lastAccount=0x034f39843a563a1a8ebc41ed1bd1c89978db2e78f475915a8e79cd286a283863 path=0x0400
INFO [10-02|18:19:28.257] Verifying storage trie                   elapsed=24.00149943s  nodes=219630 accounts=10384 slots=70896 lastAccount=0x06abb624a0c2f006d3f1be327dc7b01de77739a38f52ec6f5aa96191a2110417 path=0x060007
INFO [10-02|18:19:36.257] Verifying storage trie                   elapsed=32.002240519s nodes=320805 accounts=14603 slots=104660 lastAccount=0x09534c0dacce9e3614e635a6fc024f2cbd9e6f6ec30112cbe6a8b28b287e622a path=0x0f03
INFO [10-02|18:19:44.259] Verifying storage trie                   elapsed=40.004123094s nodes=429775 accounts=17484 slots=144451 lastAccount=0x0b2eb16afade5164e942381593704695829ba4cc8010e82be9f5e36d0f376216 path=0x01020c0e
INFO [10-02|18:19:52.260] Verifying storage trie                   elapsed=48.005070881s nodes=595518 accounts=21764 slots=205685 lastAccount=0x0def036ddaf660c3bc1dda63bbb828da3ea5aa39237241d38e7edb1f5572c5ef path=0x04080001
ERROR[10-02|18:19:59.136] Storage trie error                       path=000e060e   hash="f8a04e…cd4a4f" parent="66e626…147278" error="missing trie node a70cf18665863276809553bff241807e8021b219bdbd6d7d1c7c6c9e20149da6 (path 000e060e09)"
Elements that need to be removed:

66e626049926c0d4131b867c79ce1ddb8f654ee73a1df13e565e2757aa147278
ERROR[10-02|18:19:59.136] Parent                                   hash="66e626…147278" parent="496229…26d306" path=0x000e06
4962296a513bff78b8e6a6cc03a18a43c3a4c951f143bbb48a8618ff5f26d306
ERROR[10-02|18:19:59.136] Parent                                   hash="496229…26d306" parent="ae6931…58ef5a" path=0x000e
ae69311532b4b6a7a7f841e7103ab1502a55cdeaeff433fd20c0accf5d58ef5a
ERROR[10-02|18:19:59.136] Parent                                   hash="ae6931…58ef5a" parent="4f0730…f366ff" path=0x00
4f073081496508e1119102469ab1ea833b19cec9ad344fa397193a01d9f366ff
ERROR[10-02|18:19:59.136] Parent                                   hash="4f0730…f366ff" parent="000000…000000" path=0x
ERROR[10-02|18:19:59.136] Trie error                               path=000e060104000c070f05080c050c0f0e0901090e000000020301000a0f0a07020a070a0f030b0e0a03050c050907000e030301060a05030805050d0109050c0910 hash="000000…000000" parent="522644…4b674b" error="storage trie error: missing trie node a70cf18665863276809553bff241807e8021b219bdbd6d7d1c7c6c9e20149da6 (path 000e060e09)"
5226448e48cf6c9ff1e1fac1f9f3ab34de5809faca550eab156c522a144b674b
ERROR[10-02|18:19:59.136] Parent                                   hash="522644…4b674b" parent="ead9ca…40a50d" path=0x000e060104
ead9ca312a6f91635f0a775af8cd761dd56536838cc1d153190436718640a50d
ERROR[10-02|18:19:59.136] Parent                                   hash="ead9ca…40a50d" parent="1dbe13…1a0102" path=0x000e0601
1dbe13f1b1ce2c36419f247dc2db3008b545666fe4b8ee271bd72c53ca1a0102
ERROR[10-02|18:19:59.136] Parent                                   hash="1dbe13…1a0102" parent="c159b3…441c29" path=0x000e06
c159b3a88ade2c9a073e3e581b5ac9b4057223f3d264c2ea4e1f3f4a30441c29
ERROR[10-02|18:19:59.136] Parent                                   hash="c159b3…441c29" parent="8faa0b…abe14a" path=0x000e
8faa0b0b7445fd1fd254d64575bc7f178aa693e543054ad7f779cd0b2eabe14a
ERROR[10-02|18:19:59.136] Parent                                   hash="8faa0b…abe14a" parent="402ddf…35123d" path=0x00
402ddf9bcbd6bcdff16f967fb92ed0cae9a00b48b3908bc6cdb8de758435123d
ERROR[10-02|18:19:59.136] Parent                                   hash="402ddf…35123d" parent="000000…000000" path=0x
Elements that need to be removed:

66e626049926c0d4131b867c79ce1ddb8f654ee73a1df13e565e2757aa147278
4962296a513bff78b8e6a6cc03a18a43c3a4c951f143bbb48a8618ff5f26d306
ae69311532b4b6a7a7f841e7103ab1502a55cdeaeff433fd20c0accf5d58ef5a
4f073081496508e1119102469ab1ea833b19cec9ad344fa397193a01d9f366ff
5226448e48cf6c9ff1e1fac1f9f3ab34de5809faca550eab156c522a144b674b
ead9ca312a6f91635f0a775af8cd761dd56536838cc1d153190436718640a50d
1dbe13f1b1ce2c36419f247dc2db3008b545666fe4b8ee271bd72c53ca1a0102
c159b3a88ade2c9a073e3e581b5ac9b4057223f3d264c2ea4e1f3f4a30441c29
8faa0b0b7445fd1fd254d64575bc7f178aa693e543054ad7f779cd0b2eabe14a
402ddf9bcbd6bcdff16f967fb92ed0cae9a00b48b3908bc6cdb8de758435123d
Deleting hash 66e626049926c0d4131b867c79ce1ddb8f654ee73a1df13e565e2757aa147278
Deleting hash 4962296a513bff78b8e6a6cc03a18a43c3a4c951f143bbb48a8618ff5f26d306
Deleting hash ae69311532b4b6a7a7f841e7103ab1502a55cdeaeff433fd20c0accf5d58ef5a
Deleting hash 4f073081496508e1119102469ab1ea833b19cec9ad344fa397193a01d9f366ff
Deleting hash 5226448e48cf6c9ff1e1fac1f9f3ab34de5809faca550eab156c522a144b674b
Deleting hash ead9ca312a6f91635f0a775af8cd761dd56536838cc1d153190436718640a50d
Deleting hash 1dbe13f1b1ce2c36419f247dc2db3008b545666fe4b8ee271bd72c53ca1a0102
Deleting hash c159b3a88ade2c9a073e3e581b5ac9b4057223f3d264c2ea4e1f3f4a30441c29
Deleting hash 8faa0b0b7445fd1fd254d64575bc7f178aa693e543054ad7f779cd0b2eabe14a
Deleting hash 402ddf9bcbd6bcdff16f967fb92ed0cae9a00b48b3908bc6cdb8de758435123d
INFO [10-02|18:19:59.136] Writing genesis as headblock             hash="bf7e33…b88c1a"
Please restart the node in fast-sync mode, and hope that it works!

For the second run, I removed the trie-hack, and ran:

[user@work go-ethereum]$ make geth && ./build/bin/geth --goerli  --txlookuplimit 1
env GO111MODULE=on go run build/ci.go install ./cmd/geth
>>> /usr/local/go/bin/go install -ldflags -X main.gitCommit=b7003af2b39910058dc7e493dca7f692866b8f82 -X main.gitDate=20201002 -trimpath -v ./cmd/geth
github.com/ethereum/go-ethereum/core/state
github.com/ethereum/go-ethereum/consensus/clique
github.com/ethereum/go-ethereum/consensus/ethash
github.com/ethereum/go-ethereum/core
github.com/ethereum/go-ethereum/eth/fetcher
github.com/ethereum/go-ethereum/light
github.com/ethereum/go-ethereum/eth/downloader
github.com/ethereum/go-ethereum/miner
github.com/ethereum/go-ethereum/eth
github.com/ethereum/go-ethereum/les
github.com/ethereum/go-ethereum/cmd/geth
Done building.
Run "./build/bin/geth" to launch geth.
INFO [10-02|18:20:38.724] Starting Geth on Görli testnet...
INFO [10-02|18:20:38.725] Maximum peer count                       ETH=50 LES=0 total=50
INFO [10-02|18:20:38.725] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
ERROR[10-02|18:20:38.725] Failed to enumerate USB devices          hub=ledger vendor=11415 failcount=1 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:20:38.725] Failed to enumerate USB devices          hub=trezor vendor=21324 failcount=1 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:20:38.726] Failed to enumerate USB devices          hub=trezor vendor=4617  failcount=1 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:20:38.726] Failed to enumerate USB devices          hub=ledger vendor=11415 failcount=2 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:20:38.726] Failed to enumerate USB devices          hub=trezor vendor=21324 failcount=2 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:20:38.726] Failed to enumerate USB devices          hub=trezor vendor=4617  failcount=2 err="failed to initialize libusb: libusb: unknown error [code -99]"
INFO [10-02|18:20:38.727] Set global gas cap                       cap=25000000
INFO [10-02|18:20:38.727] Allocated trie memory caches             clean=256.00MiB dirty=256.00MiB
INFO [10-02|18:20:38.727] Allocated cache and file handles         database=/home/user/.ethereum/goerli/geth/chaindata cache=512.00MiB handles=262144
INFO [10-02|18:20:38.774] Opened ancient database                  database=/home/user/.ethereum/goerli/geth/chaindata/ancient
INFO [10-02|18:20:38.795] Persisted trie from memory database      nodes=361 size=51.17KiB time="790.915µs" gcnodes=0 gcsize=0.00B gctime=0s livenodes=1 livesize=0.00B
INFO [10-02|18:20:38.795] Initialised chain configuration          config="{ChainID: 5 Homestead: 0 DAO: <nil> DAOSupport: true EIP150: 0 EIP155: 0 EIP158: 0 Byzantium: 0 Constantinople: 0 Petersburg: 0 Istanbul: 1561651, Muir Glacier: <nil>, YOLO v1: <nil>, Engine: clique}"
INFO [10-02|18:20:38.796] Initialising Ethereum protocol           versions="[65 64 63]" network=5 dbversion=8
INFO [10-02|18:20:39.103] Loaded most recent local header          number=3505559 hash="13e0ab…8f844b" td=5091612 age=5m35s
INFO [10-02|18:20:39.103] Loaded most recent local full block      number=0       hash="bf7e33…b88c1a" td=1       age=1y8mo1w
INFO [10-02|18:20:39.103] Loaded most recent local fast block      number=3505559 hash="13e0ab…8f844b" td=5091612 age=5m35s
INFO [10-02|18:20:39.103] Loaded last fast-sync pivot marker       number=3505461
INFO [10-02|18:20:39.165] Loaded local transaction journal         transactions=0 dropped=0
INFO [10-02|18:20:39.165] Regenerated local transaction journal    transactions=0 accounts=0
INFO [10-02|18:20:39.174] Allocated fast sync bloom                size=512.00MiB
INFO [10-02|18:20:39.174] Starting peer-to-peer node               instance=Geth/v1.9.23-unstable-b7003af2-20201002/linux-amd64/go1.14.4
INFO [10-02|18:20:39.352] New local node record                    seq=81 id=b23ed698170364bf ip=127.0.0.1 udp=30303 tcp=30303
INFO [10-02|18:20:39.356] IPC endpoint opened                      url=/home/user/.ethereum/goerli/geth.ipc
INFO [10-02|18:20:39.356] Started P2P networking                   self=enode://edeb01f0a7e397bc50b663a33458d5c2414db6458e272bbabbbd5ded34c935725219f55665dd4d86efd2d123de8da928da7bdf7f1fa82ce2d5fa1eb217a7ffd3@127.0.0.1:30303
ERROR[10-02|18:20:39.730] Failed to enumerate USB devices          hub=trezor vendor=4617  failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:20:39.731] Failed to enumerate USB devices          hub=trezor vendor=21324 failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
ERROR[10-02|18:20:39.731] Failed to enumerate USB devices          hub=ledger vendor=11415 failcount=3 err="failed to initialize libusb: libusb: unknown error [code -99]"
INFO [10-02|18:20:47.188] Initializing fast sync bloom             items=3067262 errorrate=0.000 elapsed=8.005s
INFO [10-02|18:20:49.363] Looking for peers                        peercount=0 tried=3 static=0
INFO [10-02|18:20:53.938] New local node record                    seq=82 id=b23ed698170364bf ip=46.59.37.153 udp=30303 tcp=30303
INFO [10-02|18:20:55.188] Initializing fast sync bloom             items=6371288 errorrate=0.000 elapsed=16.005s
INFO [10-02|18:20:59.502] Looking for peers                        peercount=0 tried=13 static=0
INFO [10-02|18:21:03.188] Initializing fast sync bloom             items=7637369 errorrate=0.000 elapsed=24.005s
INFO [10-02|18:21:09.574] Looking for peers                        peercount=0 tried=6  static=0
INFO [10-02|18:21:11.189] Initializing fast sync bloom             items=9916656 errorrate=0.000 elapsed=32.006s
INFO [10-02|18:21:19.189] Initializing fast sync bloom             items=11941456 errorrate=0.000 elapsed=40.006s
INFO [10-02|18:21:21.633] Looking for peers                        peercount=0 tried=45 static=0
INFO [10-02|18:21:27.189] Initializing fast sync bloom             items=16857822 errorrate=0.000 elapsed=48.007s
INFO [10-02|18:21:31.651] Looking for peers                        peercount=0 tried=46 static=0
INFO [10-02|18:21:35.190] Initializing fast sync bloom             items=22263406 errorrate=0.000 elapsed=56.007s
INFO [10-02|18:21:42.279] Initialized fast sync bloom              items=27006042 errorrate=0.000 elapsed=1m3.096s
INFO [10-02|18:21:43.070] Looking for peers                        peercount=0 tried=21 static=0
INFO [10-02|18:21:55.003] Looking for peers                        peercount=0 tried=33 static=0
INFO [10-02|18:22:05.072] Looking for peers                        peercount=0 tried=57 static=0
INFO [10-02|18:22:15.396] Looking for peers                        peercount=0 tried=55 static=0
INFO [10-02|18:22:20.213] Block synchronisation started
INFO [10-02|18:22:21.463] Disabling direct-ancient mode            origin=3505523 ancient=3473407
INFO [10-02|18:22:21.675] Imported new block headers               count=29 elapsed=23.097ms number=3505588 hash="06b6d6…4f4ce6" ignored=36
INFO [10-02|18:22:21.754] Downloader queue stats                   receiptTasks=9 blockTasks=8 itemSize=362.72B throttle=8192
INFO [10-02|18:22:21.858] Imported new state entries               count=17 elapsed="144.864µs" processed=27020352 pending=77 trieretry=0 coderetry=0 duplicate=0 unexpected=0
INFO [10-02|18:22:23.566] Imported new state entries               count=385 elapsed="723.857µs" processed=27020737 pending=345 trieretry=0 coderetry=0 duplicate=0 unexpected=0
INFO [10-02|18:22:24.619] Imported new state entries               count=245 elapsed=1.716ms     processed=27020982 pending=0   trieretry=0 coderetry=0 duplicate=0 unexpected=0
INFO [10-02|18:22:24.620] Imported new block receipts              count=0   elapsed="133.553µs" number=3505524 hash="c81874…009564" age=16m5s   size=0.00B     ignored=1
INFO [10-02|18:22:24.623] Committed new head block                 number=3505524 hash="c81874…009564"
INFO [10-02|18:22:24.717] Deallocated fast sync bloom              items=27006673 errorrate=0.000
ERROR[10-02|18:22:24.877] Please file an issue, skip known block execution without receipt hash="9a74e8…c4e301" number=3505558
ERROR[10-02|18:22:24.878] Please file an issue, skip known block execution without receipt hash="13e0ab…8f844b" number=3505559
INFO [10-02|18:22:25.143] Imported new chain segment               blocks=63 txs=228 mgas=48.137 elapsed=425.197ms   mgasps=113.211 number=3505588 hash="06b6d6…4f4ce6" dirty=3.27MiB
INFO [10-02|18:22:25.143] Fast sync complete, auto disabling
INFO [10-02|18:22:25.145] Unindexed transactions                   blocks=29 txs=108 tail=3505588 elapsed="491.757µs"
INFO [10-02|18:22:25.500] Looking for peers                        peercount=1 tried=64 static=0
INFO [10-02|18:22:35.754] Looking for peers                        peercount=1 tried=47 static=0
INFO [10-02|18:22:45.765] Looking for peers                        peercount=1 tried=47 static=0
INFO [10-02|18:22:55.800] Looking for peers                        peercount=1 tried=37 static=0
INFO [10-02|18:23:05.823] Looking for peers                        peercount=1 tried=54 static=0
^CINFO [10-02|18:23:12.818] Got interrupt, shutting down...
INFO [10-02|18:23:12.820] IPC endpoint closed                      url=/home/user/.ethereum/goerli/geth.ipc
INFO [10-02|18:23:12.820] Ethereum protocol stopped
INFO [10-02|18:23:12.820] Transaction pool stopped
INFO [10-02|18:23:12.820] Writing cached state to disk             block=3505588 hash="06b6d6…4f4ce6" root="44c184…c14fcc"
INFO [10-02|18:23:12.872] Persisted trie from memory database      nodes=4890 size=1.50MiB   time=51.982966ms gcnodes=0 gcsize=0.00B gctime=0s livenodes=2824 livesize=1.05MiB
INFO [10-02|18:23:12.872] Writing cached state to disk             block=3505587 hash="29bdf8…397c69" root="6ed00a…45d2f4"
INFO [10-02|18:23:12.873] Persisted trie from memory database      nodes=52   size=19.55KiB  time="435.613µs" gcnodes=0 gcsize=0.00B gctime=0s livenodes=2772 livesize=1.03MiB
INFO [10-02|18:23:12.873] Writing cached state to disk             block=3505461 hash="e8a7eb…0dbd13" root="71aeff…ed3028"
INFO [10-02|18:23:12.873] Persisted trie from memory database      nodes=0    size=0.00B     time="4.336µs"   gcnodes=0 gcsize=0.00B gctime=0s livenodes=2772 livesize=1.03MiB
INFO [10-02|18:23:12.879] Writing clean trie cache to disk         path=/home/user/.ethereum/goerli/geth/triecache threads=6
INFO [10-02|18:23:12.914] Persisted the clean trie cache           path=/home/user/.ethereum/goerli/geth/triecache elapsed=34.907ms
INFO [10-02|18:23:12.915] Blockchain stopped
[user@work go-ethereum]$

It required only 647 state entries to heal to the latest state.
Note: this is an extreme sunshine-path, since my corrupted state had not had any time to become stale. If I had let a couple of
days pass, the situation would have been different indeed.

@holiman holiman changed the title Verify state Check + repair state on corrupted database Oct 2, 2020
@holiman holiman marked this pull request as ready for review October 2, 2020 16:29
@holiman
Copy link
Contributor Author

holiman commented Oct 2, 2020

ERROR[10-02|18:22:24.877] Please file an issue, skip known block execution without receipt hash="9a74e8…c4e301" number=3505558
ERROR[10-02|18:22:24.878] Please file an issue, skip known block execution without receipt hash="13e0ab…8f844b" number=3505559

Oops -- I guess there are still some rough edges to even out

@holiman
Copy link
Contributor Author

holiman commented Oct 5, 2020

I've been considering

ERROR[10-02|18:22:24.877] Please file an issue, skip known block execution without receipt hash="9a74e8…c4e301" number=3505558

What happened here was,

  1. I synced to block N (3505559). Then exited, which saved N, N-1 and N-128 (?).
  2. Then corrupted the db for N.
  3. Then deleted upwards along the path, so the stateroot for N got removed.
  4. Then did a fast-heal, which pivoted at block 3505524, which is N-35.
  5. When importing blocks after the pivot, I hit both 3505558 and 3505559 again. These blocks were already fully imported, and the importer yielded ErrKnownBlock -- which means that we fully had the block and state.
    • This is very odd, because it indicates that the state for N was not successfully deleted. I should have only gotten that for block N-1, not N.

In most cases, this would not happen -- since in a less synthetic setting, the on-disk state would be stale, and not ahead of the repair pivot-block. In this case, it's harmless.

However, it might be nice to not only check the latest block, but also check the last ~128 blocks or so. After finding some bad paths on state N, check the same paths on the older roots, and see if they hit the same nodes. And then remove those aswell.

Of course there may still be corrupted states that are even older, but if one wants a fully pristine state, a total wipe + resync is needed.

@holiman
Copy link
Contributor Author

holiman commented Oct 5, 2020

Did another test where I had a sync from Friday, then on monday morning made it corrupt and heal-synced it.
It required 493K states to heal, (total 27.5M nodes processed).
Block went from 3505588 to 3520751, so ~15K blocks.

@noprom
Copy link

noprom commented Feb 7, 2021

geth repairtire --datadir /data takes more than 14hours, and it haven't completed yet.
Some logs:

INFO [02-07|23:52:00.143] Verifying storage trie                   elapsed=14h38m28.352392114s nodes=188746475 accounts=12042217 slots=56382150 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b09020d0b0906
INFO [02-07|23:52:08.143] Verifying storage trie                   elapsed=14h38m36.352767309s nodes=188769037 accounts=12042217 slots=56391765 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b090a0f0609
INFO [02-07|23:52:16.143] Verifying storage trie                   elapsed=14h38m44.353043499s nodes=188791778 accounts=12042217 slots=56401455 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0a0302020a
INFO [02-07|23:52:24.143] Verifying storage trie                   elapsed=14h38m52.353116048s nodes=188814510 accounts=12042217 slots=56411156 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0a0b08000f
INFO [02-07|23:52:32.144] Verifying storage trie                   elapsed=14h39m0.353244479s  nodes=188837171 accounts=12042217 slots=56420800 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0b030a0b
INFO [02-07|23:52:40.144] Verifying storage trie                   elapsed=14h39m8.353601364s  nodes=188859966 accounts=12042217 slots=56430511 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0b0c000d
INFO [02-07|23:52:48.144] Verifying storage trie                   elapsed=14h39m16.353693372s nodes=188882706 accounts=12042217 slots=56440188 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0c04040c07
INFO [02-07|23:52:56.145] Verifying storage trie                   elapsed=14h39m24.354208763s nodes=188905364 accounts=12042217 slots=56449830 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0c0c090601
INFO [02-07|23:53:04.145] Verifying storage trie                   elapsed=14h39m32.354555963s nodes=188927904 accounts=12042217 slots=56459442 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0d040d0e06
INFO [02-07|23:53:12.145] Verifying storage trie                   elapsed=14h39m40.354819053s nodes=188950641 accounts=12042217 slots=56469122 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0d0d010f0a08
INFO [02-07|23:53:20.145] Verifying storage trie                   elapsed=14h39m48.355001168s nodes=188973406 accounts=12042217 slots=56478831 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0e05060a03
INFO [02-07|23:53:28.146] Verifying storage trie                   elapsed=14h39m56.355451029s nodes=188996011 accounts=12042217 slots=56488464 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0e0d090f
INFO [02-07|23:53:36.147] Verifying storage trie                   elapsed=14h40m4.356204735s  nodes=189018733 accounts=12042217 slots=56498134 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0f050f060b07
INFO [02-07|23:53:44.147] Verifying storage trie                   elapsed=14h40m12.356804778s nodes=189041409 accounts=12042217 slots=56507797 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0b0f0e05010e
INFO [02-07|23:53:52.148] Verifying storage trie                   elapsed=14h40m20.357361536s nodes=189064166 accounts=12042217 slots=56517479 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0c0006090801
INFO [02-07|23:54:00.148] Verifying storage trie                   elapsed=14h40m28.358056905s nodes=189086994 accounts=12042217 slots=56527197 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0c000e0d0d0002
INFO [02-07|23:54:08.149] Verifying storage trie                   elapsed=14h40m36.358570852s nodes=189109670 accounts=12042217 slots=56536844 codes=341898 lastAccount=0x1a4fa162e70315921486693f1d5943b7704232081b39206774caa567d63f633f path=0x0c0107020e01

How long does it take to repair the tire data? I start a fast-sync node from scratch, and want to access some history block state, but receive the errors like this:

missing trie node d568540a6966bdcab54d73d4bdb8ae67f85eb6f15a4f5f5c10000f0396c37e49 (path )"

So I found this PR and try it, but it haven't finished yet, is that normal?

@ChihChengLiang
Copy link

Hi Martin, I have a log of failing repair. Hope this helps the PR.

$ geth version
Geth
Version: 1.10.2-unstable
Git Commit: 4bd4106dfb850873873f62179baa6f82bf939b7c
Git Commit Date: 20210407
Architecture: amd64
Go Version: go1.15.4
Operating System: linux
GOPATH=/home/liangcc/go
GOROOT=/home/liangcc/.go

$ geth repairtrie
INFO [04-15|14:51:00.916] Maximum peer count                       ETH=50 LES=0 total=50
INFO [04-15|14:51:00.916] Smartcard socket not found, disabling    err="stat /run/pcscd/pcscd.comm: no such file or directory"
INFO [04-15|14:51:00.916] Set global gas cap                       cap=25000000
INFO [04-15|14:51:00.916] Allocated cache and file handles         database=/home/liangcc/.ethereum/geth/chaindata cache=512.00MiB handles=524288
INFO [04-15|14:51:01.245] Opened ancient database                  database=/home/liangcc/.ethereum/geth/chaindata/ancient readonly=false
INFO [04-15|14:51:01.246] Disk storage enabled for ethash caches   dir=/home/liangcc/.ethereum/geth/ethash count=3
INFO [04-15|14:51:01.246] Disk storage enabled for ethash DAGs     dir=/home/liangcc/.ethash               count=2
INFO [04-15|14:51:01.246] Loaded most recent local header          number=12243114 hash="d0956f…3fe488" td=23310484264173868737539 age=5m2s
INFO [04-15|14:51:01.246] Loaded most recent local full block      number=0        hash="d4e567…cb8fa3" td=17179869184             age=52y1w5d
INFO [04-15|14:51:01.246] Loaded most recent local fast block      number=12243042 hash="099766…6e634c" td=23309999106476756878991 age=19m
INFO [04-15|14:51:01.246] Loaded last fast-sync pivot marker       number=12243043
WARN [04-15|14:51:01.247] Enabling snapshot recovery               chainhead=0 diskbase=12240417
WARN [04-15|14:51:01.251] Snapshot is not continuous with chain    snaproot="16015b…282f26" chainroot="d7f897…0f0544"
INFO [04-15|14:51:01.251] State root                               blockhash="d4e567…cb8fa3" state root="d7f897…0f0544"
INFO [04-15|14:51:01.251] Starting verification procedure 
INFO [04-15|14:51:01.251] Verifying state trie                     elapsed="10.39µs" nodes=2 accounts=0 slots=0 codes=0 lastAccount=0x path=0x
INFO [04-15|14:51:01.251] Resuming state snapshot generation       root="04e562…1565ce" in="0042a5…807326" at="71a3ed…ae4176" accounts=128691 slots=1593095 storage=118.56MiB elapsed="479.407µs"
ERROR[04-15|14:51:01.251] Generator failed to iterate account trie root="04e562…1565ce" err="missing trie node 6416aa27a71d9dbe1e3679d41991dd0fc54fd545a0ae34874a54a2c3fe3e4585 (path 000004)"
INFO [04-15|14:51:02.023] Verified state trie                      elapsed=772.112798ms nodes=42499 accounts=8893   slots=0       codes=0 lastAccount=0xfffbd1e64a6554703c53cb7ab942bbf611cd44949ffb1fcec7a635054dbb39be path=0x0f0f0f0b0d010e06040a060505040700030c05030c0b070a0b0904020b0b0f0601010c0d04040904090f0f0b010f0c0e0c070a0603050005040d0b0b03090b0e10

@petejkim
Copy link

What's the status of this?

@Febl
Copy link

Febl commented Nov 8, 2021

Hello
I have pulled the master from the main branch. I forked the master from this repo and pulled verify_state. I then merged the main master back into the "local" master before merging it all back into my forked verify_state. From there I built the 1.10.12 unstable version with repairtrie which I promptly ran to attempt to fix one of these canonical block issue that started after an out of memory error I had 2 weeks ago.

So far, I ran this for 36.5 h straight and no fix were offered but I guess my block issue is at the end of the db and unfortunately I 'Ctrl+C'ed the process by mistake (multiple pc with 1 switch keyboard kind). At this stage, I do not understand the difference with doing a full resynch.
What did I miss? Isn't there an option to point straight to a range or the parent block which has the problem ?
Apologies for my ignorance on these matters.

Thank you
Kind Regards
F.

PS: actually, with a resynch, I could just continue where I left off while the repair start back at square 1.

@holiman
Copy link
Contributor Author

holiman commented Feb 3, 2023

This PR will be moot when we transition to path-based trie storage, closing

@holiman holiman closed this Feb 3, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants