-
Notifications
You must be signed in to change notification settings - Fork 20.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check + repair state on corrupted database #21650
Conversation
First run, where I instrumented the trie thus: --- a/trie/trie.go
+++ b/trie/trie.go
@@ -482,6 +482,10 @@ func (t *Trie) resolve(n node, prefix []byte) (node, error) {
func (t *Trie) resolveHash(n hashNode, prefix []byte) (node, error) {
hash := common.BytesToHash(n)
+ if bytes.Equal(prefix, []byte("\x00\x0e\x06\x0e\x09")){
+ return nil, &MissingNodeError{NodeHash: hash, Path: prefix}
+ }
+
if node := t.db.node(hash); node != nil {
return node, nil
} In order to trigger a missing node error. Running it triggers a fault inside a storage trie:
For the second run, I removed the trie-hack, and ran:
It required only |
Oops -- I guess there are still some rough edges to even out |
I've been considering
What happened here was,
In most cases, this would not happen -- since in a less synthetic setting, the on-disk state would be stale, and not ahead of the repair pivot-block. In this case, it's harmless. However, it might be nice to not only check the latest block, but also check the last ~128 blocks or so. After finding some bad paths on state Of course there may still be corrupted states that are even older, but if one wants a fully pristine state, a total wipe + resync is needed. |
Did another test where I had a sync from Friday, then on monday morning made it corrupt and heal-synced it. |
aa5a38c
to
fbd4c3d
Compare
How long does it take to repair the tire data? I start a fast-sync node from scratch, and want to access some history block state, but receive the errors like this:
So I found this PR and try it, but it haven't finished yet, is that normal? |
Hi Martin, I have a log of failing repair. Hope this helps the PR.
|
What's the status of this? |
Hello So far, I ran this for 36.5 h straight and no fix were offered but I guess my block issue is at the end of the db and unfortunately I 'Ctrl+C'ed the process by mistake (multiple pc with 1 switch keyboard kind). At this stage, I do not understand the difference with doing a full resynch. Thank you PS: actually, with a resynch, I could just continue where I left off while the repair start back at square 1. |
This PR will be moot when we transition to path-based trie storage, closing |
This is a work in progress to heal the state.
If you are reading this, and have encountered a corrupt db (e.g. due to crash), then I'd be very glad if you give this PR a shot at repairing it. It may totally fail, but it would be highly valuable to get feedback from actual runs on actual problematic databases.
The idea is to run
geth --repairtrie
, which iterates the state and checks if everythiing is present.If it is not, it finds out which parent nodes are present, and nukes them from the database, whereupon regular fast-sync can be performed to heal the database (to a more recent state). Essentially partial-fast-sync without wiping the entire datadir.
Current status:
Some TODOs