Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Race condition on diffLayer #22540

Merged
merged 1 commit into from
Apr 6, 2021

Conversation

fxfactorial
Copy link
Contributor

I encountered this race condition and it happens at difflayer.go:223 wrt dl.origin = origin but .Storage uses dl.origin at return dl.origin.Storage(accountHash, storageHash) in difflayer.go.

@holiman
Copy link
Contributor

holiman commented Mar 21, 2021

I encountered this race condition

Could you provide some more info on how you encountered it? Do you have a stack trace? Was it during a test?

Your change does two things, turning an RLock to a Lock, and changing the scope. I don't see the need for either of them, since both 1) storage and Storage internally obtains the Rlock, and 2) the internals are not modified, so readlock should suffice.

So any more info about how you encountered this would likely clear this up for me

@fxfactorial
Copy link
Contributor Author

@holiman Sorry - was late and I was too brief.

I found it the usual way, with -race turned on and one thread hit a call to rebloom https://github.com/ethereum/go-ethereum/blob/70a8d2cbacae6378e0da73097035c27a8114672f/core/state/snapshot/difflayer.go#L223 which set the .origin field, but another thread hit the the call to .Storage which uses the .origin field.

So

  1. the lock doesn't protect the usage of the field .origin
  2. a read lock isn't enough since rebloom will set it

I'll try to look through tmux history for the -race stack traces

@karalabe
Copy link
Member

The PR is definitely problematic because it serializes reads in the spanshots, and even keeps it locked for disk access.

If the underlying issue is the .origin, we can work around that by extracting it while still in the read lock:

	// Check the bloom filter first whether there's even a point in reaching into
	// all the maps in all the layers below
	dl.lock.RLock()
	hit := dl.diffed.Contains(storageBloomHasher{accountHash, storageHash})
	if !hit {
		hit = dl.diffed.Contains(destructBloomHasher(accountHash))
	}
	var origin *diskLayer
	if !hit {
		origin = dl.origin // extract origin while holding the lock
	}
	dl.lock.RUnlock()

	// If the bloom filter misses, don't even bother with traversing the memory
	// diff layers, reach straight into the bottom persistent disk layer
	if origin != nil {
		snapshotBloomStorageMissMeter.Mark(1)
		return origin.Storage(accountHash, storageHash)
	}
	// The bloom filter hit, start poking in the internal maps
	return dl.storage(accountHash, storageHash, 0)

Would this solve the issue @fxfactorial?

@karalabe
Copy link
Member

karalabe commented Mar 22, 2021

Though I guess we'd need to look through the code now, because account and whatnot accessors will use the same patterns as the faulty storage above.

We definitely need the same fix in AccountRLP too.

@karalabe
Copy link
Member

I think that would suffice. There are 1-2 more accesses into .origin via parent.origin paths, but those are snapshot mutation operations, and I think we only ever mutate serialized.

@fxfactorial
Copy link
Contributor Author

The PR is definitely problematic because it serializes reads in the spanshots, and even keeps it locked for disk access.

If the underlying issue is the .origin, we can work around that by extracting it while still in the read lock:

	// Check the bloom filter first whether there's even a point in reaching into
	// all the maps in all the layers below
	dl.lock.RLock()
	hit := dl.diffed.Contains(storageBloomHasher{accountHash, storageHash})
	if !hit {
		hit = dl.diffed.Contains(destructBloomHasher(accountHash))
	}
	var origin *diskLayer
	if !hit {
		origin = dl.origin // extract origin while holding the lock
	}
	dl.lock.RUnlock()

	// If the bloom filter misses, don't even bother with traversing the memory
	// diff layers, reach straight into the bottom persistent disk layer
	if origin != nil {
		snapshotBloomStorageMissMeter.Mark(1)
		return origin.Storage(accountHash, storageHash)
	}
	// The bloom filter hit, start poking in the internal maps
	return dl.storage(accountHash, storageHash, 0)

Would this solve the issue @fxfactorial?

yes - i think so - I can clean up the other spots if you like as well (lmk where to look, i see you mentioned some places)

@holiman
Copy link
Contributor

holiman commented Mar 29, 2021

@fxfactorial Do you want to fix this? Would be nice to get it merged.
I think you can leave the other spots out, unless you find something that looks suspicious to you

@fxfactorial
Copy link
Contributor Author

@holiman force pushed - covered the AccountRLP method as well -

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, but please remove the iterate.sh, probably an accidental addition :)

Copy link
Contributor

@holiman holiman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks!

@fjl
Copy link
Contributor

fjl commented Mar 30, 2021

@karalabe Please merge if you think this fix is OK.

@fxfactorial
Copy link
Contributor Author

@karalabe ping - anything else for merge?

Copy link
Member

@karalabe karalabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM

@karalabe karalabe added this to the 1.10.2 milestone Apr 6, 2021
@karalabe karalabe merged commit c79fc20 into ethereum:master Apr 6, 2021
@fxfactorial fxfactorial deleted the snapshot-race branch April 6, 2021 11:45
atif-konasl pushed a commit to frozeman/pandora-execution-engine that referenced this pull request Oct 15, 2021
tony-ricciardi pushed a commit to tony-ricciardi/go-ethereum that referenced this pull request Jan 20, 2022
Cherry pick bug fixes from upstream for snapshots, which will enable higher transaction throughput. It also enables snapshots by default (which is one of the commits pulled from upstream).

Upstream commits included:

68754f3 cmd/utils: grant snapshot cache to trie if disabled (ethereum#21416)
3ee91b9 core/state/snapshot: reduce disk layer depth during generation
a15d71a core/state/snapshot: stop generator if it hits missing trie nodes (ethereum#21649)
43c278c core/state: disable snapshot iteration if it's not fully constructed (ethereum#21682)
b63e3c3 core: improve snapshot journal recovery (ethereum#21594)
e640267 core/state/snapshot: fix journal recovery from generating old journal (ethereum#21775)
7b7b327 core/state/snapshot: update generator marker in sync with flushes
167ff56 core/state/snapshot: gethring -> gathering typo (ethereum#22104)
d2e1b17 snapshot, trie: fixed typos, mostly in snapshot pkg (ethereum#22133)
c4deebb core/state/snapshot: add generation logs to storage too
5e9f5ca core/state/snapshot: write snapshot generator in batch (ethereum#22163)
18145ad core/state: maintain one more diff layer (ethereum#21730)
04a7226 snapshot: merge loops for better performance (ethereum#22160)
994cdc6 cmd/utils: enable snapshots by default
9ec3329 core/state/snapshot: ensure Cap retains a min number of layers
52e5c38 core/state: copy the snap when copying the state (ethereum#22340)
a31f6d5 core/state/snapshot: fix panic on missing parent
61ff3e8 core/state/snapshot, ethdb: track deletions more accurately (ethereum#22582)
c79fc20 core/state/snapshot: fix data race in diff layer (ethereum#22540)

Other changes
Commit f9b5530 (not from upstream) fixes an incorrect default DatabaseCache value due to an earlier bad merge.

Tested
Automated tests
Testing on a private testnet
Backwards compatibility
Enabling snapshots by default is a breaking change in terms of the CLI flags, but will not cause backwards incompatibility between the node and other nodes.

Co-authored-by: Péter Szilágyi <[email protected]>
Co-authored-by: gary rong <[email protected]>
Co-authored-by: Melvin Junhee Woo <[email protected]>
Co-authored-by: Martin Holst Swende <[email protected]>
Co-authored-by: Edgar Aroutiounian <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants