Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

core/state/snapshot: reuse memory data instead of hitting disk when generating #22667

Merged
merged 4 commits into from
Apr 23, 2021

Conversation

holiman
Copy link
Contributor

@holiman holiman commented Apr 14, 2021

This PR is a successor to rjl493456442#6 (some charts can be found there).

This PR attempts to minimize the number of disk reads when doing the snapshot generation.

When we have a slice of snapshot values which do not match the trie data, we currently iterate the disk to retrieve the canonical data.

What this PR does, is commit the (incorrect) snapshot data to a trie, which will be 99% correct. When iterating the trie, we then use the snapshot-trie-database for resolving hashes. By doing so, we can read 99% of the leaves the from the memory db instead of resolving from disk.

It can also be implemented differently (this was my third variation). Other variants which also "work"

  • Use a differenceIterator, to walk only the nodes which are on disk and not in memory. The drawback with this (otherwise pretty clean) approach is that it needs two walks: one to find out what needs to be added from trie to snapshot, another (reverse direction) which figures out what needs to be deleted. It could be solved by writing a new difference-iterator which spits out "addition" or "deletion", if we prefer that approach.
  • Use a relay-database, which implements ethdb.KeyValueStore. And then shove all nodes into a memorydb, and have the relaydb use the memdb as primary and diskdb as secondary, and plug it into a new trie.Database as backend. It's pretty clunky, and does not get to use any of the cleans cache in the existing trie.Database.

The upside of this particular way to implement it is that most of the modifications are performed in the node iterator, and does not touch the trie database.

This approach can probably be further polished/optimized if we retain the trie used by the prover, instead of rebuilding it once again -- but this PR focuses on the disk access part, not so much on getting the cpu cycles down

@karalabe
Copy link
Member

If I understand this PR correctly, you are adding a new cache layer into the iterator, so that while traversing the trie, if anything is in memory already (in your bastard-trie) then we just read that, but otherwise consider it the same as if it were on disk already?

@holiman
Copy link
Contributor Author

holiman commented Apr 19, 2021

That is correct. We just resolve from a primary source (which is fully in memory), and if that fails, resolve from the secondary source (the disk trie).

Copy link
Member

@karalabe karalabe left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've pushed the minor change on top which we've discussed to use ethdb.KeyValueStore instead of a trie.Database, since the former is easy to construct (potentially useful elsewhere too) whereas the latter requires key/value data, can't use just a random soup of trie nodes.

The PR is a bit more complicated than I'd like, but an essential detail is that all simpler variations placed the memcache below the live trie.Database, which is problematic as we really don't want potentially non-persisted data to end up in the trie.Database clen caches.

@karalabe karalabe merged commit 49281ab into ethereum:master Apr 23, 2021
@jon-chuang
Copy link

jon-chuang commented Apr 28, 2021

Why isn't generate concurrent, even though we are using a lot of async IO?

From what I understand, the purpose of using generate range is not to build the Trie for the entire state, which would OOM. However, how does one decide to delete Trie nodes if using the partially-built Trie as memcache? Is there a pruning mechanism to do so easily?


Further, I am writing the following tool: Async Trie resolver.

The model is that one submits keys to fetch from asynctrieresolver. The resolver fetches data async and stores in a resolvedNodeCache map. Once a key is available, a channel signal for that key is invoked and a new Trie can be built sequentially with assistance from the resolvedNodeCache map. The resolver adds the resolved nodes to a map.

Notice that since async IO is always the bottleneck when it comes to Trie operations, if one simply requires eventual consistency, we can use our AsyncTrie, where one just calls trie.sync() and it would block on all pending insert, delete, get operations to complete. The limitation is that calling sync frequently can lead to low performance.

Further, maintaining a sequential consistency in the order in which the operations are called is also a difficulty (but this can be resolved with an applyOperation queue that is sequential).

The guarantees are: 1) thread safety on copy (regardless if copy calls trie.sync() first, or simply copies the Trie with updates yet unapplied). 2) sequential consistency.

A programming example is as follows:

func(trie *AsyncTrie) (..) {
    for .. := range .. {
        trie.Insert(..)
        trie.Delete(..)
   }
    valuefuture := trie.Get(..) 
    for .. := range .. {
        trie.Insert(..)
        trie.Delete(..)
   }

    value, err := <- valuefuture
   
    if err == nil {
        changeVaue(value)
        trie.Insert(value, ..)
    }
    trie.Insert(..)

    trie.sync() 
    return trie.root
}

Whenever trie.sync() is called, the resolvedNodeCache can be cleared since all resolved nodes would be updated in the trie at the end of sync, when all node resolver workers have been terminated.

In the concurrent tryGet rather than trie building model, as is the case in snapshot generation, one can skip building the Trie altogether (past a certain depth) and go straight to using a resolvedNodeCache. This has the advantage of not blocking on past tryGet operations which are sequential.


Question: Does this really work for Insert and Delete, since they would update the Trie itself, and change how it is walked?

Yes, since walking for hash resolution is stable under mutation.

A cached resolved node is "live" for two purposes: to update the in-memory Trie, and to cache intermediate paths for async resolver workers to walk the Trie.

The sequential update thread only queries for a HashNode if it does not yet live in memory. If it does not live in memory, then it did not live along any path of mutation. Thus, whenever this node is queried by the sequential updater, it is the correct yet unmodified version.

One can strictly enforce this by having only the updater thread mark an isRetrieved for the node variable living in a map. The AsyncTrie should panic if a node is attempted to be retrieved more than once.


In essence, one can replace the prefetcher entirely using this async approach. One just calls trie.Get(..) and then drop the future (via the gc, or manually if you want). The AsyncTrie will build in the background. Then, when wants to compute the root, just call trie.sync() and we are done.

If the order in which applying operations do not matter, (unfortunately for us it most of the time does).

@jon-chuang jon-chuang mentioned this pull request May 3, 2021
atif-konasl pushed a commit to frozeman/pandora-execution-engine that referenced this pull request Oct 15, 2021
…hen generating (ethereum#22667)

* core/state/snapshot: reuse memory data instead of hitting disk when generating

* trie: minor nitpicks wrt the resolver optimization

* core/state/snapshot, trie: use key/value store for resolver

* trie: fix linter

Co-authored-by: Péter Szilágyi <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants