-
Notifications
You must be signed in to change notification settings - Fork 268
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Fast storage optimization for queries and iterations #468
feat: Fast storage optimization for queries and iterations #468
Conversation
… is deleted, fix all tests but random and with index
…not being cleared when latest version is saved
* fix data race related to VersionExists * use regular lock instead of RW in mutable_tree.go
2437420
to
ff9f32d
Compare
Updates:
|
thanks for the update, sorry for the delay I was off last week. We will get to this asap |
Could you run the benchmarks with https://pkg.go.dev/golang.org/x/perf/cmd/benchstat ? |
@robert-zaremba I will do a new one in the next few days, sorry for the delay. However, here's the latest one. |
hi, why modify mtx from sync.RWMutex to sync.Mutex type . I think the sync.RWMutex should also work @marbar3778 thanks |
@lyh169 there was a benchmark suggesting that. (normal Mutex was faster) |
orphans map[string]int64 // Nodes removed by changes to working tree. | ||
versions map[int64]bool // The previous, saved versions of the tree. | ||
allRootLoaded bool // Whether all roots are loaded or not(by LazyLoadVersion) | ||
unsavedFastNodeAdditions map[string]*FastNode // FastNodes that have not yet been saved to disk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the unsavedFastNodeAdditions and unsavedFastNodeRemovals do not need mtx to protect? the map would not be concurrent access? @p0mvn
versions map[int64]bool // The previous, saved versions of the tree. | ||
allRootLoaded bool // Whether all roots are loaded or not(by LazyLoadVersion) | ||
unsavedFastNodeAdditions map[string]*FastNode // FastNodes that have not yet been saved to disk | ||
unsavedFastNodeRemovals map[string]interface{} // FastNodes that have not yet been removed from disk |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I review the code that the unsavedFastNodeRemovals
' value is only bool. so why not use map[string]struct{}
? @p0mvn
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is a good optimization
|
||
## 0.17.3 (December 1, 2021) | ||
|
||
### Improvements |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
please remove the 3 lines above. We are targetting master, so #468 should go under Unreleased
{"badgerdb", 1000, 100, 4, 10}, | ||
// {"cleveldb", 1000, 100, 4, 10}, | ||
// {"boltdb", 1000, 100, 4, 10}, | ||
// {"rocksdb", 1000, 100, 4, 10}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you enable rocksdb benchmarks as well?
versions = 32 // number of versions to generate | ||
versionOps = 4096 // number of operations (create/update/delete) per version | ||
versions = 8 // number of versions to generate | ||
versionOps = 1024 // number of operations (create/update/delete) per version |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why changing this?
start, end []byte | ||
|
||
valid bool | ||
|
||
ascending bool | ||
|
||
err error | ||
|
||
ndb *nodeDB | ||
|
||
nextFastNode *FastNode | ||
|
||
fastIterator dbm.Iterator |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
start, end []byte | |
valid bool | |
ascending bool | |
err error | |
ndb *nodeDB | |
nextFastNode *FastNode | |
fastIterator dbm.Iterator | |
start, end []byte | |
valid bool | |
ascending bool | |
err error | |
ndb *nodeDB | |
nextFastNode *FastNode | |
fastIterator dbm.Iterator |
start, end := iter.fastIterator.Domain() | ||
|
||
if start != nil { | ||
start = start[1:] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why do we remove the first byte? Could you add a doc comment?
iter.valid = iter.valid && iter.fastIterator.Valid() | ||
if iter.valid { | ||
iter.nextFastNode, iter.err = DeserializeFastNode(iter.fastIterator.Key()[1:], iter.fastIterator.Value()) | ||
iter.valid = iter.err == nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
shouldn't we check iter.fastIterator.Valid()
here as well?
wow, I've just realized that I had not submitted comments. |
Background
Link to the original spec
Link to the original PR in Osmosis
Background
Historically IAVL has had a very slow performance during state machine execution, and for responding to queries to live state. This release speeds up these routines by an order of magnitude, alleviating large amounts of pressure from all users of the IAVL database.
Details
This PR introduces an auxiliary fast storage system to IAVL, which represents a copy of the latest state much more amenable to efficient querying and iteration.
Prior to this PR, all data gets & iterations suffered two significant performance drawdowns:
All nodes were indexed by their Merkle tree inner node hash. This breaks data locality and makes every
Get()
that should be in RAM / CPU caches instead be a random leveldb file open.The fast storage nodes are instead indexed by the logical key on the disk. This allows us to preserve data locality for the latest state, significantly improving iterations and queries. (Depending on the particular benchmark, between 5-30x improvements) This implementation introduces a negligible overhead for writes.
Downgrade-re-upgrade protection
We introduced a downgrade and re-upgrade protection where we guard for potential downgrades of iavl and the subsequent enablement of the fast storage again. This is done so by storing the metadata about the current version of the storage and the latest live state stored.
Summary of Changes
IAVL is divided into two trees,
mutable_tree
andimmutable_tree
. Sets only happen on the mutable tree.Things that need to change and be investigated for getting and setting, and the fast node:
mutable tree
GetVersioned
Set
Remove
SaveVersion
Iterate
Iterator
Get
enableFastStorageAndCommit
and its variationsmstorage_version
wherem
is a new prefix. If the version is lower than thefastStorageVersionValue
threshold - migration is triggered.LoadVersion
,LazyLoadVersion
immutable_tree
Get
and(GetWithIndex
Get
toGetWithIndex
.GetWithIndex
always uses the default live state traversal strategyGet
method. Get attempts to use the fast cache first. Only fallbacks to regular tree traversal strategy if the fast cache is disabled or tree is not of the latest versionIterator
nodedb
fast_iterator
f
which stands for fast. Basically, all fast nodes are sorted on disk by key in ascending order so we can simply traverse the disk ensuring efficient hardware access.unsaved_fast_iterator
testing
Benchstat
Old Benchmark
Date:
2022-01-22 12:33 AM PST
Branch:
dev/iavl_data_locality
with some modifications to the bench testsLatest Benchmark
Date:
2022-01-22 10:15 AM PST
Branch:
roman/fast-node-get-set
Benchmarks Interpretation
Highlighting the difference in performance from the latest benchmarks:
Old branch is
dev/iavl_data_locality
New branch is
roman/fast-node-get-set
Initial size:
100,000
key-val pairsBlock size:
100
keysKey length:
16
bytesValue length:
40
bytesQuery with no guarantee of the key being in the tree:
22354 ns/op
18046 ns/op
4938 ns/op
Query with the key guaranteed to be in the latest tree:
27137 ns/op
23126 ns/op
1684 ns/op
Iteration:
2285116100 ns/op
1716585400 ns/op
94702442 ns/op
Update:
run Set, if this is a try that is divisible by blockSize, attempt to SaveVersion and if the latest saved version number history exceeds 20, delete the oldest version
307266 ns/op
257683 ns/op
Block:
for block size, run Get and Set. At the end of the block, SaveVersion and if the latest saved version number history exceeds 20, delete the oldest version
40663600 ns/op
44907345 ns/op