-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/cmdq: create new specialized augmented interval btree #32165
storage/cmdq: create new specialized augmented interval btree #32165
Conversation
18b27e4
to
d499efa
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I need to give the interval-btree and overlap-scan code more thorough scrutiny. The benchmark improvements are very compelling.
Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 4 of 4 files at r6, 1 of 1 files at r7, 2 of 2 files at r8.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/storage/cmdq/interval_btree.go, line 223 at r9 (raw file):
func (n *node) find(c *cmd) (index int, found bool) { // Logic copied from sort.Search. Inlining this gave // an 11% speedup on BenchmarkBTreeDeleteInsert.
Interesting.
pkg/util/interval/btree_based_interval.go, line 1101 at r7 (raw file):
} // Clear removes all items from the btree. If addNodesToFreelist is true,
s/Clear/ClearWithOpt/g
pkg/util/interval/btree_based_interval.go, line 1111 at r7 (raw file):
// somewhat faster than creating a new tree to replace the old one, because // nodes from the old tree are reclaimed into the freelist for use by the new // one, instead of being lost to the garbage collector.
Nit: reflow this paragraph.
d499efa
to
a6ea84f
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added one more commit that rips out the parent pointers and position fields in nodes. These were preventing the immutable structure from working correctly because they cause cyclical references. Luckily it's not too hard to avoid them - we can just use a stack in our iterator to remember our position in higher levels in the tree.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/cmdq/interval_btree.go, line 223 at r9 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Interesting.
Done.
pkg/util/interval/btree_based_interval.go, line 1101 at r7 (raw file):
Previously, petermattis (Peter Mattis) wrote…
s/Clear/ClearWithOpt/g
Done.
pkg/util/interval/btree_based_interval.go, line 1111 at r7 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit: reflow this paragraph.
Done.
…policy All commits from cockroachdb#32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in cockroachdb#32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None
10774c3
to
d554c0e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The removal of node.parent
and node.pos
looks good.
Reviewed 2 of 4 files at r15, 1 of 1 files at r16, 3 of 3 files at r18.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/cmdq/interval_btree.go, line 683 at r19 (raw file):
// asc ascends up to the current node's parent and resets the position to the // one previously set for this parent node. func (i *iterator) asc() {
Nit: let's spell out ascend
and descend
. My mind doesn't want to interpret asc
and it interprets desc
as short for descriptor
.
d554c0e
to
4a02b8c
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR. Did you get/want a chance to look at the overlap-scan code?
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/cmdq/interval_btree.go, line 683 at r19 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit: let's spell out
ascend
anddescend
. My mind doesn't want to interpretasc
and it interpretsdesc
as short fordescriptor
.
Done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Did you get/want a chance to look at the overlap-scan code?
I haven't had a chance yet. Either I need to take a look or someone else does. Perhaps this is a good opportunity for @ajwerner to get his feet wet reviewing code.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
Good idea. @ajwerner do you mind adding a second review to this PR? Mainly just of the last 3 commits. |
…policy All commits from cockroachdb#32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in cockroachdb#32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None
I added one more commit to prevent iterators from escaping to the heap upon creation. They were escaping for two reasons:
The commit addresses both of these problems and prevents the iterator from escaping when used. The fixes were:
|
pkg/storage/cmdq/interval_btree.go
Outdated
// MakeIter returns a new iterator object. Note that it is safe for an | ||
// iterator to be copied by value. | ||
// MakeIter returns a new iterator object. Note that it is safe for an iterator | ||
// to be copied by value. However, it is not safe to make modification to the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that really true? If you create an iterator, use it until it uses s
and then copy it by value, won't the copy and the original share its s
which gets mutated by each iterator?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Very cool use of interval trees in the btree. Overlap scan I scrutinized more than the CoW code and couldn't find anything.
I'm lacking context on how the CoW changes to interval.btree relates to the cmdq.btree implementation, is it just to ensure that the technique is valid?
Reviewed 2 of 4 files at r25, 1 of 1 files at r26, 1 of 3 files at r28, 2 of 2 files at r30.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/cmdq/interval_btree.go, line 74 at r30 (raw file):
} type keyBound struct {
totally a nit but just for clarity could you add a comment to indicate that this represents an upper bound or alternatively a comment to define contains on keyBound, or maybe even just move the upperBound function up to just proceed the type declaration?
pkg/storage/cmdq/interval_btree.go, line 593 at r30 (raw file):
Previously, tbg (Tobias Schottdorf) wrote…
Is that really true? If you create an iterator, use it until it uses
s
and then copy it by value, won't the copy and the original share itss
which gets mutated by each iterator?
Seems like a good catch to me. Maybe add a copy method that copies the slice if it's in use?
google/btree#9 Release note: None
google/btree#11 Release note: None
These were all taken directly from https://github.com/google/btree Release note: None
google/btree#5 Release note: None
Release note: None
google/btree#16 Release note: None
This commit forks https://github.com/petermattis/pebble/blob/master/internal/btree/btree.go and rips out any references to pebble. There are a number of changes we'll need to make to it: 1. Make the structure an augmented interval tree 2. Add synchronized node and leafNode freelists 3. Introduce immutability and a copy-on-write policy NOTE: this commit is meant to be as clean of a fork as possible. The next commit will clean the code up significantly. Release note: None
This commit modifies the btree type added in the previous commit and turns it into an augmented interval tree. The tree represents intervals and permits an interval search operation following the approach laid out in CLRS, Chapter 14. The B-Tree stores cmds in order based on their start key and each B-Tree node maintains the upper-bound end key of all cmds in its subtree. This is close to what `util/interval.btree` does, although the new version doesn't maintain the lower-bound start key of all cmds in each node. The new interval btree is significantly faster than both the old interval btree and the old interval llrb tree because it minimizes key comparisons while scanning for overlaps. This includes avoiding all key comparisons for cmds with start keys that are greater than the search range's start key. See the comment on `overlapScan` for an explanation of how this is possible. The new interval btree is also faster because it has been specialized for the `storage/cmdq` package. This allows it to avoid interfaces and dynamic dispatch throughout its operations, which showed up prominently on profiles of the other two implementations. A third benefit of the rewrite is that it inherits the optimizations made in pebble's btree. This includes inlining the btree items and child pointers in nodes instead of using slices. _### Benchmarks: ``` Insert/count=16-4 76.1ns ± 4% Insert/count=128-4 156ns ± 4% Insert/count=1024-4 259ns ± 8% Insert/count=8192-4 386ns ± 1% Insert/count=65536-4 735ns ± 5% Delete/count=16-4 129ns ±16% Delete/count=128-4 189ns ±12% Delete/count=1024-4 338ns ± 7% Delete/count=8192-4 547ns ± 4% Delete/count=65536-4 1.22µs ±12% DeleteInsert/count=16-4 168ns ± 2% DeleteInsert/count=128-4 375ns ± 8% DeleteInsert/count=1024-4 562ns ± 1% DeleteInsert/count=8192-4 786ns ± 3% DeleteInsert/count=65536-4 2.31µs ±26% IterSeekGE/count=16-4 87.2ns ± 3% IterSeekGE/count=128-4 141ns ± 3% IterSeekGE/count=1024-4 227ns ± 4% IterSeekGE/count=8192-4 379ns ± 2% IterSeekGE/count=65536-4 882ns ± 1% IterSeekLT/count=16-4 89.5ns ± 3% IterSeekLT/count=128-4 145ns ± 1% IterSeekLT/count=1024-4 226ns ± 6% IterSeekLT/count=8192-4 379ns ± 1% IterSeekLT/count=65536-4 891ns ± 1% IterFirstOverlap/count=16-4 184ns ± 1% IterFirstOverlap/count=128-4 260ns ± 3% IterFirstOverlap/count=1024-4 685ns ± 7% IterFirstOverlap/count=8192-4 1.23µs ± 2% IterFirstOverlap/count=65536-4 2.14µs ± 1% IterNext-4 3.82ns ± 2% IterPrev-4 14.8ns ± 2% IterNextOverlap-4 8.57ns ± 2% IterOverlapScan-4 25.8µs ± 3% ``` Compared to old llrb interval tree (currently in use): ``` Insert/count=16-4 323ns ± 7% 76ns ± 4% -76.43% (p=0.008 n=5+5) Insert/count=128-4 539ns ± 2% 156ns ± 4% -71.05% (p=0.008 n=5+5) Insert/count=1024-4 797ns ± 1% 259ns ± 8% -67.52% (p=0.008 n=5+5) Insert/count=8192-4 1.30µs ± 5% 0.39µs ± 1% -70.38% (p=0.008 n=5+5) Insert/count=65536-4 2.69µs ±11% 0.74µs ± 5% -72.65% (p=0.008 n=5+5) Delete/count=16-4 438ns ± 7% 129ns ±16% -70.44% (p=0.008 n=5+5) Delete/count=128-4 785ns ± 6% 189ns ±12% -75.89% (p=0.008 n=5+5) Delete/count=1024-4 1.38µs ± 2% 0.34µs ± 7% -75.44% (p=0.008 n=5+5) Delete/count=8192-4 2.36µs ± 2% 0.55µs ± 4% -76.82% (p=0.008 n=5+5) Delete/count=65536-4 4.73µs ±13% 1.22µs ±12% -74.19% (p=0.008 n=5+5) DeleteInsert/count=16-4 920ns ± 2% 168ns ± 2% -81.76% (p=0.008 n=5+5) DeleteInsert/count=128-4 1.73µs ± 4% 0.37µs ± 8% -78.35% (p=0.008 n=5+5) DeleteInsert/count=1024-4 2.69µs ± 3% 0.56µs ± 1% -79.15% (p=0.016 n=5+4) DeleteInsert/count=8192-4 4.55µs ±25% 0.79µs ± 3% -82.70% (p=0.008 n=5+5) DeleteInsert/count=65536-4 7.53µs ± 6% 2.31µs ±26% -69.32% (p=0.008 n=5+5) IterOverlapScan-4 285µs ± 7% 26µs ± 3% -90.96% (p=0.008 n=5+5) ``` Compared to old btree interval tree (added in a61191e, never enabled): ``` Insert/count=16-4 231ns ± 1% 76ns ± 4% -66.99% (p=0.008 n=5+5) Insert/count=128-4 351ns ± 2% 156ns ± 4% -55.53% (p=0.008 n=5+5) Insert/count=1024-4 515ns ± 5% 259ns ± 8% -49.73% (p=0.008 n=5+5) Insert/count=8192-4 786ns ± 3% 386ns ± 1% -50.85% (p=0.008 n=5+5) Insert/count=65536-4 1.50µs ± 3% 0.74µs ± 5% -50.97% (p=0.008 n=5+5) Delete/count=16-4 363ns ±11% 129ns ±16% -64.33% (p=0.008 n=5+5) Delete/count=128-4 466ns ± 9% 189ns ±12% -59.42% (p=0.008 n=5+5) Delete/count=1024-4 806ns ± 6% 338ns ± 7% -58.01% (p=0.008 n=5+5) Delete/count=8192-4 1.43µs ±13% 0.55µs ± 4% -61.71% (p=0.008 n=5+5) Delete/count=65536-4 2.75µs ± 1% 1.22µs ±12% -55.57% (p=0.008 n=5+5) DeleteInsert/count=16-4 557ns ± 1% 168ns ± 2% -69.87% (p=0.008 n=5+5) DeleteInsert/count=128-4 953ns ± 8% 375ns ± 8% -60.71% (p=0.008 n=5+5) DeleteInsert/count=1024-4 1.19µs ± 4% 0.56µs ± 1% -52.72% (p=0.016 n=5+4) DeleteInsert/count=8192-4 1.84µs ±17% 0.79µs ± 3% -57.22% (p=0.008 n=5+5) DeleteInsert/count=65536-4 3.20µs ± 3% 2.31µs ±26% -27.86% (p=0.008 n=5+5) IterOverlapScan-4 70.1µs ± 2% 25.8µs ± 3% -63.23% (p=0.008 n=5+5) ``` Release note: None
This commit rips out `node.parent` and `node.pos`. These won't work with a copy-on-write strategy. This makes the iterator code a little more complicated because the iterator now needs to maintain a stack of nodes and positions. We avoid any allocations for this stack for trees up to height 4 (~130k items) Because of this, the change does not seem to have any effect on the benchmarks. Theoretically, it should actually speed up mutations to the tree. Release note: None
The iterator returned by `tree.MakeIter` was escaping to the heap for two reasons: 1. it was capturing a reference to the tree itself. 2. it was pointing a slice into its own array. This change addresses both of these problems and prevents the iterator from escaping when used. The fixes were: 1. copy the tree's root pointer reference instead of a reference to the tree itself. 2. avoid creating the self-referential slice reference. This mistakenly escapes because of golang/go#7921, which also caused issues with `bytes.Buffer` (https://golang.org/cl/133715). This change also adds a new benchmark which demonstrates whether `MakeIter` escapes or not: ``` name old time/op new time/op delta BTreeMakeIter-4 131ns ±14% 25ns ± 1% -81.23% (p=0.000 n=9+9) name old alloc/op new alloc/op delta BTreeMakeIter-4 144B ± 0% 0B -100.00% (p=0.000 n=10+10) name old allocs/op new allocs/op delta BTreeMakeIter-4 1.00 ± 0% 0.00 -100.00% (p=0.000 n=10+10) ``` Release note: None
a0955d2
to
cc9c227
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
bors r+
I'm lacking context on how the CoW changes to interval.btree relates to the cmdq.btree implementation, is it just to ensure that the technique is valid?
The CoW stuff that we're going to use is actually all in #32251. The idea is that it will enable the access pattern described in #31997, which should signifcantly improve concurrency in command queue operations.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale)
pkg/storage/cmdq/interval_btree.go, line 74 at r30 (raw file):
Previously, ajwerner wrote…
totally a nit but just for clarity could you add a comment to indicate that this represents an upper bound or alternatively a comment to define contains on keyBound, or maybe even just move the upperBound function up to just proceed the type declaration?
Good suggestion. Done.
pkg/storage/cmdq/interval_btree.go, line 593 at r30 (raw file):
Previously, ajwerner wrote…
Seems like a good catch to me. Maybe add a copy method that copies the slice if it's in use?
Ah you're totally right. Removed the comment. We don't actually rely on this anywhere.
Build failed (retrying...) |
Failure was with #32368. |
32165: storage/cmdq: create new specialized augmented interval btree r=nvanbenschoten a=nvanbenschoten This is a component of the larger change in #31997. The first few commits here modify the existing interval btree implementation, allowing us to properly benchmark against it. The second to last commit forks https://github.com/petermattis/pebble/blob/master/internal/btree/btree.go, specializes it to the command queue, and rips out any references to pebble. There are a number of changes we'll need to make to it: 1. Add synchronized node and leafNode freelists 2. Add Clear method to release owned nodes into freelists 3. Introduce immutability and a copy-on-write policy The next commit modifies the btree type added in the previous commit and turns it into an augmented interval tree. The tree represents intervals and permits an interval search operation following the approach laid out in CLRS, Chapter 14. The B-Tree stores cmds in order based on their start key and each B-Tree node maintains the upper-bound end key of all cmds in its subtree. This is close to what `util/interval.btree` does, although the new version doesn't maintain the lower-bound start key of all cmds in each node. The new interval btree is significantly faster than both the old interval btree and the old interval llrb tree because it minimizes key comparisons while scanning for overlaps. This includes avoiding all key comparisons for cmds with start keys that are greater than the search range's start key. See the comment on `overlapScan` for an explanation of how this is possible. The new interval btree is also faster because it has been specialized for the `storage/cmdq` package. This allows it to avoid interfaces and dynamic dispatch throughout its operations, which showed up prominently on profiles of the other two implementations. A third benefit of the rewrite is that it inherits the optimizations made in pebble's btree. This includes inlining the btree items and child pointers in nodes instead of using slices. ### Benchmarks: _The new interval btree:_ ``` Insert/count=16-4 76.1ns ± 4% Insert/count=128-4 156ns ± 4% Insert/count=1024-4 259ns ± 8% Insert/count=8192-4 386ns ± 1% Insert/count=65536-4 735ns ± 5% Delete/count=16-4 129ns ±16% Delete/count=128-4 189ns ±12% Delete/count=1024-4 338ns ± 7% Delete/count=8192-4 547ns ± 4% Delete/count=65536-4 1.22µs ±12% DeleteInsert/count=16-4 168ns ± 2% DeleteInsert/count=128-4 375ns ± 8% DeleteInsert/count=1024-4 562ns ± 1% DeleteInsert/count=8192-4 786ns ± 3% DeleteInsert/count=65536-4 2.31µs ±26% IterSeekGE/count=16-4 87.2ns ± 3% IterSeekGE/count=128-4 141ns ± 3% IterSeekGE/count=1024-4 227ns ± 4% IterSeekGE/count=8192-4 379ns ± 2% IterSeekGE/count=65536-4 882ns ± 1% IterSeekLT/count=16-4 89.5ns ± 3% IterSeekLT/count=128-4 145ns ± 1% IterSeekLT/count=1024-4 226ns ± 6% IterSeekLT/count=8192-4 379ns ± 1% IterSeekLT/count=65536-4 891ns ± 1% IterFirstOverlap/count=16-4 184ns ± 1% IterFirstOverlap/count=128-4 260ns ± 3% IterFirstOverlap/count=1024-4 685ns ± 7% IterFirstOverlap/count=8192-4 1.23µs ± 2% IterFirstOverlap/count=65536-4 2.14µs ± 1% IterNext-4 3.82ns ± 2% IterPrev-4 14.8ns ± 2% IterNextOverlap-4 8.57ns ± 2% IterOverlapScan-4 25.8µs ± 3% ``` _Compared to old llrb interval tree (currently in use):_ ``` Insert/count=16-4 323ns ± 7% 76ns ± 4% -76.43% (p=0.008 n=5+5) Insert/count=128-4 539ns ± 2% 156ns ± 4% -71.05% (p=0.008 n=5+5) Insert/count=1024-4 797ns ± 1% 259ns ± 8% -67.52% (p=0.008 n=5+5) Insert/count=8192-4 1.30µs ± 5% 0.39µs ± 1% -70.38% (p=0.008 n=5+5) Insert/count=65536-4 2.69µs ±11% 0.74µs ± 5% -72.65% (p=0.008 n=5+5) Delete/count=16-4 438ns ± 7% 129ns ±16% -70.44% (p=0.008 n=5+5) Delete/count=128-4 785ns ± 6% 189ns ±12% -75.89% (p=0.008 n=5+5) Delete/count=1024-4 1.38µs ± 2% 0.34µs ± 7% -75.44% (p=0.008 n=5+5) Delete/count=8192-4 2.36µs ± 2% 0.55µs ± 4% -76.82% (p=0.008 n=5+5) Delete/count=65536-4 4.73µs ±13% 1.22µs ±12% -74.19% (p=0.008 n=5+5) DeleteInsert/count=16-4 920ns ± 2% 168ns ± 2% -81.76% (p=0.008 n=5+5) DeleteInsert/count=128-4 1.73µs ± 4% 0.37µs ± 8% -78.35% (p=0.008 n=5+5) DeleteInsert/count=1024-4 2.69µs ± 3% 0.56µs ± 1% -79.15% (p=0.016 n=5+4) DeleteInsert/count=8192-4 4.55µs ±25% 0.79µs ± 3% -82.70% (p=0.008 n=5+5) DeleteInsert/count=65536-4 7.53µs ± 6% 2.31µs ±26% -69.32% (p=0.008 n=5+5) IterOverlapScan-4 285µs ± 7% 26µs ± 3% -90.96% (p=0.008 n=5+5) ``` _Compared to old btree interval tree (added in a61191e, never enabled):_ ``` Insert/count=16-4 231ns ± 1% 76ns ± 4% -66.99% (p=0.008 n=5+5) Insert/count=128-4 351ns ± 2% 156ns ± 4% -55.53% (p=0.008 n=5+5) Insert/count=1024-4 515ns ± 5% 259ns ± 8% -49.73% (p=0.008 n=5+5) Insert/count=8192-4 786ns ± 3% 386ns ± 1% -50.85% (p=0.008 n=5+5) Insert/count=65536-4 1.50µs ± 3% 0.74µs ± 5% -50.97% (p=0.008 n=5+5) Delete/count=16-4 363ns ±11% 129ns ±16% -64.33% (p=0.008 n=5+5) Delete/count=128-4 466ns ± 9% 189ns ±12% -59.42% (p=0.008 n=5+5) Delete/count=1024-4 806ns ± 6% 338ns ± 7% -58.01% (p=0.008 n=5+5) Delete/count=8192-4 1.43µs ±13% 0.55µs ± 4% -61.71% (p=0.008 n=5+5) Delete/count=65536-4 2.75µs ± 1% 1.22µs ±12% -55.57% (p=0.008 n=5+5) DeleteInsert/count=16-4 557ns ± 1% 168ns ± 2% -69.87% (p=0.008 n=5+5) DeleteInsert/count=128-4 953ns ± 8% 375ns ± 8% -60.71% (p=0.008 n=5+5) DeleteInsert/count=1024-4 1.19µs ± 4% 0.56µs ± 1% -52.72% (p=0.016 n=5+4) DeleteInsert/count=8192-4 1.84µs ±17% 0.79µs ± 3% -57.22% (p=0.008 n=5+5) DeleteInsert/count=65536-4 3.20µs ± 3% 2.31µs ±26% -27.86% (p=0.008 n=5+5) IterOverlapScan-4 70.1µs ± 2% 25.8µs ± 3% -63.23% (p=0.008 n=5+5) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
Build succeeded |
…policy All commits from cockroachdb#32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in cockroachdb#32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None
32251: storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy r=nvanbenschoten a=nvanbenschoten All commits from #32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in #31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in #32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
This is a component of the larger change in #31997.
The first few commits here modify the existing interval btree implementation,
allowing us to properly benchmark against it.
The second to last commit forks https://github.com/petermattis/pebble/blob/master/internal/btree/btree.go, specializes
it to the command queue, and rips out any references to pebble. There are a number
of changes we'll need to make to it:
The next commit modifies the btree type added in the previous commit
and turns it into an augmented interval tree. The tree represents
intervals and permits an interval search operation following the
approach laid out in CLRS, Chapter 14. The B-Tree stores cmds in
order based on their start key and each B-Tree node maintains the
upper-bound end key of all cmds in its subtree. This is close to
what
util/interval.btree
does, although the new version doesn'tmaintain the lower-bound start key of all cmds in each node.
The new interval btree is significantly faster than both the old
interval btree and the old interval llrb tree because it minimizes
key comparisons while scanning for overlaps. This includes avoiding
all key comparisons for cmds with start keys that are greater than
the search range's start key. See the comment on
overlapScan
foran explanation of how this is possible.
The new interval btree is also faster because it has been specialized
for the
storage/cmdq
package. This allows it to avoid interfacesand dynamic dispatch throughout its operations, which showed up
prominently on profiles of the other two implementations.
A third benefit of the rewrite is that it inherits the optimizations
made in pebble's btree. This includes inlining the btree items and
child pointers in nodes instead of using slices.
Benchmarks:
The new interval btree:
Compared to old llrb interval tree (currently in use):
Compared to old btree interval tree (added in a61191e, never enabled):