Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/cmdq: create new specialized augmented interval btree #32165

Merged
merged 11 commits into from
Nov 15, 2018

Conversation

nvanbenschoten
Copy link
Member

This is a component of the larger change in #31997.

The first few commits here modify the existing interval btree implementation,
allowing us to properly benchmark against it.

The second to last commit forks https://github.com/petermattis/pebble/blob/master/internal/btree/btree.go, specializes
it to the command queue, and rips out any references to pebble. There are a number
of changes we'll need to make to it:

  1. Add synchronized node and leafNode freelists
  2. Add Clear method to release owned nodes into freelists
  3. Introduce immutability and a copy-on-write policy

The next commit modifies the btree type added in the previous commit
and turns it into an augmented interval tree. The tree represents
intervals and permits an interval search operation following the
approach laid out in CLRS, Chapter 14. The B-Tree stores cmds in
order based on their start key and each B-Tree node maintains the
upper-bound end key of all cmds in its subtree. This is close to
what util/interval.btree does, although the new version doesn't
maintain the lower-bound start key of all cmds in each node.

The new interval btree is significantly faster than both the old
interval btree and the old interval llrb tree because it minimizes
key comparisons while scanning for overlaps. This includes avoiding
all key comparisons for cmds with start keys that are greater than
the search range's start key. See the comment on overlapScan for
an explanation of how this is possible.

The new interval btree is also faster because it has been specialized
for the storage/cmdq package. This allows it to avoid interfaces
and dynamic dispatch throughout its operations, which showed up
prominently on profiles of the other two implementations.

A third benefit of the rewrite is that it inherits the optimizations
made in pebble's btree. This includes inlining the btree items and
child pointers in nodes instead of using slices.

Benchmarks:

The new interval btree:

Insert/count=16-4               76.1ns ± 4%
Insert/count=128-4               156ns ± 4%
Insert/count=1024-4              259ns ± 8%
Insert/count=8192-4              386ns ± 1%
Insert/count=65536-4             735ns ± 5%
Delete/count=16-4                129ns ±16%
Delete/count=128-4               189ns ±12%
Delete/count=1024-4              338ns ± 7%
Delete/count=8192-4              547ns ± 4%
Delete/count=65536-4            1.22µs ±12%
DeleteInsert/count=16-4          168ns ± 2%
DeleteInsert/count=128-4         375ns ± 8%
DeleteInsert/count=1024-4        562ns ± 1%
DeleteInsert/count=8192-4        786ns ± 3%
DeleteInsert/count=65536-4      2.31µs ±26%
IterSeekGE/count=16-4           87.2ns ± 3%
IterSeekGE/count=128-4           141ns ± 3%
IterSeekGE/count=1024-4          227ns ± 4%
IterSeekGE/count=8192-4          379ns ± 2%
IterSeekGE/count=65536-4         882ns ± 1%
IterSeekLT/count=16-4           89.5ns ± 3%
IterSeekLT/count=128-4           145ns ± 1%
IterSeekLT/count=1024-4          226ns ± 6%
IterSeekLT/count=8192-4          379ns ± 1%
IterSeekLT/count=65536-4         891ns ± 1%
IterFirstOverlap/count=16-4      184ns ± 1%
IterFirstOverlap/count=128-4     260ns ± 3%
IterFirstOverlap/count=1024-4    685ns ± 7%
IterFirstOverlap/count=8192-4   1.23µs ± 2%
IterFirstOverlap/count=65536-4  2.14µs ± 1%
IterNext-4                      3.82ns ± 2%
IterPrev-4                      14.8ns ± 2%
IterNextOverlap-4               8.57ns ± 2%
IterOverlapScan-4               25.8µs ± 3%

Compared to old llrb interval tree (currently in use):

Insert/count=16-4            323ns ± 7%    76ns ± 4%  -76.43%  (p=0.008 n=5+5)
Insert/count=128-4           539ns ± 2%   156ns ± 4%  -71.05%  (p=0.008 n=5+5)
Insert/count=1024-4          797ns ± 1%   259ns ± 8%  -67.52%  (p=0.008 n=5+5)
Insert/count=8192-4         1.30µs ± 5%  0.39µs ± 1%  -70.38%  (p=0.008 n=5+5)
Insert/count=65536-4        2.69µs ±11%  0.74µs ± 5%  -72.65%  (p=0.008 n=5+5)
Delete/count=16-4            438ns ± 7%   129ns ±16%  -70.44%  (p=0.008 n=5+5)
Delete/count=128-4           785ns ± 6%   189ns ±12%  -75.89%  (p=0.008 n=5+5)
Delete/count=1024-4         1.38µs ± 2%  0.34µs ± 7%  -75.44%  (p=0.008 n=5+5)
Delete/count=8192-4         2.36µs ± 2%  0.55µs ± 4%  -76.82%  (p=0.008 n=5+5)
Delete/count=65536-4        4.73µs ±13%  1.22µs ±12%  -74.19%  (p=0.008 n=5+5)
DeleteInsert/count=16-4      920ns ± 2%   168ns ± 2%  -81.76%  (p=0.008 n=5+5)
DeleteInsert/count=128-4    1.73µs ± 4%  0.37µs ± 8%  -78.35%  (p=0.008 n=5+5)
DeleteInsert/count=1024-4   2.69µs ± 3%  0.56µs ± 1%  -79.15%  (p=0.016 n=5+4)
DeleteInsert/count=8192-4   4.55µs ±25%  0.79µs ± 3%  -82.70%  (p=0.008 n=5+5)
DeleteInsert/count=65536-4  7.53µs ± 6%  2.31µs ±26%  -69.32%  (p=0.008 n=5+5)
IterOverlapScan-4            285µs ± 7%    26µs ± 3%  -90.96%  (p=0.008 n=5+5)

Compared to old btree interval tree (added in a61191e, never enabled):

Insert/count=16-4            231ns ± 1%    76ns ± 4%  -66.99%  (p=0.008 n=5+5)
Insert/count=128-4           351ns ± 2%   156ns ± 4%  -55.53%  (p=0.008 n=5+5)
Insert/count=1024-4          515ns ± 5%   259ns ± 8%  -49.73%  (p=0.008 n=5+5)
Insert/count=8192-4          786ns ± 3%   386ns ± 1%  -50.85%  (p=0.008 n=5+5)
Insert/count=65536-4        1.50µs ± 3%  0.74µs ± 5%  -50.97%  (p=0.008 n=5+5)
Delete/count=16-4            363ns ±11%   129ns ±16%  -64.33%  (p=0.008 n=5+5)
Delete/count=128-4           466ns ± 9%   189ns ±12%  -59.42%  (p=0.008 n=5+5)
Delete/count=1024-4          806ns ± 6%   338ns ± 7%  -58.01%  (p=0.008 n=5+5)
Delete/count=8192-4         1.43µs ±13%  0.55µs ± 4%  -61.71%  (p=0.008 n=5+5)
Delete/count=65536-4        2.75µs ± 1%  1.22µs ±12%  -55.57%  (p=0.008 n=5+5)
DeleteInsert/count=16-4      557ns ± 1%   168ns ± 2%  -69.87%  (p=0.008 n=5+5)
DeleteInsert/count=128-4     953ns ± 8%   375ns ± 8%  -60.71%  (p=0.008 n=5+5)
DeleteInsert/count=1024-4   1.19µs ± 4%  0.56µs ± 1%  -52.72%  (p=0.016 n=5+4)
DeleteInsert/count=8192-4   1.84µs ±17%  0.79µs ± 3%  -57.22%  (p=0.008 n=5+5)
DeleteInsert/count=65536-4  3.20µs ± 3%  2.31µs ±26%  -27.86%  (p=0.008 n=5+5)
IterOverlapScan-4           70.1µs ± 2%  25.8µs ± 3%  -63.23%  (p=0.008 n=5+5)

@nvanbenschoten nvanbenschoten requested review from petermattis and a team November 6, 2018 15:16
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cmdqTree branch from 18b27e4 to d499efa Compare November 6, 2018 20:47
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I need to give the interval-btree and overlap-scan code more thorough scrutiny. The benchmark improvements are very compelling.

Reviewed 1 of 1 files at r1, 1 of 1 files at r2, 1 of 1 files at r3, 1 of 1 files at r4, 1 of 1 files at r5, 4 of 4 files at r6, 1 of 1 files at r7, 2 of 2 files at r8.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/cmdq/interval_btree.go, line 223 at r9 (raw file):

func (n *node) find(c *cmd) (index int, found bool) {
	// Logic copied from sort.Search. Inlining this gave
	// an 11% speedup on BenchmarkBTreeDeleteInsert.

Interesting.


pkg/util/interval/btree_based_interval.go, line 1101 at r7 (raw file):

}

// Clear removes all items from the btree.  If addNodesToFreelist is true,

s/Clear/ClearWithOpt/g


pkg/util/interval/btree_based_interval.go, line 1111 at r7 (raw file):

// somewhat faster than creating a new tree to replace the old one, because
// nodes from the old tree are reclaimed into the freelist for use by the new
// one, instead of being lost to the garbage collector.

Nit: reflow this paragraph.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cmdqTree branch from d499efa to a6ea84f Compare November 13, 2018 07:05
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added one more commit that rips out the parent pointers and position fields in nodes. These were preventing the immutable structure from working correctly because they cause cyclical references. Luckily it's not too hard to avoid them - we can just use a stack in our iterator to remember our position in higher levels in the tree.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/cmdq/interval_btree.go, line 223 at r9 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Interesting.

Done.


pkg/util/interval/btree_based_interval.go, line 1101 at r7 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/Clear/ClearWithOpt/g

Done.


pkg/util/interval/btree_based_interval.go, line 1111 at r7 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: reflow this paragraph.

Done.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 13, 2018
…policy

All commits from cockroachdb#32165 except the last one.

This change introduces O(1) btree cloning and a new copy-on-write scheme,
essentially giving the btree an immutable API (for which I took inspiration
from https://docs.rs/crate/im/). This is made efficient by the second part
of the change - a new garbage collection policy for btrees. Nodes are now
reference counted atomically and freed into global `sync.Pools` when they
are no longer referenced.

One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command
queue as immutable structures. In doing so, we adopt a copy-on-write scheme.
Trees are cloned under lock and then accessed concurrently. When future
writers want to modify the tree, they can do so by cloning any nodes that
they touch. This commit provides this functionality in a much more elegant
manner than 6994347. Instead of giving each node a "copy-on-write context",
we instead give each node a reference count. We then use the following rule:
1. trees with exclusive ownership (refcount == 1) over a node can modify
   it in-place.
2. trees without exclusive ownership over a node must clone the node
   in order to modify it. Once cloned, the tree will now have exclusive
   ownership over that node. When cloning the node, the reference count
   of all of the node's children must be incremented.

In following the simple rules, we end up with a really nice property -
trees gain more and more "ownership" as they make modifications, meaning
that subsequent modifications are much less likely to need to clone nodes.
Essentially, we transparently incorporates the idea of local mutations
(e.g. Clojure's transients or Haskell's ST monad) without any external
API needed.

Even better, reference counting internal nodes ties directly into the
new GC policy, which allows us to recycle old nodes and make the copy-on-write
scheme zero-allocation in almost all cases. When a node's reference count
drops to 0, we simply toss it into a `sync.Pool`. We keep two separate
pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible
with the previous "copy-on-write context" approach.

The atomic reference counting does have an effect on benchmarks, but
its not a big one (single/double digit ns) and is negligible compared to
the speedup observed in cockroachdb#32165.
```
name                             old time/op  new time/op  delta
BTreeInsert/count=16-4           73.2ns ± 4%  84.4ns ± 4%  +15.30%  (p=0.008 n=5+5)
BTreeInsert/count=128-4           152ns ± 4%   167ns ± 4%   +9.89%  (p=0.008 n=5+5)
BTreeInsert/count=1024-4          250ns ± 1%   263ns ± 2%   +5.21%  (p=0.008 n=5+5)
BTreeInsert/count=8192-4          381ns ± 1%   394ns ± 2%   +3.36%  (p=0.008 n=5+5)
BTreeInsert/count=65536-4         720ns ± 6%   746ns ± 1%     ~     (p=0.119 n=5+5)
BTreeDelete/count=16-4            127ns ±15%   131ns ± 9%     ~     (p=0.690 n=5+5)
BTreeDelete/count=128-4           182ns ± 8%   192ns ± 8%     ~     (p=0.222 n=5+5)
BTreeDelete/count=1024-4          323ns ± 3%   340ns ± 4%   +5.20%  (p=0.032 n=5+5)
BTreeDelete/count=8192-4          532ns ± 2%   556ns ± 1%   +4.55%  (p=0.008 n=5+5)
BTreeDelete/count=65536-4        1.15µs ± 2%  1.22µs ± 7%     ~     (p=0.222 n=5+5)
BTreeDeleteInsert/count=16-4      166ns ± 4%   174ns ± 3%   +4.70%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=128-4     370ns ± 2%   383ns ± 1%   +3.57%  (p=0.008 n=5+5)
BTreeDeleteInsert/count=1024-4    548ns ± 3%   575ns ± 5%   +4.89%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=8192-4    775ns ± 1%   789ns ± 1%   +1.86%  (p=0.016 n=5+5)
BTreeDeleteInsert/count=65536-4  2.20µs ±22%  2.10µs ±18%     ~     (p=0.841 n=5+5)
```

We can see how important the GC and memory re-use policy is by comparing
the following few benchmarks. Specifically, notice the difference in
operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime`
between the tests that `Reset` old clones (allowing nodes to be freed into
`sync.Pool`s) and the tests that don't `Reset` old clones.
```
name                                                      time/op
BTreeDeleteInsert/count=16-4                               198ns ±28%
BTreeDeleteInsert/count=128-4                              375ns ± 3%
BTreeDeleteInsert/count=1024-4                             577ns ± 2%
BTreeDeleteInsert/count=8192-4                             798ns ± 1%
BTreeDeleteInsert/count=65536-4                           2.00µs ±13%
BTreeDeleteInsertCloneOnce/count=16-4                      173ns ± 2%
BTreeDeleteInsertCloneOnce/count=128-4                     379ns ± 2%
BTreeDeleteInsertCloneOnce/count=1024-4                    584ns ± 4%
BTreeDeleteInsertCloneOnce/count=8192-4                    800ns ± 2%
BTreeDeleteInsertCloneOnce/count=65536-4                  2.04µs ±32%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4      535ns ± 8%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4    1.29µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   2.22µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   2.55µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  5.89µs ±20%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       240ns ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      610ns ± 4%
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4    1.20µs ± 2%
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4    1.69µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4   3.52µs ±18%

name                                                      alloc/op
BTreeDeleteInsert/count=16-4                               0.00B
BTreeDeleteInsert/count=128-4                              0.00B
BTreeDeleteInsert/count=1024-4                             0.00B
BTreeDeleteInsert/count=8192-4                             0.00B
BTreeDeleteInsert/count=65536-4                            0.00B
BTreeDeleteInsertCloneOnce/count=16-4                      0.00B
BTreeDeleteInsertCloneOnce/count=128-4                     0.00B
BTreeDeleteInsertCloneOnce/count=1024-4                    0.00B
BTreeDeleteInsertCloneOnce/count=8192-4                    0.00B
BTreeDeleteInsertCloneOnce/count=65536-4                   1.00B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       288B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      897B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   1.61kB ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   1.47kB ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  2.40kB ±12%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4    0.00B

name                                                      allocs/op
BTreeDeleteInsert/count=16-4                                0.00
BTreeDeleteInsert/count=128-4                               0.00
BTreeDeleteInsert/count=1024-4                              0.00
BTreeDeleteInsert/count=8192-4                              0.00
BTreeDeleteInsert/count=65536-4                             0.00
BTreeDeleteInsertCloneOnce/count=16-4                       0.00
BTreeDeleteInsertCloneOnce/count=128-4                      0.00
BTreeDeleteInsertCloneOnce/count=1024-4                     0.00
BTreeDeleteInsertCloneOnce/count=8192-4                     0.00
BTreeDeleteInsertCloneOnce/count=65536-4                    0.00
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       1.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      2.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4    4.40 ±14%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4        0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4       0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4     0.00
```

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cmdqTree branch 3 times, most recently from 10774c3 to d554c0e Compare November 13, 2018 16:26
Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The removal of node.parent and node.pos looks good.

Reviewed 2 of 4 files at r15, 1 of 1 files at r16, 3 of 3 files at r18.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/cmdq/interval_btree.go, line 683 at r19 (raw file):

// asc ascends up to the current node's parent and resets the position to the
// one previously set for this parent node.
func (i *iterator) asc() {

Nit: let's spell out ascend and descend. My mind doesn't want to interpret asc and it interprets desc as short for descriptor.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cmdqTree branch from d554c0e to 4a02b8c Compare November 13, 2018 19:16
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR. Did you get/want a chance to look at the overlap-scan code?

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/cmdq/interval_btree.go, line 683 at r19 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: let's spell out ascend and descend. My mind doesn't want to interpret asc and it interprets desc as short for descriptor.

Done.

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Did you get/want a chance to look at the overlap-scan code?

I haven't had a chance yet. Either I need to take a look or someone else does. Perhaps this is a good opportunity for @ajwerner to get his feet wet reviewing code.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)

@nvanbenschoten
Copy link
Member Author

Good idea. @ajwerner do you mind adding a second review to this PR? Mainly just of the last 3 commits.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 14, 2018
…policy

All commits from cockroachdb#32165 except the last one.

This change introduces O(1) btree cloning and a new copy-on-write scheme,
essentially giving the btree an immutable API (for which I took inspiration
from https://docs.rs/crate/im/). This is made efficient by the second part
of the change - a new garbage collection policy for btrees. Nodes are now
reference counted atomically and freed into global `sync.Pools` when they
are no longer referenced.

One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command
queue as immutable structures. In doing so, we adopt a copy-on-write scheme.
Trees are cloned under lock and then accessed concurrently. When future
writers want to modify the tree, they can do so by cloning any nodes that
they touch. This commit provides this functionality in a much more elegant
manner than 6994347. Instead of giving each node a "copy-on-write context",
we instead give each node a reference count. We then use the following rule:
1. trees with exclusive ownership (refcount == 1) over a node can modify
   it in-place.
2. trees without exclusive ownership over a node must clone the node
   in order to modify it. Once cloned, the tree will now have exclusive
   ownership over that node. When cloning the node, the reference count
   of all of the node's children must be incremented.

In following the simple rules, we end up with a really nice property -
trees gain more and more "ownership" as they make modifications, meaning
that subsequent modifications are much less likely to need to clone nodes.
Essentially, we transparently incorporates the idea of local mutations
(e.g. Clojure's transients or Haskell's ST monad) without any external
API needed.

Even better, reference counting internal nodes ties directly into the
new GC policy, which allows us to recycle old nodes and make the copy-on-write
scheme zero-allocation in almost all cases. When a node's reference count
drops to 0, we simply toss it into a `sync.Pool`. We keep two separate
pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible
with the previous "copy-on-write context" approach.

The atomic reference counting does have an effect on benchmarks, but
its not a big one (single/double digit ns) and is negligible compared to
the speedup observed in cockroachdb#32165.
```
name                             old time/op  new time/op  delta
BTreeInsert/count=16-4           73.2ns ± 4%  84.4ns ± 4%  +15.30%  (p=0.008 n=5+5)
BTreeInsert/count=128-4           152ns ± 4%   167ns ± 4%   +9.89%  (p=0.008 n=5+5)
BTreeInsert/count=1024-4          250ns ± 1%   263ns ± 2%   +5.21%  (p=0.008 n=5+5)
BTreeInsert/count=8192-4          381ns ± 1%   394ns ± 2%   +3.36%  (p=0.008 n=5+5)
BTreeInsert/count=65536-4         720ns ± 6%   746ns ± 1%     ~     (p=0.119 n=5+5)
BTreeDelete/count=16-4            127ns ±15%   131ns ± 9%     ~     (p=0.690 n=5+5)
BTreeDelete/count=128-4           182ns ± 8%   192ns ± 8%     ~     (p=0.222 n=5+5)
BTreeDelete/count=1024-4          323ns ± 3%   340ns ± 4%   +5.20%  (p=0.032 n=5+5)
BTreeDelete/count=8192-4          532ns ± 2%   556ns ± 1%   +4.55%  (p=0.008 n=5+5)
BTreeDelete/count=65536-4        1.15µs ± 2%  1.22µs ± 7%     ~     (p=0.222 n=5+5)
BTreeDeleteInsert/count=16-4      166ns ± 4%   174ns ± 3%   +4.70%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=128-4     370ns ± 2%   383ns ± 1%   +3.57%  (p=0.008 n=5+5)
BTreeDeleteInsert/count=1024-4    548ns ± 3%   575ns ± 5%   +4.89%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=8192-4    775ns ± 1%   789ns ± 1%   +1.86%  (p=0.016 n=5+5)
BTreeDeleteInsert/count=65536-4  2.20µs ±22%  2.10µs ±18%     ~     (p=0.841 n=5+5)
```

We can see how important the GC and memory re-use policy is by comparing
the following few benchmarks. Specifically, notice the difference in
operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime`
between the tests that `Reset` old clones (allowing nodes to be freed into
`sync.Pool`s) and the tests that don't `Reset` old clones.
```
name                                                      time/op
BTreeDeleteInsert/count=16-4                               198ns ±28%
BTreeDeleteInsert/count=128-4                              375ns ± 3%
BTreeDeleteInsert/count=1024-4                             577ns ± 2%
BTreeDeleteInsert/count=8192-4                             798ns ± 1%
BTreeDeleteInsert/count=65536-4                           2.00µs ±13%
BTreeDeleteInsertCloneOnce/count=16-4                      173ns ± 2%
BTreeDeleteInsertCloneOnce/count=128-4                     379ns ± 2%
BTreeDeleteInsertCloneOnce/count=1024-4                    584ns ± 4%
BTreeDeleteInsertCloneOnce/count=8192-4                    800ns ± 2%
BTreeDeleteInsertCloneOnce/count=65536-4                  2.04µs ±32%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4      535ns ± 8%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4    1.29µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   2.22µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   2.55µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  5.89µs ±20%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       240ns ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      610ns ± 4%
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4    1.20µs ± 2%
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4    1.69µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4   3.52µs ±18%

name                                                      alloc/op
BTreeDeleteInsert/count=16-4                               0.00B
BTreeDeleteInsert/count=128-4                              0.00B
BTreeDeleteInsert/count=1024-4                             0.00B
BTreeDeleteInsert/count=8192-4                             0.00B
BTreeDeleteInsert/count=65536-4                            0.00B
BTreeDeleteInsertCloneOnce/count=16-4                      0.00B
BTreeDeleteInsertCloneOnce/count=128-4                     0.00B
BTreeDeleteInsertCloneOnce/count=1024-4                    0.00B
BTreeDeleteInsertCloneOnce/count=8192-4                    0.00B
BTreeDeleteInsertCloneOnce/count=65536-4                   1.00B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       288B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      897B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   1.61kB ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   1.47kB ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  2.40kB ±12%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4    0.00B

name                                                      allocs/op
BTreeDeleteInsert/count=16-4                                0.00
BTreeDeleteInsert/count=128-4                               0.00
BTreeDeleteInsert/count=1024-4                              0.00
BTreeDeleteInsert/count=8192-4                              0.00
BTreeDeleteInsert/count=65536-4                             0.00
BTreeDeleteInsertCloneOnce/count=16-4                       0.00
BTreeDeleteInsertCloneOnce/count=128-4                      0.00
BTreeDeleteInsertCloneOnce/count=1024-4                     0.00
BTreeDeleteInsertCloneOnce/count=8192-4                     0.00
BTreeDeleteInsertCloneOnce/count=65536-4                    0.00
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       1.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      2.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4    4.40 ±14%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4        0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4       0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4     0.00
```

Release note: None
@nvanbenschoten
Copy link
Member Author

I added one more commit to prevent iterators from escaping to the heap upon creation. They were escaping for two reasons:

  1. they captured a reference to the tree itself.
  2. the pointed a slice into their own array.

The commit addresses both of these problems and prevents the iterator from escaping when used. The fixes were:

  1. copy the tree's root pointer reference instead of a reference to the tree itself.
  2. avoid creating the self-referential slice reference. This mistakenly escapes because of cmd/compile, bytes: bootstrap array causes bytes.Buffer to always be heap-allocated golang/go#7921, which also caused issues with bytes.Buffer (https://golang.org/cl/133715).
name             old time/op    new time/op    delta
BTreeMakeIter-4     131ns ±14%      25ns ± 1%   -81.23%  (p=0.000 n=9+9)

name             old alloc/op   new alloc/op   delta
BTreeMakeIter-4      144B ± 0%        0B       -100.00%  (p=0.000 n=10+10)

name             old allocs/op  new allocs/op  delta
BTreeMakeIter-4      1.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)

// MakeIter returns a new iterator object. Note that it is safe for an
// iterator to be copied by value.
// MakeIter returns a new iterator object. Note that it is safe for an iterator
// to be copied by value. However, it is not safe to make modification to the
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that really true? If you create an iterator, use it until it uses s and then copy it by value, won't the copy and the original share its s which gets mutated by each iterator?

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Very cool use of interval trees in the btree. Overlap scan I scrutinized more than the CoW code and couldn't find anything.

I'm lacking context on how the CoW changes to interval.btree relates to the cmdq.btree implementation, is it just to ensure that the technique is valid?

Reviewed 2 of 4 files at r25, 1 of 1 files at r26, 1 of 3 files at r28, 2 of 2 files at r30.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/cmdq/interval_btree.go, line 74 at r30 (raw file):

}

type keyBound struct {

totally a nit but just for clarity could you add a comment to indicate that this represents an upper bound or alternatively a comment to define contains on keyBound, or maybe even just move the upperBound function up to just proceed the type declaration?


pkg/storage/cmdq/interval_btree.go, line 593 at r30 (raw file):

Previously, tbg (Tobias Schottdorf) wrote…

Is that really true? If you create an iterator, use it until it uses s and then copy it by value, won't the copy and the original share its s which gets mutated by each iterator?

Seems like a good catch to me. Maybe add a copy method that copies the slice if it's in use?

These were all taken directly from https://github.com/google/btree

Release note: None
This commit forks https://github.com/petermattis/pebble/blob/master/internal/btree/btree.go
and rips out any references to pebble. There are a number of changes
we'll need to make to it:
1. Make the structure an augmented interval tree
2. Add synchronized node and leafNode freelists
3. Introduce immutability and a copy-on-write policy

NOTE: this commit is meant to be as clean of a fork as possible.
The next commit will clean the code up significantly.

Release note: None
This commit modifies the btree type added in the previous commit
and turns it into an augmented interval tree. The tree represents
intervals and permits an interval search operation following the
approach laid out in CLRS, Chapter 14. The B-Tree stores cmds in
order based on their start key and each B-Tree node maintains the
upper-bound end key of all cmds in its subtree. This is close to
what `util/interval.btree` does, although the new version doesn't
maintain the lower-bound start key of all cmds in each node.

The new interval btree is significantly faster than both the old
interval btree and the old interval llrb tree because it minimizes
key comparisons while scanning for overlaps. This includes avoiding
all key comparisons for cmds with start keys that are greater than
the search range's start key. See the comment on `overlapScan` for
an explanation of how this is possible.

The new interval btree is also faster because it has been specialized
for the `storage/cmdq` package. This allows it to avoid interfaces
and dynamic dispatch throughout its operations, which showed up
prominently on profiles of the other two implementations.

A third benefit of the rewrite is that it inherits the optimizations
made in pebble's btree. This includes inlining the btree items and
child pointers in nodes instead of using slices.

_### Benchmarks:

```
Insert/count=16-4               76.1ns ± 4%
Insert/count=128-4               156ns ± 4%
Insert/count=1024-4              259ns ± 8%
Insert/count=8192-4              386ns ± 1%
Insert/count=65536-4             735ns ± 5%
Delete/count=16-4                129ns ±16%
Delete/count=128-4               189ns ±12%
Delete/count=1024-4              338ns ± 7%
Delete/count=8192-4              547ns ± 4%
Delete/count=65536-4            1.22µs ±12%
DeleteInsert/count=16-4          168ns ± 2%
DeleteInsert/count=128-4         375ns ± 8%
DeleteInsert/count=1024-4        562ns ± 1%
DeleteInsert/count=8192-4        786ns ± 3%
DeleteInsert/count=65536-4      2.31µs ±26%
IterSeekGE/count=16-4           87.2ns ± 3%
IterSeekGE/count=128-4           141ns ± 3%
IterSeekGE/count=1024-4          227ns ± 4%
IterSeekGE/count=8192-4          379ns ± 2%
IterSeekGE/count=65536-4         882ns ± 1%
IterSeekLT/count=16-4           89.5ns ± 3%
IterSeekLT/count=128-4           145ns ± 1%
IterSeekLT/count=1024-4          226ns ± 6%
IterSeekLT/count=8192-4          379ns ± 1%
IterSeekLT/count=65536-4         891ns ± 1%
IterFirstOverlap/count=16-4      184ns ± 1%
IterFirstOverlap/count=128-4     260ns ± 3%
IterFirstOverlap/count=1024-4    685ns ± 7%
IterFirstOverlap/count=8192-4   1.23µs ± 2%
IterFirstOverlap/count=65536-4  2.14µs ± 1%
IterNext-4                      3.82ns ± 2%
IterPrev-4                      14.8ns ± 2%
IterNextOverlap-4               8.57ns ± 2%
IterOverlapScan-4               25.8µs ± 3%
```

Compared to old llrb interval tree (currently in use):
```
Insert/count=16-4            323ns ± 7%    76ns ± 4%  -76.43%  (p=0.008 n=5+5)
Insert/count=128-4           539ns ± 2%   156ns ± 4%  -71.05%  (p=0.008 n=5+5)
Insert/count=1024-4          797ns ± 1%   259ns ± 8%  -67.52%  (p=0.008 n=5+5)
Insert/count=8192-4         1.30µs ± 5%  0.39µs ± 1%  -70.38%  (p=0.008 n=5+5)
Insert/count=65536-4        2.69µs ±11%  0.74µs ± 5%  -72.65%  (p=0.008 n=5+5)
Delete/count=16-4            438ns ± 7%   129ns ±16%  -70.44%  (p=0.008 n=5+5)
Delete/count=128-4           785ns ± 6%   189ns ±12%  -75.89%  (p=0.008 n=5+5)
Delete/count=1024-4         1.38µs ± 2%  0.34µs ± 7%  -75.44%  (p=0.008 n=5+5)
Delete/count=8192-4         2.36µs ± 2%  0.55µs ± 4%  -76.82%  (p=0.008 n=5+5)
Delete/count=65536-4        4.73µs ±13%  1.22µs ±12%  -74.19%  (p=0.008 n=5+5)
DeleteInsert/count=16-4      920ns ± 2%   168ns ± 2%  -81.76%  (p=0.008 n=5+5)
DeleteInsert/count=128-4    1.73µs ± 4%  0.37µs ± 8%  -78.35%  (p=0.008 n=5+5)
DeleteInsert/count=1024-4   2.69µs ± 3%  0.56µs ± 1%  -79.15%  (p=0.016 n=5+4)
DeleteInsert/count=8192-4   4.55µs ±25%  0.79µs ± 3%  -82.70%  (p=0.008 n=5+5)
DeleteInsert/count=65536-4  7.53µs ± 6%  2.31µs ±26%  -69.32%  (p=0.008 n=5+5)
IterOverlapScan-4            285µs ± 7%    26µs ± 3%  -90.96%  (p=0.008 n=5+5)
```

Compared to old btree interval tree (added in a61191e, never enabled):
```
Insert/count=16-4            231ns ± 1%    76ns ± 4%  -66.99%  (p=0.008 n=5+5)
Insert/count=128-4           351ns ± 2%   156ns ± 4%  -55.53%  (p=0.008 n=5+5)
Insert/count=1024-4          515ns ± 5%   259ns ± 8%  -49.73%  (p=0.008 n=5+5)
Insert/count=8192-4          786ns ± 3%   386ns ± 1%  -50.85%  (p=0.008 n=5+5)
Insert/count=65536-4        1.50µs ± 3%  0.74µs ± 5%  -50.97%  (p=0.008 n=5+5)
Delete/count=16-4            363ns ±11%   129ns ±16%  -64.33%  (p=0.008 n=5+5)
Delete/count=128-4           466ns ± 9%   189ns ±12%  -59.42%  (p=0.008 n=5+5)
Delete/count=1024-4          806ns ± 6%   338ns ± 7%  -58.01%  (p=0.008 n=5+5)
Delete/count=8192-4         1.43µs ±13%  0.55µs ± 4%  -61.71%  (p=0.008 n=5+5)
Delete/count=65536-4        2.75µs ± 1%  1.22µs ±12%  -55.57%  (p=0.008 n=5+5)
DeleteInsert/count=16-4      557ns ± 1%   168ns ± 2%  -69.87%  (p=0.008 n=5+5)
DeleteInsert/count=128-4     953ns ± 8%   375ns ± 8%  -60.71%  (p=0.008 n=5+5)
DeleteInsert/count=1024-4   1.19µs ± 4%  0.56µs ± 1%  -52.72%  (p=0.016 n=5+4)
DeleteInsert/count=8192-4   1.84µs ±17%  0.79µs ± 3%  -57.22%  (p=0.008 n=5+5)
DeleteInsert/count=65536-4  3.20µs ± 3%  2.31µs ±26%  -27.86%  (p=0.008 n=5+5)
IterOverlapScan-4           70.1µs ± 2%  25.8µs ± 3%  -63.23%  (p=0.008 n=5+5)
```

Release note: None
This commit rips out `node.parent` and `node.pos`. These won't work
with a copy-on-write strategy. This makes the iterator code a little
more complicated because the iterator now needs to maintain a stack
of nodes and positions.

We avoid any allocations for this stack for trees up to height 4 (~130k
items) Because of this, the change does not seem to have any effect on
the benchmarks. Theoretically, it should actually speed up mutations
to the tree.

Release note: None
The iterator returned by `tree.MakeIter` was escaping to the heap
for two reasons:
1. it was capturing a reference to the tree itself.
2. it was pointing a slice into its own array.

This change addresses both of these problems and prevents the iterator
from escaping when used. The fixes were:
1. copy the tree's root pointer reference instead of a reference
   to the tree itself.
2. avoid creating the self-referential slice reference. This mistakenly
   escapes because of golang/go#7921, which
   also caused issues with `bytes.Buffer` (https://golang.org/cl/133715).

This change also adds a new benchmark which demonstrates whether `MakeIter`
escapes or not:

```
name             old time/op    new time/op    delta
BTreeMakeIter-4     131ns ±14%      25ns ± 1%   -81.23%  (p=0.000 n=9+9)

name             old alloc/op   new alloc/op   delta
BTreeMakeIter-4      144B ± 0%        0B       -100.00%  (p=0.000 n=10+10)

name             old allocs/op  new allocs/op  delta
BTreeMakeIter-4      1.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
```

Release note: None
@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cmdqTree branch from a0955d2 to cc9c227 Compare November 15, 2018 20:49
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r+

I'm lacking context on how the CoW changes to interval.btree relates to the cmdq.btree implementation, is it just to ensure that the technique is valid?

The CoW stuff that we're going to use is actually all in #32251. The idea is that it will enable the access pattern described in #31997, which should signifcantly improve concurrency in command queue operations.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale)


pkg/storage/cmdq/interval_btree.go, line 74 at r30 (raw file):

Previously, ajwerner wrote…

totally a nit but just for clarity could you add a comment to indicate that this represents an upper bound or alternatively a comment to define contains on keyBound, or maybe even just move the upperBound function up to just proceed the type declaration?

Good suggestion. Done.


pkg/storage/cmdq/interval_btree.go, line 593 at r30 (raw file):

Previously, ajwerner wrote…

Seems like a good catch to me. Maybe add a copy method that copies the slice if it's in use?

Ah you're totally right. Removed the comment. We don't actually rely on this anywhere.

@craig
Copy link
Contributor

craig bot commented Nov 15, 2018

Build failed (retrying...)

@nvanbenschoten
Copy link
Member Author

Failure was with #32368.

craig bot pushed a commit that referenced this pull request Nov 15, 2018
32165: storage/cmdq: create new specialized augmented interval btree r=nvanbenschoten a=nvanbenschoten

This is a component of the larger change in #31997.

The first few commits here modify the existing interval btree implementation,
allowing us to properly benchmark against it.

The second to last commit forks https://github.com/petermattis/pebble/blob/master/internal/btree/btree.go, specializes
it to the command queue, and rips out any references to pebble. There are a number
of changes we'll need to make to it:
1. Add synchronized node and leafNode freelists
2. Add Clear method to release owned nodes into freelists
3. Introduce immutability and a copy-on-write policy

The next commit modifies the btree type added in the previous commit
and turns it into an augmented interval tree. The tree represents
intervals and permits an interval search operation following the
approach laid out in CLRS, Chapter 14. The B-Tree stores cmds in
order based on their start key and each B-Tree node maintains the
upper-bound end key of all cmds in its subtree. This is close to
what `util/interval.btree` does, although the new version doesn't
maintain the lower-bound start key of all cmds in each node.

The new interval btree is significantly faster than both the old
interval btree and the old interval llrb tree because it minimizes
key comparisons while scanning for overlaps. This includes avoiding
all key comparisons for cmds with start keys that are greater than
the search range's start key. See the comment on `overlapScan` for
an explanation of how this is possible.

The new interval btree is also faster because it has been specialized
for the `storage/cmdq` package. This allows it to avoid interfaces
and dynamic dispatch throughout its operations, which showed up
prominently on profiles of the other two implementations.

A third benefit of the rewrite is that it inherits the optimizations
made in pebble's btree. This includes inlining the btree items and
child pointers in nodes instead of using slices.

### Benchmarks:

_The new interval btree:_
```
Insert/count=16-4               76.1ns ± 4%
Insert/count=128-4               156ns ± 4%
Insert/count=1024-4              259ns ± 8%
Insert/count=8192-4              386ns ± 1%
Insert/count=65536-4             735ns ± 5%
Delete/count=16-4                129ns ±16%
Delete/count=128-4               189ns ±12%
Delete/count=1024-4              338ns ± 7%
Delete/count=8192-4              547ns ± 4%
Delete/count=65536-4            1.22µs ±12%
DeleteInsert/count=16-4          168ns ± 2%
DeleteInsert/count=128-4         375ns ± 8%
DeleteInsert/count=1024-4        562ns ± 1%
DeleteInsert/count=8192-4        786ns ± 3%
DeleteInsert/count=65536-4      2.31µs ±26%
IterSeekGE/count=16-4           87.2ns ± 3%
IterSeekGE/count=128-4           141ns ± 3%
IterSeekGE/count=1024-4          227ns ± 4%
IterSeekGE/count=8192-4          379ns ± 2%
IterSeekGE/count=65536-4         882ns ± 1%
IterSeekLT/count=16-4           89.5ns ± 3%
IterSeekLT/count=128-4           145ns ± 1%
IterSeekLT/count=1024-4          226ns ± 6%
IterSeekLT/count=8192-4          379ns ± 1%
IterSeekLT/count=65536-4         891ns ± 1%
IterFirstOverlap/count=16-4      184ns ± 1%
IterFirstOverlap/count=128-4     260ns ± 3%
IterFirstOverlap/count=1024-4    685ns ± 7%
IterFirstOverlap/count=8192-4   1.23µs ± 2%
IterFirstOverlap/count=65536-4  2.14µs ± 1%
IterNext-4                      3.82ns ± 2%
IterPrev-4                      14.8ns ± 2%
IterNextOverlap-4               8.57ns ± 2%
IterOverlapScan-4               25.8µs ± 3%
```

_Compared to old llrb interval tree (currently in use):_
```
Insert/count=16-4            323ns ± 7%    76ns ± 4%  -76.43%  (p=0.008 n=5+5)
Insert/count=128-4           539ns ± 2%   156ns ± 4%  -71.05%  (p=0.008 n=5+5)
Insert/count=1024-4          797ns ± 1%   259ns ± 8%  -67.52%  (p=0.008 n=5+5)
Insert/count=8192-4         1.30µs ± 5%  0.39µs ± 1%  -70.38%  (p=0.008 n=5+5)
Insert/count=65536-4        2.69µs ±11%  0.74µs ± 5%  -72.65%  (p=0.008 n=5+5)
Delete/count=16-4            438ns ± 7%   129ns ±16%  -70.44%  (p=0.008 n=5+5)
Delete/count=128-4           785ns ± 6%   189ns ±12%  -75.89%  (p=0.008 n=5+5)
Delete/count=1024-4         1.38µs ± 2%  0.34µs ± 7%  -75.44%  (p=0.008 n=5+5)
Delete/count=8192-4         2.36µs ± 2%  0.55µs ± 4%  -76.82%  (p=0.008 n=5+5)
Delete/count=65536-4        4.73µs ±13%  1.22µs ±12%  -74.19%  (p=0.008 n=5+5)
DeleteInsert/count=16-4      920ns ± 2%   168ns ± 2%  -81.76%  (p=0.008 n=5+5)
DeleteInsert/count=128-4    1.73µs ± 4%  0.37µs ± 8%  -78.35%  (p=0.008 n=5+5)
DeleteInsert/count=1024-4   2.69µs ± 3%  0.56µs ± 1%  -79.15%  (p=0.016 n=5+4)
DeleteInsert/count=8192-4   4.55µs ±25%  0.79µs ± 3%  -82.70%  (p=0.008 n=5+5)
DeleteInsert/count=65536-4  7.53µs ± 6%  2.31µs ±26%  -69.32%  (p=0.008 n=5+5)
IterOverlapScan-4            285µs ± 7%    26µs ± 3%  -90.96%  (p=0.008 n=5+5)
```

_Compared to old btree interval tree (added in a61191e, never enabled):_
```
Insert/count=16-4            231ns ± 1%    76ns ± 4%  -66.99%  (p=0.008 n=5+5)
Insert/count=128-4           351ns ± 2%   156ns ± 4%  -55.53%  (p=0.008 n=5+5)
Insert/count=1024-4          515ns ± 5%   259ns ± 8%  -49.73%  (p=0.008 n=5+5)
Insert/count=8192-4          786ns ± 3%   386ns ± 1%  -50.85%  (p=0.008 n=5+5)
Insert/count=65536-4        1.50µs ± 3%  0.74µs ± 5%  -50.97%  (p=0.008 n=5+5)
Delete/count=16-4            363ns ±11%   129ns ±16%  -64.33%  (p=0.008 n=5+5)
Delete/count=128-4           466ns ± 9%   189ns ±12%  -59.42%  (p=0.008 n=5+5)
Delete/count=1024-4          806ns ± 6%   338ns ± 7%  -58.01%  (p=0.008 n=5+5)
Delete/count=8192-4         1.43µs ±13%  0.55µs ± 4%  -61.71%  (p=0.008 n=5+5)
Delete/count=65536-4        2.75µs ± 1%  1.22µs ±12%  -55.57%  (p=0.008 n=5+5)
DeleteInsert/count=16-4      557ns ± 1%   168ns ± 2%  -69.87%  (p=0.008 n=5+5)
DeleteInsert/count=128-4     953ns ± 8%   375ns ± 8%  -60.71%  (p=0.008 n=5+5)
DeleteInsert/count=1024-4   1.19µs ± 4%  0.56µs ± 1%  -52.72%  (p=0.016 n=5+4)
DeleteInsert/count=8192-4   1.84µs ±17%  0.79µs ± 3%  -57.22%  (p=0.008 n=5+5)
DeleteInsert/count=65536-4  3.20µs ± 3%  2.31µs ±26%  -27.86%  (p=0.008 n=5+5)
IterOverlapScan-4           70.1µs ± 2%  25.8µs ± 3%  -63.23%  (p=0.008 n=5+5)
```

Co-authored-by: Nathan VanBenschoten <[email protected]>
@craig
Copy link
Contributor

craig bot commented Nov 15, 2018

Build succeeded

@craig craig bot merged commit cc9c227 into cockroachdb:master Nov 15, 2018
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/cmdqTree branch November 15, 2018 22:19
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 15, 2018
…policy

All commits from cockroachdb#32165 except the last one.

This change introduces O(1) btree cloning and a new copy-on-write scheme,
essentially giving the btree an immutable API (for which I took inspiration
from https://docs.rs/crate/im/). This is made efficient by the second part
of the change - a new garbage collection policy for btrees. Nodes are now
reference counted atomically and freed into global `sync.Pools` when they
are no longer referenced.

One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command
queue as immutable structures. In doing so, we adopt a copy-on-write scheme.
Trees are cloned under lock and then accessed concurrently. When future
writers want to modify the tree, they can do so by cloning any nodes that
they touch. This commit provides this functionality in a much more elegant
manner than 6994347. Instead of giving each node a "copy-on-write context",
we instead give each node a reference count. We then use the following rule:
1. trees with exclusive ownership (refcount == 1) over a node can modify
   it in-place.
2. trees without exclusive ownership over a node must clone the node
   in order to modify it. Once cloned, the tree will now have exclusive
   ownership over that node. When cloning the node, the reference count
   of all of the node's children must be incremented.

In following the simple rules, we end up with a really nice property -
trees gain more and more "ownership" as they make modifications, meaning
that subsequent modifications are much less likely to need to clone nodes.
Essentially, we transparently incorporates the idea of local mutations
(e.g. Clojure's transients or Haskell's ST monad) without any external
API needed.

Even better, reference counting internal nodes ties directly into the
new GC policy, which allows us to recycle old nodes and make the copy-on-write
scheme zero-allocation in almost all cases. When a node's reference count
drops to 0, we simply toss it into a `sync.Pool`. We keep two separate
pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible
with the previous "copy-on-write context" approach.

The atomic reference counting does have an effect on benchmarks, but
its not a big one (single/double digit ns) and is negligible compared to
the speedup observed in cockroachdb#32165.
```
name                             old time/op  new time/op  delta
BTreeInsert/count=16-4           73.2ns ± 4%  84.4ns ± 4%  +15.30%  (p=0.008 n=5+5)
BTreeInsert/count=128-4           152ns ± 4%   167ns ± 4%   +9.89%  (p=0.008 n=5+5)
BTreeInsert/count=1024-4          250ns ± 1%   263ns ± 2%   +5.21%  (p=0.008 n=5+5)
BTreeInsert/count=8192-4          381ns ± 1%   394ns ± 2%   +3.36%  (p=0.008 n=5+5)
BTreeInsert/count=65536-4         720ns ± 6%   746ns ± 1%     ~     (p=0.119 n=5+5)
BTreeDelete/count=16-4            127ns ±15%   131ns ± 9%     ~     (p=0.690 n=5+5)
BTreeDelete/count=128-4           182ns ± 8%   192ns ± 8%     ~     (p=0.222 n=5+5)
BTreeDelete/count=1024-4          323ns ± 3%   340ns ± 4%   +5.20%  (p=0.032 n=5+5)
BTreeDelete/count=8192-4          532ns ± 2%   556ns ± 1%   +4.55%  (p=0.008 n=5+5)
BTreeDelete/count=65536-4        1.15µs ± 2%  1.22µs ± 7%     ~     (p=0.222 n=5+5)
BTreeDeleteInsert/count=16-4      166ns ± 4%   174ns ± 3%   +4.70%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=128-4     370ns ± 2%   383ns ± 1%   +3.57%  (p=0.008 n=5+5)
BTreeDeleteInsert/count=1024-4    548ns ± 3%   575ns ± 5%   +4.89%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=8192-4    775ns ± 1%   789ns ± 1%   +1.86%  (p=0.016 n=5+5)
BTreeDeleteInsert/count=65536-4  2.20µs ±22%  2.10µs ±18%     ~     (p=0.841 n=5+5)
```

We can see how important the GC and memory re-use policy is by comparing
the following few benchmarks. Specifically, notice the difference in
operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime`
between the tests that `Reset` old clones (allowing nodes to be freed into
`sync.Pool`s) and the tests that don't `Reset` old clones.
```
name                                                      time/op
BTreeDeleteInsert/count=16-4                               198ns ±28%
BTreeDeleteInsert/count=128-4                              375ns ± 3%
BTreeDeleteInsert/count=1024-4                             577ns ± 2%
BTreeDeleteInsert/count=8192-4                             798ns ± 1%
BTreeDeleteInsert/count=65536-4                           2.00µs ±13%
BTreeDeleteInsertCloneOnce/count=16-4                      173ns ± 2%
BTreeDeleteInsertCloneOnce/count=128-4                     379ns ± 2%
BTreeDeleteInsertCloneOnce/count=1024-4                    584ns ± 4%
BTreeDeleteInsertCloneOnce/count=8192-4                    800ns ± 2%
BTreeDeleteInsertCloneOnce/count=65536-4                  2.04µs ±32%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4      535ns ± 8%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4    1.29µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   2.22µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   2.55µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  5.89µs ±20%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       240ns ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      610ns ± 4%
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4    1.20µs ± 2%
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4    1.69µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4   3.52µs ±18%

name                                                      alloc/op
BTreeDeleteInsert/count=16-4                               0.00B
BTreeDeleteInsert/count=128-4                              0.00B
BTreeDeleteInsert/count=1024-4                             0.00B
BTreeDeleteInsert/count=8192-4                             0.00B
BTreeDeleteInsert/count=65536-4                            0.00B
BTreeDeleteInsertCloneOnce/count=16-4                      0.00B
BTreeDeleteInsertCloneOnce/count=128-4                     0.00B
BTreeDeleteInsertCloneOnce/count=1024-4                    0.00B
BTreeDeleteInsertCloneOnce/count=8192-4                    0.00B
BTreeDeleteInsertCloneOnce/count=65536-4                   1.00B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       288B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      897B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   1.61kB ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   1.47kB ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  2.40kB ±12%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4    0.00B

name                                                      allocs/op
BTreeDeleteInsert/count=16-4                                0.00
BTreeDeleteInsert/count=128-4                               0.00
BTreeDeleteInsert/count=1024-4                              0.00
BTreeDeleteInsert/count=8192-4                              0.00
BTreeDeleteInsert/count=65536-4                             0.00
BTreeDeleteInsertCloneOnce/count=16-4                       0.00
BTreeDeleteInsertCloneOnce/count=128-4                      0.00
BTreeDeleteInsertCloneOnce/count=1024-4                     0.00
BTreeDeleteInsertCloneOnce/count=8192-4                     0.00
BTreeDeleteInsertCloneOnce/count=65536-4                    0.00
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       1.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      2.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4    4.40 ±14%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4        0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4       0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4     0.00
```

Release note: None
craig bot pushed a commit that referenced this pull request Nov 16, 2018
32251: storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy r=nvanbenschoten a=nvanbenschoten

All commits from #32165 except the last one.

This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced.

One of the main ideas in #31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule:
1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place.
2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented.

In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed.

Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach.

The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in #32165.
```
name                             old time/op  new time/op  delta
BTreeInsert/count=16-4           73.2ns ± 4%  84.4ns ± 4%  +15.30%  (p=0.008 n=5+5)
BTreeInsert/count=128-4           152ns ± 4%   167ns ± 4%   +9.89%  (p=0.008 n=5+5)
BTreeInsert/count=1024-4          250ns ± 1%   263ns ± 2%   +5.21%  (p=0.008 n=5+5)
BTreeInsert/count=8192-4          381ns ± 1%   394ns ± 2%   +3.36%  (p=0.008 n=5+5)
BTreeInsert/count=65536-4         720ns ± 6%   746ns ± 1%     ~     (p=0.119 n=5+5)
BTreeDelete/count=16-4            127ns ±15%   131ns ± 9%     ~     (p=0.690 n=5+5)
BTreeDelete/count=128-4           182ns ± 8%   192ns ± 8%     ~     (p=0.222 n=5+5)
BTreeDelete/count=1024-4          323ns ± 3%   340ns ± 4%   +5.20%  (p=0.032 n=5+5)
BTreeDelete/count=8192-4          532ns ± 2%   556ns ± 1%   +4.55%  (p=0.008 n=5+5)
BTreeDelete/count=65536-4        1.15µs ± 2%  1.22µs ± 7%     ~     (p=0.222 n=5+5)
BTreeDeleteInsert/count=16-4      166ns ± 4%   174ns ± 3%   +4.70%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=128-4     370ns ± 2%   383ns ± 1%   +3.57%  (p=0.008 n=5+5)
BTreeDeleteInsert/count=1024-4    548ns ± 3%   575ns ± 5%   +4.89%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=8192-4    775ns ± 1%   789ns ± 1%   +1.86%  (p=0.016 n=5+5)
BTreeDeleteInsert/count=65536-4  2.20µs ±22%  2.10µs ±18%     ~     (p=0.841 n=5+5)
```

We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones.
```
name                                                      time/op
BTreeDeleteInsert/count=16-4                               198ns ±28%
BTreeDeleteInsert/count=128-4                              375ns ± 3%
BTreeDeleteInsert/count=1024-4                             577ns ± 2%
BTreeDeleteInsert/count=8192-4                             798ns ± 1%
BTreeDeleteInsert/count=65536-4                           2.00µs ±13%
BTreeDeleteInsertCloneOnce/count=16-4                      173ns ± 2%
BTreeDeleteInsertCloneOnce/count=128-4                     379ns ± 2%
BTreeDeleteInsertCloneOnce/count=1024-4                    584ns ± 4%
BTreeDeleteInsertCloneOnce/count=8192-4                    800ns ± 2%
BTreeDeleteInsertCloneOnce/count=65536-4                  2.04µs ±32%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4      535ns ± 8%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4    1.29µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   2.22µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   2.55µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  5.89µs ±20%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       240ns ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      610ns ± 4%
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4    1.20µs ± 2%
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4    1.69µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4   3.52µs ±18%

name                                                      alloc/op
BTreeDeleteInsert/count=16-4                               0.00B
BTreeDeleteInsert/count=128-4                              0.00B
BTreeDeleteInsert/count=1024-4                             0.00B
BTreeDeleteInsert/count=8192-4                             0.00B
BTreeDeleteInsert/count=65536-4                            0.00B
BTreeDeleteInsertCloneOnce/count=16-4                      0.00B
BTreeDeleteInsertCloneOnce/count=128-4                     0.00B
BTreeDeleteInsertCloneOnce/count=1024-4                    0.00B
BTreeDeleteInsertCloneOnce/count=8192-4                    0.00B
BTreeDeleteInsertCloneOnce/count=65536-4                   1.00B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       288B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      897B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   1.61kB ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   1.47kB ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  2.40kB ±12%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4    0.00B

name                                                      allocs/op
BTreeDeleteInsert/count=16-4                                0.00
BTreeDeleteInsert/count=128-4                               0.00
BTreeDeleteInsert/count=1024-4                              0.00
BTreeDeleteInsert/count=8192-4                              0.00
BTreeDeleteInsert/count=65536-4                             0.00
BTreeDeleteInsertCloneOnce/count=16-4                       0.00
BTreeDeleteInsertCloneOnce/count=128-4                      0.00
BTreeDeleteInsertCloneOnce/count=1024-4                     0.00
BTreeDeleteInsertCloneOnce/count=8192-4                     0.00
BTreeDeleteInsertCloneOnce/count=65536-4                    0.00
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       1.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      2.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4    4.40 ±14%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4        0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4       0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4     0.00
```


Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants