Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage/spanlatch: create spanlatch.Manager using immutable btrees #31997

Merged
merged 7 commits into from
Nov 29, 2018

Conversation

nvanbenschoten
Copy link
Member

@nvanbenschoten nvanbenschoten commented Oct 30, 2018

Informs #4768.
Informs #31904.

This change was inspired by #31904 and is a progression of the thinking started in #4768 (comment).

The change introduces spanlatch.Manager, which will replace the CommandQueue in a future PR. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes #31904 easier to implement.

The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because Replica.beginCmds repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges.

The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock.

While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the spanlatch.Guard. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting immutable btree nodes, which allows us to release them back into the btree node pools when they're no longer needed. This means that we don't expect any allocations when inserting into the internal trees, even with the copy-on-write policy.

Finally, this will make the approach taken in #31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free.

Naming changes

Before After
CommandQueue spanlatch.Manager
"enter the command queue" "acquire span latches"
"exit the command queue" "release span latches"
"wait for prereq commands" "wait for latches to be released"

The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model.

Microbenchmarks

NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure.

name                              old time/op    new time/op    delta
ReadOnlyMix/size=1-4                 706ns ±20%     404ns ±10%  -42.81%  (p=0.008 n=5+5)
ReadOnlyMix/size=4-4                 649ns ±23%     382ns ± 5%  -41.13%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                611ns ±16%     367ns ± 5%  -39.83%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                692ns ±14%     370ns ± 1%  -46.49%  (p=0.016 n=5+4)
ReadOnlyMix/size=128-4               637ns ±22%     398ns ±14%  -37.48%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4               676ns ±15%     385ns ± 4%  -43.01%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4      12.2µs ± 4%     0.6µs ±17%  -94.85%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4      7.88µs ± 2%    0.55µs ± 7%  -92.99%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4      4.19µs ± 3%    0.58µs ± 5%  -86.26%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4     2.09µs ± 6%    0.54µs ±13%  -74.13%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4      875ns ±17%     423ns ±29%  -51.64%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4     655ns ± 6%     362ns ±16%  -44.71%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=256-4     549ns ±16%     314ns ±13%  -42.73%  (p=0.008 n=5+5)

name                              old alloc/op   new alloc/op   delta
ReadOnlyMix/size=1-4                  223B ± 0%      160B ± 0%  -28.25%  (p=0.079 n=4+5)
ReadOnlyMix/size=4-4                  223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                 223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                 223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=128-4                217B ± 4%      160B ± 0%  -26.27%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4                223B ± 0%      160B ± 0%  -28.25%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=0-4      1.25kB ± 0%    0.16kB ± 0%  -87.15%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4      1.00kB ± 0%    0.16kB ± 0%  -84.00%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=4-4        708B ± 0%      160B ± 0%  -77.40%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=16-4       513B ± 0%      160B ± 0%  -68.81%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       264B ± 0%      160B ± 0%  -39.39%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      221B ± 0%      160B ± 0%  -27.60%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=256-4      198B ± 0%      160B ± 0%  -19.35%  (p=0.008 n=5+5)

name                              old allocs/op  new allocs/op  delta
ReadOnlyMix/size=1-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=4-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=16-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=64-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=128-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=256-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=0-4        38.0 ± 0%       1.0 ± 0%  -97.37%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4        24.0 ± 0%       1.0 ± 0%  -95.83%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4        12.0 ± 0%       1.0 ± 0%  -91.67%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4       5.00 ± 0%      1.00 ± 0%  -80.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=256-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

There are a few interesting things to point about about these benchmark results:

  • The ReadOnlyMix results demonstrate a fixed improvement, regardless of size. This is due to the replacement of the hash-map with a linked-list for the readSet structure.
  • The ReadWriteMix is more interesting. We see that the spanlatch implementation is faster across the board. This is especially true with a high write/read ratio.
  • We see that the allocated memory stays constant regardless of the write/read ratio in the spanlatch implementation. This is due to the memory recylcing that it performs on btree nodes. It is not the case for the CommandQueue implementation.

Release note: None

@nvanbenschoten nvanbenschoten requested a review from a team October 30, 2018 05:58
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member Author

There's still plenty of room to optimize the btree implementation that we use here. Until this point we've closely follower github.com/google/btree, but there are some good ideas that we could pull from pebble's btree. For instance, we could inline item and node pointer arrays in nodes to avoid unnecessary indirection. We could also specialize the implementation to this use to avoid interfaces. Short of this, we could at least cache the interface's Range() value to avoid repeatedly calling the method when searching.

craig bot pushed a commit that referenced this pull request Nov 12, 2018
32164: storage/cmdq: create new signal type for cmd completion signaling r=nvanbenschoten a=nvanbenschoten

`signal` is a type that can signal the completion of an operation.
This is a component of the larger change in #31997.

The type has three benefits over using a channel directly and
closing the channel when the operation completes:
1. signaled() uses atomics to provide a fast-path for checking
   whether the operation has completed. It is ~75x faster than
   using a channel for this purpose.
2. the type's channel is lazily initialized when signalChan()
   is called, avoiding the allocation when one is not needed.
3. because of 2, the type's zero value can be used directly.

Release note: None

Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 13, 2018
…policy

All commits from cockroachdb#32165 except the last one.

This change introduces O(1) btree cloning and a new copy-on-write scheme,
essentially giving the btree an immutable API (for which I took inspiration
from https://docs.rs/crate/im/). This is made efficient by the second part
of the change - a new garbage collection policy for btrees. Nodes are now
reference counted atomically and freed into global `sync.Pools` when they
are no longer referenced.

One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command
queue as immutable structures. In doing so, we adopt a copy-on-write scheme.
Trees are cloned under lock and then accessed concurrently. When future
writers want to modify the tree, they can do so by cloning any nodes that
they touch. This commit provides this functionality in a much more elegant
manner than 6994347. Instead of giving each node a "copy-on-write context",
we instead give each node a reference count. We then use the following rule:
1. trees with exclusive ownership (refcount == 1) over a node can modify
   it in-place.
2. trees without exclusive ownership over a node must clone the node
   in order to modify it. Once cloned, the tree will now have exclusive
   ownership over that node. When cloning the node, the reference count
   of all of the node's children must be incremented.

In following the simple rules, we end up with a really nice property -
trees gain more and more "ownership" as they make modifications, meaning
that subsequent modifications are much less likely to need to clone nodes.
Essentially, we transparently incorporates the idea of local mutations
(e.g. Clojure's transients or Haskell's ST monad) without any external
API needed.

Even better, reference counting internal nodes ties directly into the
new GC policy, which allows us to recycle old nodes and make the copy-on-write
scheme zero-allocation in almost all cases. When a node's reference count
drops to 0, we simply toss it into a `sync.Pool`. We keep two separate
pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible
with the previous "copy-on-write context" approach.

The atomic reference counting does have an effect on benchmarks, but
its not a big one (single/double digit ns) and is negligible compared to
the speedup observed in cockroachdb#32165.
```
name                             old time/op  new time/op  delta
BTreeInsert/count=16-4           73.2ns ± 4%  84.4ns ± 4%  +15.30%  (p=0.008 n=5+5)
BTreeInsert/count=128-4           152ns ± 4%   167ns ± 4%   +9.89%  (p=0.008 n=5+5)
BTreeInsert/count=1024-4          250ns ± 1%   263ns ± 2%   +5.21%  (p=0.008 n=5+5)
BTreeInsert/count=8192-4          381ns ± 1%   394ns ± 2%   +3.36%  (p=0.008 n=5+5)
BTreeInsert/count=65536-4         720ns ± 6%   746ns ± 1%     ~     (p=0.119 n=5+5)
BTreeDelete/count=16-4            127ns ±15%   131ns ± 9%     ~     (p=0.690 n=5+5)
BTreeDelete/count=128-4           182ns ± 8%   192ns ± 8%     ~     (p=0.222 n=5+5)
BTreeDelete/count=1024-4          323ns ± 3%   340ns ± 4%   +5.20%  (p=0.032 n=5+5)
BTreeDelete/count=8192-4          532ns ± 2%   556ns ± 1%   +4.55%  (p=0.008 n=5+5)
BTreeDelete/count=65536-4        1.15µs ± 2%  1.22µs ± 7%     ~     (p=0.222 n=5+5)
BTreeDeleteInsert/count=16-4      166ns ± 4%   174ns ± 3%   +4.70%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=128-4     370ns ± 2%   383ns ± 1%   +3.57%  (p=0.008 n=5+5)
BTreeDeleteInsert/count=1024-4    548ns ± 3%   575ns ± 5%   +4.89%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=8192-4    775ns ± 1%   789ns ± 1%   +1.86%  (p=0.016 n=5+5)
BTreeDeleteInsert/count=65536-4  2.20µs ±22%  2.10µs ±18%     ~     (p=0.841 n=5+5)
```

We can see how important the GC and memory re-use policy is by comparing
the following few benchmarks. Specifically, notice the difference in
operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime`
between the tests that `Reset` old clones (allowing nodes to be freed into
`sync.Pool`s) and the tests that don't `Reset` old clones.
```
name                                                      time/op
BTreeDeleteInsert/count=16-4                               198ns ±28%
BTreeDeleteInsert/count=128-4                              375ns ± 3%
BTreeDeleteInsert/count=1024-4                             577ns ± 2%
BTreeDeleteInsert/count=8192-4                             798ns ± 1%
BTreeDeleteInsert/count=65536-4                           2.00µs ±13%
BTreeDeleteInsertCloneOnce/count=16-4                      173ns ± 2%
BTreeDeleteInsertCloneOnce/count=128-4                     379ns ± 2%
BTreeDeleteInsertCloneOnce/count=1024-4                    584ns ± 4%
BTreeDeleteInsertCloneOnce/count=8192-4                    800ns ± 2%
BTreeDeleteInsertCloneOnce/count=65536-4                  2.04µs ±32%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4      535ns ± 8%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4    1.29µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   2.22µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   2.55µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  5.89µs ±20%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       240ns ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      610ns ± 4%
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4    1.20µs ± 2%
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4    1.69µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4   3.52µs ±18%

name                                                      alloc/op
BTreeDeleteInsert/count=16-4                               0.00B
BTreeDeleteInsert/count=128-4                              0.00B
BTreeDeleteInsert/count=1024-4                             0.00B
BTreeDeleteInsert/count=8192-4                             0.00B
BTreeDeleteInsert/count=65536-4                            0.00B
BTreeDeleteInsertCloneOnce/count=16-4                      0.00B
BTreeDeleteInsertCloneOnce/count=128-4                     0.00B
BTreeDeleteInsertCloneOnce/count=1024-4                    0.00B
BTreeDeleteInsertCloneOnce/count=8192-4                    0.00B
BTreeDeleteInsertCloneOnce/count=65536-4                   1.00B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       288B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      897B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   1.61kB ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   1.47kB ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  2.40kB ±12%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4    0.00B

name                                                      allocs/op
BTreeDeleteInsert/count=16-4                                0.00
BTreeDeleteInsert/count=128-4                               0.00
BTreeDeleteInsert/count=1024-4                              0.00
BTreeDeleteInsert/count=8192-4                              0.00
BTreeDeleteInsert/count=65536-4                             0.00
BTreeDeleteInsertCloneOnce/count=16-4                       0.00
BTreeDeleteInsertCloneOnce/count=128-4                      0.00
BTreeDeleteInsertCloneOnce/count=1024-4                     0.00
BTreeDeleteInsertCloneOnce/count=8192-4                     0.00
BTreeDeleteInsertCloneOnce/count=65536-4                    0.00
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       1.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      2.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4    4.40 ±14%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4        0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4       0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4     0.00
```

Release note: None
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 14, 2018
…policy

All commits from cockroachdb#32165 except the last one.

This change introduces O(1) btree cloning and a new copy-on-write scheme,
essentially giving the btree an immutable API (for which I took inspiration
from https://docs.rs/crate/im/). This is made efficient by the second part
of the change - a new garbage collection policy for btrees. Nodes are now
reference counted atomically and freed into global `sync.Pools` when they
are no longer referenced.

One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command
queue as immutable structures. In doing so, we adopt a copy-on-write scheme.
Trees are cloned under lock and then accessed concurrently. When future
writers want to modify the tree, they can do so by cloning any nodes that
they touch. This commit provides this functionality in a much more elegant
manner than 6994347. Instead of giving each node a "copy-on-write context",
we instead give each node a reference count. We then use the following rule:
1. trees with exclusive ownership (refcount == 1) over a node can modify
   it in-place.
2. trees without exclusive ownership over a node must clone the node
   in order to modify it. Once cloned, the tree will now have exclusive
   ownership over that node. When cloning the node, the reference count
   of all of the node's children must be incremented.

In following the simple rules, we end up with a really nice property -
trees gain more and more "ownership" as they make modifications, meaning
that subsequent modifications are much less likely to need to clone nodes.
Essentially, we transparently incorporates the idea of local mutations
(e.g. Clojure's transients or Haskell's ST monad) without any external
API needed.

Even better, reference counting internal nodes ties directly into the
new GC policy, which allows us to recycle old nodes and make the copy-on-write
scheme zero-allocation in almost all cases. When a node's reference count
drops to 0, we simply toss it into a `sync.Pool`. We keep two separate
pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible
with the previous "copy-on-write context" approach.

The atomic reference counting does have an effect on benchmarks, but
its not a big one (single/double digit ns) and is negligible compared to
the speedup observed in cockroachdb#32165.
```
name                             old time/op  new time/op  delta
BTreeInsert/count=16-4           73.2ns ± 4%  84.4ns ± 4%  +15.30%  (p=0.008 n=5+5)
BTreeInsert/count=128-4           152ns ± 4%   167ns ± 4%   +9.89%  (p=0.008 n=5+5)
BTreeInsert/count=1024-4          250ns ± 1%   263ns ± 2%   +5.21%  (p=0.008 n=5+5)
BTreeInsert/count=8192-4          381ns ± 1%   394ns ± 2%   +3.36%  (p=0.008 n=5+5)
BTreeInsert/count=65536-4         720ns ± 6%   746ns ± 1%     ~     (p=0.119 n=5+5)
BTreeDelete/count=16-4            127ns ±15%   131ns ± 9%     ~     (p=0.690 n=5+5)
BTreeDelete/count=128-4           182ns ± 8%   192ns ± 8%     ~     (p=0.222 n=5+5)
BTreeDelete/count=1024-4          323ns ± 3%   340ns ± 4%   +5.20%  (p=0.032 n=5+5)
BTreeDelete/count=8192-4          532ns ± 2%   556ns ± 1%   +4.55%  (p=0.008 n=5+5)
BTreeDelete/count=65536-4        1.15µs ± 2%  1.22µs ± 7%     ~     (p=0.222 n=5+5)
BTreeDeleteInsert/count=16-4      166ns ± 4%   174ns ± 3%   +4.70%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=128-4     370ns ± 2%   383ns ± 1%   +3.57%  (p=0.008 n=5+5)
BTreeDeleteInsert/count=1024-4    548ns ± 3%   575ns ± 5%   +4.89%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=8192-4    775ns ± 1%   789ns ± 1%   +1.86%  (p=0.016 n=5+5)
BTreeDeleteInsert/count=65536-4  2.20µs ±22%  2.10µs ±18%     ~     (p=0.841 n=5+5)
```

We can see how important the GC and memory re-use policy is by comparing
the following few benchmarks. Specifically, notice the difference in
operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime`
between the tests that `Reset` old clones (allowing nodes to be freed into
`sync.Pool`s) and the tests that don't `Reset` old clones.
```
name                                                      time/op
BTreeDeleteInsert/count=16-4                               198ns ±28%
BTreeDeleteInsert/count=128-4                              375ns ± 3%
BTreeDeleteInsert/count=1024-4                             577ns ± 2%
BTreeDeleteInsert/count=8192-4                             798ns ± 1%
BTreeDeleteInsert/count=65536-4                           2.00µs ±13%
BTreeDeleteInsertCloneOnce/count=16-4                      173ns ± 2%
BTreeDeleteInsertCloneOnce/count=128-4                     379ns ± 2%
BTreeDeleteInsertCloneOnce/count=1024-4                    584ns ± 4%
BTreeDeleteInsertCloneOnce/count=8192-4                    800ns ± 2%
BTreeDeleteInsertCloneOnce/count=65536-4                  2.04µs ±32%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4      535ns ± 8%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4    1.29µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   2.22µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   2.55µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  5.89µs ±20%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       240ns ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      610ns ± 4%
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4    1.20µs ± 2%
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4    1.69µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4   3.52µs ±18%

name                                                      alloc/op
BTreeDeleteInsert/count=16-4                               0.00B
BTreeDeleteInsert/count=128-4                              0.00B
BTreeDeleteInsert/count=1024-4                             0.00B
BTreeDeleteInsert/count=8192-4                             0.00B
BTreeDeleteInsert/count=65536-4                            0.00B
BTreeDeleteInsertCloneOnce/count=16-4                      0.00B
BTreeDeleteInsertCloneOnce/count=128-4                     0.00B
BTreeDeleteInsertCloneOnce/count=1024-4                    0.00B
BTreeDeleteInsertCloneOnce/count=8192-4                    0.00B
BTreeDeleteInsertCloneOnce/count=65536-4                   1.00B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       288B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      897B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   1.61kB ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   1.47kB ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  2.40kB ±12%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4    0.00B

name                                                      allocs/op
BTreeDeleteInsert/count=16-4                                0.00
BTreeDeleteInsert/count=128-4                               0.00
BTreeDeleteInsert/count=1024-4                              0.00
BTreeDeleteInsert/count=8192-4                              0.00
BTreeDeleteInsert/count=65536-4                             0.00
BTreeDeleteInsertCloneOnce/count=16-4                       0.00
BTreeDeleteInsertCloneOnce/count=128-4                      0.00
BTreeDeleteInsertCloneOnce/count=1024-4                     0.00
BTreeDeleteInsertCloneOnce/count=8192-4                     0.00
BTreeDeleteInsertCloneOnce/count=65536-4                    0.00
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       1.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      2.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4    4.40 ±14%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4        0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4       0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4     0.00
```

Release note: None
craig bot pushed a commit that referenced this pull request Nov 15, 2018
32165: storage/cmdq: create new specialized augmented interval btree r=nvanbenschoten a=nvanbenschoten

This is a component of the larger change in #31997.

The first few commits here modify the existing interval btree implementation,
allowing us to properly benchmark against it.

The second to last commit forks https://github.com/petermattis/pebble/blob/master/internal/btree/btree.go, specializes
it to the command queue, and rips out any references to pebble. There are a number
of changes we'll need to make to it:
1. Add synchronized node and leafNode freelists
2. Add Clear method to release owned nodes into freelists
3. Introduce immutability and a copy-on-write policy

The next commit modifies the btree type added in the previous commit
and turns it into an augmented interval tree. The tree represents
intervals and permits an interval search operation following the
approach laid out in CLRS, Chapter 14. The B-Tree stores cmds in
order based on their start key and each B-Tree node maintains the
upper-bound end key of all cmds in its subtree. This is close to
what `util/interval.btree` does, although the new version doesn't
maintain the lower-bound start key of all cmds in each node.

The new interval btree is significantly faster than both the old
interval btree and the old interval llrb tree because it minimizes
key comparisons while scanning for overlaps. This includes avoiding
all key comparisons for cmds with start keys that are greater than
the search range's start key. See the comment on `overlapScan` for
an explanation of how this is possible.

The new interval btree is also faster because it has been specialized
for the `storage/cmdq` package. This allows it to avoid interfaces
and dynamic dispatch throughout its operations, which showed up
prominently on profiles of the other two implementations.

A third benefit of the rewrite is that it inherits the optimizations
made in pebble's btree. This includes inlining the btree items and
child pointers in nodes instead of using slices.

### Benchmarks:

_The new interval btree:_
```
Insert/count=16-4               76.1ns ± 4%
Insert/count=128-4               156ns ± 4%
Insert/count=1024-4              259ns ± 8%
Insert/count=8192-4              386ns ± 1%
Insert/count=65536-4             735ns ± 5%
Delete/count=16-4                129ns ±16%
Delete/count=128-4               189ns ±12%
Delete/count=1024-4              338ns ± 7%
Delete/count=8192-4              547ns ± 4%
Delete/count=65536-4            1.22µs ±12%
DeleteInsert/count=16-4          168ns ± 2%
DeleteInsert/count=128-4         375ns ± 8%
DeleteInsert/count=1024-4        562ns ± 1%
DeleteInsert/count=8192-4        786ns ± 3%
DeleteInsert/count=65536-4      2.31µs ±26%
IterSeekGE/count=16-4           87.2ns ± 3%
IterSeekGE/count=128-4           141ns ± 3%
IterSeekGE/count=1024-4          227ns ± 4%
IterSeekGE/count=8192-4          379ns ± 2%
IterSeekGE/count=65536-4         882ns ± 1%
IterSeekLT/count=16-4           89.5ns ± 3%
IterSeekLT/count=128-4           145ns ± 1%
IterSeekLT/count=1024-4          226ns ± 6%
IterSeekLT/count=8192-4          379ns ± 1%
IterSeekLT/count=65536-4         891ns ± 1%
IterFirstOverlap/count=16-4      184ns ± 1%
IterFirstOverlap/count=128-4     260ns ± 3%
IterFirstOverlap/count=1024-4    685ns ± 7%
IterFirstOverlap/count=8192-4   1.23µs ± 2%
IterFirstOverlap/count=65536-4  2.14µs ± 1%
IterNext-4                      3.82ns ± 2%
IterPrev-4                      14.8ns ± 2%
IterNextOverlap-4               8.57ns ± 2%
IterOverlapScan-4               25.8µs ± 3%
```

_Compared to old llrb interval tree (currently in use):_
```
Insert/count=16-4            323ns ± 7%    76ns ± 4%  -76.43%  (p=0.008 n=5+5)
Insert/count=128-4           539ns ± 2%   156ns ± 4%  -71.05%  (p=0.008 n=5+5)
Insert/count=1024-4          797ns ± 1%   259ns ± 8%  -67.52%  (p=0.008 n=5+5)
Insert/count=8192-4         1.30µs ± 5%  0.39µs ± 1%  -70.38%  (p=0.008 n=5+5)
Insert/count=65536-4        2.69µs ±11%  0.74µs ± 5%  -72.65%  (p=0.008 n=5+5)
Delete/count=16-4            438ns ± 7%   129ns ±16%  -70.44%  (p=0.008 n=5+5)
Delete/count=128-4           785ns ± 6%   189ns ±12%  -75.89%  (p=0.008 n=5+5)
Delete/count=1024-4         1.38µs ± 2%  0.34µs ± 7%  -75.44%  (p=0.008 n=5+5)
Delete/count=8192-4         2.36µs ± 2%  0.55µs ± 4%  -76.82%  (p=0.008 n=5+5)
Delete/count=65536-4        4.73µs ±13%  1.22µs ±12%  -74.19%  (p=0.008 n=5+5)
DeleteInsert/count=16-4      920ns ± 2%   168ns ± 2%  -81.76%  (p=0.008 n=5+5)
DeleteInsert/count=128-4    1.73µs ± 4%  0.37µs ± 8%  -78.35%  (p=0.008 n=5+5)
DeleteInsert/count=1024-4   2.69µs ± 3%  0.56µs ± 1%  -79.15%  (p=0.016 n=5+4)
DeleteInsert/count=8192-4   4.55µs ±25%  0.79µs ± 3%  -82.70%  (p=0.008 n=5+5)
DeleteInsert/count=65536-4  7.53µs ± 6%  2.31µs ±26%  -69.32%  (p=0.008 n=5+5)
IterOverlapScan-4            285µs ± 7%    26µs ± 3%  -90.96%  (p=0.008 n=5+5)
```

_Compared to old btree interval tree (added in a61191e, never enabled):_
```
Insert/count=16-4            231ns ± 1%    76ns ± 4%  -66.99%  (p=0.008 n=5+5)
Insert/count=128-4           351ns ± 2%   156ns ± 4%  -55.53%  (p=0.008 n=5+5)
Insert/count=1024-4          515ns ± 5%   259ns ± 8%  -49.73%  (p=0.008 n=5+5)
Insert/count=8192-4          786ns ± 3%   386ns ± 1%  -50.85%  (p=0.008 n=5+5)
Insert/count=65536-4        1.50µs ± 3%  0.74µs ± 5%  -50.97%  (p=0.008 n=5+5)
Delete/count=16-4            363ns ±11%   129ns ±16%  -64.33%  (p=0.008 n=5+5)
Delete/count=128-4           466ns ± 9%   189ns ±12%  -59.42%  (p=0.008 n=5+5)
Delete/count=1024-4          806ns ± 6%   338ns ± 7%  -58.01%  (p=0.008 n=5+5)
Delete/count=8192-4         1.43µs ±13%  0.55µs ± 4%  -61.71%  (p=0.008 n=5+5)
Delete/count=65536-4        2.75µs ± 1%  1.22µs ±12%  -55.57%  (p=0.008 n=5+5)
DeleteInsert/count=16-4      557ns ± 1%   168ns ± 2%  -69.87%  (p=0.008 n=5+5)
DeleteInsert/count=128-4     953ns ± 8%   375ns ± 8%  -60.71%  (p=0.008 n=5+5)
DeleteInsert/count=1024-4   1.19µs ± 4%  0.56µs ± 1%  -52.72%  (p=0.016 n=5+4)
DeleteInsert/count=8192-4   1.84µs ±17%  0.79µs ± 3%  -57.22%  (p=0.008 n=5+5)
DeleteInsert/count=65536-4  3.20µs ± 3%  2.31µs ±26%  -27.86%  (p=0.008 n=5+5)
IterOverlapScan-4           70.1µs ± 2%  25.8µs ± 3%  -63.23%  (p=0.008 n=5+5)
```

Co-authored-by: Nathan VanBenschoten <[email protected]>
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Nov 15, 2018
…policy

All commits from cockroachdb#32165 except the last one.

This change introduces O(1) btree cloning and a new copy-on-write scheme,
essentially giving the btree an immutable API (for which I took inspiration
from https://docs.rs/crate/im/). This is made efficient by the second part
of the change - a new garbage collection policy for btrees. Nodes are now
reference counted atomically and freed into global `sync.Pools` when they
are no longer referenced.

One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command
queue as immutable structures. In doing so, we adopt a copy-on-write scheme.
Trees are cloned under lock and then accessed concurrently. When future
writers want to modify the tree, they can do so by cloning any nodes that
they touch. This commit provides this functionality in a much more elegant
manner than 6994347. Instead of giving each node a "copy-on-write context",
we instead give each node a reference count. We then use the following rule:
1. trees with exclusive ownership (refcount == 1) over a node can modify
   it in-place.
2. trees without exclusive ownership over a node must clone the node
   in order to modify it. Once cloned, the tree will now have exclusive
   ownership over that node. When cloning the node, the reference count
   of all of the node's children must be incremented.

In following the simple rules, we end up with a really nice property -
trees gain more and more "ownership" as they make modifications, meaning
that subsequent modifications are much less likely to need to clone nodes.
Essentially, we transparently incorporates the idea of local mutations
(e.g. Clojure's transients or Haskell's ST monad) without any external
API needed.

Even better, reference counting internal nodes ties directly into the
new GC policy, which allows us to recycle old nodes and make the copy-on-write
scheme zero-allocation in almost all cases. When a node's reference count
drops to 0, we simply toss it into a `sync.Pool`. We keep two separate
pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible
with the previous "copy-on-write context" approach.

The atomic reference counting does have an effect on benchmarks, but
its not a big one (single/double digit ns) and is negligible compared to
the speedup observed in cockroachdb#32165.
```
name                             old time/op  new time/op  delta
BTreeInsert/count=16-4           73.2ns ± 4%  84.4ns ± 4%  +15.30%  (p=0.008 n=5+5)
BTreeInsert/count=128-4           152ns ± 4%   167ns ± 4%   +9.89%  (p=0.008 n=5+5)
BTreeInsert/count=1024-4          250ns ± 1%   263ns ± 2%   +5.21%  (p=0.008 n=5+5)
BTreeInsert/count=8192-4          381ns ± 1%   394ns ± 2%   +3.36%  (p=0.008 n=5+5)
BTreeInsert/count=65536-4         720ns ± 6%   746ns ± 1%     ~     (p=0.119 n=5+5)
BTreeDelete/count=16-4            127ns ±15%   131ns ± 9%     ~     (p=0.690 n=5+5)
BTreeDelete/count=128-4           182ns ± 8%   192ns ± 8%     ~     (p=0.222 n=5+5)
BTreeDelete/count=1024-4          323ns ± 3%   340ns ± 4%   +5.20%  (p=0.032 n=5+5)
BTreeDelete/count=8192-4          532ns ± 2%   556ns ± 1%   +4.55%  (p=0.008 n=5+5)
BTreeDelete/count=65536-4        1.15µs ± 2%  1.22µs ± 7%     ~     (p=0.222 n=5+5)
BTreeDeleteInsert/count=16-4      166ns ± 4%   174ns ± 3%   +4.70%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=128-4     370ns ± 2%   383ns ± 1%   +3.57%  (p=0.008 n=5+5)
BTreeDeleteInsert/count=1024-4    548ns ± 3%   575ns ± 5%   +4.89%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=8192-4    775ns ± 1%   789ns ± 1%   +1.86%  (p=0.016 n=5+5)
BTreeDeleteInsert/count=65536-4  2.20µs ±22%  2.10µs ±18%     ~     (p=0.841 n=5+5)
```

We can see how important the GC and memory re-use policy is by comparing
the following few benchmarks. Specifically, notice the difference in
operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime`
between the tests that `Reset` old clones (allowing nodes to be freed into
`sync.Pool`s) and the tests that don't `Reset` old clones.
```
name                                                      time/op
BTreeDeleteInsert/count=16-4                               198ns ±28%
BTreeDeleteInsert/count=128-4                              375ns ± 3%
BTreeDeleteInsert/count=1024-4                             577ns ± 2%
BTreeDeleteInsert/count=8192-4                             798ns ± 1%
BTreeDeleteInsert/count=65536-4                           2.00µs ±13%
BTreeDeleteInsertCloneOnce/count=16-4                      173ns ± 2%
BTreeDeleteInsertCloneOnce/count=128-4                     379ns ± 2%
BTreeDeleteInsertCloneOnce/count=1024-4                    584ns ± 4%
BTreeDeleteInsertCloneOnce/count=8192-4                    800ns ± 2%
BTreeDeleteInsertCloneOnce/count=65536-4                  2.04µs ±32%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4      535ns ± 8%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4    1.29µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   2.22µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   2.55µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  5.89µs ±20%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       240ns ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      610ns ± 4%
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4    1.20µs ± 2%
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4    1.69µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4   3.52µs ±18%

name                                                      alloc/op
BTreeDeleteInsert/count=16-4                               0.00B
BTreeDeleteInsert/count=128-4                              0.00B
BTreeDeleteInsert/count=1024-4                             0.00B
BTreeDeleteInsert/count=8192-4                             0.00B
BTreeDeleteInsert/count=65536-4                            0.00B
BTreeDeleteInsertCloneOnce/count=16-4                      0.00B
BTreeDeleteInsertCloneOnce/count=128-4                     0.00B
BTreeDeleteInsertCloneOnce/count=1024-4                    0.00B
BTreeDeleteInsertCloneOnce/count=8192-4                    0.00B
BTreeDeleteInsertCloneOnce/count=65536-4                   1.00B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       288B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      897B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   1.61kB ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   1.47kB ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  2.40kB ±12%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4    0.00B

name                                                      allocs/op
BTreeDeleteInsert/count=16-4                                0.00
BTreeDeleteInsert/count=128-4                               0.00
BTreeDeleteInsert/count=1024-4                              0.00
BTreeDeleteInsert/count=8192-4                              0.00
BTreeDeleteInsert/count=65536-4                             0.00
BTreeDeleteInsertCloneOnce/count=16-4                       0.00
BTreeDeleteInsertCloneOnce/count=128-4                      0.00
BTreeDeleteInsertCloneOnce/count=1024-4                     0.00
BTreeDeleteInsertCloneOnce/count=8192-4                     0.00
BTreeDeleteInsertCloneOnce/count=65536-4                    0.00
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       1.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      2.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4    4.40 ±14%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4        0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4       0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4     0.00
```

Release note: None
craig bot pushed a commit that referenced this pull request Nov 16, 2018
32251: storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy r=nvanbenschoten a=nvanbenschoten

All commits from #32165 except the last one.

This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced.

One of the main ideas in #31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule:
1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place.
2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented.

In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed.

Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach.

The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in #32165.
```
name                             old time/op  new time/op  delta
BTreeInsert/count=16-4           73.2ns ± 4%  84.4ns ± 4%  +15.30%  (p=0.008 n=5+5)
BTreeInsert/count=128-4           152ns ± 4%   167ns ± 4%   +9.89%  (p=0.008 n=5+5)
BTreeInsert/count=1024-4          250ns ± 1%   263ns ± 2%   +5.21%  (p=0.008 n=5+5)
BTreeInsert/count=8192-4          381ns ± 1%   394ns ± 2%   +3.36%  (p=0.008 n=5+5)
BTreeInsert/count=65536-4         720ns ± 6%   746ns ± 1%     ~     (p=0.119 n=5+5)
BTreeDelete/count=16-4            127ns ±15%   131ns ± 9%     ~     (p=0.690 n=5+5)
BTreeDelete/count=128-4           182ns ± 8%   192ns ± 8%     ~     (p=0.222 n=5+5)
BTreeDelete/count=1024-4          323ns ± 3%   340ns ± 4%   +5.20%  (p=0.032 n=5+5)
BTreeDelete/count=8192-4          532ns ± 2%   556ns ± 1%   +4.55%  (p=0.008 n=5+5)
BTreeDelete/count=65536-4        1.15µs ± 2%  1.22µs ± 7%     ~     (p=0.222 n=5+5)
BTreeDeleteInsert/count=16-4      166ns ± 4%   174ns ± 3%   +4.70%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=128-4     370ns ± 2%   383ns ± 1%   +3.57%  (p=0.008 n=5+5)
BTreeDeleteInsert/count=1024-4    548ns ± 3%   575ns ± 5%   +4.89%  (p=0.032 n=5+5)
BTreeDeleteInsert/count=8192-4    775ns ± 1%   789ns ± 1%   +1.86%  (p=0.016 n=5+5)
BTreeDeleteInsert/count=65536-4  2.20µs ±22%  2.10µs ±18%     ~     (p=0.841 n=5+5)
```

We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones.
```
name                                                      time/op
BTreeDeleteInsert/count=16-4                               198ns ±28%
BTreeDeleteInsert/count=128-4                              375ns ± 3%
BTreeDeleteInsert/count=1024-4                             577ns ± 2%
BTreeDeleteInsert/count=8192-4                             798ns ± 1%
BTreeDeleteInsert/count=65536-4                           2.00µs ±13%
BTreeDeleteInsertCloneOnce/count=16-4                      173ns ± 2%
BTreeDeleteInsertCloneOnce/count=128-4                     379ns ± 2%
BTreeDeleteInsertCloneOnce/count=1024-4                    584ns ± 4%
BTreeDeleteInsertCloneOnce/count=8192-4                    800ns ± 2%
BTreeDeleteInsertCloneOnce/count=65536-4                  2.04µs ±32%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4      535ns ± 8%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4    1.29µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   2.22µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   2.55µs ± 5%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  5.89µs ±20%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       240ns ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      610ns ± 4%
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4    1.20µs ± 2%
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4    1.69µs ± 1%
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4   3.52µs ±18%

name                                                      alloc/op
BTreeDeleteInsert/count=16-4                               0.00B
BTreeDeleteInsert/count=128-4                              0.00B
BTreeDeleteInsert/count=1024-4                             0.00B
BTreeDeleteInsert/count=8192-4                             0.00B
BTreeDeleteInsert/count=65536-4                            0.00B
BTreeDeleteInsertCloneOnce/count=16-4                      0.00B
BTreeDeleteInsertCloneOnce/count=128-4                     0.00B
BTreeDeleteInsertCloneOnce/count=1024-4                    0.00B
BTreeDeleteInsertCloneOnce/count=8192-4                    0.00B
BTreeDeleteInsertCloneOnce/count=65536-4                   1.00B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       288B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      897B ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4   1.61kB ± 1%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4   1.47kB ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4  2.40kB ±12%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4       0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4      0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4     0.00B
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4    0.00B

name                                                      allocs/op
BTreeDeleteInsert/count=16-4                                0.00
BTreeDeleteInsert/count=128-4                               0.00
BTreeDeleteInsert/count=1024-4                              0.00
BTreeDeleteInsert/count=8192-4                              0.00
BTreeDeleteInsert/count=65536-4                             0.00
BTreeDeleteInsertCloneOnce/count=16-4                       0.00
BTreeDeleteInsertCloneOnce/count=128-4                      0.00
BTreeDeleteInsertCloneOnce/count=1024-4                     0.00
BTreeDeleteInsertCloneOnce/count=8192-4                     0.00
BTreeDeleteInsertCloneOnce/count=65536-4                    0.00
BTreeDeleteInsertCloneEachTime/reset=false/count=16-4       1.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=128-4      2.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4     3.00 ± 0%
BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4    4.40 ±14%
BTreeDeleteInsertCloneEachTime/reset=true/count=16-4        0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=128-4       0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4      0.00
BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4     0.00
```


Co-authored-by: Nathan VanBenschoten <[email protected]>
@nvanbenschoten nvanbenschoten changed the title [WIP] storage/cmdq: rewrite CommandQueue using a copy-on-write btree strategy storage/spanlatch: create spanlatch.Manager using immutable btrees Nov 22, 2018
@nvanbenschoten
Copy link
Member Author

I've update this PR to use the interval btree type and the signal type added in #32165, #32251, #32164 to create a new spanlatch.Manager. The commits here don't replace the CommandQueue yet, but that's the immediate follow-up to this PR.

@nvanbenschoten
Copy link
Member Author

nvanbenschoten commented Nov 22, 2018

A few things to note about the testing here:

  1. I copied the CommandQueue benchmarks so that we could perform apples-to-apples microbenchmark comparisons. The results are posted above.
  2. The ReadOnlyMix benchmarks aren't particularly interesting because read-only access without any intermingled writes doesn't hit the interval trees in either implementation. The ReadWriteMix variants are much more telling.
  3. the tests in manager_test.go are direct adaptations of the ones in command_queue_test.go.

@nvanbenschoten nvanbenschoten force-pushed the nvanbenschoten/cmdq2 branch 4 times, most recently from d96bd89 to 148cb30 Compare November 22, 2018 06:08
Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I definitely need another pass. This is just the nits I've spotted in the first skim

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/spanlatch/doc.go, line 35 at r5 (raw file):

      key ranges was permitted. Conceptually, the structure became an interval
      tree of sync.RWMutexes.
    * The structure become timestamp-aware and concurrent access of non-causal

s/become/became/


pkg/storage/spanlatch/manager.go, line 77 at r5 (raw file):

}

// latches are stored in the Manager's btrees. The represent the latching of a

s/The/They/


pkg/storage/spanlatch/manager.go, line 137 at r5 (raw file):

	}

	// Guard would be an ideal candidate for object pooling, but without

Nit: move the guard and latch construction to a helper


pkg/storage/spanlatch/manager.go, line 280 at r5 (raw file):

	realloc := len(sm.rSet) > 16
	for latch := range sm.rSet {
		latch.setInRSet(false)

if the TODO is about exploiting the efficient map clearing idiom, I think it needs to be the only statement in the loop https://go-review.googlesource.com/c/go/+/110055/.

perhaps rewrite this as:

for latch := range sm.rSet {
    latch.setInRSet(false)
    sm.trees[spanset.SpanReadOnly].Set(latch)
}
if realloc := len(sm.rSet) > 16; realloc {
    sm.rSet = make(map[*latch]struct{})
} else {
    for latch := range sm.rSet {
        delete(sm.rSet, latch)
    }
}

This change renames `storage/cmdq` to `storage/spanlatch`. The package
will house the new `spanlatch.Manager` type, which will handle the
acquisition and release of span-latches.

This works off of the definition for latches presented by Goetz Graefe
in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf
(see https://i.stack.imgur.com/fSRzd.png).

The files are not changes in this commit.

Release note: None
This commit replaces all reference to cmds with references to latches.

Release note: None
Informs cockroachdb#4768.
Informs cockroachdb#31904.

This change was inspired by cockroachdb#31904 and is a progression of the thinking
started in cockroachdb#4768 (comment).

The change introduces `spanlatch.Manager`, which will replace the `CommandQueue`
**in a future PR**. The new type isn't hooked up yet because doing so will
require a lot of plumbing changes in that storage package that are best kept
in a separate PR. The structure uses a new strategy that reduces lock contention,
simplifies the code, avoids allocations, and makes cockroachdb#31904 easier to implement.

The primary objective, reducing lock contention, is addressed by minimizing
the amount of work we perform under the exclusive "sequencing" mutex while
locking the structure. This is made possible by employing a copy-on-write
strategy. Before this change, commands would lock the queue, create a large
slice of prerequisites, insert into the queue and unlock. After the change,
commands lock the manager, grab an immutable snapshot of the manager's trees in
O(1) time, insert into the manager, and unlock. They can then iterate over the
immutable tree snapshot outside of the lock. Effectively, this means that
the work performed under lock is linear with respect to the number of spans
that a command declares but NO LONGER linear with respect to the number of
other commands that it will wait on. This is important because `Replica.beginCmds`
repeatedly comes up as the largest source of mutex contention in our system,
especially on hot ranges.

The use of immutable snapshots also simplifies the code significantly. We're
no longer copying our prereqs into a slice so we no longer need to carefully
determine which transitive dependencies we do or don't need to wait on
explicitly. This also makes lock cancellation trivial because we no longer
explicitly hold on to our prereqs at all. Instead, we simply iterate through
the snapshot outside of the lock.

While rewriting the structure, I also spent some time optimizing its allocations.
Under normal operation, acquiring a latch now incurs only a single allocation -
that being for the `spanlatch.Guard`. All other allocations are avoided through
object pooling where appropriate. The overhead of using a copy-on-write
technique is almost entirely avoided by atomically reference counting btree nodes,
which allows us to release them back into the btree node pools when they're no
longer references by any btree snapshots. This means that we don't expect any
allocations when inserting into the internal trees, even with the COW policy.

Finally, this will make the approach taken in cockroachdb#31904 much more natural.
Instead of tracking dependents and prerequisites for speculative reads
and then iterating through them to find overlaps after, we can use the
immutable snapshots directly! We can grab a snapshot and sequence ourselves
as usual, but avoid waiting for prereqs. We then execute optimistically
before finally checking whether we overlapped any of our prereqs. The
great thing about this is that we already have the prereqs in an interval
tree structure, so we get an efficient validation check for free.

_### Naming changes

| Before                     | After                             |
|----------------------------|-----------------------------------|
| `CommandQueue`             | `spanlatch.Manager`               |
| "enter the command queue"  | "acquire span latches"            |
| "exit the command queue"   | "release span latches"            |
| "wait for prereq commands" | "wait for latches to be released" |

The use of the word "latch" is based on the definition of
latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf
(see https://i.stack.imgur.com/fSRzd.png). An important reason
for avoiding the word "lock" here is that it is critical for
understanding that we don't confuse the operational locking
performed by the CommandQueue/spanlatch.Manager with the
transaction-scoped locking enforced by intents and our
transactional concurrency control model.

_### Microbenchmarks

NOTE: these are single-threaded benchmarks that don't benefit at all
from the concurrency improvements enabled by this new structure.

```
name                              cmdq time/op    spanlatch time/op    delta
ReadOnlyMix/size=1-4                  897ns ±21%           917ns ±18%     ~     (p=0.897 n=8+10)
ReadOnlyMix/size=4-4                  827ns ±22%           772ns ±15%     ~     (p=0.448 n=10+10)
ReadOnlyMix/size=16-4                 905ns ±19%           770ns ±10%  -14.90%  (p=0.004 n=10+10)
ReadOnlyMix/size=64-4                 907ns ±20%           730ns ±15%  -19.51%  (p=0.001 n=10+10)
ReadOnlyMix/size=128-4                926ns ±17%           731ns ±11%  -21.04%  (p=0.000 n=9+10)
ReadOnlyMix/size=256-4                977ns ±19%           726ns ± 9%  -25.65%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=0-4       12.5µs ± 4%           0.7µs ±17%  -94.70%  (p=0.000 n=8+9)
ReadWriteMix/readsPerWrite=1-4       8.18µs ± 5%          0.63µs ± 6%  -92.24%  (p=0.000 n=10+9)
ReadWriteMix/readsPerWrite=4-4       3.80µs ± 2%          0.66µs ± 5%  -82.58%  (p=0.000 n=8+10)
ReadWriteMix/readsPerWrite=16-4      1.82µs ± 2%          0.70µs ± 5%  -61.43%  (p=0.000 n=9+10)
ReadWriteMix/readsPerWrite=64-4       894ns ±12%           514ns ± 6%  -42.48%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=128-4      717ns ± 5%           472ns ± 1%  -34.21%  (p=0.000 n=10+8)
ReadWriteMix/readsPerWrite=256-4      607ns ± 5%           453ns ± 3%  -25.35%  (p=0.000 n=7+10)

name                              cmdq alloc/op   spanlatch alloc/op   delta
ReadOnlyMix/size=1-4                   223B ± 0%            191B ± 0%  -14.35%  (p=0.000 n=10+10)
ReadOnlyMix/size=4-4                   223B ± 0%            191B ± 0%  -14.35%  (p=0.000 n=10+10)
ReadOnlyMix/size=16-4                  223B ± 0%            191B ± 0%  -14.35%  (p=0.000 n=10+10)
ReadOnlyMix/size=64-4                  223B ± 0%            191B ± 0%  -14.35%  (p=0.000 n=10+10)
ReadOnlyMix/size=128-4                 223B ± 0%            191B ± 0%  -14.35%  (p=0.000 n=10+10)
ReadOnlyMix/size=256-4                 223B ± 0%            191B ± 0%  -14.35%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=0-4         915B ± 0%            144B ± 0%  -84.26%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=1-4         730B ± 0%            144B ± 0%  -80.29%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=4-4         486B ± 0%            144B ± 0%  -70.35%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=16-4        350B ± 0%            144B ± 0%  -58.86%  (p=0.000 n=9+10)
ReadWriteMix/readsPerWrite=64-4        222B ± 0%            144B ± 0%  -35.14%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=128-4       199B ± 0%            144B ± 0%  -27.64%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=256-4       188B ± 0%            144B ± 0%  -23.40%  (p=0.000 n=10+10)

name                              cmdq allocs/op  spanlatch allocs/op  delta
ReadOnlyMix/size=1-4                   1.00 ± 0%            1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=4-4                   1.00 ± 0%            1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=16-4                  1.00 ± 0%            1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=64-4                  1.00 ± 0%            1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=128-4                 1.00 ± 0%            1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=256-4                 1.00 ± 0%            1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=0-4         34.0 ± 0%             1.0 ± 0%  -97.06%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=1-4         22.0 ± 0%             1.0 ± 0%  -95.45%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=4-4         10.0 ± 0%             1.0 ± 0%  -90.00%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=16-4        4.00 ± 0%            1.00 ± 0%  -75.00%  (p=0.000 n=10+10)
ReadWriteMix/readsPerWrite=64-4        1.00 ± 0%            1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=128-4       1.00 ± 0%            1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=256-4       1.00 ± 0%            1.00 ± 0%     ~     (all equal)
```

Release note: None
…g removal

This change modifies `adjustUpperBoundOnRemoval` to avoid a degenerate
case in element removal where all intervals have the same end key. In
this case, we would previously adjust the upper bound of every node from
the root of the tree to the node that the interval was being removed
from. We now check whether removing the element with the largest end key
is actually changing the upper bound of the node. If there are other
elements with the same end key then this is not the case and we can
avoid repeat calls to `adjustUpperBoundOnRemoval` while traversing back
up the tree.

This came up while profiling a benchmark that was giving suprising
results.

Release note: None
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/command_queue_test.go, line 809 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I suppose you'll be renaming all of these instances of CommandQueue as well in a future PR.

Yes, I'll be ripping out every single reference I can find to it.


pkg/storage/spanlatch/doc.go, line 20 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps mention that this is the evolution of complexity. Something like: s/Managers's/The evolution of/g.

Done.


pkg/storage/spanlatch/manager.go, line 62 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: I'd prefer to see this spelled out as readSet and inReadSet.

We could avoid the use of a map by instead using a circularly linked list. latch would need next, prev *latch fields. You can remove an element from such a list without knowing its position. See util/cache.Entry and util/cache.entryList for an example of what I'm thinking of.

That's a really cool idea! It provides a nice speedup:

name                                          old time/op    new time/op    delta
LatchManagerReadOnlyMix/size=1-4                 683ns ± 9%     404ns ±10%  -40.85%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                 660ns ± 7%     382ns ± 5%  -42.17%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                684ns ±10%     367ns ± 5%  -46.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                683ns ± 8%     370ns ± 1%  -45.75%  (p=0.016 n=5+4)
LatchManagerReadOnlyMix/size=128-4               678ns ± 4%     398ns ±14%  -41.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4               652ns ± 4%     385ns ± 4%  -40.95%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4       594ns ±16%     629ns ±17%     ~     (p=0.222 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4       603ns ± 1%     552ns ± 7%   -8.39%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4       621ns ± 4%     576ns ± 5%   -7.28%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4      649ns ± 2%     541ns ±13%  -16.69%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4      474ns ± 5%     423ns ±29%     ~     (p=0.151 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4     413ns ± 2%     362ns ±16%     ~     (p=0.095 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4     448ns ±14%     314ns ±13%  -29.85%  (p=0.008 n=5+5)

name                                          old alloc/op   new alloc/op   delta
LatchManagerReadOnlyMix/size=1-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                 191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                 191B ± 0%      160B ± 0%     ~     (p=0.079 n=4+5)
LatchManagerReadOnlyMix/size=128-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)

name                                          old allocs/op  new allocs/op  delta
LatchManagerReadOnlyMix/size=1-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=4-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=16-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=64-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=128-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=256-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=0-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=1-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=4-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=16-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=64-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=128-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=256-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

pkg/storage/spanlatch/manager.go, line 228 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Looks like you always have a snapshot associated with a guard. Rather than passing the snapshot on the stack, it might be better (faster) to embed the snapshot in the guard and to change Manager.snapshot() to take a *snapshot which it fills in.

But then we have to allocate that entire object on the heap and keep the memory around for the entire lifetime of the Guard. Do you think that will be faster?


pkg/storage/spanlatch/manager.go, line 250 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps follow the Locked naming convention. E.g. snapshotLocked and insertLocked.

Done.


pkg/storage/spanlatch/interval_btree.go, line 15 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Note that there are various bits in the UI that refer to "Command Queue". Let's file an issue to change the name there as well.

I have a series of changes lined up to eradicate that word.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 7 files at r4, 1 of 1 files at r5, 2 of 4 files at r7, 3 of 7 files at r9, 1 of 1 files at r10, 1 of 7 files at r14, 1 of 5 files at r16.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained


pkg/storage/spanlatch/list.go, line 20 at r16 (raw file):

type latchList struct {
	root latch
	len  int

while it's reasonable and clean to track len (and it's done in container/list to be able to implement O(1) Length), it seems like given the general memory consciousness of this package, it's safe to omit latchList.len if in front() you make the nil condition ll.root.next == nil || ll.root.next == &ll.root


pkg/storage/spanlatch/manager.go, line 246 at r16 (raw file):

// flushReadSetLocked flushes the read set into the read interval tree.
func (sm *scopedManager) flushReadSetLocked() {
	for sm.readSet.len > 0 {

if you decide to eliminate.len then I guess this could look like:

for latch := sm.readSet.front(); latch != nil; latch = sm.readSet.front() {
   sm.readSet.remove(latch)
   sm.trees[spanset.SpanReadOnly].Set(latch)
}

Copy link
Collaborator

@petermattis petermattis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

I didn't fully scrutinize all of the details or testing here. Let me know if you think something deserve particular attention and I'll give it a thorough look.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained


pkg/storage/spanlatch/list.go, line 20 at r16 (raw file):

Previously, ajwerner wrote…

while it's reasonable and clean to track len (and it's done in container/list to be able to implement O(1) Length), it seems like given the general memory consciousness of this package, it's safe to omit latchList.len if in front() you make the nil condition ll.root.next == nil || ll.root.next == &ll.root

The memory savings are minimal as there are a constant number of latchLists per Manager. That said, I'd remove len because it doesn't seem necessary per @ajwerner's suggestion.


pkg/storage/spanlatch/list.go, line 30 at r16 (raw file):

}

func (ll *latchList) lazyInit() {

Do you need this lazyInit stuff? For the usage in spanlatch.Manager I think an init method could be called when the Manager is created.


pkg/storage/spanlatch/manager.go, line 62 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

That's a really cool idea! It provides a nice speedup:

name                                          old time/op    new time/op    delta
LatchManagerReadOnlyMix/size=1-4                 683ns ± 9%     404ns ±10%  -40.85%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                 660ns ± 7%     382ns ± 5%  -42.17%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                684ns ±10%     367ns ± 5%  -46.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                683ns ± 8%     370ns ± 1%  -45.75%  (p=0.016 n=5+4)
LatchManagerReadOnlyMix/size=128-4               678ns ± 4%     398ns ±14%  -41.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4               652ns ± 4%     385ns ± 4%  -40.95%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4       594ns ±16%     629ns ±17%     ~     (p=0.222 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4       603ns ± 1%     552ns ± 7%   -8.39%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4       621ns ± 4%     576ns ± 5%   -7.28%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4      649ns ± 2%     541ns ±13%  -16.69%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4      474ns ± 5%     423ns ±29%     ~     (p=0.151 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4     413ns ± 2%     362ns ±16%     ~     (p=0.095 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4     448ns ±14%     314ns ±13%  -29.85%  (p=0.008 n=5+5)

name                                          old alloc/op   new alloc/op   delta
LatchManagerReadOnlyMix/size=1-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                 191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                 191B ± 0%      160B ± 0%     ~     (p=0.079 n=4+5)
LatchManagerReadOnlyMix/size=128-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)

name                                          old allocs/op  new allocs/op  delta
LatchManagerReadOnlyMix/size=1-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=4-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=16-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=64-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=128-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=256-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=0-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=1-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=4-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=16-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=64-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=128-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=256-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

💯


pkg/storage/spanlatch/manager.go, line 228 at r10 (raw file):

Do you think that will be faster?

I don't know. Perhaps add it to a TODO list to investigate after this PR goes in. Probably a very minor benefit if any.

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewed 4 of 9 files at r11, 3 of 4 files at r12, 1 of 2 files at r13, 5 of 7 files at r14, 1 of 1 files at r15, 2 of 5 files at r16.
Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained


pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):

// before returning.
func (m *Manager) wait(ctx context.Context, lg *Guard, ts hlc.Timestamp, snap snapshot) error {
	for s := spanset.SpanScope(0); s < spanset.NumSpanScope; s++ {

Just a question for discussion, can the order in which latches are examined impact performance? It seems like if we could wait on the longest blocking item first then we'd increase the rate of hitting the fast path on the signal and the number of goroutine yields on the select. I don't have good intuition about what it would take to come up with a heuristic to guess as when a latch will be removed. Do we expect reads to happen faster than writes? Do we expect global things to take longer than local? All of this may be premature optimization. It might be worth trying to see how often you hit the fast path and if the number is low (maybe even as low as something like 50%), then maybe there's a cheap win here.


pkg/storage/spanlatch/manager.go, line 265 at r16 (raw file):

				switch a {
				case spanset.SpanReadOnly:
					// Add reads to the rSet. They only need to enter the read

total nit: s/rSet/readSet/

This change replaces the Manager's `readSet` map implementation with
a linked-list implementation. This provides the following speedup:

```
name                                          old time/op    new time/op    delta
LatchManagerReadOnlyMix/size=1-4                 683ns ± 9%     404ns ±10%  -40.85%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                 660ns ± 7%     382ns ± 5%  -42.17%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                684ns ±10%     367ns ± 5%  -46.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                683ns ± 8%     370ns ± 1%  -45.75%  (p=0.016 n=5+4)
LatchManagerReadOnlyMix/size=128-4               678ns ± 4%     398ns ±14%  -41.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4               652ns ± 4%     385ns ± 4%  -40.95%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4       594ns ±16%     629ns ±17%     ~     (p=0.222 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4       603ns ± 1%     552ns ± 7%   -8.39%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4       621ns ± 4%     576ns ± 5%   -7.28%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4      649ns ± 2%     541ns ±13%  -16.69%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4      474ns ± 5%     423ns ±29%     ~     (p=0.151 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4     413ns ± 2%     362ns ±16%     ~     (p=0.095 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4     448ns ±14%     314ns ±13%  -29.85%  (p=0.008 n=5+5)

name                                          old alloc/op   new alloc/op   delta
LatchManagerReadOnlyMix/size=1-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                 191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                 191B ± 0%      160B ± 0%     ~     (p=0.079 n=4+5)
LatchManagerReadOnlyMix/size=128-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)

name                                          old allocs/op  new allocs/op  delta
LatchManagerReadOnlyMix/size=1-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=4-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=16-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=64-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=128-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=256-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=0-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=1-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=4-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=16-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=64-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=128-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=256-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
```

The change also makes Manager's zero value completely usable.

Release note: None
It is cheaper to wait on an already released latch than it is an unreleased
latch so we prefer waiting on longer latches first. We expect writes to take
longer than reads to release their latches, so we wait on them first.

Release note: None
Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTRs!

bors r+

Reviewable status: :shipit: complete! 2 of 0 LGTMs obtained


pkg/storage/spanlatch/list.go, line 20 at r16 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The memory savings are minimal as there are a constant number of latchLists per Manager. That said, I'd remove len because it doesn't seem necessary per @ajwerner's suggestion.

I actually did exactly what's being suggested here at first, but I realized that we're going to want metrics on this soon enough and being able to track how many reads are in the readSet will be important.


pkg/storage/spanlatch/list.go, line 30 at r16 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Do you need this lazyInit stuff? For the usage in spanlatch.Manager I think an init method could be called when the Manager is created.

This allows the zero value for the entire spanlatch.Manager to be used directly, which is super nice. We don't have or need a Manager constructor.


pkg/storage/spanlatch/manager.go, line 228 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Do you think that will be faster?

I don't know. Perhaps add it to a TODO list to investigate after this PR goes in. Probably a very minor benefit if any.

I gave it a shot and it didn't seem to help:

name                              old time/op    new time/op    delta
ReadOnlyMix/size=1-4                 404ns ±10%     561ns ±14%  +38.91%  (p=0.008 n=5+5)
ReadOnlyMix/size=4-4                 382ns ± 5%     533ns ±17%  +39.60%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                367ns ± 5%     500ns ±17%  +36.04%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                370ns ± 1%     518ns ± 8%  +39.92%  (p=0.016 n=4+5)
ReadOnlyMix/size=128-4               398ns ±14%     548ns ± 8%  +37.50%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4               385ns ± 4%     546ns ± 5%  +41.92%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4       629ns ±17%     755ns ±14%     ~     (p=0.056 n=5+5)
ReadWriteMix/readsPerWrite=1-4       552ns ± 7%     729ns ± 9%  +31.93%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4       576ns ± 5%     673ns ±20%  +16.84%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4      541ns ±13%     632ns ± 1%  +16.89%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4      423ns ±29%     552ns ±31%  +30.50%  (p=0.032 n=5+5)
ReadWriteMix/readsPerWrite=128-4     362ns ±16%     426ns ± 3%  +17.44%  (p=0.016 n=5+5)
ReadWriteMix/readsPerWrite=256-4     314ns ±13%     405ns ± 6%  +28.94%  (p=0.008 n=5+5)

name                              old alloc/op   new alloc/op   delta
ReadOnlyMix/size=1-4                  160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=4-4                  160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                 160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                 160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=128-4                160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4                160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4        160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4        160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4        160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4       160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=256-4      160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)

pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):
This is an interesting idea. We expect writes to hold their latches significantly longer that reads, so it should be a clear win to wait on them first so that we select from fewer channels in total. Done.

Do we expect reads to happen faster than writes?

Yes.

Do we expect global things to take longer than local?

Not necessarily. I don't think there's any real correlation here.


pkg/storage/spanlatch/manager.go, line 246 at r16 (raw file):

Previously, ajwerner wrote…

if you decide to eliminate.len then I guess this could look like:

for latch := sm.readSet.front(); latch != nil; latch = sm.readSet.front() {
   sm.readSet.remove(latch)
   sm.trees[spanset.SpanReadOnly].Set(latch)
}

See discussion above.


pkg/storage/spanlatch/manager.go, line 265 at r16 (raw file):

Previously, ajwerner wrote…

total nit: s/rSet/readSet/

Not a nit, a botched refactor :) Done.

@nvanbenschoten
Copy link
Member Author

bors r-

@craig
Copy link
Contributor

craig bot commented Nov 29, 2018

Canceled

Copy link
Contributor

@ajwerner ajwerner left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is an interesting idea. We expect writes to hold their latches significantly longer that reads, so it should be a clear win to wait on them first so that we select from fewer channels in total. Done.

Do we expect reads to happen faster than writes?

Yes.

Do we expect global things to take longer than local?

Not necessarily. I don't think there's any real correlation here.

Cool, the next steps to push this idea further would be:

  1. set a to SpanReadWrite before setting it SpanReadOnly (0) in the for loop
  2. sort the latches in newGuard with the highest timestamps first as my intuition is that the high timestamp latches are expected to finish last.

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):

set a to SpanReadWrite before setting it SpanReadOnly (0) in the for loop

But we would want a=SpanReadOnly before a=SpanReadWrite, right? Because then the order of access will be readSpan+tree[SpanReadWrite], writeSpan+tree[SpanReadWrite], writeSpan+tree[SpanReadOnly]. Either way, in practice we never actually see requests with read and write spans together.

sort the latches in newGuard with the highest timestamps first as my intuition is that the high timestamp latches are expected to finish last.

All of the latches in newGuard have the same timestamp. Also, anything that requires sorting will almost certainly cost more than doing nothing at all. We're dealing on the order of double-digit ns at this point.

craig bot pushed a commit that referenced this pull request Nov 29, 2018
31997: storage/spanlatch: create spanlatch.Manager using immutable btrees r=nvanbenschoten a=nvanbenschoten

Informs #4768.
Informs #31904.

This change was inspired by #31904 and is a progression of the thinking started in #4768 (comment).

The change introduces `spanlatch.Manager`, which will replace the `CommandQueue` **in a future PR**. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes #31904 easier to implement.

The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because `Replica.beginCmds` repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges.

The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock.

While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the `spanlatch.Guard`. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting immutable btree nodes, which allows us to release them back into the btree node pools when they're no longer needed. This means that we don't expect any allocations when inserting into the internal trees, even with the copy-on-write policy.

Finally, this will make the approach taken in #31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free.

### Naming changes

| Before                     | After                             |
|----------------------------|-----------------------------------|
| `CommandQueue`             | `spanlatch.Manager`               |
| "enter the command queue"  | "acquire span latches"            |
| "exit the command queue"   | "release span latches"            |
| "wait for prereq commands" | "wait for latches to be released" |

The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model.

### Microbenchmarks

NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure.

```
name                              old time/op    new time/op    delta
ReadOnlyMix/size=1-4                 706ns ±20%     404ns ±10%  -42.81%  (p=0.008 n=5+5)
ReadOnlyMix/size=4-4                 649ns ±23%     382ns ± 5%  -41.13%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                611ns ±16%     367ns ± 5%  -39.83%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                692ns ±14%     370ns ± 1%  -46.49%  (p=0.016 n=5+4)
ReadOnlyMix/size=128-4               637ns ±22%     398ns ±14%  -37.48%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4               676ns ±15%     385ns ± 4%  -43.01%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4      12.2µs ± 4%     0.6µs ±17%  -94.85%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4      7.88µs ± 2%    0.55µs ± 7%  -92.99%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4      4.19µs ± 3%    0.58µs ± 5%  -86.26%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4     2.09µs ± 6%    0.54µs ±13%  -74.13%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4      875ns ±17%     423ns ±29%  -51.64%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4     655ns ± 6%     362ns ±16%  -44.71%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=256-4     549ns ±16%     314ns ±13%  -42.73%  (p=0.008 n=5+5)

name                              old alloc/op   new alloc/op   delta
ReadOnlyMix/size=1-4                  223B ± 0%      160B ± 0%  -28.25%  (p=0.079 n=4+5)
ReadOnlyMix/size=4-4                  223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                 223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                 223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=128-4                217B ± 4%      160B ± 0%  -26.27%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4                223B ± 0%      160B ± 0%  -28.25%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=0-4      1.25kB ± 0%    0.16kB ± 0%  -87.15%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4      1.00kB ± 0%    0.16kB ± 0%  -84.00%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=4-4        708B ± 0%      160B ± 0%  -77.40%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=16-4       513B ± 0%      160B ± 0%  -68.81%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       264B ± 0%      160B ± 0%  -39.39%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      221B ± 0%      160B ± 0%  -27.60%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=256-4      198B ± 0%      160B ± 0%  -19.35%  (p=0.008 n=5+5)

name                              old allocs/op  new allocs/op  delta
ReadOnlyMix/size=1-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=4-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=16-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=64-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=128-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=256-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=0-4        38.0 ± 0%       1.0 ± 0%  -97.37%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4        24.0 ± 0%       1.0 ± 0%  -95.83%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4        12.0 ± 0%       1.0 ± 0%  -91.67%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4       5.00 ± 0%      1.00 ± 0%  -80.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=256-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
```

There are a few interesting things to point about about these benchmark results:
- The `ReadOnlyMix` results demonstrate a fixed improvement, regardless of size. This is due to the replacement of the hash-map with a linked-list for the readSet structure.
- The `ReadWriteMix` is more interesting. We see that the spanlatch implementation is faster across the board. This is especially true with a high write/read ratio.
- We see that the allocated memory stays constant regardless of the write/read ratio in the spanlatch implementation. This is due to the memory recylcing that it performs on btree nodes. It is not the case for the CommandQueue implementation.

Release note: None

32416:  scripts: enhance the release notes r=knz a=knz

Fixes #25180.

With this the amount of release notes for the first 2.2 alpha in cockroachdb/docs#4051 is reduced to just under two pages.

Also this PR makes it easier to monitor progress during the execution of the script.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
@craig
Copy link
Contributor

craig bot commented Nov 29, 2018

Build succeeded

@craig craig bot merged commit b2ab370 into cockroachdb:master Nov 29, 2018
@nvanbenschoten nvanbenschoten deleted the nvanbenschoten/cmdq2 branch November 29, 2018 17:56
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💯

Reviewed 9 of 9 files at r11, 4 of 4 files at r12, 2 of 2 files at r13, 7 of 7 files at r14, 1 of 1 files at r15, 4 of 5 files at r16, 1 of 1 files at r17.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/storage/spanlatch/manager.go, line 314 at r14 (raw file):

}

func (m *Manager) nextID() uint64 {

nit: nextIDLocked()


pkg/storage/spanlatch/manager_test.go, line 122 at r14 (raw file):

	m := New()

	// Try latch with no overlapping already-acquired lathes.

lathes

Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 1 files at r18.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)

Copy link
Member Author

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 2 stale)


pkg/storage/spanlatch/manager.go, line 314 at r14 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: nextIDLocked()

Will address in next PR.

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 4, 2018
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 5, 2018
This commit replaces the CommandQueue with the spanlatch.Manager, which
was introduced in cockroachdb#31997. See that PR for an introduction to how the
structure differs from the CommandQueue and how it improves performance
on microbenchmarks.

This is mostly a mechanical change. One important detail is that it removes
the CommandQueue debug change. We found that the page was buggy (or straight
up broken) and it wasn't actively used by members of Core when debugging problems.
In its place, the commit revives the "slow requests" metric for latching, which
hasn't been hooked up in over a year.

_### Benchamrks

_#### Standard Benchmarks

These benchmarks are standard benchmarks that we commonly run. They were run with
varying node sizes, cluster sizes, and pre-split counts.

```
name                              old ops/sec  new ops/sec  delta
kv0/cores=4/nodes=1/splits=0       1.99k ± 2%   2.06k ± 1%   +3.22%  (p=0.008 n=5+5)
kv0/cores=4/nodes=1/splits=100     2.25k ± 1%   2.38k ± 1%   +6.01%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=0       1.60k ± 0%   1.69k ± 2%   +5.53%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=100     3.52k ± 6%   3.65k ± 9%     ~     (p=0.421 n=5+5)
kv0/cores=16/nodes=1/splits=0      19.9k ± 1%   21.8k ± 1%   +9.34%  (p=0.008 n=5+5)
kv0/cores=16/nodes=1/splits=100    24.4k ± 1%   26.1k ± 1%   +7.17%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=0      14.9k ± 1%   16.1k ± 1%   +8.03%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=100    20.6k ± 1%   22.8k ± 1%  +10.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=0      31.2k ± 2%   35.3k ± 1%  +13.28%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=100    45.7k ± 1%   51.1k ± 1%  +11.80%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=0      23.7k ± 2%   27.1k ± 2%  +14.39%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=100    34.9k ± 2%   45.1k ± 1%  +29.44%  (p=0.008 n=5+5)
kv95/cores=4/nodes=1/splits=0      12.7k ± 2%   12.9k ± 2%   +1.39%  (p=0.151 n=5+5)
kv95/cores=4/nodes=1/splits=100    12.8k ± 2%   13.1k ± 2%   +2.10%  (p=0.032 n=5+5)
kv95/cores=4/nodes=3/splits=0      10.6k ± 1%   10.8k ± 1%   +1.58%  (p=0.056 n=5+5)
kv95/cores=4/nodes=3/splits=100    12.3k ± 7%   12.6k ± 8%   +2.61%  (p=0.095 n=5+5)
kv95/cores=16/nodes=1/splits=0     50.9k ± 1%   52.2k ± 1%   +2.37%  (p=0.008 n=5+5)
kv95/cores=16/nodes=1/splits=100   52.2k ± 1%   53.0k ± 1%   +1.49%  (p=0.008 n=5+5)
kv95/cores=16/nodes=3/splits=0     46.2k ± 1%   46.8k ± 1%   +1.32%  (p=0.032 n=5+5)
kv95/cores=16/nodes=3/splits=100   51.0k ± 1%   53.2k ± 1%   +4.25%  (p=0.008 n=5+5)
kv95/cores=36/nodes=1/splits=0     79.8k ± 2%  101.6k ± 1%  +27.31%  (p=0.008 n=5+5)
kv95/cores=36/nodes=1/splits=100    104k ± 1%    107k ± 1%   +2.60%  (p=0.008 n=5+5)
kv95/cores=36/nodes=3/splits=0     85.8k ± 1%   91.8k ± 1%   +7.08%  (p=0.008 n=5+5)
kv95/cores=36/nodes=3/splits=100    106k ± 1%    112k ± 1%   +5.51%  (p=0.008 n=5+5)

name                              old p50(ms)  new p50(ms)  delta
kv0/cores=4/nodes=1/splits=0        3.52 ± 5%    3.40 ± 0%   -3.41%  (p=0.016 n=5+4)
kv0/cores=4/nodes=1/splits=100      3.30 ± 0%    3.00 ± 0%   -9.09%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=0        4.70 ± 0%    4.14 ± 9%  -11.91%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=100      1.50 ± 0%    1.48 ± 8%     ~     (p=0.968 n=4+5)
kv0/cores=16/nodes=1/splits=0       1.40 ± 0%    1.40 ± 0%     ~     (all equal)
kv0/cores=16/nodes=1/splits=100     1.20 ± 0%    1.20 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/splits=0       2.00 ± 0%    1.90 ± 0%   -5.00%  (p=0.000 n=5+4)
kv0/cores=16/nodes=3/splits=100     1.40 ± 0%    1.40 ± 0%     ~     (all equal)
kv0/cores=36/nodes=1/splits=0       1.76 ± 3%    1.60 ± 0%   -9.09%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=100     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=0       2.56 ± 2%    2.40 ± 0%   -6.25%  (p=0.000 n=5+4)
kv0/cores=36/nodes=3/splits=100     1.70 ± 0%    1.40 ± 0%  -17.65%  (p=0.008 n=5+5)
kv95/cores=4/nodes=1/splits=0       0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=4/nodes=1/splits=100     0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=4/nodes=3/splits=0       0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95/cores=4/nodes=3/splits=100     0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95/cores=16/nodes=1/splits=0      0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=16/nodes=1/splits=100    0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=16/nodes=3/splits=0      0.70 ± 0%    0.64 ± 9%   -8.57%  (p=0.167 n=5+5)
kv95/cores=16/nodes=3/splits=100    0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95/cores=36/nodes=1/splits=0      0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=36/nodes=1/splits=100    0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=36/nodes=3/splits=0      0.66 ± 9%    0.60 ± 0%   -9.09%  (p=0.167 n=5+5)
kv95/cores=36/nodes=3/splits=100    0.60 ± 0%    0.60 ± 0%     ~     (all equal)

name                              old p99(ms)  new p99(ms)  delta
kv0/cores=4/nodes=1/splits=0        11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.000 n=5+4)
kv0/cores=4/nodes=1/splits=100      7.90 ± 0%    7.60 ± 0%   -3.80%  (p=0.000 n=5+4)
kv0/cores=4/nodes=3/splits=0        15.7 ± 0%    15.2 ± 0%   -3.18%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=100      8.90 ± 0%    8.12 ± 3%   -8.76%  (p=0.016 n=4+5)
kv0/cores=16/nodes=1/splits=0       3.46 ± 2%    3.00 ± 0%  -13.29%  (p=0.000 n=5+4)
kv0/cores=16/nodes=1/splits=100     4.50 ± 0%    3.36 ± 2%  -25.33%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=0       4.50 ± 0%    3.90 ± 0%  -13.33%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=100     5.80 ± 0%    4.10 ± 0%  -29.31%  (p=0.029 n=4+4)
kv0/cores=36/nodes=1/splits=0       6.80 ± 0%    5.20 ± 0%  -23.53%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=100     5.80 ± 0%    4.32 ± 4%  -25.52%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=0       7.72 ± 2%    6.30 ± 0%  -18.39%  (p=0.000 n=5+4)
kv0/cores=36/nodes=3/splits=100     7.98 ± 2%    5.20 ± 0%  -34.84%  (p=0.000 n=5+4)
kv95/cores=4/nodes=1/splits=0       5.38 ± 3%    5.20 ± 0%   -3.35%  (p=0.167 n=5+5)
kv95/cores=4/nodes=1/splits=100     5.00 ± 0%    5.00 ± 0%     ~     (all equal)
kv95/cores=4/nodes=3/splits=0       5.68 ± 3%    5.50 ± 0%   -3.17%  (p=0.095 n=5+4)
kv95/cores=4/nodes=3/splits=100     3.60 ±31%    2.93 ± 3%  -18.75%  (p=0.016 n=5+4)
kv95/cores=16/nodes=1/splits=0      4.10 ± 0%    4.10 ± 0%     ~     (all equal)
kv95/cores=16/nodes=1/splits=100    4.50 ± 0%    4.10 ± 0%   -8.89%  (p=0.000 n=5+4)
kv95/cores=16/nodes=3/splits=0      2.60 ± 0%    2.60 ± 0%     ~     (all equal)
kv95/cores=16/nodes=3/splits=100    2.50 ± 0%    1.90 ± 5%  -24.00%  (p=0.008 n=5+5)
kv95/cores=36/nodes=1/splits=0      6.60 ± 0%    6.00 ± 0%   -9.09%  (p=0.029 n=4+4)
kv95/cores=36/nodes=1/splits=100    5.50 ± 0%    5.12 ± 2%   -6.91%  (p=0.008 n=5+5)
kv95/cores=36/nodes=3/splits=0      4.18 ± 2%    4.02 ± 3%   -3.71%  (p=0.000 n=4+5)
kv95/cores=36/nodes=3/splits=100    3.80 ± 0%    2.80 ± 0%  -26.32%  (p=0.008 n=5+5)
```

_#### Large-machine Benchmarks

These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs.

```
name                              old ops/sec  new ops/sec  delta
kv0/cores=72/nodes=1/splits=0      31.0k ± 4%   36.4k ± 1%  +17.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=1/splits=100    44.0k ± 0%   49.0k ± 1%  +11.41%  (p=0.008 n=5+5)
kv95/cores=72/nodes=1/splits=0     52.7k ±18%   72.6k ±26%  +37.70%  (p=0.016 n=5+5)
kv95/cores=72/nodes=1/splits=100   66.8k ±17%   68.5k ± 5%     ~     (p=0.286 n=5+4)

name                              old p50(ms)  new p50(ms)  delta
kv0/cores=72/nodes=1/splits=0       2.30 ±13%    2.52 ± 5%     ~     (p=0.214 n=5+5)
kv0/cores=72/nodes=1/splits=100     3.00 ± 0%    2.90 ± 0%   -3.33%  (p=0.008 n=5+5)
kv95/cores=72/nodes=1/splits=0      0.46 ±13%    0.50 ± 0%     ~     (p=0.444 n=5+5)
kv95/cores=72/nodes=1/splits=100    0.44 ±14%    0.50 ± 0%  +13.64%  (p=0.167 n=5+5)

name                              old p99(ms)  new p99(ms)  delta
kv0/cores=72/nodes=1/splits=0       18.9 ± 6%    13.3 ± 5%  -29.56%  (p=0.008 n=5+5)
kv0/cores=72/nodes=1/splits=100     13.4 ± 2%    11.0 ± 0%  -17.91%  (p=0.008 n=5+5)
kv95/cores=72/nodes=1/splits=0      34.4 ±34%    23.5 ±24%  -31.74%  (p=0.048 n=5+5)
kv95/cores=72/nodes=1/splits=100    21.0 ± 0%    19.1 ± 4%   -8.81%  (p=0.029 n=4+4)
```

_#### Motivating Benchmarks

These are benchmarks that used to generate a lot of contention in the CommandQueue.
They have small cycle-lengths, indicated by the `c` specifier. The last one also includes
20% scan operations, which increases contention between non-overlapping point operations.

```
name                                    old ops/sec  new ops/sec  delta
kv95-c5/cores=16/nodes=1/splits=0        45.1k ± 1%   47.2k ± 4%   +4.59%  (p=0.008 n=5+5)
kv95-c5/cores=36/nodes=1/splits=0        44.6k ± 1%   76.3k ± 1%  +71.05%  (p=0.008 n=5+5)
kv50-c128/cores=16/nodes=1/splits=0      27.2k ± 2%   29.4k ± 1%   +8.12%  (p=0.008 n=5+5)
kv50-c128/cores=36/nodes=1/splits=0      42.6k ± 2%   50.0k ± 1%  +17.39%  (p=0.008 n=5+5)
kv70-20-c128/cores=16/nodes=1/splits=0   28.7k ± 1%   29.8k ± 3%   +3.87%  (p=0.008 n=5+5)
kv70-20-c128/cores=36/nodes=1/splits=0   41.9k ± 4%   52.8k ± 2%  +25.97%  (p=0.008 n=5+5)

name                                    old p50(ms)  new p50(ms)  delta
kv95-c5/cores=16/nodes=1/splits=0         0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95-c5/cores=36/nodes=1/splits=0         0.90 ± 0%    0.80 ± 0%  -11.11%  (p=0.008 n=5+5)
kv50-c128/cores=16/nodes=1/splits=0       1.10 ± 0%    1.06 ± 6%     ~     (p=0.444 n=5+5)
kv50-c128/cores=36/nodes=1/splits=0       1.26 ± 5%    1.30 ± 0%     ~     (p=0.444 n=5+5)
kv70-20-c128/cores=16/nodes=1/splits=0    0.66 ± 9%    0.60 ± 0%   -9.09%  (p=0.167 n=5+5)
kv70-20-c128/cores=36/nodes=1/splits=0    0.70 ± 0%    0.50 ± 0%  -28.57%  (p=0.008 n=5+5)

name                                    old p99(ms)  new p99(ms)  delta
kv95-c5/cores=16/nodes=1/splits=0         2.40 ± 0%    2.10 ± 0%  -12.50%  (p=0.000 n=5+4)
kv95-c5/cores=36/nodes=1/splits=0         5.80 ± 0%    3.30 ± 0%  -43.10%  (p=0.000 n=5+4)
kv50-c128/cores=16/nodes=1/splits=0       3.50 ± 0%    3.00 ± 0%  -14.29%  (p=0.008 n=5+5)
kv50-c128/cores=36/nodes=1/splits=0       6.80 ± 0%    4.70 ± 0%  -30.88%  (p=0.079 n=4+5)
kv70-20-c128/cores=16/nodes=1/splits=0    5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.029 n=4+4)
kv70-20-c128/cores=36/nodes=1/splits=0    11.0 ± 0%     6.8 ± 0%  -38.18%  (p=0.008 n=5+5)
```

_#### Batching Benchmarks

One optimization left out of the new spanlatch.Manager was the "covering" optimization,
where commands were initially added to the interval tree as a single spanning interval
and only expanded later. I ran a series of benchmarks to verify that this optimization
was not needed. My hypothesis was that the order of magnitude increase the speed of the
interval tree would make the optimization unnecessary.

It turns out that removing the optimization hurt a few benchmarks to a small
degree but speed up others tremendously (some benchmarks improved by over 400%).
I suspect that the covering optimization could actually hurt in cases where it
causes non-overlapping requests to overlap. It is interesting how quickly a few
of these benchmarks oscillate from small losses to big wins. It makes me think
that there's some non-linear behavior with the old CommandQueue that would cause
its performance to quickly degrade once it became a contention bottleneck.

```
name                                    old ops/sec  new ops/sec  delta
kv0-b16/cores=4/nodes=1/splits=0         2.41k ± 0%   2.06k ± 3%   -14.75%  (p=0.008 n=5+5)
kv0-b16/cores=4/nodes=1/splits=100         514 ± 0%     534 ± 1%    +3.88%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=0        2.95k ± 0%   4.35k ± 0%   +47.74%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=100      1.80k ± 1%   1.88k ± 1%    +4.46%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=0        2.74k ± 0%   4.92k ± 1%   +79.55%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=100      2.39k ± 1%   2.45k ± 1%    +2.41%  (p=0.008 n=5+5)
kv0-b128/cores=4/nodes=1/splits=0          422 ± 0%     518 ± 1%   +22.60%  (p=0.008 n=5+5)
kv0-b128/cores=4/nodes=1/splits=100       98.4 ± 1%    98.8 ± 1%      ~     (p=0.810 n=5+5)
kv0-b128/cores=16/nodes=1/splits=0         532 ± 0%    1059 ± 0%   +99.16%  (p=0.008 n=5+5)
kv0-b128/cores=16/nodes=1/splits=100       291 ± 1%     307 ± 1%    +5.18%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=0         483 ± 0%    1288 ± 1%  +166.37%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=100       394 ± 1%     408 ± 1%    +3.51%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=0        49.7 ± 1%    72.8 ± 1%   +46.52%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=100      30.8 ± 0%    23.4 ± 0%   -24.03%  (p=0.008 n=5+5)
kv0-b1024/cores=16/nodes=1/splits=0       48.9 ± 2%   160.6 ± 0%  +228.38%  (p=0.008 n=5+5)
kv0-b1024/cores=16/nodes=1/splits=100      101 ± 1%      80 ± 0%   -21.64%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=0       37.5 ± 0%   208.1 ± 1%  +454.99%  (p=0.016 n=4+5)
kv0-b1024/cores=36/nodes=1/splits=100      162 ± 0%     124 ± 0%   -23.22%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=0        5.93k ± 0%   6.20k ± 1%    +4.55%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=100      2.27k ± 1%   2.32k ± 1%    +2.28%  (p=0.008 n=5+5)
kv95-b16/cores=16/nodes=1/splits=0       5.15k ± 1%  18.79k ± 1%  +264.73%  (p=0.008 n=5+5)
kv95-b16/cores=16/nodes=1/splits=100     8.31k ± 1%   8.57k ± 1%    +3.16%  (p=0.008 n=5+5)
kv95-b16/cores=36/nodes=1/splits=0       3.96k ± 0%  10.67k ± 1%  +169.81%  (p=0.008 n=5+5)
kv95-b16/cores=36/nodes=1/splits=100     15.7k ± 2%   16.2k ± 4%    +2.75%  (p=0.151 n=5+5)
kv95-b128/cores=4/nodes=1/splits=0       1.12k ± 1%   1.27k ± 0%   +13.28%  (p=0.008 n=5+5)
kv95-b128/cores=4/nodes=1/splits=100       290 ± 1%     299 ± 1%    +3.02%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=0      1.06k ± 0%   3.31k ± 0%  +213.09%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=100      662 ±91%    1095 ± 1%   +65.42%  (p=0.016 n=5+4)
kv95-b128/cores=36/nodes=1/splits=0        715 ± 2%    3586 ± 0%  +401.21%  (p=0.008 n=5+5)
kv95-b128/cores=36/nodes=1/splits=100    1.15k ±90%   2.01k ± 2%   +74.79%  (p=0.016 n=5+4)
kv95-b1024/cores=4/nodes=1/splits=0        134 ± 1%     170 ± 1%   +26.59%  (p=0.008 n=5+5)
kv95-b1024/cores=4/nodes=1/splits=100     54.8 ± 3%    53.3 ± 3%    -2.84%  (p=0.056 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=0       104 ± 3%     367 ± 1%  +252.37%  (p=0.008 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=100     210 ± 1%     214 ± 1%    +1.86%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=0      76.5 ± 2%   383.9 ± 1%  +401.67%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=100     431 ± 1%     436 ± 1%    +1.17%  (p=0.016 n=5+5)

name                                    old p50(ms)  new p50(ms)  delta
kv0-b16/cores=4/nodes=1/splits=0          3.00 ± 0%    3.40 ± 0%   +13.33%  (p=0.016 n=5+4)
kv0-b16/cores=4/nodes=1/splits=100        15.2 ± 0%    14.7 ± 0%    -3.29%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=0         10.5 ± 0%     7.7 ± 2%   -26.48%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=100       17.8 ± 0%    16.8 ± 0%    -5.62%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=0         26.2 ± 0%    14.2 ± 0%   -45.80%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=100       29.0 ± 2%    28.3 ± 0%    -2.28%  (p=0.095 n=5+4)
kv0-b128/cores=4/nodes=1/splits=0         17.8 ± 0%    15.2 ± 0%   -14.61%  (p=0.000 n=5+4)
kv0-b128/cores=4/nodes=1/splits=100       79.7 ± 0%    79.7 ± 0%      ~     (all equal)
kv0-b128/cores=16/nodes=1/splits=0        65.0 ± 0%    32.5 ± 0%   -50.00%  (p=0.029 n=4+4)
kv0-b128/cores=16/nodes=1/splits=100       109 ± 0%     105 ± 0%    -3.85%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=0         168 ± 0%      50 ± 0%   -70.02%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=100       184 ± 0%     176 ± 0%    -4.50%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=0         159 ± 0%     109 ± 0%   -31.56%  (p=0.000 n=5+4)
kv0-b1024/cores=4/nodes=1/splits=100       252 ± 0%     319 ± 0%   +26.66%  (p=0.008 n=5+5)
kv0-b1024/cores=16/nodes=1/splits=0        705 ± 0%     193 ± 0%   -72.62%  (p=0.000 n=5+4)
kv0-b1024/cores=16/nodes=1/splits=100      319 ± 0%     386 ± 0%   +21.05%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=0      1.88k ± 0%   0.24k ± 0%   -87.05%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=100      436 ± 0%     570 ± 0%   +30.77%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=0         1.20 ± 0%    1.20 ± 0%      ~     (all equal)
kv95-b16/cores=4/nodes=1/splits=100       2.60 ± 0%    2.60 ± 0%      ~     (all equal)
kv95-b16/cores=16/nodes=1/splits=0        6.30 ± 0%    1.40 ± 0%   -77.78%  (p=0.000 n=5+4)
kv95-b16/cores=16/nodes=1/splits=100      1.74 ± 3%    1.76 ± 3%      ~     (p=1.000 n=5+5)
kv95-b16/cores=36/nodes=1/splits=0        11.5 ± 0%     5.5 ± 0%   -52.17%  (p=0.000 n=5+4)
kv95-b16/cores=36/nodes=1/splits=100      2.42 ±20%    2.42 ±45%      ~     (p=0.579 n=5+5)
kv95-b128/cores=4/nodes=1/splits=0        6.60 ± 0%    6.00 ± 0%    -9.09%  (p=0.008 n=5+5)
kv95-b128/cores=4/nodes=1/splits=100      21.4 ± 3%    21.0 ± 0%      ~     (p=0.444 n=5+5)
kv95-b128/cores=16/nodes=1/splits=0       30.4 ± 0%     9.4 ± 0%   -69.08%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=100     38.2 ±76%    21.2 ± 4%   -44.31%  (p=0.063 n=5+4)
kv95-b128/cores=36/nodes=1/splits=0       88.1 ± 0%    16.8 ± 0%   -80.93%  (p=0.000 n=5+4)
kv95-b128/cores=36/nodes=1/splits=100     56.6 ±85%    29.6 ±15%      ~     (p=0.873 n=5+4)
kv95-b1024/cores=4/nodes=1/splits=0       52.4 ± 0%    44.0 ± 0%   -16.03%  (p=0.029 n=4+4)
kv95-b1024/cores=4/nodes=1/splits=100      132 ± 2%     143 ± 0%    +8.29%  (p=0.016 n=5+4)
kv95-b1024/cores=16/nodes=1/splits=0       325 ± 3%      80 ± 0%   -75.51%  (p=0.008 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=100     151 ± 0%     151 ± 0%      ~     (all equal)
kv95-b1024/cores=36/nodes=1/splits=0       973 ± 0%     180 ± 3%   -81.55%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=100     168 ± 0%     168 ± 0%      ~     (all equal)

name                                    old p99(ms)  new p99(ms)  delta
kv0-b16/cores=4/nodes=1/splits=0          8.40 ± 0%   10.30 ± 3%   +22.62%  (p=0.016 n=4+5)
kv0-b16/cores=4/nodes=1/splits=100        29.4 ± 0%    27.3 ± 0%    -7.14%  (p=0.000 n=5+4)
kv0-b16/cores=16/nodes=1/splits=0         16.3 ± 0%    15.5 ± 2%    -4.91%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=100       31.5 ± 0%    29.4 ± 0%    -6.67%  (p=0.000 n=5+4)
kv0-b16/cores=36/nodes=1/splits=0         37.7 ± 0%    28.7 ± 2%   -23.77%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=100       62.1 ± 2%    68.4 ±10%   +10.15%  (p=0.008 n=5+5)
kv0-b128/cores=4/nodes=1/splits=0         37.7 ± 0%    39.4 ± 6%    +4.46%  (p=0.167 n=5+5)
kv0-b128/cores=4/nodes=1/splits=100        143 ± 0%     151 ± 0%    +5.89%  (p=0.016 n=4+5)
kv0-b128/cores=16/nodes=1/splits=0        79.7 ± 0%    55.8 ± 2%   -30.04%  (p=0.008 n=5+5)
kv0-b128/cores=16/nodes=1/splits=100       198 ± 3%     188 ± 3%    -5.09%  (p=0.048 n=5+5)
kv0-b128/cores=36/nodes=1/splits=0         184 ± 0%     126 ± 3%   -31.82%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=100       319 ± 0%     336 ± 0%    +5.24%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=0         322 ± 6%     253 ± 4%   -21.35%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=100       470 ± 0%     772 ± 4%   +64.28%  (p=0.016 n=4+5)
kv0-b1024/cores=16/nodes=1/splits=0      1.41k ± 0%   0.56k ±11%   -60.00%  (p=0.000 n=4+5)
kv0-b1024/cores=16/nodes=1/splits=100      530 ± 2%     772 ± 0%   +45.57%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=0      4.05k ± 7%   1.17k ± 3%   -71.19%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=100      792 ±14%    1020 ± 2%   +28.81%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=0         3.90 ± 0%    3.22 ± 4%   -17.44%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=100       21.0 ± 0%    19.9 ± 0%    -5.24%  (p=0.079 n=4+5)
kv95-b16/cores=16/nodes=1/splits=0        15.2 ± 0%     7.1 ± 0%   -53.29%  (p=0.079 n=4+5)
kv95-b16/cores=16/nodes=1/splits=100      38.5 ± 3%    37.7 ± 0%      ~     (p=0.333 n=5+4)
kv95-b16/cores=36/nodes=1/splits=0         128 ± 2%      52 ± 0%   -59.16%  (p=0.000 n=5+4)
kv95-b16/cores=36/nodes=1/splits=100      41.1 ±13%    39.2 ±33%      ~     (p=0.984 n=5+5)
kv95-b128/cores=4/nodes=1/splits=0        17.8 ± 0%    14.7 ± 0%   -17.42%  (p=0.079 n=4+5)
kv95-b128/cores=4/nodes=1/splits=100       107 ± 2%     106 ± 5%      ~     (p=0.683 n=5+5)
kv95-b128/cores=16/nodes=1/splits=0       75.5 ± 0%    23.1 ± 0%   -69.40%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=100      107 ±34%     120 ± 2%      ~     (p=1.000 n=5+4)
kv95-b128/cores=36/nodes=1/splits=0        253 ± 4%      71 ± 0%   -71.86%  (p=0.016 n=5+4)
kv95-b128/cores=36/nodes=1/splits=100      166 ±19%     164 ±74%      ~     (p=0.310 n=5+5)
kv95-b1024/cores=4/nodes=1/splits=0        146 ± 3%     101 ± 0%   -31.01%  (p=0.000 n=5+4)
kv95-b1024/cores=4/nodes=1/splits=100      348 ± 4%     366 ± 6%      ~     (p=0.317 n=4+5)
kv95-b1024/cores=16/nodes=1/splits=0       624 ± 3%     221 ± 2%   -64.52%  (p=0.008 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=100     325 ± 3%     319 ± 0%      ~     (p=0.444 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=0     1.56k ± 5%   0.41k ± 2%   -73.71%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=100     336 ± 0%     336 ± 0%      ~     (all equal)
```

Release note (performance improvement): Replace Replica latching mechanism
with new optimized data structure that improves throughput, especially
under heavy contention.
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 8, 2018
nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 8, 2018
This commit replaces the CommandQueue with the spanlatch.Manager, which
was introduced in cockroachdb#31997. See that PR for an introduction to how the
structure differs from the CommandQueue and how it improves performance
on microbenchmarks.

This is mostly a mechanical change. One important detail is that it removes
the CommandQueue debug change. We found that the page was buggy (or straight
up broken) and it wasn't actively used by members of Core when debugging problems.
In its place, the commit revives the "slow requests" metric for latching, which
hasn't been hooked up in over a year.

_### Benchmarks

_#### Standard Benchmarks

These benchmarks are standard benchmarks that we commonly run. They were run with
varying node sizes, cluster sizes, and pre-split counts.

```
name                              old ops/sec  new ops/sec  delta
kv0/cores=4/nodes=1/splits=0       1.99k ± 2%   2.06k ± 1%   +3.22%  (p=0.008 n=5+5)
kv0/cores=4/nodes=1/splits=100     2.25k ± 1%   2.38k ± 1%   +6.01%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=0       1.60k ± 0%   1.69k ± 2%   +5.53%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=100     3.52k ± 6%   3.65k ± 9%     ~     (p=0.421 n=5+5)
kv0/cores=16/nodes=1/splits=0      19.9k ± 1%   21.8k ± 1%   +9.34%  (p=0.008 n=5+5)
kv0/cores=16/nodes=1/splits=100    24.4k ± 1%   26.1k ± 1%   +7.17%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=0      14.9k ± 1%   16.1k ± 1%   +8.03%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=100    20.6k ± 1%   22.8k ± 1%  +10.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=0      31.2k ± 2%   35.3k ± 1%  +13.28%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=100    45.7k ± 1%   51.1k ± 1%  +11.80%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=0      23.7k ± 2%   27.1k ± 2%  +14.39%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=100    34.9k ± 2%   45.1k ± 1%  +29.44%  (p=0.008 n=5+5)
kv95/cores=4/nodes=1/splits=0      12.7k ± 2%   12.9k ± 2%   +1.39%  (p=0.151 n=5+5)
kv95/cores=4/nodes=1/splits=100    12.8k ± 2%   13.1k ± 2%   +2.10%  (p=0.032 n=5+5)
kv95/cores=4/nodes=3/splits=0      10.6k ± 1%   10.8k ± 1%   +1.58%  (p=0.056 n=5+5)
kv95/cores=4/nodes=3/splits=100    12.3k ± 7%   12.6k ± 8%   +2.61%  (p=0.095 n=5+5)
kv95/cores=16/nodes=1/splits=0     50.9k ± 1%   52.2k ± 1%   +2.37%  (p=0.008 n=5+5)
kv95/cores=16/nodes=1/splits=100   52.2k ± 1%   53.0k ± 1%   +1.49%  (p=0.008 n=5+5)
kv95/cores=16/nodes=3/splits=0     46.2k ± 1%   46.8k ± 1%   +1.32%  (p=0.032 n=5+5)
kv95/cores=16/nodes=3/splits=100   51.0k ± 1%   53.2k ± 1%   +4.25%  (p=0.008 n=5+5)
kv95/cores=36/nodes=1/splits=0     79.8k ± 2%  101.6k ± 1%  +27.31%  (p=0.008 n=5+5)
kv95/cores=36/nodes=1/splits=100    104k ± 1%    107k ± 1%   +2.60%  (p=0.008 n=5+5)
kv95/cores=36/nodes=3/splits=0     85.8k ± 1%   91.8k ± 1%   +7.08%  (p=0.008 n=5+5)
kv95/cores=36/nodes=3/splits=100    106k ± 1%    112k ± 1%   +5.51%  (p=0.008 n=5+5)

name                              old p50(ms)  new p50(ms)  delta
kv0/cores=4/nodes=1/splits=0        3.52 ± 5%    3.40 ± 0%   -3.41%  (p=0.016 n=5+4)
kv0/cores=4/nodes=1/splits=100      3.30 ± 0%    3.00 ± 0%   -9.09%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=0        4.70 ± 0%    4.14 ± 9%  -11.91%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=100      1.50 ± 0%    1.48 ± 8%     ~     (p=0.968 n=4+5)
kv0/cores=16/nodes=1/splits=0       1.40 ± 0%    1.40 ± 0%     ~     (all equal)
kv0/cores=16/nodes=1/splits=100     1.20 ± 0%    1.20 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/splits=0       2.00 ± 0%    1.90 ± 0%   -5.00%  (p=0.000 n=5+4)
kv0/cores=16/nodes=3/splits=100     1.40 ± 0%    1.40 ± 0%     ~     (all equal)
kv0/cores=36/nodes=1/splits=0       1.76 ± 3%    1.60 ± 0%   -9.09%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=100     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=0       2.56 ± 2%    2.40 ± 0%   -6.25%  (p=0.000 n=5+4)
kv0/cores=36/nodes=3/splits=100     1.70 ± 0%    1.40 ± 0%  -17.65%  (p=0.008 n=5+5)
kv95/cores=4/nodes=1/splits=0       0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=4/nodes=1/splits=100     0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=4/nodes=3/splits=0       0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95/cores=4/nodes=3/splits=100     0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95/cores=16/nodes=1/splits=0      0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=16/nodes=1/splits=100    0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=16/nodes=3/splits=0      0.70 ± 0%    0.64 ± 9%   -8.57%  (p=0.167 n=5+5)
kv95/cores=16/nodes=3/splits=100    0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95/cores=36/nodes=1/splits=0      0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=36/nodes=1/splits=100    0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=36/nodes=3/splits=0      0.66 ± 9%    0.60 ± 0%   -9.09%  (p=0.167 n=5+5)
kv95/cores=36/nodes=3/splits=100    0.60 ± 0%    0.60 ± 0%     ~     (all equal)

name                              old p99(ms)  new p99(ms)  delta
kv0/cores=4/nodes=1/splits=0        11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.000 n=5+4)
kv0/cores=4/nodes=1/splits=100      7.90 ± 0%    7.60 ± 0%   -3.80%  (p=0.000 n=5+4)
kv0/cores=4/nodes=3/splits=0        15.7 ± 0%    15.2 ± 0%   -3.18%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=100      8.90 ± 0%    8.12 ± 3%   -8.76%  (p=0.016 n=4+5)
kv0/cores=16/nodes=1/splits=0       3.46 ± 2%    3.00 ± 0%  -13.29%  (p=0.000 n=5+4)
kv0/cores=16/nodes=1/splits=100     4.50 ± 0%    3.36 ± 2%  -25.33%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=0       4.50 ± 0%    3.90 ± 0%  -13.33%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=100     5.80 ± 0%    4.10 ± 0%  -29.31%  (p=0.029 n=4+4)
kv0/cores=36/nodes=1/splits=0       6.80 ± 0%    5.20 ± 0%  -23.53%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=100     5.80 ± 0%    4.32 ± 4%  -25.52%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=0       7.72 ± 2%    6.30 ± 0%  -18.39%  (p=0.000 n=5+4)
kv0/cores=36/nodes=3/splits=100     7.98 ± 2%    5.20 ± 0%  -34.84%  (p=0.000 n=5+4)
kv95/cores=4/nodes=1/splits=0       5.38 ± 3%    5.20 ± 0%   -3.35%  (p=0.167 n=5+5)
kv95/cores=4/nodes=1/splits=100     5.00 ± 0%    5.00 ± 0%     ~     (all equal)
kv95/cores=4/nodes=3/splits=0       5.68 ± 3%    5.50 ± 0%   -3.17%  (p=0.095 n=5+4)
kv95/cores=4/nodes=3/splits=100     3.60 ±31%    2.93 ± 3%  -18.75%  (p=0.016 n=5+4)
kv95/cores=16/nodes=1/splits=0      4.10 ± 0%    4.10 ± 0%     ~     (all equal)
kv95/cores=16/nodes=1/splits=100    4.50 ± 0%    4.10 ± 0%   -8.89%  (p=0.000 n=5+4)
kv95/cores=16/nodes=3/splits=0      2.60 ± 0%    2.60 ± 0%     ~     (all equal)
kv95/cores=16/nodes=3/splits=100    2.50 ± 0%    1.90 ± 5%  -24.00%  (p=0.008 n=5+5)
kv95/cores=36/nodes=1/splits=0      6.60 ± 0%    6.00 ± 0%   -9.09%  (p=0.029 n=4+4)
kv95/cores=36/nodes=1/splits=100    5.50 ± 0%    5.12 ± 2%   -6.91%  (p=0.008 n=5+5)
kv95/cores=36/nodes=3/splits=0      4.18 ± 2%    4.02 ± 3%   -3.71%  (p=0.000 n=4+5)
kv95/cores=36/nodes=3/splits=100    3.80 ± 0%    2.80 ± 0%  -26.32%  (p=0.008 n=5+5)
```

_#### Large-machine Benchmarks

These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs.

```
name                              old ops/sec  new ops/sec  delta
kv0/cores=72/nodes=1/splits=0      31.0k ± 4%   36.4k ± 1%  +17.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=1/splits=100    44.0k ± 0%   49.0k ± 1%  +11.41%  (p=0.008 n=5+5)
kv95/cores=72/nodes=1/splits=0     52.7k ±18%   72.6k ±26%  +37.70%  (p=0.016 n=5+5)
kv95/cores=72/nodes=1/splits=100   66.8k ±17%   68.5k ± 5%     ~     (p=0.286 n=5+4)

name                              old p50(ms)  new p50(ms)  delta
kv0/cores=72/nodes=1/splits=0       2.30 ±13%    2.52 ± 5%     ~     (p=0.214 n=5+5)
kv0/cores=72/nodes=1/splits=100     3.00 ± 0%    2.90 ± 0%   -3.33%  (p=0.008 n=5+5)
kv95/cores=72/nodes=1/splits=0      0.46 ±13%    0.50 ± 0%     ~     (p=0.444 n=5+5)
kv95/cores=72/nodes=1/splits=100    0.44 ±14%    0.50 ± 0%  +13.64%  (p=0.167 n=5+5)

name                              old p99(ms)  new p99(ms)  delta
kv0/cores=72/nodes=1/splits=0       18.9 ± 6%    13.3 ± 5%  -29.56%  (p=0.008 n=5+5)
kv0/cores=72/nodes=1/splits=100     13.4 ± 2%    11.0 ± 0%  -17.91%  (p=0.008 n=5+5)
kv95/cores=72/nodes=1/splits=0      34.4 ±34%    23.5 ±24%  -31.74%  (p=0.048 n=5+5)
kv95/cores=72/nodes=1/splits=100    21.0 ± 0%    19.1 ± 4%   -8.81%  (p=0.029 n=4+4)
```

_#### Motivating Benchmarks

These are benchmarks that used to generate a lot of contention in the CommandQueue.
They have small cycle-lengths, indicated by the `c` specifier. The last one also includes
20% scan operations, which increases contention between non-overlapping point operations.

```
name                                    old ops/sec  new ops/sec  delta
kv95-c5/cores=16/nodes=1/splits=0        45.1k ± 1%   47.2k ± 4%   +4.59%  (p=0.008 n=5+5)
kv95-c5/cores=36/nodes=1/splits=0        44.6k ± 1%   76.3k ± 1%  +71.05%  (p=0.008 n=5+5)
kv50-c128/cores=16/nodes=1/splits=0      27.2k ± 2%   29.4k ± 1%   +8.12%  (p=0.008 n=5+5)
kv50-c128/cores=36/nodes=1/splits=0      42.6k ± 2%   50.0k ± 1%  +17.39%  (p=0.008 n=5+5)
kv70-20-c128/cores=16/nodes=1/splits=0   28.7k ± 1%   29.8k ± 3%   +3.87%  (p=0.008 n=5+5)
kv70-20-c128/cores=36/nodes=1/splits=0   41.9k ± 4%   52.8k ± 2%  +25.97%  (p=0.008 n=5+5)

name                                    old p50(ms)  new p50(ms)  delta
kv95-c5/cores=16/nodes=1/splits=0         0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95-c5/cores=36/nodes=1/splits=0         0.90 ± 0%    0.80 ± 0%  -11.11%  (p=0.008 n=5+5)
kv50-c128/cores=16/nodes=1/splits=0       1.10 ± 0%    1.06 ± 6%     ~     (p=0.444 n=5+5)
kv50-c128/cores=36/nodes=1/splits=0       1.26 ± 5%    1.30 ± 0%     ~     (p=0.444 n=5+5)
kv70-20-c128/cores=16/nodes=1/splits=0    0.66 ± 9%    0.60 ± 0%   -9.09%  (p=0.167 n=5+5)
kv70-20-c128/cores=36/nodes=1/splits=0    0.70 ± 0%    0.50 ± 0%  -28.57%  (p=0.008 n=5+5)

name                                    old p99(ms)  new p99(ms)  delta
kv95-c5/cores=16/nodes=1/splits=0         2.40 ± 0%    2.10 ± 0%  -12.50%  (p=0.000 n=5+4)
kv95-c5/cores=36/nodes=1/splits=0         5.80 ± 0%    3.30 ± 0%  -43.10%  (p=0.000 n=5+4)
kv50-c128/cores=16/nodes=1/splits=0       3.50 ± 0%    3.00 ± 0%  -14.29%  (p=0.008 n=5+5)
kv50-c128/cores=36/nodes=1/splits=0       6.80 ± 0%    4.70 ± 0%  -30.88%  (p=0.079 n=4+5)
kv70-20-c128/cores=16/nodes=1/splits=0    5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.029 n=4+4)
kv70-20-c128/cores=36/nodes=1/splits=0    11.0 ± 0%     6.8 ± 0%  -38.18%  (p=0.008 n=5+5)
```

_#### Batching Benchmarks

One optimization left out of the new spanlatch.Manager was the "covering" optimization,
where commands were initially added to the interval tree as a single spanning interval
and only expanded later. I ran a series of benchmarks to verify that this optimization
was not needed. My hypothesis was that the order of magnitude increase the speed of the
interval tree would make the optimization unnecessary.

It turns out that removing the optimization hurt a few benchmarks to a small
degree but speed up others tremendously (some benchmarks improved by over 400%).
I suspect that the covering optimization could actually hurt in cases where it
causes non-overlapping requests to overlap. It is interesting how quickly a few
of these benchmarks oscillate from small losses to big wins. It makes me think
that there's some non-linear behavior with the old CommandQueue that would cause
its performance to quickly degrade once it became a contention bottleneck.

```
name                                    old ops/sec  new ops/sec  delta
kv0-b16/cores=4/nodes=1/splits=0         2.41k ± 0%   2.06k ± 3%   -14.75%  (p=0.008 n=5+5)
kv0-b16/cores=4/nodes=1/splits=100         514 ± 0%     534 ± 1%    +3.88%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=0        2.95k ± 0%   4.35k ± 0%   +47.74%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=100      1.80k ± 1%   1.88k ± 1%    +4.46%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=0        2.74k ± 0%   4.92k ± 1%   +79.55%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=100      2.39k ± 1%   2.45k ± 1%    +2.41%  (p=0.008 n=5+5)
kv0-b128/cores=4/nodes=1/splits=0          422 ± 0%     518 ± 1%   +22.60%  (p=0.008 n=5+5)
kv0-b128/cores=4/nodes=1/splits=100       98.4 ± 1%    98.8 ± 1%      ~     (p=0.810 n=5+5)
kv0-b128/cores=16/nodes=1/splits=0         532 ± 0%    1059 ± 0%   +99.16%  (p=0.008 n=5+5)
kv0-b128/cores=16/nodes=1/splits=100       291 ± 1%     307 ± 1%    +5.18%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=0         483 ± 0%    1288 ± 1%  +166.37%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=100       394 ± 1%     408 ± 1%    +3.51%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=0        49.7 ± 1%    72.8 ± 1%   +46.52%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=100      30.8 ± 0%    23.4 ± 0%   -24.03%  (p=0.008 n=5+5)
kv0-b1024/cores=16/nodes=1/splits=0       48.9 ± 2%   160.6 ± 0%  +228.38%  (p=0.008 n=5+5)
kv0-b1024/cores=16/nodes=1/splits=100      101 ± 1%      80 ± 0%   -21.64%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=0       37.5 ± 0%   208.1 ± 1%  +454.99%  (p=0.016 n=4+5)
kv0-b1024/cores=36/nodes=1/splits=100      162 ± 0%     124 ± 0%   -23.22%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=0        5.93k ± 0%   6.20k ± 1%    +4.55%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=100      2.27k ± 1%   2.32k ± 1%    +2.28%  (p=0.008 n=5+5)
kv95-b16/cores=16/nodes=1/splits=0       5.15k ± 1%  18.79k ± 1%  +264.73%  (p=0.008 n=5+5)
kv95-b16/cores=16/nodes=1/splits=100     8.31k ± 1%   8.57k ± 1%    +3.16%  (p=0.008 n=5+5)
kv95-b16/cores=36/nodes=1/splits=0       3.96k ± 0%  10.67k ± 1%  +169.81%  (p=0.008 n=5+5)
kv95-b16/cores=36/nodes=1/splits=100     15.7k ± 2%   16.2k ± 4%    +2.75%  (p=0.151 n=5+5)
kv95-b128/cores=4/nodes=1/splits=0       1.12k ± 1%   1.27k ± 0%   +13.28%  (p=0.008 n=5+5)
kv95-b128/cores=4/nodes=1/splits=100       290 ± 1%     299 ± 1%    +3.02%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=0      1.06k ± 0%   3.31k ± 0%  +213.09%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=100      662 ±91%    1095 ± 1%   +65.42%  (p=0.016 n=5+4)
kv95-b128/cores=36/nodes=1/splits=0        715 ± 2%    3586 ± 0%  +401.21%  (p=0.008 n=5+5)
kv95-b128/cores=36/nodes=1/splits=100    1.15k ±90%   2.01k ± 2%   +74.79%  (p=0.016 n=5+4)
kv95-b1024/cores=4/nodes=1/splits=0        134 ± 1%     170 ± 1%   +26.59%  (p=0.008 n=5+5)
kv95-b1024/cores=4/nodes=1/splits=100     54.8 ± 3%    53.3 ± 3%    -2.84%  (p=0.056 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=0       104 ± 3%     367 ± 1%  +252.37%  (p=0.008 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=100     210 ± 1%     214 ± 1%    +1.86%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=0      76.5 ± 2%   383.9 ± 1%  +401.67%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=100     431 ± 1%     436 ± 1%    +1.17%  (p=0.016 n=5+5)

name                                    old p50(ms)  new p50(ms)  delta
kv0-b16/cores=4/nodes=1/splits=0          3.00 ± 0%    3.40 ± 0%   +13.33%  (p=0.016 n=5+4)
kv0-b16/cores=4/nodes=1/splits=100        15.2 ± 0%    14.7 ± 0%    -3.29%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=0         10.5 ± 0%     7.7 ± 2%   -26.48%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=100       17.8 ± 0%    16.8 ± 0%    -5.62%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=0         26.2 ± 0%    14.2 ± 0%   -45.80%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=100       29.0 ± 2%    28.3 ± 0%    -2.28%  (p=0.095 n=5+4)
kv0-b128/cores=4/nodes=1/splits=0         17.8 ± 0%    15.2 ± 0%   -14.61%  (p=0.000 n=5+4)
kv0-b128/cores=4/nodes=1/splits=100       79.7 ± 0%    79.7 ± 0%      ~     (all equal)
kv0-b128/cores=16/nodes=1/splits=0        65.0 ± 0%    32.5 ± 0%   -50.00%  (p=0.029 n=4+4)
kv0-b128/cores=16/nodes=1/splits=100       109 ± 0%     105 ± 0%    -3.85%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=0         168 ± 0%      50 ± 0%   -70.02%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=100       184 ± 0%     176 ± 0%    -4.50%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=0         159 ± 0%     109 ± 0%   -31.56%  (p=0.000 n=5+4)
kv0-b1024/cores=4/nodes=1/splits=100       252 ± 0%     319 ± 0%   +26.66%  (p=0.008 n=5+5)
kv0-b1024/cores=16/nodes=1/splits=0        705 ± 0%     193 ± 0%   -72.62%  (p=0.000 n=5+4)
kv0-b1024/cores=16/nodes=1/splits=100      319 ± 0%     386 ± 0%   +21.05%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=0      1.88k ± 0%   0.24k ± 0%   -87.05%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=100      436 ± 0%     570 ± 0%   +30.77%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=0         1.20 ± 0%    1.20 ± 0%      ~     (all equal)
kv95-b16/cores=4/nodes=1/splits=100       2.60 ± 0%    2.60 ± 0%      ~     (all equal)
kv95-b16/cores=16/nodes=1/splits=0        6.30 ± 0%    1.40 ± 0%   -77.78%  (p=0.000 n=5+4)
kv95-b16/cores=16/nodes=1/splits=100      1.74 ± 3%    1.76 ± 3%      ~     (p=1.000 n=5+5)
kv95-b16/cores=36/nodes=1/splits=0        11.5 ± 0%     5.5 ± 0%   -52.17%  (p=0.000 n=5+4)
kv95-b16/cores=36/nodes=1/splits=100      2.42 ±20%    2.42 ±45%      ~     (p=0.579 n=5+5)
kv95-b128/cores=4/nodes=1/splits=0        6.60 ± 0%    6.00 ± 0%    -9.09%  (p=0.008 n=5+5)
kv95-b128/cores=4/nodes=1/splits=100      21.4 ± 3%    21.0 ± 0%      ~     (p=0.444 n=5+5)
kv95-b128/cores=16/nodes=1/splits=0       30.4 ± 0%     9.4 ± 0%   -69.08%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=100     38.2 ±76%    21.2 ± 4%   -44.31%  (p=0.063 n=5+4)
kv95-b128/cores=36/nodes=1/splits=0       88.1 ± 0%    16.8 ± 0%   -80.93%  (p=0.000 n=5+4)
kv95-b128/cores=36/nodes=1/splits=100     56.6 ±85%    29.6 ±15%      ~     (p=0.873 n=5+4)
kv95-b1024/cores=4/nodes=1/splits=0       52.4 ± 0%    44.0 ± 0%   -16.03%  (p=0.029 n=4+4)
kv95-b1024/cores=4/nodes=1/splits=100      132 ± 2%     143 ± 0%    +8.29%  (p=0.016 n=5+4)
kv95-b1024/cores=16/nodes=1/splits=0       325 ± 3%      80 ± 0%   -75.51%  (p=0.008 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=100     151 ± 0%     151 ± 0%      ~     (all equal)
kv95-b1024/cores=36/nodes=1/splits=0       973 ± 0%     180 ± 3%   -81.55%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=100     168 ± 0%     168 ± 0%      ~     (all equal)

name                                    old p99(ms)  new p99(ms)  delta
kv0-b16/cores=4/nodes=1/splits=0          8.40 ± 0%   10.30 ± 3%   +22.62%  (p=0.016 n=4+5)
kv0-b16/cores=4/nodes=1/splits=100        29.4 ± 0%    27.3 ± 0%    -7.14%  (p=0.000 n=5+4)
kv0-b16/cores=16/nodes=1/splits=0         16.3 ± 0%    15.5 ± 2%    -4.91%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=100       31.5 ± 0%    29.4 ± 0%    -6.67%  (p=0.000 n=5+4)
kv0-b16/cores=36/nodes=1/splits=0         37.7 ± 0%    28.7 ± 2%   -23.77%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=100       62.1 ± 2%    68.4 ±10%   +10.15%  (p=0.008 n=5+5)
kv0-b128/cores=4/nodes=1/splits=0         37.7 ± 0%    39.4 ± 6%    +4.46%  (p=0.167 n=5+5)
kv0-b128/cores=4/nodes=1/splits=100        143 ± 0%     151 ± 0%    +5.89%  (p=0.016 n=4+5)
kv0-b128/cores=16/nodes=1/splits=0        79.7 ± 0%    55.8 ± 2%   -30.04%  (p=0.008 n=5+5)
kv0-b128/cores=16/nodes=1/splits=100       198 ± 3%     188 ± 3%    -5.09%  (p=0.048 n=5+5)
kv0-b128/cores=36/nodes=1/splits=0         184 ± 0%     126 ± 3%   -31.82%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=100       319 ± 0%     336 ± 0%    +5.24%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=0         322 ± 6%     253 ± 4%   -21.35%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=100       470 ± 0%     772 ± 4%   +64.28%  (p=0.016 n=4+5)
kv0-b1024/cores=16/nodes=1/splits=0      1.41k ± 0%   0.56k ±11%   -60.00%  (p=0.000 n=4+5)
kv0-b1024/cores=16/nodes=1/splits=100      530 ± 2%     772 ± 0%   +45.57%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=0      4.05k ± 7%   1.17k ± 3%   -71.19%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=100      792 ±14%    1020 ± 2%   +28.81%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=0         3.90 ± 0%    3.22 ± 4%   -17.44%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=100       21.0 ± 0%    19.9 ± 0%    -5.24%  (p=0.079 n=4+5)
kv95-b16/cores=16/nodes=1/splits=0        15.2 ± 0%     7.1 ± 0%   -53.29%  (p=0.079 n=4+5)
kv95-b16/cores=16/nodes=1/splits=100      38.5 ± 3%    37.7 ± 0%      ~     (p=0.333 n=5+4)
kv95-b16/cores=36/nodes=1/splits=0         128 ± 2%      52 ± 0%   -59.16%  (p=0.000 n=5+4)
kv95-b16/cores=36/nodes=1/splits=100      41.1 ±13%    39.2 ±33%      ~     (p=0.984 n=5+5)
kv95-b128/cores=4/nodes=1/splits=0        17.8 ± 0%    14.7 ± 0%   -17.42%  (p=0.079 n=4+5)
kv95-b128/cores=4/nodes=1/splits=100       107 ± 2%     106 ± 5%      ~     (p=0.683 n=5+5)
kv95-b128/cores=16/nodes=1/splits=0       75.5 ± 0%    23.1 ± 0%   -69.40%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=100      107 ±34%     120 ± 2%      ~     (p=1.000 n=5+4)
kv95-b128/cores=36/nodes=1/splits=0        253 ± 4%      71 ± 0%   -71.86%  (p=0.016 n=5+4)
kv95-b128/cores=36/nodes=1/splits=100      166 ±19%     164 ±74%      ~     (p=0.310 n=5+5)
kv95-b1024/cores=4/nodes=1/splits=0        146 ± 3%     101 ± 0%   -31.01%  (p=0.000 n=5+4)
kv95-b1024/cores=4/nodes=1/splits=100      348 ± 4%     366 ± 6%      ~     (p=0.317 n=4+5)
kv95-b1024/cores=16/nodes=1/splits=0       624 ± 3%     221 ± 2%   -64.52%  (p=0.008 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=100     325 ± 3%     319 ± 0%      ~     (p=0.444 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=0     1.56k ± 5%   0.41k ± 2%   -73.71%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=100     336 ± 0%     336 ± 0%      ~     (all equal)
```

Release note (performance improvement): Replace Replica latching mechanism
with new optimized data structure that improves throughput, especially
under heavy contention.
craig bot pushed a commit that referenced this pull request Dec 8, 2018
32865: storage: replace CommandQueue with spanlatch.Manager r=nvanbenschoten a=nvanbenschoten

This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in #31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks.

This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year.

### Benchmarks

#### Standard Benchmarks

These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts.

```
name                              old ops/sec  new ops/sec  delta
kv0/cores=4/nodes=1/splits=0       1.99k ± 2%   2.06k ± 1%   +3.22%  (p=0.008 n=5+5)
kv0/cores=4/nodes=1/splits=100     2.25k ± 1%   2.38k ± 1%   +6.01%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=0       1.60k ± 0%   1.69k ± 2%   +5.53%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=100     3.52k ± 6%   3.65k ± 9%     ~     (p=0.421 n=5+5)
kv0/cores=16/nodes=1/splits=0      19.9k ± 1%   21.8k ± 1%   +9.34%  (p=0.008 n=5+5)
kv0/cores=16/nodes=1/splits=100    24.4k ± 1%   26.1k ± 1%   +7.17%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=0      14.9k ± 1%   16.1k ± 1%   +8.03%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=100    20.6k ± 1%   22.8k ± 1%  +10.79%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=0      31.2k ± 2%   35.3k ± 1%  +13.28%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=100    45.7k ± 1%   51.1k ± 1%  +11.80%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=0      23.7k ± 2%   27.1k ± 2%  +14.39%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=100    34.9k ± 2%   45.1k ± 1%  +29.44%  (p=0.008 n=5+5)
kv95/cores=4/nodes=1/splits=0      12.7k ± 2%   12.9k ± 2%   +1.39%  (p=0.151 n=5+5)
kv95/cores=4/nodes=1/splits=100    12.8k ± 2%   13.1k ± 2%   +2.10%  (p=0.032 n=5+5)
kv95/cores=4/nodes=3/splits=0      10.6k ± 1%   10.8k ± 1%   +1.58%  (p=0.056 n=5+5)
kv95/cores=4/nodes=3/splits=100    12.3k ± 7%   12.6k ± 8%   +2.61%  (p=0.095 n=5+5)
kv95/cores=16/nodes=1/splits=0     50.9k ± 1%   52.2k ± 1%   +2.37%  (p=0.008 n=5+5)
kv95/cores=16/nodes=1/splits=100   52.2k ± 1%   53.0k ± 1%   +1.49%  (p=0.008 n=5+5)
kv95/cores=16/nodes=3/splits=0     46.2k ± 1%   46.8k ± 1%   +1.32%  (p=0.032 n=5+5)
kv95/cores=16/nodes=3/splits=100   51.0k ± 1%   53.2k ± 1%   +4.25%  (p=0.008 n=5+5)
kv95/cores=36/nodes=1/splits=0     79.8k ± 2%  101.6k ± 1%  +27.31%  (p=0.008 n=5+5)
kv95/cores=36/nodes=1/splits=100    104k ± 1%    107k ± 1%   +2.60%  (p=0.008 n=5+5)
kv95/cores=36/nodes=3/splits=0     85.8k ± 1%   91.8k ± 1%   +7.08%  (p=0.008 n=5+5)
kv95/cores=36/nodes=3/splits=100    106k ± 1%    112k ± 1%   +5.51%  (p=0.008 n=5+5)

name                              old p50(ms)  new p50(ms)  delta
kv0/cores=4/nodes=1/splits=0        3.52 ± 5%    3.40 ± 0%   -3.41%  (p=0.016 n=5+4)
kv0/cores=4/nodes=1/splits=100      3.30 ± 0%    3.00 ± 0%   -9.09%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=0        4.70 ± 0%    4.14 ± 9%  -11.91%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=100      1.50 ± 0%    1.48 ± 8%     ~     (p=0.968 n=4+5)
kv0/cores=16/nodes=1/splits=0       1.40 ± 0%    1.40 ± 0%     ~     (all equal)
kv0/cores=16/nodes=1/splits=100     1.20 ± 0%    1.20 ± 0%     ~     (all equal)
kv0/cores=16/nodes=3/splits=0       2.00 ± 0%    1.90 ± 0%   -5.00%  (p=0.000 n=5+4)
kv0/cores=16/nodes=3/splits=100     1.40 ± 0%    1.40 ± 0%     ~     (all equal)
kv0/cores=36/nodes=1/splits=0       1.76 ± 3%    1.60 ± 0%   -9.09%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=100     1.40 ± 0%    1.30 ± 0%   -7.14%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=0       2.56 ± 2%    2.40 ± 0%   -6.25%  (p=0.000 n=5+4)
kv0/cores=36/nodes=3/splits=100     1.70 ± 0%    1.40 ± 0%  -17.65%  (p=0.008 n=5+5)
kv95/cores=4/nodes=1/splits=0       0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=4/nodes=1/splits=100     0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=4/nodes=3/splits=0       0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95/cores=4/nodes=3/splits=100     0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95/cores=16/nodes=1/splits=0      0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=16/nodes=1/splits=100    0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=16/nodes=3/splits=0      0.70 ± 0%    0.64 ± 9%   -8.57%  (p=0.167 n=5+5)
kv95/cores=16/nodes=3/splits=100    0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95/cores=36/nodes=1/splits=0      0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=36/nodes=1/splits=100    0.50 ± 0%    0.50 ± 0%     ~     (all equal)
kv95/cores=36/nodes=3/splits=0      0.66 ± 9%    0.60 ± 0%   -9.09%  (p=0.167 n=5+5)
kv95/cores=36/nodes=3/splits=100    0.60 ± 0%    0.60 ± 0%     ~     (all equal)

name                              old p99(ms)  new p99(ms)  delta
kv0/cores=4/nodes=1/splits=0        11.0 ± 0%    10.5 ± 0%   -4.55%  (p=0.000 n=5+4)
kv0/cores=4/nodes=1/splits=100      7.90 ± 0%    7.60 ± 0%   -3.80%  (p=0.000 n=5+4)
kv0/cores=4/nodes=3/splits=0        15.7 ± 0%    15.2 ± 0%   -3.18%  (p=0.008 n=5+5)
kv0/cores=4/nodes=3/splits=100      8.90 ± 0%    8.12 ± 3%   -8.76%  (p=0.016 n=4+5)
kv0/cores=16/nodes=1/splits=0       3.46 ± 2%    3.00 ± 0%  -13.29%  (p=0.000 n=5+4)
kv0/cores=16/nodes=1/splits=100     4.50 ± 0%    3.36 ± 2%  -25.33%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=0       4.50 ± 0%    3.90 ± 0%  -13.33%  (p=0.008 n=5+5)
kv0/cores=16/nodes=3/splits=100     5.80 ± 0%    4.10 ± 0%  -29.31%  (p=0.029 n=4+4)
kv0/cores=36/nodes=1/splits=0       6.80 ± 0%    5.20 ± 0%  -23.53%  (p=0.008 n=5+5)
kv0/cores=36/nodes=1/splits=100     5.80 ± 0%    4.32 ± 4%  -25.52%  (p=0.008 n=5+5)
kv0/cores=36/nodes=3/splits=0       7.72 ± 2%    6.30 ± 0%  -18.39%  (p=0.000 n=5+4)
kv0/cores=36/nodes=3/splits=100     7.98 ± 2%    5.20 ± 0%  -34.84%  (p=0.000 n=5+4)
kv95/cores=4/nodes=1/splits=0       5.38 ± 3%    5.20 ± 0%   -3.35%  (p=0.167 n=5+5)
kv95/cores=4/nodes=1/splits=100     5.00 ± 0%    5.00 ± 0%     ~     (all equal)
kv95/cores=4/nodes=3/splits=0       5.68 ± 3%    5.50 ± 0%   -3.17%  (p=0.095 n=5+4)
kv95/cores=4/nodes=3/splits=100     3.60 ±31%    2.93 ± 3%  -18.75%  (p=0.016 n=5+4)
kv95/cores=16/nodes=1/splits=0      4.10 ± 0%    4.10 ± 0%     ~     (all equal)
kv95/cores=16/nodes=1/splits=100    4.50 ± 0%    4.10 ± 0%   -8.89%  (p=0.000 n=5+4)
kv95/cores=16/nodes=3/splits=0      2.60 ± 0%    2.60 ± 0%     ~     (all equal)
kv95/cores=16/nodes=3/splits=100    2.50 ± 0%    1.90 ± 5%  -24.00%  (p=0.008 n=5+5)
kv95/cores=36/nodes=1/splits=0      6.60 ± 0%    6.00 ± 0%   -9.09%  (p=0.029 n=4+4)
kv95/cores=36/nodes=1/splits=100    5.50 ± 0%    5.12 ± 2%   -6.91%  (p=0.008 n=5+5)
kv95/cores=36/nodes=3/splits=0      4.18 ± 2%    4.02 ± 3%   -3.71%  (p=0.000 n=4+5)
kv95/cores=36/nodes=3/splits=100    3.80 ± 0%    2.80 ± 0%  -26.32%  (p=0.008 n=5+5)
```

#### Large-machine Benchmarks

These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs.

```
name                              old ops/sec  new ops/sec  delta
kv0/cores=72/nodes=1/splits=0      31.0k ± 4%   36.4k ± 1%  +17.57%  (p=0.008 n=5+5)
kv0/cores=72/nodes=1/splits=100    44.0k ± 0%   49.0k ± 1%  +11.41%  (p=0.008 n=5+5)
kv95/cores=72/nodes=1/splits=0     52.7k ±18%   72.6k ±26%  +37.70%  (p=0.016 n=5+5)
kv95/cores=72/nodes=1/splits=100   66.8k ±17%   68.5k ± 5%     ~     (p=0.286 n=5+4)

name                              old p50(ms)  new p50(ms)  delta
kv0/cores=72/nodes=1/splits=0       2.30 ±13%    2.52 ± 5%     ~     (p=0.214 n=5+5)
kv0/cores=72/nodes=1/splits=100     3.00 ± 0%    2.90 ± 0%   -3.33%  (p=0.008 n=5+5)
kv95/cores=72/nodes=1/splits=0      0.46 ±13%    0.50 ± 0%     ~     (p=0.444 n=5+5)
kv95/cores=72/nodes=1/splits=100    0.44 ±14%    0.50 ± 0%  +13.64%  (p=0.167 n=5+5)

name                              old p99(ms)  new p99(ms)  delta
kv0/cores=72/nodes=1/splits=0       18.9 ± 6%    13.3 ± 5%  -29.56%  (p=0.008 n=5+5)
kv0/cores=72/nodes=1/splits=100     13.4 ± 2%    11.0 ± 0%  -17.91%  (p=0.008 n=5+5)
kv95/cores=72/nodes=1/splits=0      34.4 ±34%    23.5 ±24%  -31.74%  (p=0.048 n=5+5)
kv95/cores=72/nodes=1/splits=100    21.0 ± 0%    19.1 ± 4%   -8.81%  (p=0.029 n=4+4)
```

#### Motivating Benchmarks

These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the `c` specifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations.

```
name                                    old ops/sec  new ops/sec  delta
kv95-c5/cores=16/nodes=1/splits=0        45.1k ± 1%   47.2k ± 4%   +4.59%  (p=0.008 n=5+5)
kv95-c5/cores=36/nodes=1/splits=0        44.6k ± 1%   76.3k ± 1%  +71.05%  (p=0.008 n=5+5)
kv50-c128/cores=16/nodes=1/splits=0      27.2k ± 2%   29.4k ± 1%   +8.12%  (p=0.008 n=5+5)
kv50-c128/cores=36/nodes=1/splits=0      42.6k ± 2%   50.0k ± 1%  +17.39%  (p=0.008 n=5+5)
kv70-20-c128/cores=16/nodes=1/splits=0   28.7k ± 1%   29.8k ± 3%   +3.87%  (p=0.008 n=5+5)
kv70-20-c128/cores=36/nodes=1/splits=0   41.9k ± 4%   52.8k ± 2%  +25.97%  (p=0.008 n=5+5)

name                                    old p50(ms)  new p50(ms)  delta
kv95-c5/cores=16/nodes=1/splits=0         0.60 ± 0%    0.60 ± 0%     ~     (all equal)
kv95-c5/cores=36/nodes=1/splits=0         0.90 ± 0%    0.80 ± 0%  -11.11%  (p=0.008 n=5+5)
kv50-c128/cores=16/nodes=1/splits=0       1.10 ± 0%    1.06 ± 6%     ~     (p=0.444 n=5+5)
kv50-c128/cores=36/nodes=1/splits=0       1.26 ± 5%    1.30 ± 0%     ~     (p=0.444 n=5+5)
kv70-20-c128/cores=16/nodes=1/splits=0    0.66 ± 9%    0.60 ± 0%   -9.09%  (p=0.167 n=5+5)
kv70-20-c128/cores=36/nodes=1/splits=0    0.70 ± 0%    0.50 ± 0%  -28.57%  (p=0.008 n=5+5)

name                                    old p99(ms)  new p99(ms)  delta
kv95-c5/cores=16/nodes=1/splits=0         2.40 ± 0%    2.10 ± 0%  -12.50%  (p=0.000 n=5+4)
kv95-c5/cores=36/nodes=1/splits=0         5.80 ± 0%    3.30 ± 0%  -43.10%  (p=0.000 n=5+4)
kv50-c128/cores=16/nodes=1/splits=0       3.50 ± 0%    3.00 ± 0%  -14.29%  (p=0.008 n=5+5)
kv50-c128/cores=36/nodes=1/splits=0       6.80 ± 0%    4.70 ± 0%  -30.88%  (p=0.079 n=4+5)
kv70-20-c128/cores=16/nodes=1/splits=0    5.00 ± 0%    4.70 ± 0%   -6.00%  (p=0.029 n=4+4)
kv70-20-c128/cores=36/nodes=1/splits=0    11.0 ± 0%     6.8 ± 0%  -38.18%  (p=0.008 n=5+5)
```

#### Batching Benchmarks

One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary.

It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck.

```
name                                    old ops/sec  new ops/sec  delta
kv0-b16/cores=4/nodes=1/splits=0         2.41k ± 0%   2.06k ± 3%   -14.75%  (p=0.008 n=5+5)
kv0-b16/cores=4/nodes=1/splits=100         514 ± 0%     534 ± 1%    +3.88%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=0        2.95k ± 0%   4.35k ± 0%   +47.74%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=100      1.80k ± 1%   1.88k ± 1%    +4.46%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=0        2.74k ± 0%   4.92k ± 1%   +79.55%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=100      2.39k ± 1%   2.45k ± 1%    +2.41%  (p=0.008 n=5+5)
kv0-b128/cores=4/nodes=1/splits=0          422 ± 0%     518 ± 1%   +22.60%  (p=0.008 n=5+5)
kv0-b128/cores=4/nodes=1/splits=100       98.4 ± 1%    98.8 ± 1%      ~     (p=0.810 n=5+5)
kv0-b128/cores=16/nodes=1/splits=0         532 ± 0%    1059 ± 0%   +99.16%  (p=0.008 n=5+5)
kv0-b128/cores=16/nodes=1/splits=100       291 ± 1%     307 ± 1%    +5.18%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=0         483 ± 0%    1288 ± 1%  +166.37%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=100       394 ± 1%     408 ± 1%    +3.51%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=0        49.7 ± 1%    72.8 ± 1%   +46.52%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=100      30.8 ± 0%    23.4 ± 0%   -24.03%  (p=0.008 n=5+5)
kv0-b1024/cores=16/nodes=1/splits=0       48.9 ± 2%   160.6 ± 0%  +228.38%  (p=0.008 n=5+5)
kv0-b1024/cores=16/nodes=1/splits=100      101 ± 1%      80 ± 0%   -21.64%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=0       37.5 ± 0%   208.1 ± 1%  +454.99%  (p=0.016 n=4+5)
kv0-b1024/cores=36/nodes=1/splits=100      162 ± 0%     124 ± 0%   -23.22%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=0        5.93k ± 0%   6.20k ± 1%    +4.55%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=100      2.27k ± 1%   2.32k ± 1%    +2.28%  (p=0.008 n=5+5)
kv95-b16/cores=16/nodes=1/splits=0       5.15k ± 1%  18.79k ± 1%  +264.73%  (p=0.008 n=5+5)
kv95-b16/cores=16/nodes=1/splits=100     8.31k ± 1%   8.57k ± 1%    +3.16%  (p=0.008 n=5+5)
kv95-b16/cores=36/nodes=1/splits=0       3.96k ± 0%  10.67k ± 1%  +169.81%  (p=0.008 n=5+5)
kv95-b16/cores=36/nodes=1/splits=100     15.7k ± 2%   16.2k ± 4%    +2.75%  (p=0.151 n=5+5)
kv95-b128/cores=4/nodes=1/splits=0       1.12k ± 1%   1.27k ± 0%   +13.28%  (p=0.008 n=5+5)
kv95-b128/cores=4/nodes=1/splits=100       290 ± 1%     299 ± 1%    +3.02%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=0      1.06k ± 0%   3.31k ± 0%  +213.09%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=100      662 ±91%    1095 ± 1%   +65.42%  (p=0.016 n=5+4)
kv95-b128/cores=36/nodes=1/splits=0        715 ± 2%    3586 ± 0%  +401.21%  (p=0.008 n=5+5)
kv95-b128/cores=36/nodes=1/splits=100    1.15k ±90%   2.01k ± 2%   +74.79%  (p=0.016 n=5+4)
kv95-b1024/cores=4/nodes=1/splits=0        134 ± 1%     170 ± 1%   +26.59%  (p=0.008 n=5+5)
kv95-b1024/cores=4/nodes=1/splits=100     54.8 ± 3%    53.3 ± 3%    -2.84%  (p=0.056 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=0       104 ± 3%     367 ± 1%  +252.37%  (p=0.008 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=100     210 ± 1%     214 ± 1%    +1.86%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=0      76.5 ± 2%   383.9 ± 1%  +401.67%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=100     431 ± 1%     436 ± 1%    +1.17%  (p=0.016 n=5+5)

name                                    old p50(ms)  new p50(ms)  delta
kv0-b16/cores=4/nodes=1/splits=0          3.00 ± 0%    3.40 ± 0%   +13.33%  (p=0.016 n=5+4)
kv0-b16/cores=4/nodes=1/splits=100        15.2 ± 0%    14.7 ± 0%    -3.29%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=0         10.5 ± 0%     7.7 ± 2%   -26.48%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=100       17.8 ± 0%    16.8 ± 0%    -5.62%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=0         26.2 ± 0%    14.2 ± 0%   -45.80%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=100       29.0 ± 2%    28.3 ± 0%    -2.28%  (p=0.095 n=5+4)
kv0-b128/cores=4/nodes=1/splits=0         17.8 ± 0%    15.2 ± 0%   -14.61%  (p=0.000 n=5+4)
kv0-b128/cores=4/nodes=1/splits=100       79.7 ± 0%    79.7 ± 0%      ~     (all equal)
kv0-b128/cores=16/nodes=1/splits=0        65.0 ± 0%    32.5 ± 0%   -50.00%  (p=0.029 n=4+4)
kv0-b128/cores=16/nodes=1/splits=100       109 ± 0%     105 ± 0%    -3.85%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=0         168 ± 0%      50 ± 0%   -70.02%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=100       184 ± 0%     176 ± 0%    -4.50%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=0         159 ± 0%     109 ± 0%   -31.56%  (p=0.000 n=5+4)
kv0-b1024/cores=4/nodes=1/splits=100       252 ± 0%     319 ± 0%   +26.66%  (p=0.008 n=5+5)
kv0-b1024/cores=16/nodes=1/splits=0        705 ± 0%     193 ± 0%   -72.62%  (p=0.000 n=5+4)
kv0-b1024/cores=16/nodes=1/splits=100      319 ± 0%     386 ± 0%   +21.05%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=0      1.88k ± 0%   0.24k ± 0%   -87.05%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=100      436 ± 0%     570 ± 0%   +30.77%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=0         1.20 ± 0%    1.20 ± 0%      ~     (all equal)
kv95-b16/cores=4/nodes=1/splits=100       2.60 ± 0%    2.60 ± 0%      ~     (all equal)
kv95-b16/cores=16/nodes=1/splits=0        6.30 ± 0%    1.40 ± 0%   -77.78%  (p=0.000 n=5+4)
kv95-b16/cores=16/nodes=1/splits=100      1.74 ± 3%    1.76 ± 3%      ~     (p=1.000 n=5+5)
kv95-b16/cores=36/nodes=1/splits=0        11.5 ± 0%     5.5 ± 0%   -52.17%  (p=0.000 n=5+4)
kv95-b16/cores=36/nodes=1/splits=100      2.42 ±20%    2.42 ±45%      ~     (p=0.579 n=5+5)
kv95-b128/cores=4/nodes=1/splits=0        6.60 ± 0%    6.00 ± 0%    -9.09%  (p=0.008 n=5+5)
kv95-b128/cores=4/nodes=1/splits=100      21.4 ± 3%    21.0 ± 0%      ~     (p=0.444 n=5+5)
kv95-b128/cores=16/nodes=1/splits=0       30.4 ± 0%     9.4 ± 0%   -69.08%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=100     38.2 ±76%    21.2 ± 4%   -44.31%  (p=0.063 n=5+4)
kv95-b128/cores=36/nodes=1/splits=0       88.1 ± 0%    16.8 ± 0%   -80.93%  (p=0.000 n=5+4)
kv95-b128/cores=36/nodes=1/splits=100     56.6 ±85%    29.6 ±15%      ~     (p=0.873 n=5+4)
kv95-b1024/cores=4/nodes=1/splits=0       52.4 ± 0%    44.0 ± 0%   -16.03%  (p=0.029 n=4+4)
kv95-b1024/cores=4/nodes=1/splits=100      132 ± 2%     143 ± 0%    +8.29%  (p=0.016 n=5+4)
kv95-b1024/cores=16/nodes=1/splits=0       325 ± 3%      80 ± 0%   -75.51%  (p=0.008 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=100     151 ± 0%     151 ± 0%      ~     (all equal)
kv95-b1024/cores=36/nodes=1/splits=0       973 ± 0%     180 ± 3%   -81.55%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=100     168 ± 0%     168 ± 0%      ~     (all equal)

name                                    old p99(ms)  new p99(ms)  delta
kv0-b16/cores=4/nodes=1/splits=0          8.40 ± 0%   10.30 ± 3%   +22.62%  (p=0.016 n=4+5)
kv0-b16/cores=4/nodes=1/splits=100        29.4 ± 0%    27.3 ± 0%    -7.14%  (p=0.000 n=5+4)
kv0-b16/cores=16/nodes=1/splits=0         16.3 ± 0%    15.5 ± 2%    -4.91%  (p=0.008 n=5+5)
kv0-b16/cores=16/nodes=1/splits=100       31.5 ± 0%    29.4 ± 0%    -6.67%  (p=0.000 n=5+4)
kv0-b16/cores=36/nodes=1/splits=0         37.7 ± 0%    28.7 ± 2%   -23.77%  (p=0.008 n=5+5)
kv0-b16/cores=36/nodes=1/splits=100       62.1 ± 2%    68.4 ±10%   +10.15%  (p=0.008 n=5+5)
kv0-b128/cores=4/nodes=1/splits=0         37.7 ± 0%    39.4 ± 6%    +4.46%  (p=0.167 n=5+5)
kv0-b128/cores=4/nodes=1/splits=100        143 ± 0%     151 ± 0%    +5.89%  (p=0.016 n=4+5)
kv0-b128/cores=16/nodes=1/splits=0        79.7 ± 0%    55.8 ± 2%   -30.04%  (p=0.008 n=5+5)
kv0-b128/cores=16/nodes=1/splits=100       198 ± 3%     188 ± 3%    -5.09%  (p=0.048 n=5+5)
kv0-b128/cores=36/nodes=1/splits=0         184 ± 0%     126 ± 3%   -31.82%  (p=0.008 n=5+5)
kv0-b128/cores=36/nodes=1/splits=100       319 ± 0%     336 ± 0%    +5.24%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=0         322 ± 6%     253 ± 4%   -21.35%  (p=0.008 n=5+5)
kv0-b1024/cores=4/nodes=1/splits=100       470 ± 0%     772 ± 4%   +64.28%  (p=0.016 n=4+5)
kv0-b1024/cores=16/nodes=1/splits=0      1.41k ± 0%   0.56k ±11%   -60.00%  (p=0.000 n=4+5)
kv0-b1024/cores=16/nodes=1/splits=100      530 ± 2%     772 ± 0%   +45.57%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=0      4.05k ± 7%   1.17k ± 3%   -71.19%  (p=0.008 n=5+5)
kv0-b1024/cores=36/nodes=1/splits=100      792 ±14%    1020 ± 2%   +28.81%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=0         3.90 ± 0%    3.22 ± 4%   -17.44%  (p=0.008 n=5+5)
kv95-b16/cores=4/nodes=1/splits=100       21.0 ± 0%    19.9 ± 0%    -5.24%  (p=0.079 n=4+5)
kv95-b16/cores=16/nodes=1/splits=0        15.2 ± 0%     7.1 ± 0%   -53.29%  (p=0.079 n=4+5)
kv95-b16/cores=16/nodes=1/splits=100      38.5 ± 3%    37.7 ± 0%      ~     (p=0.333 n=5+4)
kv95-b16/cores=36/nodes=1/splits=0         128 ± 2%      52 ± 0%   -59.16%  (p=0.000 n=5+4)
kv95-b16/cores=36/nodes=1/splits=100      41.1 ±13%    39.2 ±33%      ~     (p=0.984 n=5+5)
kv95-b128/cores=4/nodes=1/splits=0        17.8 ± 0%    14.7 ± 0%   -17.42%  (p=0.079 n=4+5)
kv95-b128/cores=4/nodes=1/splits=100       107 ± 2%     106 ± 5%      ~     (p=0.683 n=5+5)
kv95-b128/cores=16/nodes=1/splits=0       75.5 ± 0%    23.1 ± 0%   -69.40%  (p=0.008 n=5+5)
kv95-b128/cores=16/nodes=1/splits=100      107 ±34%     120 ± 2%      ~     (p=1.000 n=5+4)
kv95-b128/cores=36/nodes=1/splits=0        253 ± 4%      71 ± 0%   -71.86%  (p=0.016 n=5+4)
kv95-b128/cores=36/nodes=1/splits=100      166 ±19%     164 ±74%      ~     (p=0.310 n=5+5)
kv95-b1024/cores=4/nodes=1/splits=0        146 ± 3%     101 ± 0%   -31.01%  (p=0.000 n=5+4)
kv95-b1024/cores=4/nodes=1/splits=100      348 ± 4%     366 ± 6%      ~     (p=0.317 n=4+5)
kv95-b1024/cores=16/nodes=1/splits=0       624 ± 3%     221 ± 2%   -64.52%  (p=0.008 n=5+5)
kv95-b1024/cores=16/nodes=1/splits=100     325 ± 3%     319 ± 0%      ~     (p=0.444 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=0     1.56k ± 5%   0.41k ± 2%   -73.71%  (p=0.008 n=5+5)
kv95-b1024/cores=36/nodes=1/splits=100     336 ± 0%     336 ± 0%      ~     (all equal)
```

Co-authored-by: Nathan VanBenschoten <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants