storage/spanlatch: create spanlatch.Manager using immutable btrees #31997

nvanbenschoten · 2018-10-30T05:58:32Z

Informs #4768.
Informs #31904.

This change was inspired by #31904 and is a progression of the thinking started in #4768 (comment).

The change introduces spanlatch.Manager, which will replace the CommandQueue in a future PR. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes #31904 easier to implement.

The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because Replica.beginCmds repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges.

The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock.

While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the spanlatch.Guard. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting immutable btree nodes, which allows us to release them back into the btree node pools when they're no longer needed. This means that we don't expect any allocations when inserting into the internal trees, even with the copy-on-write policy.

Finally, this will make the approach taken in #31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free.

Naming changes

Before	After
`CommandQueue`	`spanlatch.Manager`
"enter the command queue"	"acquire span latches"
"exit the command queue"	"release span latches"
"wait for prereq commands"	"wait for latches to be released"

The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model.

Microbenchmarks

NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure.

name                              old time/op    new time/op    delta
ReadOnlyMix/size=1-4                 706ns ±20%     404ns ±10%  -42.81%  (p=0.008 n=5+5)
ReadOnlyMix/size=4-4                 649ns ±23%     382ns ± 5%  -41.13%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                611ns ±16%     367ns ± 5%  -39.83%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                692ns ±14%     370ns ± 1%  -46.49%  (p=0.016 n=5+4)
ReadOnlyMix/size=128-4               637ns ±22%     398ns ±14%  -37.48%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4               676ns ±15%     385ns ± 4%  -43.01%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4      12.2µs ± 4%     0.6µs ±17%  -94.85%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4      7.88µs ± 2%    0.55µs ± 7%  -92.99%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4      4.19µs ± 3%    0.58µs ± 5%  -86.26%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4     2.09µs ± 6%    0.54µs ±13%  -74.13%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4      875ns ±17%     423ns ±29%  -51.64%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4     655ns ± 6%     362ns ±16%  -44.71%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=256-4     549ns ±16%     314ns ±13%  -42.73%  (p=0.008 n=5+5)

name                              old alloc/op   new alloc/op   delta
ReadOnlyMix/size=1-4                  223B ± 0%      160B ± 0%  -28.25%  (p=0.079 n=4+5)
ReadOnlyMix/size=4-4                  223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                 223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                 223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=128-4                217B ± 4%      160B ± 0%  -26.27%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4                223B ± 0%      160B ± 0%  -28.25%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=0-4      1.25kB ± 0%    0.16kB ± 0%  -87.15%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4      1.00kB ± 0%    0.16kB ± 0%  -84.00%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=4-4        708B ± 0%      160B ± 0%  -77.40%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=16-4       513B ± 0%      160B ± 0%  -68.81%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       264B ± 0%      160B ± 0%  -39.39%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      221B ± 0%      160B ± 0%  -27.60%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=256-4      198B ± 0%      160B ± 0%  -19.35%  (p=0.008 n=5+5)

name                              old allocs/op  new allocs/op  delta
ReadOnlyMix/size=1-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=4-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=16-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=64-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=128-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=256-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=0-4        38.0 ± 0%       1.0 ± 0%  -97.37%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4        24.0 ± 0%       1.0 ± 0%  -95.83%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4        12.0 ± 0%       1.0 ± 0%  -91.67%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4       5.00 ± 0%      1.00 ± 0%  -80.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=256-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

There are a few interesting things to point about about these benchmark results:

The ReadOnlyMix results demonstrate a fixed improvement, regardless of size. This is due to the replacement of the hash-map with a linked-list for the readSet structure.
The ReadWriteMix is more interesting. We see that the spanlatch implementation is faster across the board. This is especially true with a high write/read ratio.
We see that the allocated memory stays constant regardless of the write/read ratio in the spanlatch implementation. This is due to the memory recylcing that it performs on btree nodes. It is not the case for the CommandQueue implementation.

Release note: None

cockroach-teamcity · 2018-10-30T05:58:46Z

This change is

nvanbenschoten · 2018-10-30T15:00:19Z

There's still plenty of room to optimize the btree implementation that we use here. Until this point we've closely follower github.com/google/btree, but there are some good ideas that we could pull from pebble's btree. For instance, we could inline item and node pointer arrays in nodes to avoid unnecessary indirection. We could also specialize the implementation to this use to avoid interfaces. Short of this, we could at least cache the interface's Range() value to avoid repeatedly calling the method when searching.

32164: storage/cmdq: create new signal type for cmd completion signaling r=nvanbenschoten a=nvanbenschoten `signal` is a type that can signal the completion of an operation. This is a component of the larger change in #31997. The type has three benefits over using a channel directly and closing the channel when the operation completes: 1. signaled() uses atomics to provide a fast-path for checking whether the operation has completed. It is ~75x faster than using a channel for this purpose. 2. the type's channel is lazily initialized when signalChan() is called, avoiding the allocation when one is not needed. 3. because of 2, the type's zero value can be used directly. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>

…policy All commits from cockroachdb#32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in cockroachdb#32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None

32165: storage/cmdq: create new specialized augmented interval btree r=nvanbenschoten a=nvanbenschoten This is a component of the larger change in #31997. The first few commits here modify the existing interval btree implementation, allowing us to properly benchmark against it. The second to last commit forks https://github.com/petermattis/pebble/blob/master/internal/btree/btree.go, specializes it to the command queue, and rips out any references to pebble. There are a number of changes we'll need to make to it: 1. Add synchronized node and leafNode freelists 2. Add Clear method to release owned nodes into freelists 3. Introduce immutability and a copy-on-write policy The next commit modifies the btree type added in the previous commit and turns it into an augmented interval tree. The tree represents intervals and permits an interval search operation following the approach laid out in CLRS, Chapter 14. The B-Tree stores cmds in order based on their start key and each B-Tree node maintains the upper-bound end key of all cmds in its subtree. This is close to what `util/interval.btree` does, although the new version doesn't maintain the lower-bound start key of all cmds in each node. The new interval btree is significantly faster than both the old interval btree and the old interval llrb tree because it minimizes key comparisons while scanning for overlaps. This includes avoiding all key comparisons for cmds with start keys that are greater than the search range's start key. See the comment on `overlapScan` for an explanation of how this is possible. The new interval btree is also faster because it has been specialized for the `storage/cmdq` package. This allows it to avoid interfaces and dynamic dispatch throughout its operations, which showed up prominently on profiles of the other two implementations. A third benefit of the rewrite is that it inherits the optimizations made in pebble's btree. This includes inlining the btree items and child pointers in nodes instead of using slices. ### Benchmarks: _The new interval btree:_ ``` Insert/count=16-4 76.1ns ± 4% Insert/count=128-4 156ns ± 4% Insert/count=1024-4 259ns ± 8% Insert/count=8192-4 386ns ± 1% Insert/count=65536-4 735ns ± 5% Delete/count=16-4 129ns ±16% Delete/count=128-4 189ns ±12% Delete/count=1024-4 338ns ± 7% Delete/count=8192-4 547ns ± 4% Delete/count=65536-4 1.22µs ±12% DeleteInsert/count=16-4 168ns ± 2% DeleteInsert/count=128-4 375ns ± 8% DeleteInsert/count=1024-4 562ns ± 1% DeleteInsert/count=8192-4 786ns ± 3% DeleteInsert/count=65536-4 2.31µs ±26% IterSeekGE/count=16-4 87.2ns ± 3% IterSeekGE/count=128-4 141ns ± 3% IterSeekGE/count=1024-4 227ns ± 4% IterSeekGE/count=8192-4 379ns ± 2% IterSeekGE/count=65536-4 882ns ± 1% IterSeekLT/count=16-4 89.5ns ± 3% IterSeekLT/count=128-4 145ns ± 1% IterSeekLT/count=1024-4 226ns ± 6% IterSeekLT/count=8192-4 379ns ± 1% IterSeekLT/count=65536-4 891ns ± 1% IterFirstOverlap/count=16-4 184ns ± 1% IterFirstOverlap/count=128-4 260ns ± 3% IterFirstOverlap/count=1024-4 685ns ± 7% IterFirstOverlap/count=8192-4 1.23µs ± 2% IterFirstOverlap/count=65536-4 2.14µs ± 1% IterNext-4 3.82ns ± 2% IterPrev-4 14.8ns ± 2% IterNextOverlap-4 8.57ns ± 2% IterOverlapScan-4 25.8µs ± 3% ``` _Compared to old llrb interval tree (currently in use):_ ``` Insert/count=16-4 323ns ± 7% 76ns ± 4% -76.43% (p=0.008 n=5+5) Insert/count=128-4 539ns ± 2% 156ns ± 4% -71.05% (p=0.008 n=5+5) Insert/count=1024-4 797ns ± 1% 259ns ± 8% -67.52% (p=0.008 n=5+5) Insert/count=8192-4 1.30µs ± 5% 0.39µs ± 1% -70.38% (p=0.008 n=5+5) Insert/count=65536-4 2.69µs ±11% 0.74µs ± 5% -72.65% (p=0.008 n=5+5) Delete/count=16-4 438ns ± 7% 129ns ±16% -70.44% (p=0.008 n=5+5) Delete/count=128-4 785ns ± 6% 189ns ±12% -75.89% (p=0.008 n=5+5) Delete/count=1024-4 1.38µs ± 2% 0.34µs ± 7% -75.44% (p=0.008 n=5+5) Delete/count=8192-4 2.36µs ± 2% 0.55µs ± 4% -76.82% (p=0.008 n=5+5) Delete/count=65536-4 4.73µs ±13% 1.22µs ±12% -74.19% (p=0.008 n=5+5) DeleteInsert/count=16-4 920ns ± 2% 168ns ± 2% -81.76% (p=0.008 n=5+5) DeleteInsert/count=128-4 1.73µs ± 4% 0.37µs ± 8% -78.35% (p=0.008 n=5+5) DeleteInsert/count=1024-4 2.69µs ± 3% 0.56µs ± 1% -79.15% (p=0.016 n=5+4) DeleteInsert/count=8192-4 4.55µs ±25% 0.79µs ± 3% -82.70% (p=0.008 n=5+5) DeleteInsert/count=65536-4 7.53µs ± 6% 2.31µs ±26% -69.32% (p=0.008 n=5+5) IterOverlapScan-4 285µs ± 7% 26µs ± 3% -90.96% (p=0.008 n=5+5) ``` _Compared to old btree interval tree (added in a61191e, never enabled):_ ``` Insert/count=16-4 231ns ± 1% 76ns ± 4% -66.99% (p=0.008 n=5+5) Insert/count=128-4 351ns ± 2% 156ns ± 4% -55.53% (p=0.008 n=5+5) Insert/count=1024-4 515ns ± 5% 259ns ± 8% -49.73% (p=0.008 n=5+5) Insert/count=8192-4 786ns ± 3% 386ns ± 1% -50.85% (p=0.008 n=5+5) Insert/count=65536-4 1.50µs ± 3% 0.74µs ± 5% -50.97% (p=0.008 n=5+5) Delete/count=16-4 363ns ±11% 129ns ±16% -64.33% (p=0.008 n=5+5) Delete/count=128-4 466ns ± 9% 189ns ±12% -59.42% (p=0.008 n=5+5) Delete/count=1024-4 806ns ± 6% 338ns ± 7% -58.01% (p=0.008 n=5+5) Delete/count=8192-4 1.43µs ±13% 0.55µs ± 4% -61.71% (p=0.008 n=5+5) Delete/count=65536-4 2.75µs ± 1% 1.22µs ±12% -55.57% (p=0.008 n=5+5) DeleteInsert/count=16-4 557ns ± 1% 168ns ± 2% -69.87% (p=0.008 n=5+5) DeleteInsert/count=128-4 953ns ± 8% 375ns ± 8% -60.71% (p=0.008 n=5+5) DeleteInsert/count=1024-4 1.19µs ± 4% 0.56µs ± 1% -52.72% (p=0.016 n=5+4) DeleteInsert/count=8192-4 1.84µs ±17% 0.79µs ± 3% -57.22% (p=0.008 n=5+5) DeleteInsert/count=65536-4 3.20µs ± 3% 2.31µs ±26% -27.86% (p=0.008 n=5+5) IterOverlapScan-4 70.1µs ± 2% 25.8µs ± 3% -63.23% (p=0.008 n=5+5) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>

…policy All commits from cockroachdb#32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in cockroachdb#32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None

32251: storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy r=nvanbenschoten a=nvanbenschoten All commits from #32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in #31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in #32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten · 2018-11-22T04:06:53Z

I've update this PR to use the interval btree type and the signal type added in #32165, #32251, #32164 to create a new spanlatch.Manager. The commits here don't replace the CommandQueue yet, but that's the immediate follow-up to this PR.

nvanbenschoten · 2018-11-22T04:21:34Z

A few things to note about the testing here:

I copied the CommandQueue benchmarks so that we could perform apples-to-apples microbenchmark comparisons. The results are posted above.
The ReadOnlyMix benchmarks aren't particularly interesting because read-only access without any intermingled writes doesn't hit the interval trees in either implementation. The ReadWriteMix variants are much more telling.
the tests in manager_test.go are direct adaptations of the ones in command_queue_test.go.

ajwerner

I definitely need another pass. This is just the nits I've spotted in the first skim

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/spanlatch/doc.go, line 35 at r5 (raw file):

      key ranges was permitted. Conceptually, the structure became an interval
      tree of sync.RWMutexes.
    * The structure become timestamp-aware and concurrent access of non-causal

s/become/became/

pkg/storage/spanlatch/manager.go, line 77 at r5 (raw file):

}

// latches are stored in the Manager's btrees. The represent the latching of a

s/The/They/

pkg/storage/spanlatch/manager.go, line 137 at r5 (raw file):

	}

	// Guard would be an ideal candidate for object pooling, but without

Nit: move the guard and latch construction to a helper

pkg/storage/spanlatch/manager.go, line 280 at r5 (raw file):

	realloc := len(sm.rSet) > 16
	for latch := range sm.rSet {
		latch.setInRSet(false)

if the TODO is about exploiting the efficient map clearing idiom, I think it needs to be the only statement in the loop https://go-review.googlesource.com/c/go/+/110055/.

perhaps rewrite this as:

for latch := range sm.rSet {
    latch.setInRSet(false)
    sm.trees[spanset.SpanReadOnly].Set(latch)
}
if realloc := len(sm.rSet) > 16; realloc {
    sm.rSet = make(map[*latch]struct{})
} else {
    for latch := range sm.rSet {
        delete(sm.rSet, latch)
    }
}

This change renames `storage/cmdq` to `storage/spanlatch`. The package will house the new `spanlatch.Manager` type, which will handle the acquisition and release of span-latches. This works off of the definition for latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). The files are not changes in this commit. Release note: None

This commit replaces all reference to cmds with references to latches. Release note: None

Informs cockroachdb#4768. Informs cockroachdb#31904. This change was inspired by cockroachdb#31904 and is a progression of the thinking started in cockroachdb#4768 (comment). The change introduces `spanlatch.Manager`, which will replace the `CommandQueue` **in a future PR**. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes cockroachdb#31904 easier to implement. The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because `Replica.beginCmds` repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges. The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock. While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the `spanlatch.Guard`. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting btree nodes, which allows us to release them back into the btree node pools when they're no longer references by any btree snapshots. This means that we don't expect any allocations when inserting into the internal trees, even with the COW policy. Finally, this will make the approach taken in cockroachdb#31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free. _### Naming changes | Before | After | |----------------------------|-----------------------------------| | `CommandQueue` | `spanlatch.Manager` | | "enter the command queue" | "acquire span latches" | | "exit the command queue" | "release span latches" | | "wait for prereq commands" | "wait for latches to be released" | The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model. _### Microbenchmarks NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure. ``` name cmdq time/op spanlatch time/op delta ReadOnlyMix/size=1-4 897ns ±21% 917ns ±18% ~ (p=0.897 n=8+10) ReadOnlyMix/size=4-4 827ns ±22% 772ns ±15% ~ (p=0.448 n=10+10) ReadOnlyMix/size=16-4 905ns ±19% 770ns ±10% -14.90% (p=0.004 n=10+10) ReadOnlyMix/size=64-4 907ns ±20% 730ns ±15% -19.51% (p=0.001 n=10+10) ReadOnlyMix/size=128-4 926ns ±17% 731ns ±11% -21.04% (p=0.000 n=9+10) ReadOnlyMix/size=256-4 977ns ±19% 726ns ± 9% -25.65% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=0-4 12.5µs ± 4% 0.7µs ±17% -94.70% (p=0.000 n=8+9) ReadWriteMix/readsPerWrite=1-4 8.18µs ± 5% 0.63µs ± 6% -92.24% (p=0.000 n=10+9) ReadWriteMix/readsPerWrite=4-4 3.80µs ± 2% 0.66µs ± 5% -82.58% (p=0.000 n=8+10) ReadWriteMix/readsPerWrite=16-4 1.82µs ± 2% 0.70µs ± 5% -61.43% (p=0.000 n=9+10) ReadWriteMix/readsPerWrite=64-4 894ns ±12% 514ns ± 6% -42.48% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=128-4 717ns ± 5% 472ns ± 1% -34.21% (p=0.000 n=10+8) ReadWriteMix/readsPerWrite=256-4 607ns ± 5% 453ns ± 3% -25.35% (p=0.000 n=7+10) name cmdq alloc/op spanlatch alloc/op delta ReadOnlyMix/size=1-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=4-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=16-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=64-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=128-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=256-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=0-4 915B ± 0% 144B ± 0% -84.26% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=1-4 730B ± 0% 144B ± 0% -80.29% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=4-4 486B ± 0% 144B ± 0% -70.35% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=16-4 350B ± 0% 144B ± 0% -58.86% (p=0.000 n=9+10) ReadWriteMix/readsPerWrite=64-4 222B ± 0% 144B ± 0% -35.14% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=128-4 199B ± 0% 144B ± 0% -27.64% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=256-4 188B ± 0% 144B ± 0% -23.40% (p=0.000 n=10+10) name cmdq allocs/op spanlatch allocs/op delta ReadOnlyMix/size=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=0-4 34.0 ± 0% 1.0 ± 0% -97.06% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=1-4 22.0 ± 0% 1.0 ± 0% -95.45% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=4-4 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=16-4 4.00 ± 0% 1.00 ± 0% -75.00% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ``` Release note: None

…g removal This change modifies `adjustUpperBoundOnRemoval` to avoid a degenerate case in element removal where all intervals have the same end key. In this case, we would previously adjust the upper bound of every node from the root of the tree to the node that the interval was being removed from. We now check whether removing the element with the largest end key is actually changing the upper bound of the node. If there are other elements with the same end key then this is not the case and we can avoid repeat calls to `adjustUpperBoundOnRemoval` while traversing back up the tree. This came up while profiling a benchmark that was giving suprising results. Release note: None

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/command_queue_test.go, line 809 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I suppose you'll be renaming all of these instances of CommandQueue as well in a future PR.

Yes, I'll be ripping out every single reference I can find to it.

pkg/storage/spanlatch/doc.go, line 20 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps mention that this is the evolution of complexity. Something like: s/Managers's/The evolution of/g.

Done.

pkg/storage/spanlatch/manager.go, line 62 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: I'd prefer to see this spelled out as readSet and inReadSet.

We could avoid the use of a map by instead using a circularly linked list. latch would need next, prev *latch fields. You can remove an element from such a list without knowing its position. See util/cache.Entry and util/cache.entryList for an example of what I'm thinking of.

That's a really cool idea! It provides a nice speedup:

name                                          old time/op    new time/op    delta
LatchManagerReadOnlyMix/size=1-4                 683ns ± 9%     404ns ±10%  -40.85%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                 660ns ± 7%     382ns ± 5%  -42.17%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                684ns ±10%     367ns ± 5%  -46.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                683ns ± 8%     370ns ± 1%  -45.75%  (p=0.016 n=5+4)
LatchManagerReadOnlyMix/size=128-4               678ns ± 4%     398ns ±14%  -41.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4               652ns ± 4%     385ns ± 4%  -40.95%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4       594ns ±16%     629ns ±17%     ~     (p=0.222 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4       603ns ± 1%     552ns ± 7%   -8.39%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4       621ns ± 4%     576ns ± 5%   -7.28%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4      649ns ± 2%     541ns ±13%  -16.69%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4      474ns ± 5%     423ns ±29%     ~     (p=0.151 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4     413ns ± 2%     362ns ±16%     ~     (p=0.095 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4     448ns ±14%     314ns ±13%  -29.85%  (p=0.008 n=5+5)

name                                          old alloc/op   new alloc/op   delta
LatchManagerReadOnlyMix/size=1-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                 191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                 191B ± 0%      160B ± 0%     ~     (p=0.079 n=4+5)
LatchManagerReadOnlyMix/size=128-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)

name                                          old allocs/op  new allocs/op  delta
LatchManagerReadOnlyMix/size=1-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=4-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=16-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=64-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=128-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=256-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=0-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=1-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=4-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=16-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=64-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=128-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=256-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

pkg/storage/spanlatch/manager.go, line 228 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Looks like you always have a snapshot associated with a guard. Rather than passing the snapshot on the stack, it might be better (faster) to embed the snapshot in the guard and to change Manager.snapshot() to take a *snapshot which it fills in.

But then we have to allocate that entire object on the heap and keep the memory around for the entire lifetime of the Guard. Do you think that will be faster?

pkg/storage/spanlatch/manager.go, line 250 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Perhaps follow the Locked naming convention. E.g. snapshotLocked and insertLocked.

Done.

pkg/storage/spanlatch/interval_btree.go, line 15 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Note that there are various bits in the UI that refer to "Command Queue". Let's file an issue to change the name there as well.

I have a series of changes lined up to eradicate that word.

ajwerner

Reviewed 1 of 7 files at r4, 1 of 1 files at r5, 2 of 4 files at r7, 3 of 7 files at r9, 1 of 1 files at r10, 1 of 7 files at r14, 1 of 5 files at r16.
Reviewable status: complete! 0 of 0 LGTMs obtained

pkg/storage/spanlatch/list.go, line 20 at r16 (raw file):

type latchList struct {
	root latch
	len  int

while it's reasonable and clean to track len (and it's done in container/list to be able to implement O(1) Length), it seems like given the general memory consciousness of this package, it's safe to omit latchList.len if in front() you make the nil condition ll.root.next == nil || ll.root.next == &ll.root

pkg/storage/spanlatch/manager.go, line 246 at r16 (raw file):

// flushReadSetLocked flushes the read set into the read interval tree.
func (sm *scopedManager) flushReadSetLocked() {
	for sm.readSet.len > 0 {

if you decide to eliminate.len then I guess this could look like:

for latch := sm.readSet.front(); latch != nil; latch = sm.readSet.front() {
   sm.readSet.remove(latch)
   sm.trees[spanset.SpanReadOnly].Set(latch)
}

petermattis

I didn't fully scrutinize all of the details or testing here. Let me know if you think something deserve particular attention and I'll give it a thorough look.

Reviewable status: complete! 1 of 0 LGTMs obtained

pkg/storage/spanlatch/list.go, line 20 at r16 (raw file):

Previously, ajwerner wrote…

while it's reasonable and clean to track len (and it's done in container/list to be able to implement O(1) Length), it seems like given the general memory consciousness of this package, it's safe to omit latchList.len if in front() you make the nil condition ll.root.next == nil || ll.root.next == &ll.root

The memory savings are minimal as there are a constant number of latchLists per Manager. That said, I'd remove len because it doesn't seem necessary per @ajwerner's suggestion.

pkg/storage/spanlatch/list.go, line 30 at r16 (raw file):

}

func (ll *latchList) lazyInit() {

Do you need this lazyInit stuff? For the usage in spanlatch.Manager I think an init method could be called when the Manager is created.

pkg/storage/spanlatch/manager.go, line 62 at r10 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

That's a really cool idea! It provides a nice speedup:

name                                          old time/op    new time/op    delta
LatchManagerReadOnlyMix/size=1-4                 683ns ± 9%     404ns ±10%  -40.85%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                 660ns ± 7%     382ns ± 5%  -42.17%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                684ns ±10%     367ns ± 5%  -46.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                683ns ± 8%     370ns ± 1%  -45.75%  (p=0.016 n=5+4)
LatchManagerReadOnlyMix/size=128-4               678ns ± 4%     398ns ±14%  -41.27%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4               652ns ± 4%     385ns ± 4%  -40.95%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4       594ns ±16%     629ns ±17%     ~     (p=0.222 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4       603ns ± 1%     552ns ± 7%   -8.39%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4       621ns ± 4%     576ns ± 5%   -7.28%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4      649ns ± 2%     541ns ±13%  -16.69%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4      474ns ± 5%     423ns ±29%     ~     (p=0.151 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4     413ns ± 2%     362ns ±16%     ~     (p=0.095 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4     448ns ±14%     314ns ±13%  -29.85%  (p=0.008 n=5+5)

name                                          old alloc/op   new alloc/op   delta
LatchManagerReadOnlyMix/size=1-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4                  191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4                 191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4                 191B ± 0%      160B ± 0%     ~     (p=0.079 n=4+5)
LatchManagerReadOnlyMix/size=128-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4                191B ± 0%      160B ± 0%  -16.23%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4        144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4       144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4      144B ± 0%      160B ± 0%  +11.11%  (p=0.008 n=5+5)

name                                          old allocs/op  new allocs/op  delta
LatchManagerReadOnlyMix/size=1-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=4-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=16-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=64-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=128-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadOnlyMix/size=256-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=0-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=1-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=4-4        1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=16-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=64-4       1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=128-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
LatchManagerReadWriteMix/readsPerWrite=256-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)

💯

pkg/storage/spanlatch/manager.go, line 228 at r10 (raw file):

Do you think that will be faster?

I don't know. Perhaps add it to a TODO list to investigate after this PR goes in. Probably a very minor benefit if any.

ajwerner

Reviewed 4 of 9 files at r11, 3 of 4 files at r12, 1 of 2 files at r13, 5 of 7 files at r14, 1 of 1 files at r15, 2 of 5 files at r16.
Reviewable status: complete! 2 of 0 LGTMs obtained

pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):

// before returning.
func (m *Manager) wait(ctx context.Context, lg *Guard, ts hlc.Timestamp, snap snapshot) error {
	for s := spanset.SpanScope(0); s < spanset.NumSpanScope; s++ {

Just a question for discussion, can the order in which latches are examined impact performance? It seems like if we could wait on the longest blocking item first then we'd increase the rate of hitting the fast path on the signal and the number of goroutine yields on the select. I don't have good intuition about what it would take to come up with a heuristic to guess as when a latch will be removed. Do we expect reads to happen faster than writes? Do we expect global things to take longer than local? All of this may be premature optimization. It might be worth trying to see how often you hit the fast path and if the number is low (maybe even as low as something like 50%), then maybe there's a cheap win here.

pkg/storage/spanlatch/manager.go, line 265 at r16 (raw file):

				switch a {
				case spanset.SpanReadOnly:
					// Add reads to the rSet. They only need to enter the read

total nit: s/rSet/readSet/

This change replaces the Manager's `readSet` map implementation with a linked-list implementation. This provides the following speedup: ``` name old time/op new time/op delta LatchManagerReadOnlyMix/size=1-4 683ns ± 9% 404ns ±10% -40.85% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=4-4 660ns ± 7% 382ns ± 5% -42.17% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=16-4 684ns ±10% 367ns ± 5% -46.27% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=64-4 683ns ± 8% 370ns ± 1% -45.75% (p=0.016 n=5+4) LatchManagerReadOnlyMix/size=128-4 678ns ± 4% 398ns ±14% -41.27% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=256-4 652ns ± 4% 385ns ± 4% -40.95% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=0-4 594ns ±16% 629ns ±17% ~ (p=0.222 n=5+5) LatchManagerReadWriteMix/readsPerWrite=1-4 603ns ± 1% 552ns ± 7% -8.39% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=4-4 621ns ± 4% 576ns ± 5% -7.28% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=16-4 649ns ± 2% 541ns ±13% -16.69% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=64-4 474ns ± 5% 423ns ±29% ~ (p=0.151 n=5+5) LatchManagerReadWriteMix/readsPerWrite=128-4 413ns ± 2% 362ns ±16% ~ (p=0.095 n=5+5) LatchManagerReadWriteMix/readsPerWrite=256-4 448ns ±14% 314ns ±13% -29.85% (p=0.008 n=5+5) name old alloc/op new alloc/op delta LatchManagerReadOnlyMix/size=1-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=4-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=16-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=64-4 191B ± 0% 160B ± 0% ~ (p=0.079 n=4+5) LatchManagerReadOnlyMix/size=128-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=256-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=0-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=1-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=4-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=16-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=64-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=128-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=256-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) name old allocs/op new allocs/op delta LatchManagerReadOnlyMix/size=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=0-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ``` The change also makes Manager's zero value completely usable. Release note: None

It is cheaper to wait on an already released latch than it is an unreleased latch so we prefer waiting on longer latches first. We expect writes to take longer than reads to release their latches, so we wait on them first. Release note: None

nvanbenschoten

TFTRs!

bors r+

Reviewable status: complete! 2 of 0 LGTMs obtained

pkg/storage/spanlatch/list.go, line 20 at r16 (raw file):

Previously, petermattis (Peter Mattis) wrote…

The memory savings are minimal as there are a constant number of latchLists per Manager. That said, I'd remove len because it doesn't seem necessary per @ajwerner's suggestion.

I actually did exactly what's being suggested here at first, but I realized that we're going to want metrics on this soon enough and being able to track how many reads are in the readSet will be important.

pkg/storage/spanlatch/list.go, line 30 at r16 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Do you need this lazyInit stuff? For the usage in spanlatch.Manager I think an init method could be called when the Manager is created.

This allows the zero value for the entire spanlatch.Manager to be used directly, which is super nice. We don't have or need a Manager constructor.

pkg/storage/spanlatch/manager.go, line 228 at r10 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Do you think that will be faster?

I don't know. Perhaps add it to a TODO list to investigate after this PR goes in. Probably a very minor benefit if any.

I gave it a shot and it didn't seem to help:

name                              old time/op    new time/op    delta
ReadOnlyMix/size=1-4                 404ns ±10%     561ns ±14%  +38.91%  (p=0.008 n=5+5)
ReadOnlyMix/size=4-4                 382ns ± 5%     533ns ±17%  +39.60%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                367ns ± 5%     500ns ±17%  +36.04%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                370ns ± 1%     518ns ± 8%  +39.92%  (p=0.016 n=4+5)
ReadOnlyMix/size=128-4               398ns ±14%     548ns ± 8%  +37.50%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4               385ns ± 4%     546ns ± 5%  +41.92%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4       629ns ±17%     755ns ±14%     ~     (p=0.056 n=5+5)
ReadWriteMix/readsPerWrite=1-4       552ns ± 7%     729ns ± 9%  +31.93%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4       576ns ± 5%     673ns ±20%  +16.84%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4      541ns ±13%     632ns ± 1%  +16.89%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4      423ns ±29%     552ns ±31%  +30.50%  (p=0.032 n=5+5)
ReadWriteMix/readsPerWrite=128-4     362ns ±16%     426ns ± 3%  +17.44%  (p=0.016 n=5+5)
ReadWriteMix/readsPerWrite=256-4     314ns ±13%     405ns ± 6%  +28.94%  (p=0.008 n=5+5)

name                              old alloc/op   new alloc/op   delta
ReadOnlyMix/size=1-4                  160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=4-4                  160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                 160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                 160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=128-4                160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4                160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4        160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4        160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4        160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4       160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=256-4      160B ± 0%      224B ± 0%  +40.00%  (p=0.008 n=5+5)

pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):
This is an interesting idea. We expect writes to hold their latches significantly longer that reads, so it should be a clear win to wait on them first so that we select from fewer channels in total. Done.

Do we expect reads to happen faster than writes?

Yes.

Do we expect global things to take longer than local?

Not necessarily. I don't think there's any real correlation here.

pkg/storage/spanlatch/manager.go, line 246 at r16 (raw file):

Previously, ajwerner wrote…

if you decide to eliminate.len then I guess this could look like:

for latch := sm.readSet.front(); latch != nil; latch = sm.readSet.front() {
   sm.readSet.remove(latch)
   sm.trees[spanset.SpanReadOnly].Set(latch)
}

See discussion above.

pkg/storage/spanlatch/manager.go, line 265 at r16 (raw file):

Previously, ajwerner wrote…

total nit: s/rSet/readSet/

Not a nit, a botched refactor :) Done.

nvanbenschoten · 2018-11-29T16:56:51Z

bors r-

craig · 2018-11-29T16:56:52Z

Canceled

ajwerner

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

This is an interesting idea. We expect writes to hold their latches significantly longer that reads, so it should be a clear win to wait on them first so that we select from fewer channels in total. Done.

Do we expect reads to happen faster than writes?

Yes.

Do we expect global things to take longer than local?

Not necessarily. I don't think there's any real correlation here.

Cool, the next steps to push this idea further would be:

set a to SpanReadWrite before setting it SpanReadOnly (0) in the for loop
sort the latches in newGuard with the highest timestamps first as my intuition is that the high timestamp latches are expected to finish last.

nvanbenschoten

bors r+

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):

set a to SpanReadWrite before setting it SpanReadOnly (0) in the for loop

But we would want a=SpanReadOnly before a=SpanReadWrite, right? Because then the order of access will be readSpan+tree[SpanReadWrite], writeSpan+tree[SpanReadWrite], writeSpan+tree[SpanReadOnly]. Either way, in practice we never actually see requests with read and write spans together.

sort the latches in newGuard with the highest timestamps first as my intuition is that the high timestamp latches are expected to finish last.

All of the latches in newGuard have the same timestamp. Also, anything that requires sorting will almost certainly cost more than doing nothing at all. We're dealing on the order of double-digit ns at this point.

31997: storage/spanlatch: create spanlatch.Manager using immutable btrees r=nvanbenschoten a=nvanbenschoten Informs #4768. Informs #31904. This change was inspired by #31904 and is a progression of the thinking started in #4768 (comment). The change introduces `spanlatch.Manager`, which will replace the `CommandQueue` **in a future PR**. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes #31904 easier to implement. The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because `Replica.beginCmds` repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges. The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock. While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the `spanlatch.Guard`. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting immutable btree nodes, which allows us to release them back into the btree node pools when they're no longer needed. This means that we don't expect any allocations when inserting into the internal trees, even with the copy-on-write policy. Finally, this will make the approach taken in #31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free. ### Naming changes | Before | After | |----------------------------|-----------------------------------| | `CommandQueue` | `spanlatch.Manager` | | "enter the command queue" | "acquire span latches" | | "exit the command queue" | "release span latches" | | "wait for prereq commands" | "wait for latches to be released" | The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model. ### Microbenchmarks NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure. ``` name old time/op new time/op delta ReadOnlyMix/size=1-4 706ns ±20% 404ns ±10% -42.81% (p=0.008 n=5+5) ReadOnlyMix/size=4-4 649ns ±23% 382ns ± 5% -41.13% (p=0.008 n=5+5) ReadOnlyMix/size=16-4 611ns ±16% 367ns ± 5% -39.83% (p=0.008 n=5+5) ReadOnlyMix/size=64-4 692ns ±14% 370ns ± 1% -46.49% (p=0.016 n=5+4) ReadOnlyMix/size=128-4 637ns ±22% 398ns ±14% -37.48% (p=0.008 n=5+5) ReadOnlyMix/size=256-4 676ns ±15% 385ns ± 4% -43.01% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=0-4 12.2µs ± 4% 0.6µs ±17% -94.85% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=1-4 7.88µs ± 2% 0.55µs ± 7% -92.99% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=4-4 4.19µs ± 3% 0.58µs ± 5% -86.26% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=16-4 2.09µs ± 6% 0.54µs ±13% -74.13% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=64-4 875ns ±17% 423ns ±29% -51.64% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=128-4 655ns ± 6% 362ns ±16% -44.71% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=256-4 549ns ±16% 314ns ±13% -42.73% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ReadOnlyMix/size=1-4 223B ± 0% 160B ± 0% -28.25% (p=0.079 n=4+5) ReadOnlyMix/size=4-4 223B ± 0% 160B ± 0% -28.25% (p=0.008 n=5+5) ReadOnlyMix/size=16-4 223B ± 0% 160B ± 0% -28.25% (p=0.008 n=5+5) ReadOnlyMix/size=64-4 223B ± 0% 160B ± 0% -28.25% (p=0.008 n=5+5) ReadOnlyMix/size=128-4 217B ± 4% 160B ± 0% -26.27% (p=0.008 n=5+5) ReadOnlyMix/size=256-4 223B ± 0% 160B ± 0% -28.25% (p=0.079 n=4+5) ReadWriteMix/readsPerWrite=0-4 1.25kB ± 0% 0.16kB ± 0% -87.15% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=1-4 1.00kB ± 0% 0.16kB ± 0% -84.00% (p=0.079 n=4+5) ReadWriteMix/readsPerWrite=4-4 708B ± 0% 160B ± 0% -77.40% (p=0.079 n=4+5) ReadWriteMix/readsPerWrite=16-4 513B ± 0% 160B ± 0% -68.81% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=64-4 264B ± 0% 160B ± 0% -39.39% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=128-4 221B ± 0% 160B ± 0% -27.60% (p=0.079 n=4+5) ReadWriteMix/readsPerWrite=256-4 198B ± 0% 160B ± 0% -19.35% (p=0.008 n=5+5) name old allocs/op new allocs/op delta ReadOnlyMix/size=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=0-4 38.0 ± 0% 1.0 ± 0% -97.37% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=1-4 24.0 ± 0% 1.0 ± 0% -95.83% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=4-4 12.0 ± 0% 1.0 ± 0% -91.67% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=16-4 5.00 ± 0% 1.00 ± 0% -80.00% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=64-4 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ``` There are a few interesting things to point about about these benchmark results: - The `ReadOnlyMix` results demonstrate a fixed improvement, regardless of size. This is due to the replacement of the hash-map with a linked-list for the readSet structure. - The `ReadWriteMix` is more interesting. We see that the spanlatch implementation is faster across the board. This is especially true with a high write/read ratio. - We see that the allocated memory stays constant regardless of the write/read ratio in the spanlatch implementation. This is due to the memory recylcing that it performs on btree nodes. It is not the case for the CommandQueue implementation. Release note: None 32416: scripts: enhance the release notes r=knz a=knz Fixes #25180. With this the amount of release notes for the first 2.2 alpha in cockroachdb/docs#4051 is reduced to just under two pages. Also this PR makes it easier to monitor progress during the execution of the script. Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>

craig · 2018-11-29T17:43:55Z

Build succeeded

GitHub CI (Cockroach)

tbg

💯

Reviewed 9 of 9 files at r11, 4 of 4 files at r12, 2 of 2 files at r13, 7 of 7 files at r14, 1 of 1 files at r15, 4 of 5 files at r16, 1 of 1 files at r17.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/storage/spanlatch/manager.go, line 314 at r14 (raw file):

}

func (m *Manager) nextID() uint64 {

nit: nextIDLocked()

pkg/storage/spanlatch/manager_test.go, line 122 at r14 (raw file):

	m := New()

	// Try latch with no overlapping already-acquired lathes.

lathes

tbg

Reviewed 1 of 1 files at r18.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

nvanbenschoten

Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)

pkg/storage/spanlatch/manager.go, line 314 at r14 (raw file):

Previously, tbg (Tobias Grieger) wrote…

nit: nextIDLocked()

Will address in next PR.

cockroachdb#31997 (review) Release note: None

This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in cockroachdb#31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks. This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year. _### Benchamrks _#### Standard Benchmarks These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts. ``` name old ops/sec new ops/sec delta kv0/cores=4/nodes=1/splits=0 1.99k ± 2% 2.06k ± 1% +3.22% (p=0.008 n=5+5) kv0/cores=4/nodes=1/splits=100 2.25k ± 1% 2.38k ± 1% +6.01% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 1.60k ± 0% 1.69k ± 2% +5.53% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 3.52k ± 6% 3.65k ± 9% ~ (p=0.421 n=5+5) kv0/cores=16/nodes=1/splits=0 19.9k ± 1% 21.8k ± 1% +9.34% (p=0.008 n=5+5) kv0/cores=16/nodes=1/splits=100 24.4k ± 1% 26.1k ± 1% +7.17% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 14.9k ± 1% 16.1k ± 1% +8.03% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 20.6k ± 1% 22.8k ± 1% +10.79% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=0 31.2k ± 2% 35.3k ± 1% +13.28% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 45.7k ± 1% 51.1k ± 1% +11.80% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 23.7k ± 2% 27.1k ± 2% +14.39% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=100 34.9k ± 2% 45.1k ± 1% +29.44% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 12.7k ± 2% 12.9k ± 2% +1.39% (p=0.151 n=5+5) kv95/cores=4/nodes=1/splits=100 12.8k ± 2% 13.1k ± 2% +2.10% (p=0.032 n=5+5) kv95/cores=4/nodes=3/splits=0 10.6k ± 1% 10.8k ± 1% +1.58% (p=0.056 n=5+5) kv95/cores=4/nodes=3/splits=100 12.3k ± 7% 12.6k ± 8% +2.61% (p=0.095 n=5+5) kv95/cores=16/nodes=1/splits=0 50.9k ± 1% 52.2k ± 1% +2.37% (p=0.008 n=5+5) kv95/cores=16/nodes=1/splits=100 52.2k ± 1% 53.0k ± 1% +1.49% (p=0.008 n=5+5) kv95/cores=16/nodes=3/splits=0 46.2k ± 1% 46.8k ± 1% +1.32% (p=0.032 n=5+5) kv95/cores=16/nodes=3/splits=100 51.0k ± 1% 53.2k ± 1% +4.25% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 79.8k ± 2% 101.6k ± 1% +27.31% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=100 104k ± 1% 107k ± 1% +2.60% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 85.8k ± 1% 91.8k ± 1% +7.08% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=100 106k ± 1% 112k ± 1% +5.51% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=4/nodes=1/splits=0 3.52 ± 5% 3.40 ± 0% -3.41% (p=0.016 n=5+4) kv0/cores=4/nodes=1/splits=100 3.30 ± 0% 3.00 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 4.70 ± 0% 4.14 ± 9% -11.91% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 1.50 ± 0% 1.48 ± 8% ~ (p=0.968 n=4+5) kv0/cores=16/nodes=1/splits=0 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=16/nodes=1/splits=100 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0/cores=16/nodes=3/splits=0 2.00 ± 0% 1.90 ± 0% -5.00% (p=0.000 n=5+4) kv0/cores=16/nodes=3/splits=100 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=36/nodes=1/splits=0 1.76 ± 3% 1.60 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 2.56 ± 2% 2.40 ± 0% -6.25% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 1.70 ± 0% 1.40 ± 0% -17.65% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=0 0.70 ± 0% 0.64 ± 9% -8.57% (p=0.167 n=5+5) kv95/cores=16/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=3/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv95/cores=36/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0/cores=4/nodes=1/splits=0 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.000 n=5+4) kv0/cores=4/nodes=1/splits=100 7.90 ± 0% 7.60 ± 0% -3.80% (p=0.000 n=5+4) kv0/cores=4/nodes=3/splits=0 15.7 ± 0% 15.2 ± 0% -3.18% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 8.90 ± 0% 8.12 ± 3% -8.76% (p=0.016 n=4+5) kv0/cores=16/nodes=1/splits=0 3.46 ± 2% 3.00 ± 0% -13.29% (p=0.000 n=5+4) kv0/cores=16/nodes=1/splits=100 4.50 ± 0% 3.36 ± 2% -25.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 4.50 ± 0% 3.90 ± 0% -13.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 5.80 ± 0% 4.10 ± 0% -29.31% (p=0.029 n=4+4) kv0/cores=36/nodes=1/splits=0 6.80 ± 0% 5.20 ± 0% -23.53% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 5.80 ± 0% 4.32 ± 4% -25.52% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 7.72 ± 2% 6.30 ± 0% -18.39% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 7.98 ± 2% 5.20 ± 0% -34.84% (p=0.000 n=5+4) kv95/cores=4/nodes=1/splits=0 5.38 ± 3% 5.20 ± 0% -3.35% (p=0.167 n=5+5) kv95/cores=4/nodes=1/splits=100 5.00 ± 0% 5.00 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 5.68 ± 3% 5.50 ± 0% -3.17% (p=0.095 n=5+4) kv95/cores=4/nodes=3/splits=100 3.60 ±31% 2.93 ± 3% -18.75% (p=0.016 n=5+4) kv95/cores=16/nodes=1/splits=0 4.10 ± 0% 4.10 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 4.50 ± 0% 4.10 ± 0% -8.89% (p=0.000 n=5+4) kv95/cores=16/nodes=3/splits=0 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=100 2.50 ± 0% 1.90 ± 5% -24.00% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.029 n=4+4) kv95/cores=36/nodes=1/splits=100 5.50 ± 0% 5.12 ± 2% -6.91% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 4.18 ± 2% 4.02 ± 3% -3.71% (p=0.000 n=4+5) kv95/cores=36/nodes=3/splits=100 3.80 ± 0% 2.80 ± 0% -26.32% (p=0.008 n=5+5) ``` _#### Large-machine Benchmarks These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs. ``` name old ops/sec new ops/sec delta kv0/cores=72/nodes=1/splits=0 31.0k ± 4% 36.4k ± 1% +17.57% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 44.0k ± 0% 49.0k ± 1% +11.41% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 52.7k ±18% 72.6k ±26% +37.70% (p=0.016 n=5+5) kv95/cores=72/nodes=1/splits=100 66.8k ±17% 68.5k ± 5% ~ (p=0.286 n=5+4) name old p50(ms) new p50(ms) delta kv0/cores=72/nodes=1/splits=0 2.30 ±13% 2.52 ± 5% ~ (p=0.214 n=5+5) kv0/cores=72/nodes=1/splits=100 3.00 ± 0% 2.90 ± 0% -3.33% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 0.46 ±13% 0.50 ± 0% ~ (p=0.444 n=5+5) kv95/cores=72/nodes=1/splits=100 0.44 ±14% 0.50 ± 0% +13.64% (p=0.167 n=5+5) name old p99(ms) new p99(ms) delta kv0/cores=72/nodes=1/splits=0 18.9 ± 6% 13.3 ± 5% -29.56% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 13.4 ± 2% 11.0 ± 0% -17.91% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 34.4 ±34% 23.5 ±24% -31.74% (p=0.048 n=5+5) kv95/cores=72/nodes=1/splits=100 21.0 ± 0% 19.1 ± 4% -8.81% (p=0.029 n=4+4) ``` _#### Motivating Benchmarks These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the `c` specifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations. ``` name old ops/sec new ops/sec delta kv95-c5/cores=16/nodes=1/splits=0 45.1k ± 1% 47.2k ± 4% +4.59% (p=0.008 n=5+5) kv95-c5/cores=36/nodes=1/splits=0 44.6k ± 1% 76.3k ± 1% +71.05% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 27.2k ± 2% 29.4k ± 1% +8.12% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 42.6k ± 2% 50.0k ± 1% +17.39% (p=0.008 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 28.7k ± 1% 29.8k ± 3% +3.87% (p=0.008 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 41.9k ± 4% 52.8k ± 2% +25.97% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv95-c5/cores=16/nodes=1/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95-c5/cores=36/nodes=1/splits=0 0.90 ± 0% 0.80 ± 0% -11.11% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 1.10 ± 0% 1.06 ± 6% ~ (p=0.444 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 1.26 ± 5% 1.30 ± 0% ~ (p=0.444 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 0.70 ± 0% 0.50 ± 0% -28.57% (p=0.008 n=5+5) name old p99(ms) new p99(ms) delta kv95-c5/cores=16/nodes=1/splits=0 2.40 ± 0% 2.10 ± 0% -12.50% (p=0.000 n=5+4) kv95-c5/cores=36/nodes=1/splits=0 5.80 ± 0% 3.30 ± 0% -43.10% (p=0.000 n=5+4) kv50-c128/cores=16/nodes=1/splits=0 3.50 ± 0% 3.00 ± 0% -14.29% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 6.80 ± 0% 4.70 ± 0% -30.88% (p=0.079 n=4+5) kv70-20-c128/cores=16/nodes=1/splits=0 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.029 n=4+4) kv70-20-c128/cores=36/nodes=1/splits=0 11.0 ± 0% 6.8 ± 0% -38.18% (p=0.008 n=5+5) ``` _#### Batching Benchmarks One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary. It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck. ``` name old ops/sec new ops/sec delta kv0-b16/cores=4/nodes=1/splits=0 2.41k ± 0% 2.06k ± 3% -14.75% (p=0.008 n=5+5) kv0-b16/cores=4/nodes=1/splits=100 514 ± 0% 534 ± 1% +3.88% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 2.95k ± 0% 4.35k ± 0% +47.74% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 1.80k ± 1% 1.88k ± 1% +4.46% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 2.74k ± 0% 4.92k ± 1% +79.55% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 2.39k ± 1% 2.45k ± 1% +2.41% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 422 ± 0% 518 ± 1% +22.60% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 98.4 ± 1% 98.8 ± 1% ~ (p=0.810 n=5+5) kv0-b128/cores=16/nodes=1/splits=0 532 ± 0% 1059 ± 0% +99.16% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 291 ± 1% 307 ± 1% +5.18% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 483 ± 0% 1288 ± 1% +166.37% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 394 ± 1% 408 ± 1% +3.51% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 49.7 ± 1% 72.8 ± 1% +46.52% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 30.8 ± 0% 23.4 ± 0% -24.03% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 48.9 ± 2% 160.6 ± 0% +228.38% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=100 101 ± 1% 80 ± 0% -21.64% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 37.5 ± 0% 208.1 ± 1% +454.99% (p=0.016 n=4+5) kv0-b1024/cores=36/nodes=1/splits=100 162 ± 0% 124 ± 0% -23.22% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 5.93k ± 0% 6.20k ± 1% +4.55% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 2.27k ± 1% 2.32k ± 1% +2.28% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=0 5.15k ± 1% 18.79k ± 1% +264.73% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=100 8.31k ± 1% 8.57k ± 1% +3.16% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 3.96k ± 0% 10.67k ± 1% +169.81% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=100 15.7k ± 2% 16.2k ± 4% +2.75% (p=0.151 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 1.12k ± 1% 1.27k ± 0% +13.28% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 290 ± 1% 299 ± 1% +3.02% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 1.06k ± 0% 3.31k ± 0% +213.09% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 662 ±91% 1095 ± 1% +65.42% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 715 ± 2% 3586 ± 0% +401.21% (p=0.008 n=5+5) kv95-b128/cores=36/nodes=1/splits=100 1.15k ±90% 2.01k ± 2% +74.79% (p=0.016 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 134 ± 1% 170 ± 1% +26.59% (p=0.008 n=5+5) kv95-b1024/cores=4/nodes=1/splits=100 54.8 ± 3% 53.3 ± 3% -2.84% (p=0.056 n=5+5) kv95-b1024/cores=16/nodes=1/splits=0 104 ± 3% 367 ± 1% +252.37% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 210 ± 1% 214 ± 1% +1.86% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 76.5 ± 2% 383.9 ± 1% +401.67% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 431 ± 1% 436 ± 1% +1.17% (p=0.016 n=5+5) name old p50(ms) new p50(ms) delta kv0-b16/cores=4/nodes=1/splits=0 3.00 ± 0% 3.40 ± 0% +13.33% (p=0.016 n=5+4) kv0-b16/cores=4/nodes=1/splits=100 15.2 ± 0% 14.7 ± 0% -3.29% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 10.5 ± 0% 7.7 ± 2% -26.48% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 17.8 ± 0% 16.8 ± 0% -5.62% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 26.2 ± 0% 14.2 ± 0% -45.80% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 29.0 ± 2% 28.3 ± 0% -2.28% (p=0.095 n=5+4) kv0-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 15.2 ± 0% -14.61% (p=0.000 n=5+4) kv0-b128/cores=4/nodes=1/splits=100 79.7 ± 0% 79.7 ± 0% ~ (all equal) kv0-b128/cores=16/nodes=1/splits=0 65.0 ± 0% 32.5 ± 0% -50.00% (p=0.029 n=4+4) kv0-b128/cores=16/nodes=1/splits=100 109 ± 0% 105 ± 0% -3.85% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 168 ± 0% 50 ± 0% -70.02% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 184 ± 0% 176 ± 0% -4.50% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 159 ± 0% 109 ± 0% -31.56% (p=0.000 n=5+4) kv0-b1024/cores=4/nodes=1/splits=100 252 ± 0% 319 ± 0% +26.66% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 705 ± 0% 193 ± 0% -72.62% (p=0.000 n=5+4) kv0-b1024/cores=16/nodes=1/splits=100 319 ± 0% 386 ± 0% +21.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 1.88k ± 0% 0.24k ± 0% -87.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 436 ± 0% 570 ± 0% +30.77% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv95-b16/cores=4/nodes=1/splits=100 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95-b16/cores=16/nodes=1/splits=0 6.30 ± 0% 1.40 ± 0% -77.78% (p=0.000 n=5+4) kv95-b16/cores=16/nodes=1/splits=100 1.74 ± 3% 1.76 ± 3% ~ (p=1.000 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 11.5 ± 0% 5.5 ± 0% -52.17% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 2.42 ±20% 2.42 ±45% ~ (p=0.579 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 21.4 ± 3% 21.0 ± 0% ~ (p=0.444 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 30.4 ± 0% 9.4 ± 0% -69.08% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 38.2 ±76% 21.2 ± 4% -44.31% (p=0.063 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 88.1 ± 0% 16.8 ± 0% -80.93% (p=0.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 56.6 ±85% 29.6 ±15% ~ (p=0.873 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 52.4 ± 0% 44.0 ± 0% -16.03% (p=0.029 n=4+4) kv95-b1024/cores=4/nodes=1/splits=100 132 ± 2% 143 ± 0% +8.29% (p=0.016 n=5+4) kv95-b1024/cores=16/nodes=1/splits=0 325 ± 3% 80 ± 0% -75.51% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 151 ± 0% 151 ± 0% ~ (all equal) kv95-b1024/cores=36/nodes=1/splits=0 973 ± 0% 180 ± 3% -81.55% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 168 ± 0% 168 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0-b16/cores=4/nodes=1/splits=0 8.40 ± 0% 10.30 ± 3% +22.62% (p=0.016 n=4+5) kv0-b16/cores=4/nodes=1/splits=100 29.4 ± 0% 27.3 ± 0% -7.14% (p=0.000 n=5+4) kv0-b16/cores=16/nodes=1/splits=0 16.3 ± 0% 15.5 ± 2% -4.91% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 31.5 ± 0% 29.4 ± 0% -6.67% (p=0.000 n=5+4) kv0-b16/cores=36/nodes=1/splits=0 37.7 ± 0% 28.7 ± 2% -23.77% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 62.1 ± 2% 68.4 ±10% +10.15% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 37.7 ± 0% 39.4 ± 6% +4.46% (p=0.167 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 143 ± 0% 151 ± 0% +5.89% (p=0.016 n=4+5) kv0-b128/cores=16/nodes=1/splits=0 79.7 ± 0% 55.8 ± 2% -30.04% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 198 ± 3% 188 ± 3% -5.09% (p=0.048 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 184 ± 0% 126 ± 3% -31.82% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 319 ± 0% 336 ± 0% +5.24% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 322 ± 6% 253 ± 4% -21.35% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 470 ± 0% 772 ± 4% +64.28% (p=0.016 n=4+5) kv0-b1024/cores=16/nodes=1/splits=0 1.41k ± 0% 0.56k ±11% -60.00% (p=0.000 n=4+5) kv0-b1024/cores=16/nodes=1/splits=100 530 ± 2% 772 ± 0% +45.57% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 4.05k ± 7% 1.17k ± 3% -71.19% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 792 ±14% 1020 ± 2% +28.81% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 3.90 ± 0% 3.22 ± 4% -17.44% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 21.0 ± 0% 19.9 ± 0% -5.24% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=0 15.2 ± 0% 7.1 ± 0% -53.29% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=100 38.5 ± 3% 37.7 ± 0% ~ (p=0.333 n=5+4) kv95-b16/cores=36/nodes=1/splits=0 128 ± 2% 52 ± 0% -59.16% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 41.1 ±13% 39.2 ±33% ~ (p=0.984 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 14.7 ± 0% -17.42% (p=0.079 n=4+5) kv95-b128/cores=4/nodes=1/splits=100 107 ± 2% 106 ± 5% ~ (p=0.683 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 75.5 ± 0% 23.1 ± 0% -69.40% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 107 ±34% 120 ± 2% ~ (p=1.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 253 ± 4% 71 ± 0% -71.86% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 166 ±19% 164 ±74% ~ (p=0.310 n=5+5) kv95-b1024/cores=4/nodes=1/splits=0 146 ± 3% 101 ± 0% -31.01% (p=0.000 n=5+4) kv95-b1024/cores=4/nodes=1/splits=100 348 ± 4% 366 ± 6% ~ (p=0.317 n=4+5) kv95-b1024/cores=16/nodes=1/splits=0 624 ± 3% 221 ± 2% -64.52% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 325 ± 3% 319 ± 0% ~ (p=0.444 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 1.56k ± 5% 0.41k ± 2% -73.71% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 336 ± 0% 336 ± 0% ~ (all equal) ``` Release note (performance improvement): Replace Replica latching mechanism with new optimized data structure that improves throughput, especially under heavy contention.

cockroachdb#31997 (review) Release note: None

This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in cockroachdb#31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks. This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year. _### Benchmarks _#### Standard Benchmarks These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts. ``` name old ops/sec new ops/sec delta kv0/cores=4/nodes=1/splits=0 1.99k ± 2% 2.06k ± 1% +3.22% (p=0.008 n=5+5) kv0/cores=4/nodes=1/splits=100 2.25k ± 1% 2.38k ± 1% +6.01% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 1.60k ± 0% 1.69k ± 2% +5.53% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 3.52k ± 6% 3.65k ± 9% ~ (p=0.421 n=5+5) kv0/cores=16/nodes=1/splits=0 19.9k ± 1% 21.8k ± 1% +9.34% (p=0.008 n=5+5) kv0/cores=16/nodes=1/splits=100 24.4k ± 1% 26.1k ± 1% +7.17% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 14.9k ± 1% 16.1k ± 1% +8.03% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 20.6k ± 1% 22.8k ± 1% +10.79% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=0 31.2k ± 2% 35.3k ± 1% +13.28% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 45.7k ± 1% 51.1k ± 1% +11.80% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 23.7k ± 2% 27.1k ± 2% +14.39% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=100 34.9k ± 2% 45.1k ± 1% +29.44% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 12.7k ± 2% 12.9k ± 2% +1.39% (p=0.151 n=5+5) kv95/cores=4/nodes=1/splits=100 12.8k ± 2% 13.1k ± 2% +2.10% (p=0.032 n=5+5) kv95/cores=4/nodes=3/splits=0 10.6k ± 1% 10.8k ± 1% +1.58% (p=0.056 n=5+5) kv95/cores=4/nodes=3/splits=100 12.3k ± 7% 12.6k ± 8% +2.61% (p=0.095 n=5+5) kv95/cores=16/nodes=1/splits=0 50.9k ± 1% 52.2k ± 1% +2.37% (p=0.008 n=5+5) kv95/cores=16/nodes=1/splits=100 52.2k ± 1% 53.0k ± 1% +1.49% (p=0.008 n=5+5) kv95/cores=16/nodes=3/splits=0 46.2k ± 1% 46.8k ± 1% +1.32% (p=0.032 n=5+5) kv95/cores=16/nodes=3/splits=100 51.0k ± 1% 53.2k ± 1% +4.25% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 79.8k ± 2% 101.6k ± 1% +27.31% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=100 104k ± 1% 107k ± 1% +2.60% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 85.8k ± 1% 91.8k ± 1% +7.08% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=100 106k ± 1% 112k ± 1% +5.51% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=4/nodes=1/splits=0 3.52 ± 5% 3.40 ± 0% -3.41% (p=0.016 n=5+4) kv0/cores=4/nodes=1/splits=100 3.30 ± 0% 3.00 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 4.70 ± 0% 4.14 ± 9% -11.91% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 1.50 ± 0% 1.48 ± 8% ~ (p=0.968 n=4+5) kv0/cores=16/nodes=1/splits=0 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=16/nodes=1/splits=100 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0/cores=16/nodes=3/splits=0 2.00 ± 0% 1.90 ± 0% -5.00% (p=0.000 n=5+4) kv0/cores=16/nodes=3/splits=100 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=36/nodes=1/splits=0 1.76 ± 3% 1.60 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 2.56 ± 2% 2.40 ± 0% -6.25% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 1.70 ± 0% 1.40 ± 0% -17.65% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=0 0.70 ± 0% 0.64 ± 9% -8.57% (p=0.167 n=5+5) kv95/cores=16/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=3/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv95/cores=36/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0/cores=4/nodes=1/splits=0 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.000 n=5+4) kv0/cores=4/nodes=1/splits=100 7.90 ± 0% 7.60 ± 0% -3.80% (p=0.000 n=5+4) kv0/cores=4/nodes=3/splits=0 15.7 ± 0% 15.2 ± 0% -3.18% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 8.90 ± 0% 8.12 ± 3% -8.76% (p=0.016 n=4+5) kv0/cores=16/nodes=1/splits=0 3.46 ± 2% 3.00 ± 0% -13.29% (p=0.000 n=5+4) kv0/cores=16/nodes=1/splits=100 4.50 ± 0% 3.36 ± 2% -25.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 4.50 ± 0% 3.90 ± 0% -13.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 5.80 ± 0% 4.10 ± 0% -29.31% (p=0.029 n=4+4) kv0/cores=36/nodes=1/splits=0 6.80 ± 0% 5.20 ± 0% -23.53% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 5.80 ± 0% 4.32 ± 4% -25.52% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 7.72 ± 2% 6.30 ± 0% -18.39% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 7.98 ± 2% 5.20 ± 0% -34.84% (p=0.000 n=5+4) kv95/cores=4/nodes=1/splits=0 5.38 ± 3% 5.20 ± 0% -3.35% (p=0.167 n=5+5) kv95/cores=4/nodes=1/splits=100 5.00 ± 0% 5.00 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 5.68 ± 3% 5.50 ± 0% -3.17% (p=0.095 n=5+4) kv95/cores=4/nodes=3/splits=100 3.60 ±31% 2.93 ± 3% -18.75% (p=0.016 n=5+4) kv95/cores=16/nodes=1/splits=0 4.10 ± 0% 4.10 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 4.50 ± 0% 4.10 ± 0% -8.89% (p=0.000 n=5+4) kv95/cores=16/nodes=3/splits=0 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=100 2.50 ± 0% 1.90 ± 5% -24.00% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.029 n=4+4) kv95/cores=36/nodes=1/splits=100 5.50 ± 0% 5.12 ± 2% -6.91% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 4.18 ± 2% 4.02 ± 3% -3.71% (p=0.000 n=4+5) kv95/cores=36/nodes=3/splits=100 3.80 ± 0% 2.80 ± 0% -26.32% (p=0.008 n=5+5) ``` _#### Large-machine Benchmarks These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs. ``` name old ops/sec new ops/sec delta kv0/cores=72/nodes=1/splits=0 31.0k ± 4% 36.4k ± 1% +17.57% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 44.0k ± 0% 49.0k ± 1% +11.41% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 52.7k ±18% 72.6k ±26% +37.70% (p=0.016 n=5+5) kv95/cores=72/nodes=1/splits=100 66.8k ±17% 68.5k ± 5% ~ (p=0.286 n=5+4) name old p50(ms) new p50(ms) delta kv0/cores=72/nodes=1/splits=0 2.30 ±13% 2.52 ± 5% ~ (p=0.214 n=5+5) kv0/cores=72/nodes=1/splits=100 3.00 ± 0% 2.90 ± 0% -3.33% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 0.46 ±13% 0.50 ± 0% ~ (p=0.444 n=5+5) kv95/cores=72/nodes=1/splits=100 0.44 ±14% 0.50 ± 0% +13.64% (p=0.167 n=5+5) name old p99(ms) new p99(ms) delta kv0/cores=72/nodes=1/splits=0 18.9 ± 6% 13.3 ± 5% -29.56% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 13.4 ± 2% 11.0 ± 0% -17.91% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 34.4 ±34% 23.5 ±24% -31.74% (p=0.048 n=5+5) kv95/cores=72/nodes=1/splits=100 21.0 ± 0% 19.1 ± 4% -8.81% (p=0.029 n=4+4) ``` _#### Motivating Benchmarks These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the `c` specifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations. ``` name old ops/sec new ops/sec delta kv95-c5/cores=16/nodes=1/splits=0 45.1k ± 1% 47.2k ± 4% +4.59% (p=0.008 n=5+5) kv95-c5/cores=36/nodes=1/splits=0 44.6k ± 1% 76.3k ± 1% +71.05% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 27.2k ± 2% 29.4k ± 1% +8.12% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 42.6k ± 2% 50.0k ± 1% +17.39% (p=0.008 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 28.7k ± 1% 29.8k ± 3% +3.87% (p=0.008 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 41.9k ± 4% 52.8k ± 2% +25.97% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv95-c5/cores=16/nodes=1/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95-c5/cores=36/nodes=1/splits=0 0.90 ± 0% 0.80 ± 0% -11.11% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 1.10 ± 0% 1.06 ± 6% ~ (p=0.444 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 1.26 ± 5% 1.30 ± 0% ~ (p=0.444 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 0.70 ± 0% 0.50 ± 0% -28.57% (p=0.008 n=5+5) name old p99(ms) new p99(ms) delta kv95-c5/cores=16/nodes=1/splits=0 2.40 ± 0% 2.10 ± 0% -12.50% (p=0.000 n=5+4) kv95-c5/cores=36/nodes=1/splits=0 5.80 ± 0% 3.30 ± 0% -43.10% (p=0.000 n=5+4) kv50-c128/cores=16/nodes=1/splits=0 3.50 ± 0% 3.00 ± 0% -14.29% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 6.80 ± 0% 4.70 ± 0% -30.88% (p=0.079 n=4+5) kv70-20-c128/cores=16/nodes=1/splits=0 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.029 n=4+4) kv70-20-c128/cores=36/nodes=1/splits=0 11.0 ± 0% 6.8 ± 0% -38.18% (p=0.008 n=5+5) ``` _#### Batching Benchmarks One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary. It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck. ``` name old ops/sec new ops/sec delta kv0-b16/cores=4/nodes=1/splits=0 2.41k ± 0% 2.06k ± 3% -14.75% (p=0.008 n=5+5) kv0-b16/cores=4/nodes=1/splits=100 514 ± 0% 534 ± 1% +3.88% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 2.95k ± 0% 4.35k ± 0% +47.74% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 1.80k ± 1% 1.88k ± 1% +4.46% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 2.74k ± 0% 4.92k ± 1% +79.55% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 2.39k ± 1% 2.45k ± 1% +2.41% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 422 ± 0% 518 ± 1% +22.60% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 98.4 ± 1% 98.8 ± 1% ~ (p=0.810 n=5+5) kv0-b128/cores=16/nodes=1/splits=0 532 ± 0% 1059 ± 0% +99.16% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 291 ± 1% 307 ± 1% +5.18% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 483 ± 0% 1288 ± 1% +166.37% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 394 ± 1% 408 ± 1% +3.51% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 49.7 ± 1% 72.8 ± 1% +46.52% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 30.8 ± 0% 23.4 ± 0% -24.03% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 48.9 ± 2% 160.6 ± 0% +228.38% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=100 101 ± 1% 80 ± 0% -21.64% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 37.5 ± 0% 208.1 ± 1% +454.99% (p=0.016 n=4+5) kv0-b1024/cores=36/nodes=1/splits=100 162 ± 0% 124 ± 0% -23.22% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 5.93k ± 0% 6.20k ± 1% +4.55% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 2.27k ± 1% 2.32k ± 1% +2.28% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=0 5.15k ± 1% 18.79k ± 1% +264.73% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=100 8.31k ± 1% 8.57k ± 1% +3.16% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 3.96k ± 0% 10.67k ± 1% +169.81% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=100 15.7k ± 2% 16.2k ± 4% +2.75% (p=0.151 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 1.12k ± 1% 1.27k ± 0% +13.28% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 290 ± 1% 299 ± 1% +3.02% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 1.06k ± 0% 3.31k ± 0% +213.09% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 662 ±91% 1095 ± 1% +65.42% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 715 ± 2% 3586 ± 0% +401.21% (p=0.008 n=5+5) kv95-b128/cores=36/nodes=1/splits=100 1.15k ±90% 2.01k ± 2% +74.79% (p=0.016 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 134 ± 1% 170 ± 1% +26.59% (p=0.008 n=5+5) kv95-b1024/cores=4/nodes=1/splits=100 54.8 ± 3% 53.3 ± 3% -2.84% (p=0.056 n=5+5) kv95-b1024/cores=16/nodes=1/splits=0 104 ± 3% 367 ± 1% +252.37% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 210 ± 1% 214 ± 1% +1.86% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 76.5 ± 2% 383.9 ± 1% +401.67% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 431 ± 1% 436 ± 1% +1.17% (p=0.016 n=5+5) name old p50(ms) new p50(ms) delta kv0-b16/cores=4/nodes=1/splits=0 3.00 ± 0% 3.40 ± 0% +13.33% (p=0.016 n=5+4) kv0-b16/cores=4/nodes=1/splits=100 15.2 ± 0% 14.7 ± 0% -3.29% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 10.5 ± 0% 7.7 ± 2% -26.48% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 17.8 ± 0% 16.8 ± 0% -5.62% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 26.2 ± 0% 14.2 ± 0% -45.80% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 29.0 ± 2% 28.3 ± 0% -2.28% (p=0.095 n=5+4) kv0-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 15.2 ± 0% -14.61% (p=0.000 n=5+4) kv0-b128/cores=4/nodes=1/splits=100 79.7 ± 0% 79.7 ± 0% ~ (all equal) kv0-b128/cores=16/nodes=1/splits=0 65.0 ± 0% 32.5 ± 0% -50.00% (p=0.029 n=4+4) kv0-b128/cores=16/nodes=1/splits=100 109 ± 0% 105 ± 0% -3.85% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 168 ± 0% 50 ± 0% -70.02% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 184 ± 0% 176 ± 0% -4.50% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 159 ± 0% 109 ± 0% -31.56% (p=0.000 n=5+4) kv0-b1024/cores=4/nodes=1/splits=100 252 ± 0% 319 ± 0% +26.66% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 705 ± 0% 193 ± 0% -72.62% (p=0.000 n=5+4) kv0-b1024/cores=16/nodes=1/splits=100 319 ± 0% 386 ± 0% +21.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 1.88k ± 0% 0.24k ± 0% -87.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 436 ± 0% 570 ± 0% +30.77% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv95-b16/cores=4/nodes=1/splits=100 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95-b16/cores=16/nodes=1/splits=0 6.30 ± 0% 1.40 ± 0% -77.78% (p=0.000 n=5+4) kv95-b16/cores=16/nodes=1/splits=100 1.74 ± 3% 1.76 ± 3% ~ (p=1.000 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 11.5 ± 0% 5.5 ± 0% -52.17% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 2.42 ±20% 2.42 ±45% ~ (p=0.579 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 21.4 ± 3% 21.0 ± 0% ~ (p=0.444 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 30.4 ± 0% 9.4 ± 0% -69.08% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 38.2 ±76% 21.2 ± 4% -44.31% (p=0.063 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 88.1 ± 0% 16.8 ± 0% -80.93% (p=0.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 56.6 ±85% 29.6 ±15% ~ (p=0.873 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 52.4 ± 0% 44.0 ± 0% -16.03% (p=0.029 n=4+4) kv95-b1024/cores=4/nodes=1/splits=100 132 ± 2% 143 ± 0% +8.29% (p=0.016 n=5+4) kv95-b1024/cores=16/nodes=1/splits=0 325 ± 3% 80 ± 0% -75.51% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 151 ± 0% 151 ± 0% ~ (all equal) kv95-b1024/cores=36/nodes=1/splits=0 973 ± 0% 180 ± 3% -81.55% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 168 ± 0% 168 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0-b16/cores=4/nodes=1/splits=0 8.40 ± 0% 10.30 ± 3% +22.62% (p=0.016 n=4+5) kv0-b16/cores=4/nodes=1/splits=100 29.4 ± 0% 27.3 ± 0% -7.14% (p=0.000 n=5+4) kv0-b16/cores=16/nodes=1/splits=0 16.3 ± 0% 15.5 ± 2% -4.91% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 31.5 ± 0% 29.4 ± 0% -6.67% (p=0.000 n=5+4) kv0-b16/cores=36/nodes=1/splits=0 37.7 ± 0% 28.7 ± 2% -23.77% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 62.1 ± 2% 68.4 ±10% +10.15% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 37.7 ± 0% 39.4 ± 6% +4.46% (p=0.167 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 143 ± 0% 151 ± 0% +5.89% (p=0.016 n=4+5) kv0-b128/cores=16/nodes=1/splits=0 79.7 ± 0% 55.8 ± 2% -30.04% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 198 ± 3% 188 ± 3% -5.09% (p=0.048 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 184 ± 0% 126 ± 3% -31.82% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 319 ± 0% 336 ± 0% +5.24% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 322 ± 6% 253 ± 4% -21.35% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 470 ± 0% 772 ± 4% +64.28% (p=0.016 n=4+5) kv0-b1024/cores=16/nodes=1/splits=0 1.41k ± 0% 0.56k ±11% -60.00% (p=0.000 n=4+5) kv0-b1024/cores=16/nodes=1/splits=100 530 ± 2% 772 ± 0% +45.57% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 4.05k ± 7% 1.17k ± 3% -71.19% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 792 ±14% 1020 ± 2% +28.81% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 3.90 ± 0% 3.22 ± 4% -17.44% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 21.0 ± 0% 19.9 ± 0% -5.24% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=0 15.2 ± 0% 7.1 ± 0% -53.29% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=100 38.5 ± 3% 37.7 ± 0% ~ (p=0.333 n=5+4) kv95-b16/cores=36/nodes=1/splits=0 128 ± 2% 52 ± 0% -59.16% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 41.1 ±13% 39.2 ±33% ~ (p=0.984 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 14.7 ± 0% -17.42% (p=0.079 n=4+5) kv95-b128/cores=4/nodes=1/splits=100 107 ± 2% 106 ± 5% ~ (p=0.683 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 75.5 ± 0% 23.1 ± 0% -69.40% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 107 ±34% 120 ± 2% ~ (p=1.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 253 ± 4% 71 ± 0% -71.86% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 166 ±19% 164 ±74% ~ (p=0.310 n=5+5) kv95-b1024/cores=4/nodes=1/splits=0 146 ± 3% 101 ± 0% -31.01% (p=0.000 n=5+4) kv95-b1024/cores=4/nodes=1/splits=100 348 ± 4% 366 ± 6% ~ (p=0.317 n=4+5) kv95-b1024/cores=16/nodes=1/splits=0 624 ± 3% 221 ± 2% -64.52% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 325 ± 3% 319 ± 0% ~ (p=0.444 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 1.56k ± 5% 0.41k ± 2% -73.71% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 336 ± 0% 336 ± 0% ~ (all equal) ``` Release note (performance improvement): Replace Replica latching mechanism with new optimized data structure that improves throughput, especially under heavy contention.

32865: storage: replace CommandQueue with spanlatch.Manager r=nvanbenschoten a=nvanbenschoten This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in #31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks. This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year. ### Benchmarks #### Standard Benchmarks These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts. ``` name old ops/sec new ops/sec delta kv0/cores=4/nodes=1/splits=0 1.99k ± 2% 2.06k ± 1% +3.22% (p=0.008 n=5+5) kv0/cores=4/nodes=1/splits=100 2.25k ± 1% 2.38k ± 1% +6.01% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 1.60k ± 0% 1.69k ± 2% +5.53% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 3.52k ± 6% 3.65k ± 9% ~ (p=0.421 n=5+5) kv0/cores=16/nodes=1/splits=0 19.9k ± 1% 21.8k ± 1% +9.34% (p=0.008 n=5+5) kv0/cores=16/nodes=1/splits=100 24.4k ± 1% 26.1k ± 1% +7.17% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 14.9k ± 1% 16.1k ± 1% +8.03% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 20.6k ± 1% 22.8k ± 1% +10.79% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=0 31.2k ± 2% 35.3k ± 1% +13.28% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 45.7k ± 1% 51.1k ± 1% +11.80% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 23.7k ± 2% 27.1k ± 2% +14.39% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=100 34.9k ± 2% 45.1k ± 1% +29.44% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 12.7k ± 2% 12.9k ± 2% +1.39% (p=0.151 n=5+5) kv95/cores=4/nodes=1/splits=100 12.8k ± 2% 13.1k ± 2% +2.10% (p=0.032 n=5+5) kv95/cores=4/nodes=3/splits=0 10.6k ± 1% 10.8k ± 1% +1.58% (p=0.056 n=5+5) kv95/cores=4/nodes=3/splits=100 12.3k ± 7% 12.6k ± 8% +2.61% (p=0.095 n=5+5) kv95/cores=16/nodes=1/splits=0 50.9k ± 1% 52.2k ± 1% +2.37% (p=0.008 n=5+5) kv95/cores=16/nodes=1/splits=100 52.2k ± 1% 53.0k ± 1% +1.49% (p=0.008 n=5+5) kv95/cores=16/nodes=3/splits=0 46.2k ± 1% 46.8k ± 1% +1.32% (p=0.032 n=5+5) kv95/cores=16/nodes=3/splits=100 51.0k ± 1% 53.2k ± 1% +4.25% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 79.8k ± 2% 101.6k ± 1% +27.31% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=100 104k ± 1% 107k ± 1% +2.60% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 85.8k ± 1% 91.8k ± 1% +7.08% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=100 106k ± 1% 112k ± 1% +5.51% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=4/nodes=1/splits=0 3.52 ± 5% 3.40 ± 0% -3.41% (p=0.016 n=5+4) kv0/cores=4/nodes=1/splits=100 3.30 ± 0% 3.00 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 4.70 ± 0% 4.14 ± 9% -11.91% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 1.50 ± 0% 1.48 ± 8% ~ (p=0.968 n=4+5) kv0/cores=16/nodes=1/splits=0 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=16/nodes=1/splits=100 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0/cores=16/nodes=3/splits=0 2.00 ± 0% 1.90 ± 0% -5.00% (p=0.000 n=5+4) kv0/cores=16/nodes=3/splits=100 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=36/nodes=1/splits=0 1.76 ± 3% 1.60 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 2.56 ± 2% 2.40 ± 0% -6.25% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 1.70 ± 0% 1.40 ± 0% -17.65% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=0 0.70 ± 0% 0.64 ± 9% -8.57% (p=0.167 n=5+5) kv95/cores=16/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=3/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv95/cores=36/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0/cores=4/nodes=1/splits=0 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.000 n=5+4) kv0/cores=4/nodes=1/splits=100 7.90 ± 0% 7.60 ± 0% -3.80% (p=0.000 n=5+4) kv0/cores=4/nodes=3/splits=0 15.7 ± 0% 15.2 ± 0% -3.18% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 8.90 ± 0% 8.12 ± 3% -8.76% (p=0.016 n=4+5) kv0/cores=16/nodes=1/splits=0 3.46 ± 2% 3.00 ± 0% -13.29% (p=0.000 n=5+4) kv0/cores=16/nodes=1/splits=100 4.50 ± 0% 3.36 ± 2% -25.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 4.50 ± 0% 3.90 ± 0% -13.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 5.80 ± 0% 4.10 ± 0% -29.31% (p=0.029 n=4+4) kv0/cores=36/nodes=1/splits=0 6.80 ± 0% 5.20 ± 0% -23.53% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 5.80 ± 0% 4.32 ± 4% -25.52% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 7.72 ± 2% 6.30 ± 0% -18.39% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 7.98 ± 2% 5.20 ± 0% -34.84% (p=0.000 n=5+4) kv95/cores=4/nodes=1/splits=0 5.38 ± 3% 5.20 ± 0% -3.35% (p=0.167 n=5+5) kv95/cores=4/nodes=1/splits=100 5.00 ± 0% 5.00 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 5.68 ± 3% 5.50 ± 0% -3.17% (p=0.095 n=5+4) kv95/cores=4/nodes=3/splits=100 3.60 ±31% 2.93 ± 3% -18.75% (p=0.016 n=5+4) kv95/cores=16/nodes=1/splits=0 4.10 ± 0% 4.10 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 4.50 ± 0% 4.10 ± 0% -8.89% (p=0.000 n=5+4) kv95/cores=16/nodes=3/splits=0 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=100 2.50 ± 0% 1.90 ± 5% -24.00% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.029 n=4+4) kv95/cores=36/nodes=1/splits=100 5.50 ± 0% 5.12 ± 2% -6.91% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 4.18 ± 2% 4.02 ± 3% -3.71% (p=0.000 n=4+5) kv95/cores=36/nodes=3/splits=100 3.80 ± 0% 2.80 ± 0% -26.32% (p=0.008 n=5+5) ``` #### Large-machine Benchmarks These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs. ``` name old ops/sec new ops/sec delta kv0/cores=72/nodes=1/splits=0 31.0k ± 4% 36.4k ± 1% +17.57% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 44.0k ± 0% 49.0k ± 1% +11.41% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 52.7k ±18% 72.6k ±26% +37.70% (p=0.016 n=5+5) kv95/cores=72/nodes=1/splits=100 66.8k ±17% 68.5k ± 5% ~ (p=0.286 n=5+4) name old p50(ms) new p50(ms) delta kv0/cores=72/nodes=1/splits=0 2.30 ±13% 2.52 ± 5% ~ (p=0.214 n=5+5) kv0/cores=72/nodes=1/splits=100 3.00 ± 0% 2.90 ± 0% -3.33% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 0.46 ±13% 0.50 ± 0% ~ (p=0.444 n=5+5) kv95/cores=72/nodes=1/splits=100 0.44 ±14% 0.50 ± 0% +13.64% (p=0.167 n=5+5) name old p99(ms) new p99(ms) delta kv0/cores=72/nodes=1/splits=0 18.9 ± 6% 13.3 ± 5% -29.56% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 13.4 ± 2% 11.0 ± 0% -17.91% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 34.4 ±34% 23.5 ±24% -31.74% (p=0.048 n=5+5) kv95/cores=72/nodes=1/splits=100 21.0 ± 0% 19.1 ± 4% -8.81% (p=0.029 n=4+4) ``` #### Motivating Benchmarks These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the `c` specifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations. ``` name old ops/sec new ops/sec delta kv95-c5/cores=16/nodes=1/splits=0 45.1k ± 1% 47.2k ± 4% +4.59% (p=0.008 n=5+5) kv95-c5/cores=36/nodes=1/splits=0 44.6k ± 1% 76.3k ± 1% +71.05% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 27.2k ± 2% 29.4k ± 1% +8.12% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 42.6k ± 2% 50.0k ± 1% +17.39% (p=0.008 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 28.7k ± 1% 29.8k ± 3% +3.87% (p=0.008 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 41.9k ± 4% 52.8k ± 2% +25.97% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv95-c5/cores=16/nodes=1/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95-c5/cores=36/nodes=1/splits=0 0.90 ± 0% 0.80 ± 0% -11.11% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 1.10 ± 0% 1.06 ± 6% ~ (p=0.444 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 1.26 ± 5% 1.30 ± 0% ~ (p=0.444 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 0.70 ± 0% 0.50 ± 0% -28.57% (p=0.008 n=5+5) name old p99(ms) new p99(ms) delta kv95-c5/cores=16/nodes=1/splits=0 2.40 ± 0% 2.10 ± 0% -12.50% (p=0.000 n=5+4) kv95-c5/cores=36/nodes=1/splits=0 5.80 ± 0% 3.30 ± 0% -43.10% (p=0.000 n=5+4) kv50-c128/cores=16/nodes=1/splits=0 3.50 ± 0% 3.00 ± 0% -14.29% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 6.80 ± 0% 4.70 ± 0% -30.88% (p=0.079 n=4+5) kv70-20-c128/cores=16/nodes=1/splits=0 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.029 n=4+4) kv70-20-c128/cores=36/nodes=1/splits=0 11.0 ± 0% 6.8 ± 0% -38.18% (p=0.008 n=5+5) ``` #### Batching Benchmarks One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary. It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck. ``` name old ops/sec new ops/sec delta kv0-b16/cores=4/nodes=1/splits=0 2.41k ± 0% 2.06k ± 3% -14.75% (p=0.008 n=5+5) kv0-b16/cores=4/nodes=1/splits=100 514 ± 0% 534 ± 1% +3.88% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 2.95k ± 0% 4.35k ± 0% +47.74% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 1.80k ± 1% 1.88k ± 1% +4.46% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 2.74k ± 0% 4.92k ± 1% +79.55% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 2.39k ± 1% 2.45k ± 1% +2.41% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 422 ± 0% 518 ± 1% +22.60% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 98.4 ± 1% 98.8 ± 1% ~ (p=0.810 n=5+5) kv0-b128/cores=16/nodes=1/splits=0 532 ± 0% 1059 ± 0% +99.16% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 291 ± 1% 307 ± 1% +5.18% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 483 ± 0% 1288 ± 1% +166.37% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 394 ± 1% 408 ± 1% +3.51% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 49.7 ± 1% 72.8 ± 1% +46.52% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 30.8 ± 0% 23.4 ± 0% -24.03% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 48.9 ± 2% 160.6 ± 0% +228.38% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=100 101 ± 1% 80 ± 0% -21.64% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 37.5 ± 0% 208.1 ± 1% +454.99% (p=0.016 n=4+5) kv0-b1024/cores=36/nodes=1/splits=100 162 ± 0% 124 ± 0% -23.22% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 5.93k ± 0% 6.20k ± 1% +4.55% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 2.27k ± 1% 2.32k ± 1% +2.28% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=0 5.15k ± 1% 18.79k ± 1% +264.73% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=100 8.31k ± 1% 8.57k ± 1% +3.16% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 3.96k ± 0% 10.67k ± 1% +169.81% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=100 15.7k ± 2% 16.2k ± 4% +2.75% (p=0.151 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 1.12k ± 1% 1.27k ± 0% +13.28% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 290 ± 1% 299 ± 1% +3.02% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 1.06k ± 0% 3.31k ± 0% +213.09% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 662 ±91% 1095 ± 1% +65.42% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 715 ± 2% 3586 ± 0% +401.21% (p=0.008 n=5+5) kv95-b128/cores=36/nodes=1/splits=100 1.15k ±90% 2.01k ± 2% +74.79% (p=0.016 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 134 ± 1% 170 ± 1% +26.59% (p=0.008 n=5+5) kv95-b1024/cores=4/nodes=1/splits=100 54.8 ± 3% 53.3 ± 3% -2.84% (p=0.056 n=5+5) kv95-b1024/cores=16/nodes=1/splits=0 104 ± 3% 367 ± 1% +252.37% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 210 ± 1% 214 ± 1% +1.86% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 76.5 ± 2% 383.9 ± 1% +401.67% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 431 ± 1% 436 ± 1% +1.17% (p=0.016 n=5+5) name old p50(ms) new p50(ms) delta kv0-b16/cores=4/nodes=1/splits=0 3.00 ± 0% 3.40 ± 0% +13.33% (p=0.016 n=5+4) kv0-b16/cores=4/nodes=1/splits=100 15.2 ± 0% 14.7 ± 0% -3.29% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 10.5 ± 0% 7.7 ± 2% -26.48% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 17.8 ± 0% 16.8 ± 0% -5.62% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 26.2 ± 0% 14.2 ± 0% -45.80% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 29.0 ± 2% 28.3 ± 0% -2.28% (p=0.095 n=5+4) kv0-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 15.2 ± 0% -14.61% (p=0.000 n=5+4) kv0-b128/cores=4/nodes=1/splits=100 79.7 ± 0% 79.7 ± 0% ~ (all equal) kv0-b128/cores=16/nodes=1/splits=0 65.0 ± 0% 32.5 ± 0% -50.00% (p=0.029 n=4+4) kv0-b128/cores=16/nodes=1/splits=100 109 ± 0% 105 ± 0% -3.85% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 168 ± 0% 50 ± 0% -70.02% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 184 ± 0% 176 ± 0% -4.50% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 159 ± 0% 109 ± 0% -31.56% (p=0.000 n=5+4) kv0-b1024/cores=4/nodes=1/splits=100 252 ± 0% 319 ± 0% +26.66% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 705 ± 0% 193 ± 0% -72.62% (p=0.000 n=5+4) kv0-b1024/cores=16/nodes=1/splits=100 319 ± 0% 386 ± 0% +21.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 1.88k ± 0% 0.24k ± 0% -87.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 436 ± 0% 570 ± 0% +30.77% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv95-b16/cores=4/nodes=1/splits=100 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95-b16/cores=16/nodes=1/splits=0 6.30 ± 0% 1.40 ± 0% -77.78% (p=0.000 n=5+4) kv95-b16/cores=16/nodes=1/splits=100 1.74 ± 3% 1.76 ± 3% ~ (p=1.000 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 11.5 ± 0% 5.5 ± 0% -52.17% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 2.42 ±20% 2.42 ±45% ~ (p=0.579 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 21.4 ± 3% 21.0 ± 0% ~ (p=0.444 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 30.4 ± 0% 9.4 ± 0% -69.08% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 38.2 ±76% 21.2 ± 4% -44.31% (p=0.063 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 88.1 ± 0% 16.8 ± 0% -80.93% (p=0.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 56.6 ±85% 29.6 ±15% ~ (p=0.873 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 52.4 ± 0% 44.0 ± 0% -16.03% (p=0.029 n=4+4) kv95-b1024/cores=4/nodes=1/splits=100 132 ± 2% 143 ± 0% +8.29% (p=0.016 n=5+4) kv95-b1024/cores=16/nodes=1/splits=0 325 ± 3% 80 ± 0% -75.51% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 151 ± 0% 151 ± 0% ~ (all equal) kv95-b1024/cores=36/nodes=1/splits=0 973 ± 0% 180 ± 3% -81.55% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 168 ± 0% 168 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0-b16/cores=4/nodes=1/splits=0 8.40 ± 0% 10.30 ± 3% +22.62% (p=0.016 n=4+5) kv0-b16/cores=4/nodes=1/splits=100 29.4 ± 0% 27.3 ± 0% -7.14% (p=0.000 n=5+4) kv0-b16/cores=16/nodes=1/splits=0 16.3 ± 0% 15.5 ± 2% -4.91% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 31.5 ± 0% 29.4 ± 0% -6.67% (p=0.000 n=5+4) kv0-b16/cores=36/nodes=1/splits=0 37.7 ± 0% 28.7 ± 2% -23.77% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 62.1 ± 2% 68.4 ±10% +10.15% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 37.7 ± 0% 39.4 ± 6% +4.46% (p=0.167 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 143 ± 0% 151 ± 0% +5.89% (p=0.016 n=4+5) kv0-b128/cores=16/nodes=1/splits=0 79.7 ± 0% 55.8 ± 2% -30.04% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 198 ± 3% 188 ± 3% -5.09% (p=0.048 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 184 ± 0% 126 ± 3% -31.82% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 319 ± 0% 336 ± 0% +5.24% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 322 ± 6% 253 ± 4% -21.35% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 470 ± 0% 772 ± 4% +64.28% (p=0.016 n=4+5) kv0-b1024/cores=16/nodes=1/splits=0 1.41k ± 0% 0.56k ±11% -60.00% (p=0.000 n=4+5) kv0-b1024/cores=16/nodes=1/splits=100 530 ± 2% 772 ± 0% +45.57% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 4.05k ± 7% 1.17k ± 3% -71.19% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 792 ±14% 1020 ± 2% +28.81% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 3.90 ± 0% 3.22 ± 4% -17.44% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 21.0 ± 0% 19.9 ± 0% -5.24% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=0 15.2 ± 0% 7.1 ± 0% -53.29% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=100 38.5 ± 3% 37.7 ± 0% ~ (p=0.333 n=5+4) kv95-b16/cores=36/nodes=1/splits=0 128 ± 2% 52 ± 0% -59.16% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 41.1 ±13% 39.2 ±33% ~ (p=0.984 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 14.7 ± 0% -17.42% (p=0.079 n=4+5) kv95-b128/cores=4/nodes=1/splits=100 107 ± 2% 106 ± 5% ~ (p=0.683 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 75.5 ± 0% 23.1 ± 0% -69.40% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 107 ±34% 120 ± 2% ~ (p=1.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 253 ± 4% 71 ± 0% -71.86% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 166 ±19% 164 ±74% ~ (p=0.310 n=5+5) kv95-b1024/cores=4/nodes=1/splits=0 146 ± 3% 101 ± 0% -31.01% (p=0.000 n=5+4) kv95-b1024/cores=4/nodes=1/splits=100 348 ± 4% 366 ± 6% ~ (p=0.317 n=4+5) kv95-b1024/cores=16/nodes=1/splits=0 624 ± 3% 221 ± 2% -64.52% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 325 ± 3% 319 ± 0% ~ (p=0.444 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 1.56k ± 5% 0.41k ± 2% -73.71% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 336 ± 0% 336 ± 0% ~ (all equal) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>

nvanbenschoten requested a review from a team October 30, 2018 05:58

This was referenced Nov 6, 2018

storage/cmdq: create new signal type for cmd completion signaling #32164

Merged

storage/cmdq: create new specialized augmented interval btree #32165

Merged

nvanbenschoten mentioned this pull request Nov 13, 2018

storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy #32251

Merged

nvanbenschoten force-pushed the nvanbenschoten/cmdq2 branch from 90c1038 to 404942a Compare November 22, 2018 03:59

nvanbenschoten changed the title ~~[WIP] storage/cmdq: rewrite CommandQueue using a copy-on-write btree strategy~~ storage/spanlatch: create spanlatch.Manager using immutable btrees Nov 22, 2018

nvanbenschoten force-pushed the nvanbenschoten/cmdq2 branch from 404942a to db792ec Compare November 22, 2018 04:04

nvanbenschoten requested review from ajwerner and petermattis November 22, 2018 04:07

nvanbenschoten force-pushed the nvanbenschoten/cmdq2 branch from db792ec to fc49dcc Compare November 22, 2018 04:17

nvanbenschoten force-pushed the nvanbenschoten/cmdq2 branch 4 times, most recently from d96bd89 to 148cb30 Compare November 22, 2018 06:08

ajwerner reviewed Nov 27, 2018

View reviewed changes

nvanbenschoten added 4 commits November 28, 2018 21:04

storage/spanlatch: s/cmd/latch/

418d3ae

This commit replaces all reference to cmds with references to latches. Release note: None

nvanbenschoten force-pushed the nvanbenschoten/cmdq2 branch from 44fc962 to 3cf960e Compare November 29, 2018 02:44

nvanbenschoten commented Nov 29, 2018

View reviewed changes

ajwerner reviewed Nov 29, 2018

View reviewed changes

petermattis approved these changes Nov 29, 2018

View reviewed changes

ajwerner approved these changes Nov 29, 2018

View reviewed changes

nvanbenschoten added 2 commits November 29, 2018 11:37

nvanbenschoten commented Nov 29, 2018

View reviewed changes

nvanbenschoten force-pushed the nvanbenschoten/cmdq2 branch from 3cf960e to b2ab370 Compare November 29, 2018 16:57

ajwerner approved these changes Nov 29, 2018

View reviewed changes

nvanbenschoten commented Nov 29, 2018

View reviewed changes

craig bot merged commit b2ab370 into cockroachdb:master Nov 29, 2018

nvanbenschoten deleted the nvanbenschoten/cmdq2 branch November 29, 2018 17:56

tbg reviewed Nov 30, 2018

View reviewed changes

nvanbenschoten commented Nov 30, 2018

View reviewed changes

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 4, 2018

storage/spanlatch: address minor PR comments

af852f9

cockroachdb#31997 (review) Release note: None

nvanbenschoten mentioned this pull request Dec 5, 2018

storage: replace CommandQueue with spanlatch.Manager #32865

Merged

nvanbenschoten added a commit to nvanbenschoten/cockroach that referenced this pull request Dec 8, 2018

storage/spanlatch: address minor PR comments

f725fe4

cockroachdb#31997 (review) Release note: None

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

storage/spanlatch: create spanlatch.Manager using immutable btrees #31997

storage/spanlatch: create spanlatch.Manager using immutable btrees #31997

nvanbenschoten commented Oct 30, 2018 •

edited

Loading

cockroach-teamcity commented Oct 30, 2018

nvanbenschoten commented Oct 30, 2018

nvanbenschoten commented Nov 22, 2018

nvanbenschoten commented Nov 22, 2018 •

edited

Loading

ajwerner left a comment

nvanbenschoten left a comment

ajwerner left a comment

petermattis left a comment

ajwerner left a comment

nvanbenschoten left a comment

nvanbenschoten commented Nov 29, 2018

craig bot commented Nov 29, 2018

ajwerner left a comment

nvanbenschoten left a comment

craig bot commented Nov 29, 2018

tbg left a comment

tbg left a comment

nvanbenschoten left a comment

storage/spanlatch: create spanlatch.Manager using immutable btrees #31997

storage/spanlatch: create spanlatch.Manager using immutable btrees #31997

Conversation

nvanbenschoten commented Oct 30, 2018 • edited Loading

Naming changes

Microbenchmarks

cockroach-teamcity commented Oct 30, 2018

nvanbenschoten commented Oct 30, 2018

nvanbenschoten commented Nov 22, 2018

nvanbenschoten commented Nov 22, 2018 • edited Loading

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

petermattis left a comment

Choose a reason for hiding this comment

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Nov 29, 2018

craig bot commented Nov 29, 2018

Canceled

ajwerner left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

craig bot commented Nov 29, 2018

Build succeeded

tbg left a comment

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

nvanbenschoten left a comment

Choose a reason for hiding this comment

nvanbenschoten commented Oct 30, 2018 •

edited

Loading

nvanbenschoten commented Nov 22, 2018 •

edited

Loading