-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
storage/spanlatch: create spanlatch.Manager using immutable btrees #31997
storage/spanlatch: create spanlatch.Manager using immutable btrees #31997
Conversation
There's still plenty of room to optimize the btree implementation that we use here. Until this point we've closely follower |
32164: storage/cmdq: create new signal type for cmd completion signaling r=nvanbenschoten a=nvanbenschoten `signal` is a type that can signal the completion of an operation. This is a component of the larger change in #31997. The type has three benefits over using a channel directly and closing the channel when the operation completes: 1. signaled() uses atomics to provide a fast-path for checking whether the operation has completed. It is ~75x faster than using a channel for this purpose. 2. the type's channel is lazily initialized when signalChan() is called, avoiding the allocation when one is not needed. 3. because of 2, the type's zero value can be used directly. Release note: None Co-authored-by: Nathan VanBenschoten <[email protected]>
…policy All commits from cockroachdb#32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in cockroachdb#32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None
…policy All commits from cockroachdb#32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in cockroachdb#32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None
32165: storage/cmdq: create new specialized augmented interval btree r=nvanbenschoten a=nvanbenschoten This is a component of the larger change in #31997. The first few commits here modify the existing interval btree implementation, allowing us to properly benchmark against it. The second to last commit forks https://github.com/petermattis/pebble/blob/master/internal/btree/btree.go, specializes it to the command queue, and rips out any references to pebble. There are a number of changes we'll need to make to it: 1. Add synchronized node and leafNode freelists 2. Add Clear method to release owned nodes into freelists 3. Introduce immutability and a copy-on-write policy The next commit modifies the btree type added in the previous commit and turns it into an augmented interval tree. The tree represents intervals and permits an interval search operation following the approach laid out in CLRS, Chapter 14. The B-Tree stores cmds in order based on their start key and each B-Tree node maintains the upper-bound end key of all cmds in its subtree. This is close to what `util/interval.btree` does, although the new version doesn't maintain the lower-bound start key of all cmds in each node. The new interval btree is significantly faster than both the old interval btree and the old interval llrb tree because it minimizes key comparisons while scanning for overlaps. This includes avoiding all key comparisons for cmds with start keys that are greater than the search range's start key. See the comment on `overlapScan` for an explanation of how this is possible. The new interval btree is also faster because it has been specialized for the `storage/cmdq` package. This allows it to avoid interfaces and dynamic dispatch throughout its operations, which showed up prominently on profiles of the other two implementations. A third benefit of the rewrite is that it inherits the optimizations made in pebble's btree. This includes inlining the btree items and child pointers in nodes instead of using slices. ### Benchmarks: _The new interval btree:_ ``` Insert/count=16-4 76.1ns ± 4% Insert/count=128-4 156ns ± 4% Insert/count=1024-4 259ns ± 8% Insert/count=8192-4 386ns ± 1% Insert/count=65536-4 735ns ± 5% Delete/count=16-4 129ns ±16% Delete/count=128-4 189ns ±12% Delete/count=1024-4 338ns ± 7% Delete/count=8192-4 547ns ± 4% Delete/count=65536-4 1.22µs ±12% DeleteInsert/count=16-4 168ns ± 2% DeleteInsert/count=128-4 375ns ± 8% DeleteInsert/count=1024-4 562ns ± 1% DeleteInsert/count=8192-4 786ns ± 3% DeleteInsert/count=65536-4 2.31µs ±26% IterSeekGE/count=16-4 87.2ns ± 3% IterSeekGE/count=128-4 141ns ± 3% IterSeekGE/count=1024-4 227ns ± 4% IterSeekGE/count=8192-4 379ns ± 2% IterSeekGE/count=65536-4 882ns ± 1% IterSeekLT/count=16-4 89.5ns ± 3% IterSeekLT/count=128-4 145ns ± 1% IterSeekLT/count=1024-4 226ns ± 6% IterSeekLT/count=8192-4 379ns ± 1% IterSeekLT/count=65536-4 891ns ± 1% IterFirstOverlap/count=16-4 184ns ± 1% IterFirstOverlap/count=128-4 260ns ± 3% IterFirstOverlap/count=1024-4 685ns ± 7% IterFirstOverlap/count=8192-4 1.23µs ± 2% IterFirstOverlap/count=65536-4 2.14µs ± 1% IterNext-4 3.82ns ± 2% IterPrev-4 14.8ns ± 2% IterNextOverlap-4 8.57ns ± 2% IterOverlapScan-4 25.8µs ± 3% ``` _Compared to old llrb interval tree (currently in use):_ ``` Insert/count=16-4 323ns ± 7% 76ns ± 4% -76.43% (p=0.008 n=5+5) Insert/count=128-4 539ns ± 2% 156ns ± 4% -71.05% (p=0.008 n=5+5) Insert/count=1024-4 797ns ± 1% 259ns ± 8% -67.52% (p=0.008 n=5+5) Insert/count=8192-4 1.30µs ± 5% 0.39µs ± 1% -70.38% (p=0.008 n=5+5) Insert/count=65536-4 2.69µs ±11% 0.74µs ± 5% -72.65% (p=0.008 n=5+5) Delete/count=16-4 438ns ± 7% 129ns ±16% -70.44% (p=0.008 n=5+5) Delete/count=128-4 785ns ± 6% 189ns ±12% -75.89% (p=0.008 n=5+5) Delete/count=1024-4 1.38µs ± 2% 0.34µs ± 7% -75.44% (p=0.008 n=5+5) Delete/count=8192-4 2.36µs ± 2% 0.55µs ± 4% -76.82% (p=0.008 n=5+5) Delete/count=65536-4 4.73µs ±13% 1.22µs ±12% -74.19% (p=0.008 n=5+5) DeleteInsert/count=16-4 920ns ± 2% 168ns ± 2% -81.76% (p=0.008 n=5+5) DeleteInsert/count=128-4 1.73µs ± 4% 0.37µs ± 8% -78.35% (p=0.008 n=5+5) DeleteInsert/count=1024-4 2.69µs ± 3% 0.56µs ± 1% -79.15% (p=0.016 n=5+4) DeleteInsert/count=8192-4 4.55µs ±25% 0.79µs ± 3% -82.70% (p=0.008 n=5+5) DeleteInsert/count=65536-4 7.53µs ± 6% 2.31µs ±26% -69.32% (p=0.008 n=5+5) IterOverlapScan-4 285µs ± 7% 26µs ± 3% -90.96% (p=0.008 n=5+5) ``` _Compared to old btree interval tree (added in a61191e, never enabled):_ ``` Insert/count=16-4 231ns ± 1% 76ns ± 4% -66.99% (p=0.008 n=5+5) Insert/count=128-4 351ns ± 2% 156ns ± 4% -55.53% (p=0.008 n=5+5) Insert/count=1024-4 515ns ± 5% 259ns ± 8% -49.73% (p=0.008 n=5+5) Insert/count=8192-4 786ns ± 3% 386ns ± 1% -50.85% (p=0.008 n=5+5) Insert/count=65536-4 1.50µs ± 3% 0.74µs ± 5% -50.97% (p=0.008 n=5+5) Delete/count=16-4 363ns ±11% 129ns ±16% -64.33% (p=0.008 n=5+5) Delete/count=128-4 466ns ± 9% 189ns ±12% -59.42% (p=0.008 n=5+5) Delete/count=1024-4 806ns ± 6% 338ns ± 7% -58.01% (p=0.008 n=5+5) Delete/count=8192-4 1.43µs ±13% 0.55µs ± 4% -61.71% (p=0.008 n=5+5) Delete/count=65536-4 2.75µs ± 1% 1.22µs ±12% -55.57% (p=0.008 n=5+5) DeleteInsert/count=16-4 557ns ± 1% 168ns ± 2% -69.87% (p=0.008 n=5+5) DeleteInsert/count=128-4 953ns ± 8% 375ns ± 8% -60.71% (p=0.008 n=5+5) DeleteInsert/count=1024-4 1.19µs ± 4% 0.56µs ± 1% -52.72% (p=0.016 n=5+4) DeleteInsert/count=8192-4 1.84µs ±17% 0.79µs ± 3% -57.22% (p=0.008 n=5+5) DeleteInsert/count=65536-4 3.20µs ± 3% 2.31µs ±26% -27.86% (p=0.008 n=5+5) IterOverlapScan-4 70.1µs ± 2% 25.8µs ± 3% -63.23% (p=0.008 n=5+5) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
…policy All commits from cockroachdb#32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in cockroachdb#31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in cockroachdb#32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Release note: None
32251: storage/cmdq: O(1) copy-on-write btree clones and atomic refcount GC policy r=nvanbenschoten a=nvanbenschoten All commits from #32165 except the last one. This change introduces O(1) btree cloning and a new copy-on-write scheme, essentially giving the btree an immutable API (for which I took inspiration from https://docs.rs/crate/im/). This is made efficient by the second part of the change - a new garbage collection policy for btrees. Nodes are now reference counted atomically and freed into global `sync.Pools` when they are no longer referenced. One of the main ideas in #31997 is to treat the btrees backing the command queue as immutable structures. In doing so, we adopt a copy-on-write scheme. Trees are cloned under lock and then accessed concurrently. When future writers want to modify the tree, they can do so by cloning any nodes that they touch. This commit provides this functionality in a much more elegant manner than 6994347. Instead of giving each node a "copy-on-write context", we instead give each node a reference count. We then use the following rule: 1. trees with exclusive ownership (refcount == 1) over a node can modify it in-place. 2. trees without exclusive ownership over a node must clone the node in order to modify it. Once cloned, the tree will now have exclusive ownership over that node. When cloning the node, the reference count of all of the node's children must be incremented. In following the simple rules, we end up with a really nice property - trees gain more and more "ownership" as they make modifications, meaning that subsequent modifications are much less likely to need to clone nodes. Essentially, we transparently incorporates the idea of local mutations (e.g. Clojure's transients or Haskell's ST monad) without any external API needed. Even better, reference counting internal nodes ties directly into the new GC policy, which allows us to recycle old nodes and make the copy-on-write scheme zero-allocation in almost all cases. When a node's reference count drops to 0, we simply toss it into a `sync.Pool`. We keep two separate pools - one for leaf nodes and one for non-leaf nodes. This wasn't possible with the previous "copy-on-write context" approach. The atomic reference counting does have an effect on benchmarks, but its not a big one (single/double digit ns) and is negligible compared to the speedup observed in #32165. ``` name old time/op new time/op delta BTreeInsert/count=16-4 73.2ns ± 4% 84.4ns ± 4% +15.30% (p=0.008 n=5+5) BTreeInsert/count=128-4 152ns ± 4% 167ns ± 4% +9.89% (p=0.008 n=5+5) BTreeInsert/count=1024-4 250ns ± 1% 263ns ± 2% +5.21% (p=0.008 n=5+5) BTreeInsert/count=8192-4 381ns ± 1% 394ns ± 2% +3.36% (p=0.008 n=5+5) BTreeInsert/count=65536-4 720ns ± 6% 746ns ± 1% ~ (p=0.119 n=5+5) BTreeDelete/count=16-4 127ns ±15% 131ns ± 9% ~ (p=0.690 n=5+5) BTreeDelete/count=128-4 182ns ± 8% 192ns ± 8% ~ (p=0.222 n=5+5) BTreeDelete/count=1024-4 323ns ± 3% 340ns ± 4% +5.20% (p=0.032 n=5+5) BTreeDelete/count=8192-4 532ns ± 2% 556ns ± 1% +4.55% (p=0.008 n=5+5) BTreeDelete/count=65536-4 1.15µs ± 2% 1.22µs ± 7% ~ (p=0.222 n=5+5) BTreeDeleteInsert/count=16-4 166ns ± 4% 174ns ± 3% +4.70% (p=0.032 n=5+5) BTreeDeleteInsert/count=128-4 370ns ± 2% 383ns ± 1% +3.57% (p=0.008 n=5+5) BTreeDeleteInsert/count=1024-4 548ns ± 3% 575ns ± 5% +4.89% (p=0.032 n=5+5) BTreeDeleteInsert/count=8192-4 775ns ± 1% 789ns ± 1% +1.86% (p=0.016 n=5+5) BTreeDeleteInsert/count=65536-4 2.20µs ±22% 2.10µs ±18% ~ (p=0.841 n=5+5) ``` We can see how important the GC and memory re-use policy is by comparing the following few benchmarks. Specifically, notice the difference in operation speed and allocation count in `BenchmarkBTreeDeleteInsertCloneEachTime` between the tests that `Reset` old clones (allowing nodes to be freed into `sync.Pool`s) and the tests that don't `Reset` old clones. ``` name time/op BTreeDeleteInsert/count=16-4 198ns ±28% BTreeDeleteInsert/count=128-4 375ns ± 3% BTreeDeleteInsert/count=1024-4 577ns ± 2% BTreeDeleteInsert/count=8192-4 798ns ± 1% BTreeDeleteInsert/count=65536-4 2.00µs ±13% BTreeDeleteInsertCloneOnce/count=16-4 173ns ± 2% BTreeDeleteInsertCloneOnce/count=128-4 379ns ± 2% BTreeDeleteInsertCloneOnce/count=1024-4 584ns ± 4% BTreeDeleteInsertCloneOnce/count=8192-4 800ns ± 2% BTreeDeleteInsertCloneOnce/count=65536-4 2.04µs ±32% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 535ns ± 8% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 1.29µs ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 2.22µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 2.55µs ± 5% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 5.89µs ±20% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 240ns ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 610ns ± 4% BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 1.20µs ± 2% BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 1.69µs ± 1% BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 3.52µs ±18% name alloc/op BTreeDeleteInsert/count=16-4 0.00B BTreeDeleteInsert/count=128-4 0.00B BTreeDeleteInsert/count=1024-4 0.00B BTreeDeleteInsert/count=8192-4 0.00B BTreeDeleteInsert/count=65536-4 0.00B BTreeDeleteInsertCloneOnce/count=16-4 0.00B BTreeDeleteInsertCloneOnce/count=128-4 0.00B BTreeDeleteInsertCloneOnce/count=1024-4 0.00B BTreeDeleteInsertCloneOnce/count=8192-4 0.00B BTreeDeleteInsertCloneOnce/count=65536-4 1.00B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 288B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 897B ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 1.61kB ± 1% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 1.47kB ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 2.40kB ±12% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00B BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00B name allocs/op BTreeDeleteInsert/count=16-4 0.00 BTreeDeleteInsert/count=128-4 0.00 BTreeDeleteInsert/count=1024-4 0.00 BTreeDeleteInsert/count=8192-4 0.00 BTreeDeleteInsert/count=65536-4 0.00 BTreeDeleteInsertCloneOnce/count=16-4 0.00 BTreeDeleteInsertCloneOnce/count=128-4 0.00 BTreeDeleteInsertCloneOnce/count=1024-4 0.00 BTreeDeleteInsertCloneOnce/count=8192-4 0.00 BTreeDeleteInsertCloneOnce/count=65536-4 0.00 BTreeDeleteInsertCloneEachTime/reset=false/count=16-4 1.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=128-4 2.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=1024-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=8192-4 3.00 ± 0% BTreeDeleteInsertCloneEachTime/reset=false/count=65536-4 4.40 ±14% BTreeDeleteInsertCloneEachTime/reset=true/count=16-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=128-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=1024-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=8192-4 0.00 BTreeDeleteInsertCloneEachTime/reset=true/count=65536-4 0.00 ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
90c1038
to
404942a
Compare
404942a
to
db792ec
Compare
db792ec
to
fc49dcc
Compare
A few things to note about the testing here:
|
d96bd89
to
148cb30
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I definitely need another pass. This is just the nits I've spotted in the first skim
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/spanlatch/doc.go, line 35 at r5 (raw file):
key ranges was permitted. Conceptually, the structure became an interval tree of sync.RWMutexes. * The structure become timestamp-aware and concurrent access of non-causal
s/become/became/
pkg/storage/spanlatch/manager.go, line 77 at r5 (raw file):
} // latches are stored in the Manager's btrees. The represent the latching of a
s/The/They/
pkg/storage/spanlatch/manager.go, line 137 at r5 (raw file):
} // Guard would be an ideal candidate for object pooling, but without
Nit: move the guard and latch construction to a helper
pkg/storage/spanlatch/manager.go, line 280 at r5 (raw file):
realloc := len(sm.rSet) > 16 for latch := range sm.rSet { latch.setInRSet(false)
if the TODO is about exploiting the efficient map clearing idiom, I think it needs to be the only statement in the loop https://go-review.googlesource.com/c/go/+/110055/.
perhaps rewrite this as:
for latch := range sm.rSet {
latch.setInRSet(false)
sm.trees[spanset.SpanReadOnly].Set(latch)
}
if realloc := len(sm.rSet) > 16; realloc {
sm.rSet = make(map[*latch]struct{})
} else {
for latch := range sm.rSet {
delete(sm.rSet, latch)
}
}
This change renames `storage/cmdq` to `storage/spanlatch`. The package will house the new `spanlatch.Manager` type, which will handle the acquisition and release of span-latches. This works off of the definition for latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). The files are not changes in this commit. Release note: None
This commit replaces all reference to cmds with references to latches. Release note: None
Informs cockroachdb#4768. Informs cockroachdb#31904. This change was inspired by cockroachdb#31904 and is a progression of the thinking started in cockroachdb#4768 (comment). The change introduces `spanlatch.Manager`, which will replace the `CommandQueue` **in a future PR**. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes cockroachdb#31904 easier to implement. The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because `Replica.beginCmds` repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges. The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock. While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the `spanlatch.Guard`. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting btree nodes, which allows us to release them back into the btree node pools when they're no longer references by any btree snapshots. This means that we don't expect any allocations when inserting into the internal trees, even with the COW policy. Finally, this will make the approach taken in cockroachdb#31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free. _### Naming changes | Before | After | |----------------------------|-----------------------------------| | `CommandQueue` | `spanlatch.Manager` | | "enter the command queue" | "acquire span latches" | | "exit the command queue" | "release span latches" | | "wait for prereq commands" | "wait for latches to be released" | The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model. _### Microbenchmarks NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure. ``` name cmdq time/op spanlatch time/op delta ReadOnlyMix/size=1-4 897ns ±21% 917ns ±18% ~ (p=0.897 n=8+10) ReadOnlyMix/size=4-4 827ns ±22% 772ns ±15% ~ (p=0.448 n=10+10) ReadOnlyMix/size=16-4 905ns ±19% 770ns ±10% -14.90% (p=0.004 n=10+10) ReadOnlyMix/size=64-4 907ns ±20% 730ns ±15% -19.51% (p=0.001 n=10+10) ReadOnlyMix/size=128-4 926ns ±17% 731ns ±11% -21.04% (p=0.000 n=9+10) ReadOnlyMix/size=256-4 977ns ±19% 726ns ± 9% -25.65% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=0-4 12.5µs ± 4% 0.7µs ±17% -94.70% (p=0.000 n=8+9) ReadWriteMix/readsPerWrite=1-4 8.18µs ± 5% 0.63µs ± 6% -92.24% (p=0.000 n=10+9) ReadWriteMix/readsPerWrite=4-4 3.80µs ± 2% 0.66µs ± 5% -82.58% (p=0.000 n=8+10) ReadWriteMix/readsPerWrite=16-4 1.82µs ± 2% 0.70µs ± 5% -61.43% (p=0.000 n=9+10) ReadWriteMix/readsPerWrite=64-4 894ns ±12% 514ns ± 6% -42.48% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=128-4 717ns ± 5% 472ns ± 1% -34.21% (p=0.000 n=10+8) ReadWriteMix/readsPerWrite=256-4 607ns ± 5% 453ns ± 3% -25.35% (p=0.000 n=7+10) name cmdq alloc/op spanlatch alloc/op delta ReadOnlyMix/size=1-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=4-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=16-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=64-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=128-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=256-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=0-4 915B ± 0% 144B ± 0% -84.26% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=1-4 730B ± 0% 144B ± 0% -80.29% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=4-4 486B ± 0% 144B ± 0% -70.35% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=16-4 350B ± 0% 144B ± 0% -58.86% (p=0.000 n=9+10) ReadWriteMix/readsPerWrite=64-4 222B ± 0% 144B ± 0% -35.14% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=128-4 199B ± 0% 144B ± 0% -27.64% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=256-4 188B ± 0% 144B ± 0% -23.40% (p=0.000 n=10+10) name cmdq allocs/op spanlatch allocs/op delta ReadOnlyMix/size=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=0-4 34.0 ± 0% 1.0 ± 0% -97.06% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=1-4 22.0 ± 0% 1.0 ± 0% -95.45% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=4-4 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=16-4 4.00 ± 0% 1.00 ± 0% -75.00% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ``` Release note: None
…g removal This change modifies `adjustUpperBoundOnRemoval` to avoid a degenerate case in element removal where all intervals have the same end key. In this case, we would previously adjust the upper bound of every node from the root of the tree to the node that the interval was being removed from. We now check whether removing the element with the largest end key is actually changing the upper bound of the node. If there are other elements with the same end key then this is not the case and we can avoid repeat calls to `adjustUpperBoundOnRemoval` while traversing back up the tree. This came up while profiling a benchmark that was giving suprising results. Release note: None
44fc962
to
3cf960e
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/command_queue_test.go, line 809 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
I suppose you'll be renaming all of these instances of
CommandQueue
as well in a future PR.
Yes, I'll be ripping out every single reference I can find to it.
pkg/storage/spanlatch/doc.go, line 20 at r4 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Perhaps mention that this is the evolution of complexity. Something like:
s/Managers's/The evolution of/g
.
Done.
pkg/storage/spanlatch/manager.go, line 62 at r10 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Nit: I'd prefer to see this spelled out as
readSet
andinReadSet
.We could avoid the use of a
map
by instead using a circularly linked list.latch
would neednext, prev *latch
fields. You can remove an element from such a list without knowing its position. Seeutil/cache.Entry
andutil/cache.entryList
for an example of what I'm thinking of.
That's a really cool idea! It provides a nice speedup:
name old time/op new time/op delta
LatchManagerReadOnlyMix/size=1-4 683ns ± 9% 404ns ±10% -40.85% (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4 660ns ± 7% 382ns ± 5% -42.17% (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4 684ns ±10% 367ns ± 5% -46.27% (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4 683ns ± 8% 370ns ± 1% -45.75% (p=0.016 n=5+4)
LatchManagerReadOnlyMix/size=128-4 678ns ± 4% 398ns ±14% -41.27% (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4 652ns ± 4% 385ns ± 4% -40.95% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4 594ns ±16% 629ns ±17% ~ (p=0.222 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4 603ns ± 1% 552ns ± 7% -8.39% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4 621ns ± 4% 576ns ± 5% -7.28% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4 649ns ± 2% 541ns ±13% -16.69% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4 474ns ± 5% 423ns ±29% ~ (p=0.151 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4 413ns ± 2% 362ns ±16% ~ (p=0.095 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4 448ns ±14% 314ns ±13% -29.85% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
LatchManagerReadOnlyMix/size=1-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=4-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=16-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=64-4 191B ± 0% 160B ± 0% ~ (p=0.079 n=4+5)
LatchManagerReadOnlyMix/size=128-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5)
LatchManagerReadOnlyMix/size=256-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=0-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=1-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=4-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=16-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=64-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=128-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5)
LatchManagerReadWriteMix/readsPerWrite=256-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5)
name old allocs/op new allocs/op delta
LatchManagerReadOnlyMix/size=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadOnlyMix/size=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadOnlyMix/size=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadOnlyMix/size=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadOnlyMix/size=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadOnlyMix/size=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadWriteMix/readsPerWrite=0-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadWriteMix/readsPerWrite=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadWriteMix/readsPerWrite=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadWriteMix/readsPerWrite=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadWriteMix/readsPerWrite=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadWriteMix/readsPerWrite=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
LatchManagerReadWriteMix/readsPerWrite=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
pkg/storage/spanlatch/manager.go, line 228 at r10 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Looks like you always have a snapshot associated with a guard. Rather than passing the snapshot on the stack, it might be better (faster) to embed the snapshot in the guard and to change
Manager.snapshot()
to take a*snapshot
which it fills in.
But then we have to allocate that entire object on the heap and keep the memory around for the entire lifetime of the Guard
. Do you think that will be faster?
pkg/storage/spanlatch/manager.go, line 250 at r10 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Perhaps follow the
Locked
naming convention. E.g.snapshotLocked
andinsertLocked
.
Done.
pkg/storage/spanlatch/interval_btree.go, line 15 at r2 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Note that there are various bits in the UI that refer to "Command Queue". Let's file an issue to change the name there as well.
I have a series of changes lined up to eradicate that word.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 7 files at r4, 1 of 1 files at r5, 2 of 4 files at r7, 3 of 7 files at r9, 1 of 1 files at r10, 1 of 7 files at r14, 1 of 5 files at r16.
Reviewable status: complete! 0 of 0 LGTMs obtained
pkg/storage/spanlatch/list.go, line 20 at r16 (raw file):
type latchList struct { root latch len int
while it's reasonable and clean to track len (and it's done in container/list to be able to implement O(1) Length), it seems like given the general memory consciousness of this package, it's safe to omit latchList.len
if in front()
you make the nil condition ll.root.next == nil || ll.root.next == &ll.root
pkg/storage/spanlatch/manager.go, line 246 at r16 (raw file):
// flushReadSetLocked flushes the read set into the read interval tree. func (sm *scopedManager) flushReadSetLocked() { for sm.readSet.len > 0 {
if you decide to eliminate.len
then I guess this could look like:
for latch := sm.readSet.front(); latch != nil; latch = sm.readSet.front() {
sm.readSet.remove(latch)
sm.trees[spanset.SpanReadOnly].Set(latch)
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't fully scrutinize all of the details or testing here. Let me know if you think something deserve particular attention and I'll give it a thorough look.
Reviewable status: complete! 1 of 0 LGTMs obtained
pkg/storage/spanlatch/list.go, line 20 at r16 (raw file):
Previously, ajwerner wrote…
while it's reasonable and clean to track len (and it's done in container/list to be able to implement O(1) Length), it seems like given the general memory consciousness of this package, it's safe to omit
latchList.len
if infront()
you make the nil conditionll.root.next == nil || ll.root.next == &ll.root
The memory savings are minimal as there are a constant number of latchLists
per Manager
. That said, I'd remove len
because it doesn't seem necessary per @ajwerner's suggestion.
pkg/storage/spanlatch/list.go, line 30 at r16 (raw file):
} func (ll *latchList) lazyInit() {
Do you need this lazyInit
stuff? For the usage in spanlatch.Manager
I think an init
method could be called when the Manager
is created.
pkg/storage/spanlatch/manager.go, line 62 at r10 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
That's a really cool idea! It provides a nice speedup:
name old time/op new time/op delta LatchManagerReadOnlyMix/size=1-4 683ns ± 9% 404ns ±10% -40.85% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=4-4 660ns ± 7% 382ns ± 5% -42.17% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=16-4 684ns ±10% 367ns ± 5% -46.27% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=64-4 683ns ± 8% 370ns ± 1% -45.75% (p=0.016 n=5+4) LatchManagerReadOnlyMix/size=128-4 678ns ± 4% 398ns ±14% -41.27% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=256-4 652ns ± 4% 385ns ± 4% -40.95% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=0-4 594ns ±16% 629ns ±17% ~ (p=0.222 n=5+5) LatchManagerReadWriteMix/readsPerWrite=1-4 603ns ± 1% 552ns ± 7% -8.39% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=4-4 621ns ± 4% 576ns ± 5% -7.28% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=16-4 649ns ± 2% 541ns ±13% -16.69% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=64-4 474ns ± 5% 423ns ±29% ~ (p=0.151 n=5+5) LatchManagerReadWriteMix/readsPerWrite=128-4 413ns ± 2% 362ns ±16% ~ (p=0.095 n=5+5) LatchManagerReadWriteMix/readsPerWrite=256-4 448ns ±14% 314ns ±13% -29.85% (p=0.008 n=5+5) name old alloc/op new alloc/op delta LatchManagerReadOnlyMix/size=1-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=4-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=16-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=64-4 191B ± 0% 160B ± 0% ~ (p=0.079 n=4+5) LatchManagerReadOnlyMix/size=128-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=256-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=0-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=1-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=4-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=16-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=64-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=128-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=256-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) name old allocs/op new allocs/op delta LatchManagerReadOnlyMix/size=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=0-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal)
💯
pkg/storage/spanlatch/manager.go, line 228 at r10 (raw file):
Do you think that will be faster?
I don't know. Perhaps add it to a TODO list to investigate after this PR goes in. Probably a very minor benefit if any.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 4 of 9 files at r11, 3 of 4 files at r12, 1 of 2 files at r13, 5 of 7 files at r14, 1 of 1 files at r15, 2 of 5 files at r16.
Reviewable status: complete! 2 of 0 LGTMs obtained
pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):
// before returning. func (m *Manager) wait(ctx context.Context, lg *Guard, ts hlc.Timestamp, snap snapshot) error { for s := spanset.SpanScope(0); s < spanset.NumSpanScope; s++ {
Just a question for discussion, can the order in which latches are examined impact performance? It seems like if we could wait on the longest blocking item first then we'd increase the rate of hitting the fast path on the signal and the number of goroutine yields on the select. I don't have good intuition about what it would take to come up with a heuristic to guess as when a latch will be removed. Do we expect reads to happen faster than writes? Do we expect global things to take longer than local? All of this may be premature optimization. It might be worth trying to see how often you hit the fast path and if the number is low (maybe even as low as something like 50%), then maybe there's a cheap win here.
pkg/storage/spanlatch/manager.go, line 265 at r16 (raw file):
switch a { case spanset.SpanReadOnly: // Add reads to the rSet. They only need to enter the read
total nit: s/rSet/readSet/
This change replaces the Manager's `readSet` map implementation with a linked-list implementation. This provides the following speedup: ``` name old time/op new time/op delta LatchManagerReadOnlyMix/size=1-4 683ns ± 9% 404ns ±10% -40.85% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=4-4 660ns ± 7% 382ns ± 5% -42.17% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=16-4 684ns ±10% 367ns ± 5% -46.27% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=64-4 683ns ± 8% 370ns ± 1% -45.75% (p=0.016 n=5+4) LatchManagerReadOnlyMix/size=128-4 678ns ± 4% 398ns ±14% -41.27% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=256-4 652ns ± 4% 385ns ± 4% -40.95% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=0-4 594ns ±16% 629ns ±17% ~ (p=0.222 n=5+5) LatchManagerReadWriteMix/readsPerWrite=1-4 603ns ± 1% 552ns ± 7% -8.39% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=4-4 621ns ± 4% 576ns ± 5% -7.28% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=16-4 649ns ± 2% 541ns ±13% -16.69% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=64-4 474ns ± 5% 423ns ±29% ~ (p=0.151 n=5+5) LatchManagerReadWriteMix/readsPerWrite=128-4 413ns ± 2% 362ns ±16% ~ (p=0.095 n=5+5) LatchManagerReadWriteMix/readsPerWrite=256-4 448ns ±14% 314ns ±13% -29.85% (p=0.008 n=5+5) name old alloc/op new alloc/op delta LatchManagerReadOnlyMix/size=1-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=4-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=16-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=64-4 191B ± 0% 160B ± 0% ~ (p=0.079 n=4+5) LatchManagerReadOnlyMix/size=128-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadOnlyMix/size=256-4 191B ± 0% 160B ± 0% -16.23% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=0-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=1-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=4-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=16-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=64-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=128-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) LatchManagerReadWriteMix/readsPerWrite=256-4 144B ± 0% 160B ± 0% +11.11% (p=0.008 n=5+5) name old allocs/op new allocs/op delta LatchManagerReadOnlyMix/size=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadOnlyMix/size=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=0-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) LatchManagerReadWriteMix/readsPerWrite=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ``` The change also makes Manager's zero value completely usable. Release note: None
It is cheaper to wait on an already released latch than it is an unreleased latch so we prefer waiting on longer latches first. We expect writes to take longer than reads to release their latches, so we wait on them first. Release note: None
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTRs!
bors r+
Reviewable status: complete! 2 of 0 LGTMs obtained
pkg/storage/spanlatch/list.go, line 20 at r16 (raw file):
Previously, petermattis (Peter Mattis) wrote…
The memory savings are minimal as there are a constant number of
latchLists
perManager
. That said, I'd removelen
because it doesn't seem necessary per @ajwerner's suggestion.
I actually did exactly what's being suggested here at first, but I realized that we're going to want metrics on this soon enough and being able to track how many reads are in the readSet will be important.
pkg/storage/spanlatch/list.go, line 30 at r16 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Do you need this
lazyInit
stuff? For the usage inspanlatch.Manager
I think aninit
method could be called when theManager
is created.
This allows the zero value for the entire spanlatch.Manager
to be used directly, which is super nice. We don't have or need a Manager
constructor.
pkg/storage/spanlatch/manager.go, line 228 at r10 (raw file):
Previously, petermattis (Peter Mattis) wrote…
Do you think that will be faster?
I don't know. Perhaps add it to a TODO list to investigate after this PR goes in. Probably a very minor benefit if any.
I gave it a shot and it didn't seem to help:
name old time/op new time/op delta
ReadOnlyMix/size=1-4 404ns ±10% 561ns ±14% +38.91% (p=0.008 n=5+5)
ReadOnlyMix/size=4-4 382ns ± 5% 533ns ±17% +39.60% (p=0.008 n=5+5)
ReadOnlyMix/size=16-4 367ns ± 5% 500ns ±17% +36.04% (p=0.008 n=5+5)
ReadOnlyMix/size=64-4 370ns ± 1% 518ns ± 8% +39.92% (p=0.016 n=4+5)
ReadOnlyMix/size=128-4 398ns ±14% 548ns ± 8% +37.50% (p=0.008 n=5+5)
ReadOnlyMix/size=256-4 385ns ± 4% 546ns ± 5% +41.92% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4 629ns ±17% 755ns ±14% ~ (p=0.056 n=5+5)
ReadWriteMix/readsPerWrite=1-4 552ns ± 7% 729ns ± 9% +31.93% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4 576ns ± 5% 673ns ±20% +16.84% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4 541ns ±13% 632ns ± 1% +16.89% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4 423ns ±29% 552ns ±31% +30.50% (p=0.032 n=5+5)
ReadWriteMix/readsPerWrite=128-4 362ns ±16% 426ns ± 3% +17.44% (p=0.016 n=5+5)
ReadWriteMix/readsPerWrite=256-4 314ns ±13% 405ns ± 6% +28.94% (p=0.008 n=5+5)
name old alloc/op new alloc/op delta
ReadOnlyMix/size=1-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadOnlyMix/size=4-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadOnlyMix/size=16-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadOnlyMix/size=64-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadOnlyMix/size=128-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadOnlyMix/size=256-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=256-4 160B ± 0% 224B ± 0% +40.00% (p=0.008 n=5+5)
pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):
This is an interesting idea. We expect writes to hold their latches significantly longer that reads, so it should be a clear win to wait on them first so that we select from fewer channels in total. Done.
Do we expect reads to happen faster than writes?
Yes.
Do we expect global things to take longer than local?
Not necessarily. I don't think there's any real correlation here.
pkg/storage/spanlatch/manager.go, line 246 at r16 (raw file):
Previously, ajwerner wrote…
if you decide to eliminate
.len
then I guess this could look like:for latch := sm.readSet.front(); latch != nil; latch = sm.readSet.front() { sm.readSet.remove(latch) sm.trees[spanset.SpanReadOnly].Set(latch) }
See discussion above.
pkg/storage/spanlatch/manager.go, line 265 at r16 (raw file):
Previously, ajwerner wrote…
total nit: s/rSet/readSet/
Not a nit, a botched refactor :) Done.
bors r- |
Canceled |
3cf960e
to
b2ab370
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):
Previously, nvanbenschoten (Nathan VanBenschoten) wrote…
This is an interesting idea. We expect writes to hold their latches significantly longer that reads, so it should be a clear win to wait on them first so that we select from fewer channels in total. Done.
Do we expect reads to happen faster than writes?
Yes.
Do we expect global things to take longer than local?
Not necessarily. I don't think there's any real correlation here.
Cool, the next steps to push this idea further would be:
- set a to SpanReadWrite before setting it SpanReadOnly (0) in the for loop
- sort the latches in newGuard with the highest timestamps first as my intuition is that the high timestamp latches are expected to finish last.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bors r+
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/storage/spanlatch/manager.go, line 350 at r14 (raw file):
set a to SpanReadWrite before setting it SpanReadOnly (0) in the for loop
But we would want a=SpanReadOnly
before a=SpanReadWrite
, right? Because then the order of access will be readSpan+tree[SpanReadWrite]
, writeSpan+tree[SpanReadWrite]
, writeSpan+tree[SpanReadOnly]
. Either way, in practice we never actually see requests with read and write spans together.
sort the latches in newGuard with the highest timestamps first as my intuition is that the high timestamp latches are expected to finish last.
All of the latches in newGuard have the same timestamp. Also, anything that requires sorting will almost certainly cost more than doing nothing at all. We're dealing on the order of double-digit ns at this point.
31997: storage/spanlatch: create spanlatch.Manager using immutable btrees r=nvanbenschoten a=nvanbenschoten Informs #4768. Informs #31904. This change was inspired by #31904 and is a progression of the thinking started in #4768 (comment). The change introduces `spanlatch.Manager`, which will replace the `CommandQueue` **in a future PR**. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes #31904 easier to implement. The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because `Replica.beginCmds` repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges. The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock. While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the `spanlatch.Guard`. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting immutable btree nodes, which allows us to release them back into the btree node pools when they're no longer needed. This means that we don't expect any allocations when inserting into the internal trees, even with the copy-on-write policy. Finally, this will make the approach taken in #31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free. ### Naming changes | Before | After | |----------------------------|-----------------------------------| | `CommandQueue` | `spanlatch.Manager` | | "enter the command queue" | "acquire span latches" | | "exit the command queue" | "release span latches" | | "wait for prereq commands" | "wait for latches to be released" | The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model. ### Microbenchmarks NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure. ``` name old time/op new time/op delta ReadOnlyMix/size=1-4 706ns ±20% 404ns ±10% -42.81% (p=0.008 n=5+5) ReadOnlyMix/size=4-4 649ns ±23% 382ns ± 5% -41.13% (p=0.008 n=5+5) ReadOnlyMix/size=16-4 611ns ±16% 367ns ± 5% -39.83% (p=0.008 n=5+5) ReadOnlyMix/size=64-4 692ns ±14% 370ns ± 1% -46.49% (p=0.016 n=5+4) ReadOnlyMix/size=128-4 637ns ±22% 398ns ±14% -37.48% (p=0.008 n=5+5) ReadOnlyMix/size=256-4 676ns ±15% 385ns ± 4% -43.01% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=0-4 12.2µs ± 4% 0.6µs ±17% -94.85% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=1-4 7.88µs ± 2% 0.55µs ± 7% -92.99% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=4-4 4.19µs ± 3% 0.58µs ± 5% -86.26% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=16-4 2.09µs ± 6% 0.54µs ±13% -74.13% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=64-4 875ns ±17% 423ns ±29% -51.64% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=128-4 655ns ± 6% 362ns ±16% -44.71% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=256-4 549ns ±16% 314ns ±13% -42.73% (p=0.008 n=5+5) name old alloc/op new alloc/op delta ReadOnlyMix/size=1-4 223B ± 0% 160B ± 0% -28.25% (p=0.079 n=4+5) ReadOnlyMix/size=4-4 223B ± 0% 160B ± 0% -28.25% (p=0.008 n=5+5) ReadOnlyMix/size=16-4 223B ± 0% 160B ± 0% -28.25% (p=0.008 n=5+5) ReadOnlyMix/size=64-4 223B ± 0% 160B ± 0% -28.25% (p=0.008 n=5+5) ReadOnlyMix/size=128-4 217B ± 4% 160B ± 0% -26.27% (p=0.008 n=5+5) ReadOnlyMix/size=256-4 223B ± 0% 160B ± 0% -28.25% (p=0.079 n=4+5) ReadWriteMix/readsPerWrite=0-4 1.25kB ± 0% 0.16kB ± 0% -87.15% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=1-4 1.00kB ± 0% 0.16kB ± 0% -84.00% (p=0.079 n=4+5) ReadWriteMix/readsPerWrite=4-4 708B ± 0% 160B ± 0% -77.40% (p=0.079 n=4+5) ReadWriteMix/readsPerWrite=16-4 513B ± 0% 160B ± 0% -68.81% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=64-4 264B ± 0% 160B ± 0% -39.39% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=128-4 221B ± 0% 160B ± 0% -27.60% (p=0.079 n=4+5) ReadWriteMix/readsPerWrite=256-4 198B ± 0% 160B ± 0% -19.35% (p=0.008 n=5+5) name old allocs/op new allocs/op delta ReadOnlyMix/size=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=0-4 38.0 ± 0% 1.0 ± 0% -97.37% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=1-4 24.0 ± 0% 1.0 ± 0% -95.83% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=4-4 12.0 ± 0% 1.0 ± 0% -91.67% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=16-4 5.00 ± 0% 1.00 ± 0% -80.00% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=64-4 2.00 ± 0% 1.00 ± 0% -50.00% (p=0.008 n=5+5) ReadWriteMix/readsPerWrite=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ``` There are a few interesting things to point about about these benchmark results: - The `ReadOnlyMix` results demonstrate a fixed improvement, regardless of size. This is due to the replacement of the hash-map with a linked-list for the readSet structure. - The `ReadWriteMix` is more interesting. We see that the spanlatch implementation is faster across the board. This is especially true with a high write/read ratio. - We see that the allocated memory stays constant regardless of the write/read ratio in the spanlatch implementation. This is due to the memory recylcing that it performs on btree nodes. It is not the case for the CommandQueue implementation. Release note: None 32416: scripts: enhance the release notes r=knz a=knz Fixes #25180. With this the amount of release notes for the first 2.2 alpha in cockroachdb/docs#4051 is reduced to just under two pages. Also this PR makes it easier to monitor progress during the execution of the script. Co-authored-by: Nathan VanBenschoten <[email protected]> Co-authored-by: Raphael 'kena' Poss <[email protected]>
Build succeeded |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯
Reviewed 9 of 9 files at r11, 4 of 4 files at r12, 2 of 2 files at r13, 7 of 7 files at r14, 1 of 1 files at r15, 4 of 5 files at r16, 1 of 1 files at r17.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/storage/spanlatch/manager.go, line 314 at r14 (raw file):
} func (m *Manager) nextID() uint64 {
nit: nextIDLocked()
pkg/storage/spanlatch/manager_test.go, line 122 at r14 (raw file):
m := New() // Try latch with no overlapping already-acquired lathes.
lathes
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 1 of 1 files at r18.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 2 stale)
pkg/storage/spanlatch/manager.go, line 314 at r14 (raw file):
Previously, tbg (Tobias Grieger) wrote…
nit: nextIDLocked()
Will address in next PR.
cockroachdb#31997 (review) Release note: None
This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in cockroachdb#31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks. This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year. _### Benchamrks _#### Standard Benchmarks These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts. ``` name old ops/sec new ops/sec delta kv0/cores=4/nodes=1/splits=0 1.99k ± 2% 2.06k ± 1% +3.22% (p=0.008 n=5+5) kv0/cores=4/nodes=1/splits=100 2.25k ± 1% 2.38k ± 1% +6.01% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 1.60k ± 0% 1.69k ± 2% +5.53% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 3.52k ± 6% 3.65k ± 9% ~ (p=0.421 n=5+5) kv0/cores=16/nodes=1/splits=0 19.9k ± 1% 21.8k ± 1% +9.34% (p=0.008 n=5+5) kv0/cores=16/nodes=1/splits=100 24.4k ± 1% 26.1k ± 1% +7.17% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 14.9k ± 1% 16.1k ± 1% +8.03% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 20.6k ± 1% 22.8k ± 1% +10.79% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=0 31.2k ± 2% 35.3k ± 1% +13.28% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 45.7k ± 1% 51.1k ± 1% +11.80% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 23.7k ± 2% 27.1k ± 2% +14.39% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=100 34.9k ± 2% 45.1k ± 1% +29.44% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 12.7k ± 2% 12.9k ± 2% +1.39% (p=0.151 n=5+5) kv95/cores=4/nodes=1/splits=100 12.8k ± 2% 13.1k ± 2% +2.10% (p=0.032 n=5+5) kv95/cores=4/nodes=3/splits=0 10.6k ± 1% 10.8k ± 1% +1.58% (p=0.056 n=5+5) kv95/cores=4/nodes=3/splits=100 12.3k ± 7% 12.6k ± 8% +2.61% (p=0.095 n=5+5) kv95/cores=16/nodes=1/splits=0 50.9k ± 1% 52.2k ± 1% +2.37% (p=0.008 n=5+5) kv95/cores=16/nodes=1/splits=100 52.2k ± 1% 53.0k ± 1% +1.49% (p=0.008 n=5+5) kv95/cores=16/nodes=3/splits=0 46.2k ± 1% 46.8k ± 1% +1.32% (p=0.032 n=5+5) kv95/cores=16/nodes=3/splits=100 51.0k ± 1% 53.2k ± 1% +4.25% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 79.8k ± 2% 101.6k ± 1% +27.31% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=100 104k ± 1% 107k ± 1% +2.60% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 85.8k ± 1% 91.8k ± 1% +7.08% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=100 106k ± 1% 112k ± 1% +5.51% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=4/nodes=1/splits=0 3.52 ± 5% 3.40 ± 0% -3.41% (p=0.016 n=5+4) kv0/cores=4/nodes=1/splits=100 3.30 ± 0% 3.00 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 4.70 ± 0% 4.14 ± 9% -11.91% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 1.50 ± 0% 1.48 ± 8% ~ (p=0.968 n=4+5) kv0/cores=16/nodes=1/splits=0 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=16/nodes=1/splits=100 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0/cores=16/nodes=3/splits=0 2.00 ± 0% 1.90 ± 0% -5.00% (p=0.000 n=5+4) kv0/cores=16/nodes=3/splits=100 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=36/nodes=1/splits=0 1.76 ± 3% 1.60 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 2.56 ± 2% 2.40 ± 0% -6.25% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 1.70 ± 0% 1.40 ± 0% -17.65% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=0 0.70 ± 0% 0.64 ± 9% -8.57% (p=0.167 n=5+5) kv95/cores=16/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=3/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv95/cores=36/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0/cores=4/nodes=1/splits=0 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.000 n=5+4) kv0/cores=4/nodes=1/splits=100 7.90 ± 0% 7.60 ± 0% -3.80% (p=0.000 n=5+4) kv0/cores=4/nodes=3/splits=0 15.7 ± 0% 15.2 ± 0% -3.18% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 8.90 ± 0% 8.12 ± 3% -8.76% (p=0.016 n=4+5) kv0/cores=16/nodes=1/splits=0 3.46 ± 2% 3.00 ± 0% -13.29% (p=0.000 n=5+4) kv0/cores=16/nodes=1/splits=100 4.50 ± 0% 3.36 ± 2% -25.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 4.50 ± 0% 3.90 ± 0% -13.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 5.80 ± 0% 4.10 ± 0% -29.31% (p=0.029 n=4+4) kv0/cores=36/nodes=1/splits=0 6.80 ± 0% 5.20 ± 0% -23.53% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 5.80 ± 0% 4.32 ± 4% -25.52% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 7.72 ± 2% 6.30 ± 0% -18.39% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 7.98 ± 2% 5.20 ± 0% -34.84% (p=0.000 n=5+4) kv95/cores=4/nodes=1/splits=0 5.38 ± 3% 5.20 ± 0% -3.35% (p=0.167 n=5+5) kv95/cores=4/nodes=1/splits=100 5.00 ± 0% 5.00 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 5.68 ± 3% 5.50 ± 0% -3.17% (p=0.095 n=5+4) kv95/cores=4/nodes=3/splits=100 3.60 ±31% 2.93 ± 3% -18.75% (p=0.016 n=5+4) kv95/cores=16/nodes=1/splits=0 4.10 ± 0% 4.10 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 4.50 ± 0% 4.10 ± 0% -8.89% (p=0.000 n=5+4) kv95/cores=16/nodes=3/splits=0 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=100 2.50 ± 0% 1.90 ± 5% -24.00% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.029 n=4+4) kv95/cores=36/nodes=1/splits=100 5.50 ± 0% 5.12 ± 2% -6.91% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 4.18 ± 2% 4.02 ± 3% -3.71% (p=0.000 n=4+5) kv95/cores=36/nodes=3/splits=100 3.80 ± 0% 2.80 ± 0% -26.32% (p=0.008 n=5+5) ``` _#### Large-machine Benchmarks These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs. ``` name old ops/sec new ops/sec delta kv0/cores=72/nodes=1/splits=0 31.0k ± 4% 36.4k ± 1% +17.57% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 44.0k ± 0% 49.0k ± 1% +11.41% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 52.7k ±18% 72.6k ±26% +37.70% (p=0.016 n=5+5) kv95/cores=72/nodes=1/splits=100 66.8k ±17% 68.5k ± 5% ~ (p=0.286 n=5+4) name old p50(ms) new p50(ms) delta kv0/cores=72/nodes=1/splits=0 2.30 ±13% 2.52 ± 5% ~ (p=0.214 n=5+5) kv0/cores=72/nodes=1/splits=100 3.00 ± 0% 2.90 ± 0% -3.33% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 0.46 ±13% 0.50 ± 0% ~ (p=0.444 n=5+5) kv95/cores=72/nodes=1/splits=100 0.44 ±14% 0.50 ± 0% +13.64% (p=0.167 n=5+5) name old p99(ms) new p99(ms) delta kv0/cores=72/nodes=1/splits=0 18.9 ± 6% 13.3 ± 5% -29.56% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 13.4 ± 2% 11.0 ± 0% -17.91% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 34.4 ±34% 23.5 ±24% -31.74% (p=0.048 n=5+5) kv95/cores=72/nodes=1/splits=100 21.0 ± 0% 19.1 ± 4% -8.81% (p=0.029 n=4+4) ``` _#### Motivating Benchmarks These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the `c` specifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations. ``` name old ops/sec new ops/sec delta kv95-c5/cores=16/nodes=1/splits=0 45.1k ± 1% 47.2k ± 4% +4.59% (p=0.008 n=5+5) kv95-c5/cores=36/nodes=1/splits=0 44.6k ± 1% 76.3k ± 1% +71.05% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 27.2k ± 2% 29.4k ± 1% +8.12% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 42.6k ± 2% 50.0k ± 1% +17.39% (p=0.008 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 28.7k ± 1% 29.8k ± 3% +3.87% (p=0.008 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 41.9k ± 4% 52.8k ± 2% +25.97% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv95-c5/cores=16/nodes=1/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95-c5/cores=36/nodes=1/splits=0 0.90 ± 0% 0.80 ± 0% -11.11% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 1.10 ± 0% 1.06 ± 6% ~ (p=0.444 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 1.26 ± 5% 1.30 ± 0% ~ (p=0.444 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 0.70 ± 0% 0.50 ± 0% -28.57% (p=0.008 n=5+5) name old p99(ms) new p99(ms) delta kv95-c5/cores=16/nodes=1/splits=0 2.40 ± 0% 2.10 ± 0% -12.50% (p=0.000 n=5+4) kv95-c5/cores=36/nodes=1/splits=0 5.80 ± 0% 3.30 ± 0% -43.10% (p=0.000 n=5+4) kv50-c128/cores=16/nodes=1/splits=0 3.50 ± 0% 3.00 ± 0% -14.29% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 6.80 ± 0% 4.70 ± 0% -30.88% (p=0.079 n=4+5) kv70-20-c128/cores=16/nodes=1/splits=0 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.029 n=4+4) kv70-20-c128/cores=36/nodes=1/splits=0 11.0 ± 0% 6.8 ± 0% -38.18% (p=0.008 n=5+5) ``` _#### Batching Benchmarks One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary. It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck. ``` name old ops/sec new ops/sec delta kv0-b16/cores=4/nodes=1/splits=0 2.41k ± 0% 2.06k ± 3% -14.75% (p=0.008 n=5+5) kv0-b16/cores=4/nodes=1/splits=100 514 ± 0% 534 ± 1% +3.88% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 2.95k ± 0% 4.35k ± 0% +47.74% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 1.80k ± 1% 1.88k ± 1% +4.46% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 2.74k ± 0% 4.92k ± 1% +79.55% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 2.39k ± 1% 2.45k ± 1% +2.41% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 422 ± 0% 518 ± 1% +22.60% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 98.4 ± 1% 98.8 ± 1% ~ (p=0.810 n=5+5) kv0-b128/cores=16/nodes=1/splits=0 532 ± 0% 1059 ± 0% +99.16% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 291 ± 1% 307 ± 1% +5.18% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 483 ± 0% 1288 ± 1% +166.37% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 394 ± 1% 408 ± 1% +3.51% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 49.7 ± 1% 72.8 ± 1% +46.52% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 30.8 ± 0% 23.4 ± 0% -24.03% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 48.9 ± 2% 160.6 ± 0% +228.38% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=100 101 ± 1% 80 ± 0% -21.64% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 37.5 ± 0% 208.1 ± 1% +454.99% (p=0.016 n=4+5) kv0-b1024/cores=36/nodes=1/splits=100 162 ± 0% 124 ± 0% -23.22% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 5.93k ± 0% 6.20k ± 1% +4.55% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 2.27k ± 1% 2.32k ± 1% +2.28% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=0 5.15k ± 1% 18.79k ± 1% +264.73% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=100 8.31k ± 1% 8.57k ± 1% +3.16% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 3.96k ± 0% 10.67k ± 1% +169.81% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=100 15.7k ± 2% 16.2k ± 4% +2.75% (p=0.151 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 1.12k ± 1% 1.27k ± 0% +13.28% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 290 ± 1% 299 ± 1% +3.02% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 1.06k ± 0% 3.31k ± 0% +213.09% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 662 ±91% 1095 ± 1% +65.42% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 715 ± 2% 3586 ± 0% +401.21% (p=0.008 n=5+5) kv95-b128/cores=36/nodes=1/splits=100 1.15k ±90% 2.01k ± 2% +74.79% (p=0.016 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 134 ± 1% 170 ± 1% +26.59% (p=0.008 n=5+5) kv95-b1024/cores=4/nodes=1/splits=100 54.8 ± 3% 53.3 ± 3% -2.84% (p=0.056 n=5+5) kv95-b1024/cores=16/nodes=1/splits=0 104 ± 3% 367 ± 1% +252.37% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 210 ± 1% 214 ± 1% +1.86% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 76.5 ± 2% 383.9 ± 1% +401.67% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 431 ± 1% 436 ± 1% +1.17% (p=0.016 n=5+5) name old p50(ms) new p50(ms) delta kv0-b16/cores=4/nodes=1/splits=0 3.00 ± 0% 3.40 ± 0% +13.33% (p=0.016 n=5+4) kv0-b16/cores=4/nodes=1/splits=100 15.2 ± 0% 14.7 ± 0% -3.29% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 10.5 ± 0% 7.7 ± 2% -26.48% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 17.8 ± 0% 16.8 ± 0% -5.62% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 26.2 ± 0% 14.2 ± 0% -45.80% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 29.0 ± 2% 28.3 ± 0% -2.28% (p=0.095 n=5+4) kv0-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 15.2 ± 0% -14.61% (p=0.000 n=5+4) kv0-b128/cores=4/nodes=1/splits=100 79.7 ± 0% 79.7 ± 0% ~ (all equal) kv0-b128/cores=16/nodes=1/splits=0 65.0 ± 0% 32.5 ± 0% -50.00% (p=0.029 n=4+4) kv0-b128/cores=16/nodes=1/splits=100 109 ± 0% 105 ± 0% -3.85% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 168 ± 0% 50 ± 0% -70.02% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 184 ± 0% 176 ± 0% -4.50% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 159 ± 0% 109 ± 0% -31.56% (p=0.000 n=5+4) kv0-b1024/cores=4/nodes=1/splits=100 252 ± 0% 319 ± 0% +26.66% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 705 ± 0% 193 ± 0% -72.62% (p=0.000 n=5+4) kv0-b1024/cores=16/nodes=1/splits=100 319 ± 0% 386 ± 0% +21.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 1.88k ± 0% 0.24k ± 0% -87.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 436 ± 0% 570 ± 0% +30.77% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv95-b16/cores=4/nodes=1/splits=100 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95-b16/cores=16/nodes=1/splits=0 6.30 ± 0% 1.40 ± 0% -77.78% (p=0.000 n=5+4) kv95-b16/cores=16/nodes=1/splits=100 1.74 ± 3% 1.76 ± 3% ~ (p=1.000 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 11.5 ± 0% 5.5 ± 0% -52.17% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 2.42 ±20% 2.42 ±45% ~ (p=0.579 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 21.4 ± 3% 21.0 ± 0% ~ (p=0.444 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 30.4 ± 0% 9.4 ± 0% -69.08% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 38.2 ±76% 21.2 ± 4% -44.31% (p=0.063 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 88.1 ± 0% 16.8 ± 0% -80.93% (p=0.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 56.6 ±85% 29.6 ±15% ~ (p=0.873 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 52.4 ± 0% 44.0 ± 0% -16.03% (p=0.029 n=4+4) kv95-b1024/cores=4/nodes=1/splits=100 132 ± 2% 143 ± 0% +8.29% (p=0.016 n=5+4) kv95-b1024/cores=16/nodes=1/splits=0 325 ± 3% 80 ± 0% -75.51% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 151 ± 0% 151 ± 0% ~ (all equal) kv95-b1024/cores=36/nodes=1/splits=0 973 ± 0% 180 ± 3% -81.55% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 168 ± 0% 168 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0-b16/cores=4/nodes=1/splits=0 8.40 ± 0% 10.30 ± 3% +22.62% (p=0.016 n=4+5) kv0-b16/cores=4/nodes=1/splits=100 29.4 ± 0% 27.3 ± 0% -7.14% (p=0.000 n=5+4) kv0-b16/cores=16/nodes=1/splits=0 16.3 ± 0% 15.5 ± 2% -4.91% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 31.5 ± 0% 29.4 ± 0% -6.67% (p=0.000 n=5+4) kv0-b16/cores=36/nodes=1/splits=0 37.7 ± 0% 28.7 ± 2% -23.77% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 62.1 ± 2% 68.4 ±10% +10.15% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 37.7 ± 0% 39.4 ± 6% +4.46% (p=0.167 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 143 ± 0% 151 ± 0% +5.89% (p=0.016 n=4+5) kv0-b128/cores=16/nodes=1/splits=0 79.7 ± 0% 55.8 ± 2% -30.04% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 198 ± 3% 188 ± 3% -5.09% (p=0.048 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 184 ± 0% 126 ± 3% -31.82% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 319 ± 0% 336 ± 0% +5.24% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 322 ± 6% 253 ± 4% -21.35% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 470 ± 0% 772 ± 4% +64.28% (p=0.016 n=4+5) kv0-b1024/cores=16/nodes=1/splits=0 1.41k ± 0% 0.56k ±11% -60.00% (p=0.000 n=4+5) kv0-b1024/cores=16/nodes=1/splits=100 530 ± 2% 772 ± 0% +45.57% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 4.05k ± 7% 1.17k ± 3% -71.19% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 792 ±14% 1020 ± 2% +28.81% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 3.90 ± 0% 3.22 ± 4% -17.44% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 21.0 ± 0% 19.9 ± 0% -5.24% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=0 15.2 ± 0% 7.1 ± 0% -53.29% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=100 38.5 ± 3% 37.7 ± 0% ~ (p=0.333 n=5+4) kv95-b16/cores=36/nodes=1/splits=0 128 ± 2% 52 ± 0% -59.16% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 41.1 ±13% 39.2 ±33% ~ (p=0.984 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 14.7 ± 0% -17.42% (p=0.079 n=4+5) kv95-b128/cores=4/nodes=1/splits=100 107 ± 2% 106 ± 5% ~ (p=0.683 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 75.5 ± 0% 23.1 ± 0% -69.40% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 107 ±34% 120 ± 2% ~ (p=1.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 253 ± 4% 71 ± 0% -71.86% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 166 ±19% 164 ±74% ~ (p=0.310 n=5+5) kv95-b1024/cores=4/nodes=1/splits=0 146 ± 3% 101 ± 0% -31.01% (p=0.000 n=5+4) kv95-b1024/cores=4/nodes=1/splits=100 348 ± 4% 366 ± 6% ~ (p=0.317 n=4+5) kv95-b1024/cores=16/nodes=1/splits=0 624 ± 3% 221 ± 2% -64.52% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 325 ± 3% 319 ± 0% ~ (p=0.444 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 1.56k ± 5% 0.41k ± 2% -73.71% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 336 ± 0% 336 ± 0% ~ (all equal) ``` Release note (performance improvement): Replace Replica latching mechanism with new optimized data structure that improves throughput, especially under heavy contention.
cockroachdb#31997 (review) Release note: None
This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in cockroachdb#31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks. This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year. _### Benchmarks _#### Standard Benchmarks These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts. ``` name old ops/sec new ops/sec delta kv0/cores=4/nodes=1/splits=0 1.99k ± 2% 2.06k ± 1% +3.22% (p=0.008 n=5+5) kv0/cores=4/nodes=1/splits=100 2.25k ± 1% 2.38k ± 1% +6.01% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 1.60k ± 0% 1.69k ± 2% +5.53% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 3.52k ± 6% 3.65k ± 9% ~ (p=0.421 n=5+5) kv0/cores=16/nodes=1/splits=0 19.9k ± 1% 21.8k ± 1% +9.34% (p=0.008 n=5+5) kv0/cores=16/nodes=1/splits=100 24.4k ± 1% 26.1k ± 1% +7.17% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 14.9k ± 1% 16.1k ± 1% +8.03% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 20.6k ± 1% 22.8k ± 1% +10.79% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=0 31.2k ± 2% 35.3k ± 1% +13.28% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 45.7k ± 1% 51.1k ± 1% +11.80% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 23.7k ± 2% 27.1k ± 2% +14.39% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=100 34.9k ± 2% 45.1k ± 1% +29.44% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 12.7k ± 2% 12.9k ± 2% +1.39% (p=0.151 n=5+5) kv95/cores=4/nodes=1/splits=100 12.8k ± 2% 13.1k ± 2% +2.10% (p=0.032 n=5+5) kv95/cores=4/nodes=3/splits=0 10.6k ± 1% 10.8k ± 1% +1.58% (p=0.056 n=5+5) kv95/cores=4/nodes=3/splits=100 12.3k ± 7% 12.6k ± 8% +2.61% (p=0.095 n=5+5) kv95/cores=16/nodes=1/splits=0 50.9k ± 1% 52.2k ± 1% +2.37% (p=0.008 n=5+5) kv95/cores=16/nodes=1/splits=100 52.2k ± 1% 53.0k ± 1% +1.49% (p=0.008 n=5+5) kv95/cores=16/nodes=3/splits=0 46.2k ± 1% 46.8k ± 1% +1.32% (p=0.032 n=5+5) kv95/cores=16/nodes=3/splits=100 51.0k ± 1% 53.2k ± 1% +4.25% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 79.8k ± 2% 101.6k ± 1% +27.31% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=100 104k ± 1% 107k ± 1% +2.60% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 85.8k ± 1% 91.8k ± 1% +7.08% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=100 106k ± 1% 112k ± 1% +5.51% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=4/nodes=1/splits=0 3.52 ± 5% 3.40 ± 0% -3.41% (p=0.016 n=5+4) kv0/cores=4/nodes=1/splits=100 3.30 ± 0% 3.00 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 4.70 ± 0% 4.14 ± 9% -11.91% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 1.50 ± 0% 1.48 ± 8% ~ (p=0.968 n=4+5) kv0/cores=16/nodes=1/splits=0 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=16/nodes=1/splits=100 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0/cores=16/nodes=3/splits=0 2.00 ± 0% 1.90 ± 0% -5.00% (p=0.000 n=5+4) kv0/cores=16/nodes=3/splits=100 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=36/nodes=1/splits=0 1.76 ± 3% 1.60 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 2.56 ± 2% 2.40 ± 0% -6.25% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 1.70 ± 0% 1.40 ± 0% -17.65% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=0 0.70 ± 0% 0.64 ± 9% -8.57% (p=0.167 n=5+5) kv95/cores=16/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=3/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv95/cores=36/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0/cores=4/nodes=1/splits=0 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.000 n=5+4) kv0/cores=4/nodes=1/splits=100 7.90 ± 0% 7.60 ± 0% -3.80% (p=0.000 n=5+4) kv0/cores=4/nodes=3/splits=0 15.7 ± 0% 15.2 ± 0% -3.18% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 8.90 ± 0% 8.12 ± 3% -8.76% (p=0.016 n=4+5) kv0/cores=16/nodes=1/splits=0 3.46 ± 2% 3.00 ± 0% -13.29% (p=0.000 n=5+4) kv0/cores=16/nodes=1/splits=100 4.50 ± 0% 3.36 ± 2% -25.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 4.50 ± 0% 3.90 ± 0% -13.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 5.80 ± 0% 4.10 ± 0% -29.31% (p=0.029 n=4+4) kv0/cores=36/nodes=1/splits=0 6.80 ± 0% 5.20 ± 0% -23.53% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 5.80 ± 0% 4.32 ± 4% -25.52% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 7.72 ± 2% 6.30 ± 0% -18.39% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 7.98 ± 2% 5.20 ± 0% -34.84% (p=0.000 n=5+4) kv95/cores=4/nodes=1/splits=0 5.38 ± 3% 5.20 ± 0% -3.35% (p=0.167 n=5+5) kv95/cores=4/nodes=1/splits=100 5.00 ± 0% 5.00 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 5.68 ± 3% 5.50 ± 0% -3.17% (p=0.095 n=5+4) kv95/cores=4/nodes=3/splits=100 3.60 ±31% 2.93 ± 3% -18.75% (p=0.016 n=5+4) kv95/cores=16/nodes=1/splits=0 4.10 ± 0% 4.10 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 4.50 ± 0% 4.10 ± 0% -8.89% (p=0.000 n=5+4) kv95/cores=16/nodes=3/splits=0 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=100 2.50 ± 0% 1.90 ± 5% -24.00% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.029 n=4+4) kv95/cores=36/nodes=1/splits=100 5.50 ± 0% 5.12 ± 2% -6.91% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 4.18 ± 2% 4.02 ± 3% -3.71% (p=0.000 n=4+5) kv95/cores=36/nodes=3/splits=100 3.80 ± 0% 2.80 ± 0% -26.32% (p=0.008 n=5+5) ``` _#### Large-machine Benchmarks These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs. ``` name old ops/sec new ops/sec delta kv0/cores=72/nodes=1/splits=0 31.0k ± 4% 36.4k ± 1% +17.57% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 44.0k ± 0% 49.0k ± 1% +11.41% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 52.7k ±18% 72.6k ±26% +37.70% (p=0.016 n=5+5) kv95/cores=72/nodes=1/splits=100 66.8k ±17% 68.5k ± 5% ~ (p=0.286 n=5+4) name old p50(ms) new p50(ms) delta kv0/cores=72/nodes=1/splits=0 2.30 ±13% 2.52 ± 5% ~ (p=0.214 n=5+5) kv0/cores=72/nodes=1/splits=100 3.00 ± 0% 2.90 ± 0% -3.33% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 0.46 ±13% 0.50 ± 0% ~ (p=0.444 n=5+5) kv95/cores=72/nodes=1/splits=100 0.44 ±14% 0.50 ± 0% +13.64% (p=0.167 n=5+5) name old p99(ms) new p99(ms) delta kv0/cores=72/nodes=1/splits=0 18.9 ± 6% 13.3 ± 5% -29.56% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 13.4 ± 2% 11.0 ± 0% -17.91% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 34.4 ±34% 23.5 ±24% -31.74% (p=0.048 n=5+5) kv95/cores=72/nodes=1/splits=100 21.0 ± 0% 19.1 ± 4% -8.81% (p=0.029 n=4+4) ``` _#### Motivating Benchmarks These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the `c` specifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations. ``` name old ops/sec new ops/sec delta kv95-c5/cores=16/nodes=1/splits=0 45.1k ± 1% 47.2k ± 4% +4.59% (p=0.008 n=5+5) kv95-c5/cores=36/nodes=1/splits=0 44.6k ± 1% 76.3k ± 1% +71.05% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 27.2k ± 2% 29.4k ± 1% +8.12% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 42.6k ± 2% 50.0k ± 1% +17.39% (p=0.008 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 28.7k ± 1% 29.8k ± 3% +3.87% (p=0.008 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 41.9k ± 4% 52.8k ± 2% +25.97% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv95-c5/cores=16/nodes=1/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95-c5/cores=36/nodes=1/splits=0 0.90 ± 0% 0.80 ± 0% -11.11% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 1.10 ± 0% 1.06 ± 6% ~ (p=0.444 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 1.26 ± 5% 1.30 ± 0% ~ (p=0.444 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 0.70 ± 0% 0.50 ± 0% -28.57% (p=0.008 n=5+5) name old p99(ms) new p99(ms) delta kv95-c5/cores=16/nodes=1/splits=0 2.40 ± 0% 2.10 ± 0% -12.50% (p=0.000 n=5+4) kv95-c5/cores=36/nodes=1/splits=0 5.80 ± 0% 3.30 ± 0% -43.10% (p=0.000 n=5+4) kv50-c128/cores=16/nodes=1/splits=0 3.50 ± 0% 3.00 ± 0% -14.29% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 6.80 ± 0% 4.70 ± 0% -30.88% (p=0.079 n=4+5) kv70-20-c128/cores=16/nodes=1/splits=0 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.029 n=4+4) kv70-20-c128/cores=36/nodes=1/splits=0 11.0 ± 0% 6.8 ± 0% -38.18% (p=0.008 n=5+5) ``` _#### Batching Benchmarks One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary. It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck. ``` name old ops/sec new ops/sec delta kv0-b16/cores=4/nodes=1/splits=0 2.41k ± 0% 2.06k ± 3% -14.75% (p=0.008 n=5+5) kv0-b16/cores=4/nodes=1/splits=100 514 ± 0% 534 ± 1% +3.88% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 2.95k ± 0% 4.35k ± 0% +47.74% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 1.80k ± 1% 1.88k ± 1% +4.46% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 2.74k ± 0% 4.92k ± 1% +79.55% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 2.39k ± 1% 2.45k ± 1% +2.41% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 422 ± 0% 518 ± 1% +22.60% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 98.4 ± 1% 98.8 ± 1% ~ (p=0.810 n=5+5) kv0-b128/cores=16/nodes=1/splits=0 532 ± 0% 1059 ± 0% +99.16% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 291 ± 1% 307 ± 1% +5.18% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 483 ± 0% 1288 ± 1% +166.37% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 394 ± 1% 408 ± 1% +3.51% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 49.7 ± 1% 72.8 ± 1% +46.52% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 30.8 ± 0% 23.4 ± 0% -24.03% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 48.9 ± 2% 160.6 ± 0% +228.38% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=100 101 ± 1% 80 ± 0% -21.64% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 37.5 ± 0% 208.1 ± 1% +454.99% (p=0.016 n=4+5) kv0-b1024/cores=36/nodes=1/splits=100 162 ± 0% 124 ± 0% -23.22% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 5.93k ± 0% 6.20k ± 1% +4.55% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 2.27k ± 1% 2.32k ± 1% +2.28% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=0 5.15k ± 1% 18.79k ± 1% +264.73% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=100 8.31k ± 1% 8.57k ± 1% +3.16% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 3.96k ± 0% 10.67k ± 1% +169.81% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=100 15.7k ± 2% 16.2k ± 4% +2.75% (p=0.151 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 1.12k ± 1% 1.27k ± 0% +13.28% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 290 ± 1% 299 ± 1% +3.02% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 1.06k ± 0% 3.31k ± 0% +213.09% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 662 ±91% 1095 ± 1% +65.42% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 715 ± 2% 3586 ± 0% +401.21% (p=0.008 n=5+5) kv95-b128/cores=36/nodes=1/splits=100 1.15k ±90% 2.01k ± 2% +74.79% (p=0.016 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 134 ± 1% 170 ± 1% +26.59% (p=0.008 n=5+5) kv95-b1024/cores=4/nodes=1/splits=100 54.8 ± 3% 53.3 ± 3% -2.84% (p=0.056 n=5+5) kv95-b1024/cores=16/nodes=1/splits=0 104 ± 3% 367 ± 1% +252.37% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 210 ± 1% 214 ± 1% +1.86% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 76.5 ± 2% 383.9 ± 1% +401.67% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 431 ± 1% 436 ± 1% +1.17% (p=0.016 n=5+5) name old p50(ms) new p50(ms) delta kv0-b16/cores=4/nodes=1/splits=0 3.00 ± 0% 3.40 ± 0% +13.33% (p=0.016 n=5+4) kv0-b16/cores=4/nodes=1/splits=100 15.2 ± 0% 14.7 ± 0% -3.29% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 10.5 ± 0% 7.7 ± 2% -26.48% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 17.8 ± 0% 16.8 ± 0% -5.62% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 26.2 ± 0% 14.2 ± 0% -45.80% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 29.0 ± 2% 28.3 ± 0% -2.28% (p=0.095 n=5+4) kv0-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 15.2 ± 0% -14.61% (p=0.000 n=5+4) kv0-b128/cores=4/nodes=1/splits=100 79.7 ± 0% 79.7 ± 0% ~ (all equal) kv0-b128/cores=16/nodes=1/splits=0 65.0 ± 0% 32.5 ± 0% -50.00% (p=0.029 n=4+4) kv0-b128/cores=16/nodes=1/splits=100 109 ± 0% 105 ± 0% -3.85% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 168 ± 0% 50 ± 0% -70.02% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 184 ± 0% 176 ± 0% -4.50% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 159 ± 0% 109 ± 0% -31.56% (p=0.000 n=5+4) kv0-b1024/cores=4/nodes=1/splits=100 252 ± 0% 319 ± 0% +26.66% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 705 ± 0% 193 ± 0% -72.62% (p=0.000 n=5+4) kv0-b1024/cores=16/nodes=1/splits=100 319 ± 0% 386 ± 0% +21.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 1.88k ± 0% 0.24k ± 0% -87.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 436 ± 0% 570 ± 0% +30.77% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv95-b16/cores=4/nodes=1/splits=100 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95-b16/cores=16/nodes=1/splits=0 6.30 ± 0% 1.40 ± 0% -77.78% (p=0.000 n=5+4) kv95-b16/cores=16/nodes=1/splits=100 1.74 ± 3% 1.76 ± 3% ~ (p=1.000 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 11.5 ± 0% 5.5 ± 0% -52.17% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 2.42 ±20% 2.42 ±45% ~ (p=0.579 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 21.4 ± 3% 21.0 ± 0% ~ (p=0.444 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 30.4 ± 0% 9.4 ± 0% -69.08% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 38.2 ±76% 21.2 ± 4% -44.31% (p=0.063 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 88.1 ± 0% 16.8 ± 0% -80.93% (p=0.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 56.6 ±85% 29.6 ±15% ~ (p=0.873 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 52.4 ± 0% 44.0 ± 0% -16.03% (p=0.029 n=4+4) kv95-b1024/cores=4/nodes=1/splits=100 132 ± 2% 143 ± 0% +8.29% (p=0.016 n=5+4) kv95-b1024/cores=16/nodes=1/splits=0 325 ± 3% 80 ± 0% -75.51% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 151 ± 0% 151 ± 0% ~ (all equal) kv95-b1024/cores=36/nodes=1/splits=0 973 ± 0% 180 ± 3% -81.55% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 168 ± 0% 168 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0-b16/cores=4/nodes=1/splits=0 8.40 ± 0% 10.30 ± 3% +22.62% (p=0.016 n=4+5) kv0-b16/cores=4/nodes=1/splits=100 29.4 ± 0% 27.3 ± 0% -7.14% (p=0.000 n=5+4) kv0-b16/cores=16/nodes=1/splits=0 16.3 ± 0% 15.5 ± 2% -4.91% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 31.5 ± 0% 29.4 ± 0% -6.67% (p=0.000 n=5+4) kv0-b16/cores=36/nodes=1/splits=0 37.7 ± 0% 28.7 ± 2% -23.77% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 62.1 ± 2% 68.4 ±10% +10.15% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 37.7 ± 0% 39.4 ± 6% +4.46% (p=0.167 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 143 ± 0% 151 ± 0% +5.89% (p=0.016 n=4+5) kv0-b128/cores=16/nodes=1/splits=0 79.7 ± 0% 55.8 ± 2% -30.04% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 198 ± 3% 188 ± 3% -5.09% (p=0.048 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 184 ± 0% 126 ± 3% -31.82% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 319 ± 0% 336 ± 0% +5.24% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 322 ± 6% 253 ± 4% -21.35% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 470 ± 0% 772 ± 4% +64.28% (p=0.016 n=4+5) kv0-b1024/cores=16/nodes=1/splits=0 1.41k ± 0% 0.56k ±11% -60.00% (p=0.000 n=4+5) kv0-b1024/cores=16/nodes=1/splits=100 530 ± 2% 772 ± 0% +45.57% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 4.05k ± 7% 1.17k ± 3% -71.19% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 792 ±14% 1020 ± 2% +28.81% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 3.90 ± 0% 3.22 ± 4% -17.44% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 21.0 ± 0% 19.9 ± 0% -5.24% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=0 15.2 ± 0% 7.1 ± 0% -53.29% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=100 38.5 ± 3% 37.7 ± 0% ~ (p=0.333 n=5+4) kv95-b16/cores=36/nodes=1/splits=0 128 ± 2% 52 ± 0% -59.16% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 41.1 ±13% 39.2 ±33% ~ (p=0.984 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 14.7 ± 0% -17.42% (p=0.079 n=4+5) kv95-b128/cores=4/nodes=1/splits=100 107 ± 2% 106 ± 5% ~ (p=0.683 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 75.5 ± 0% 23.1 ± 0% -69.40% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 107 ±34% 120 ± 2% ~ (p=1.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 253 ± 4% 71 ± 0% -71.86% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 166 ±19% 164 ±74% ~ (p=0.310 n=5+5) kv95-b1024/cores=4/nodes=1/splits=0 146 ± 3% 101 ± 0% -31.01% (p=0.000 n=5+4) kv95-b1024/cores=4/nodes=1/splits=100 348 ± 4% 366 ± 6% ~ (p=0.317 n=4+5) kv95-b1024/cores=16/nodes=1/splits=0 624 ± 3% 221 ± 2% -64.52% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 325 ± 3% 319 ± 0% ~ (p=0.444 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 1.56k ± 5% 0.41k ± 2% -73.71% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 336 ± 0% 336 ± 0% ~ (all equal) ``` Release note (performance improvement): Replace Replica latching mechanism with new optimized data structure that improves throughput, especially under heavy contention.
32865: storage: replace CommandQueue with spanlatch.Manager r=nvanbenschoten a=nvanbenschoten This commit replaces the CommandQueue with the spanlatch.Manager, which was introduced in #31997. See that PR for an introduction to how the structure differs from the CommandQueue and how it improves performance on microbenchmarks. This is mostly a mechanical change. One important detail is that it removes the CommandQueue debug change. We found that the page was buggy (or straight up broken) and it wasn't actively used by members of Core when debugging problems. In its place, the commit revives the "slow requests" metric for latching, which hasn't been hooked up in over a year. ### Benchmarks #### Standard Benchmarks These benchmarks are standard benchmarks that we commonly run. They were run with varying node sizes, cluster sizes, and pre-split counts. ``` name old ops/sec new ops/sec delta kv0/cores=4/nodes=1/splits=0 1.99k ± 2% 2.06k ± 1% +3.22% (p=0.008 n=5+5) kv0/cores=4/nodes=1/splits=100 2.25k ± 1% 2.38k ± 1% +6.01% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 1.60k ± 0% 1.69k ± 2% +5.53% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 3.52k ± 6% 3.65k ± 9% ~ (p=0.421 n=5+5) kv0/cores=16/nodes=1/splits=0 19.9k ± 1% 21.8k ± 1% +9.34% (p=0.008 n=5+5) kv0/cores=16/nodes=1/splits=100 24.4k ± 1% 26.1k ± 1% +7.17% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 14.9k ± 1% 16.1k ± 1% +8.03% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 20.6k ± 1% 22.8k ± 1% +10.79% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=0 31.2k ± 2% 35.3k ± 1% +13.28% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 45.7k ± 1% 51.1k ± 1% +11.80% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 23.7k ± 2% 27.1k ± 2% +14.39% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=100 34.9k ± 2% 45.1k ± 1% +29.44% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 12.7k ± 2% 12.9k ± 2% +1.39% (p=0.151 n=5+5) kv95/cores=4/nodes=1/splits=100 12.8k ± 2% 13.1k ± 2% +2.10% (p=0.032 n=5+5) kv95/cores=4/nodes=3/splits=0 10.6k ± 1% 10.8k ± 1% +1.58% (p=0.056 n=5+5) kv95/cores=4/nodes=3/splits=100 12.3k ± 7% 12.6k ± 8% +2.61% (p=0.095 n=5+5) kv95/cores=16/nodes=1/splits=0 50.9k ± 1% 52.2k ± 1% +2.37% (p=0.008 n=5+5) kv95/cores=16/nodes=1/splits=100 52.2k ± 1% 53.0k ± 1% +1.49% (p=0.008 n=5+5) kv95/cores=16/nodes=3/splits=0 46.2k ± 1% 46.8k ± 1% +1.32% (p=0.032 n=5+5) kv95/cores=16/nodes=3/splits=100 51.0k ± 1% 53.2k ± 1% +4.25% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 79.8k ± 2% 101.6k ± 1% +27.31% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=100 104k ± 1% 107k ± 1% +2.60% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 85.8k ± 1% 91.8k ± 1% +7.08% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=100 106k ± 1% 112k ± 1% +5.51% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv0/cores=4/nodes=1/splits=0 3.52 ± 5% 3.40 ± 0% -3.41% (p=0.016 n=5+4) kv0/cores=4/nodes=1/splits=100 3.30 ± 0% 3.00 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=0 4.70 ± 0% 4.14 ± 9% -11.91% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 1.50 ± 0% 1.48 ± 8% ~ (p=0.968 n=4+5) kv0/cores=16/nodes=1/splits=0 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=16/nodes=1/splits=100 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv0/cores=16/nodes=3/splits=0 2.00 ± 0% 1.90 ± 0% -5.00% (p=0.000 n=5+4) kv0/cores=16/nodes=3/splits=100 1.40 ± 0% 1.40 ± 0% ~ (all equal) kv0/cores=36/nodes=1/splits=0 1.76 ± 3% 1.60 ± 0% -9.09% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 1.40 ± 0% 1.30 ± 0% -7.14% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 2.56 ± 2% 2.40 ± 0% -6.25% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 1.70 ± 0% 1.40 ± 0% -17.65% (p=0.008 n=5+5) kv95/cores=4/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=0 0.70 ± 0% 0.64 ± 9% -8.57% (p=0.167 n=5+5) kv95/cores=16/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=0 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=1/splits=100 0.50 ± 0% 0.50 ± 0% ~ (all equal) kv95/cores=36/nodes=3/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv95/cores=36/nodes=3/splits=100 0.60 ± 0% 0.60 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0/cores=4/nodes=1/splits=0 11.0 ± 0% 10.5 ± 0% -4.55% (p=0.000 n=5+4) kv0/cores=4/nodes=1/splits=100 7.90 ± 0% 7.60 ± 0% -3.80% (p=0.000 n=5+4) kv0/cores=4/nodes=3/splits=0 15.7 ± 0% 15.2 ± 0% -3.18% (p=0.008 n=5+5) kv0/cores=4/nodes=3/splits=100 8.90 ± 0% 8.12 ± 3% -8.76% (p=0.016 n=4+5) kv0/cores=16/nodes=1/splits=0 3.46 ± 2% 3.00 ± 0% -13.29% (p=0.000 n=5+4) kv0/cores=16/nodes=1/splits=100 4.50 ± 0% 3.36 ± 2% -25.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=0 4.50 ± 0% 3.90 ± 0% -13.33% (p=0.008 n=5+5) kv0/cores=16/nodes=3/splits=100 5.80 ± 0% 4.10 ± 0% -29.31% (p=0.029 n=4+4) kv0/cores=36/nodes=1/splits=0 6.80 ± 0% 5.20 ± 0% -23.53% (p=0.008 n=5+5) kv0/cores=36/nodes=1/splits=100 5.80 ± 0% 4.32 ± 4% -25.52% (p=0.008 n=5+5) kv0/cores=36/nodes=3/splits=0 7.72 ± 2% 6.30 ± 0% -18.39% (p=0.000 n=5+4) kv0/cores=36/nodes=3/splits=100 7.98 ± 2% 5.20 ± 0% -34.84% (p=0.000 n=5+4) kv95/cores=4/nodes=1/splits=0 5.38 ± 3% 5.20 ± 0% -3.35% (p=0.167 n=5+5) kv95/cores=4/nodes=1/splits=100 5.00 ± 0% 5.00 ± 0% ~ (all equal) kv95/cores=4/nodes=3/splits=0 5.68 ± 3% 5.50 ± 0% -3.17% (p=0.095 n=5+4) kv95/cores=4/nodes=3/splits=100 3.60 ±31% 2.93 ± 3% -18.75% (p=0.016 n=5+4) kv95/cores=16/nodes=1/splits=0 4.10 ± 0% 4.10 ± 0% ~ (all equal) kv95/cores=16/nodes=1/splits=100 4.50 ± 0% 4.10 ± 0% -8.89% (p=0.000 n=5+4) kv95/cores=16/nodes=3/splits=0 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95/cores=16/nodes=3/splits=100 2.50 ± 0% 1.90 ± 5% -24.00% (p=0.008 n=5+5) kv95/cores=36/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.029 n=4+4) kv95/cores=36/nodes=1/splits=100 5.50 ± 0% 5.12 ± 2% -6.91% (p=0.008 n=5+5) kv95/cores=36/nodes=3/splits=0 4.18 ± 2% 4.02 ± 3% -3.71% (p=0.000 n=4+5) kv95/cores=36/nodes=3/splits=100 3.80 ± 0% 2.80 ± 0% -26.32% (p=0.008 n=5+5) ``` #### Large-machine Benchmarks These benchmarks are standard benchmarks run on a single-node cluster with 72 vCPUs. ``` name old ops/sec new ops/sec delta kv0/cores=72/nodes=1/splits=0 31.0k ± 4% 36.4k ± 1% +17.57% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 44.0k ± 0% 49.0k ± 1% +11.41% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 52.7k ±18% 72.6k ±26% +37.70% (p=0.016 n=5+5) kv95/cores=72/nodes=1/splits=100 66.8k ±17% 68.5k ± 5% ~ (p=0.286 n=5+4) name old p50(ms) new p50(ms) delta kv0/cores=72/nodes=1/splits=0 2.30 ±13% 2.52 ± 5% ~ (p=0.214 n=5+5) kv0/cores=72/nodes=1/splits=100 3.00 ± 0% 2.90 ± 0% -3.33% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 0.46 ±13% 0.50 ± 0% ~ (p=0.444 n=5+5) kv95/cores=72/nodes=1/splits=100 0.44 ±14% 0.50 ± 0% +13.64% (p=0.167 n=5+5) name old p99(ms) new p99(ms) delta kv0/cores=72/nodes=1/splits=0 18.9 ± 6% 13.3 ± 5% -29.56% (p=0.008 n=5+5) kv0/cores=72/nodes=1/splits=100 13.4 ± 2% 11.0 ± 0% -17.91% (p=0.008 n=5+5) kv95/cores=72/nodes=1/splits=0 34.4 ±34% 23.5 ±24% -31.74% (p=0.048 n=5+5) kv95/cores=72/nodes=1/splits=100 21.0 ± 0% 19.1 ± 4% -8.81% (p=0.029 n=4+4) ``` #### Motivating Benchmarks These are benchmarks that used to generate a lot of contention in the CommandQueue. They have small cycle-lengths, indicated by the `c` specifier. The last one also includes 20% scan operations, which increases contention between non-overlapping point operations. ``` name old ops/sec new ops/sec delta kv95-c5/cores=16/nodes=1/splits=0 45.1k ± 1% 47.2k ± 4% +4.59% (p=0.008 n=5+5) kv95-c5/cores=36/nodes=1/splits=0 44.6k ± 1% 76.3k ± 1% +71.05% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 27.2k ± 2% 29.4k ± 1% +8.12% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 42.6k ± 2% 50.0k ± 1% +17.39% (p=0.008 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 28.7k ± 1% 29.8k ± 3% +3.87% (p=0.008 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 41.9k ± 4% 52.8k ± 2% +25.97% (p=0.008 n=5+5) name old p50(ms) new p50(ms) delta kv95-c5/cores=16/nodes=1/splits=0 0.60 ± 0% 0.60 ± 0% ~ (all equal) kv95-c5/cores=36/nodes=1/splits=0 0.90 ± 0% 0.80 ± 0% -11.11% (p=0.008 n=5+5) kv50-c128/cores=16/nodes=1/splits=0 1.10 ± 0% 1.06 ± 6% ~ (p=0.444 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 1.26 ± 5% 1.30 ± 0% ~ (p=0.444 n=5+5) kv70-20-c128/cores=16/nodes=1/splits=0 0.66 ± 9% 0.60 ± 0% -9.09% (p=0.167 n=5+5) kv70-20-c128/cores=36/nodes=1/splits=0 0.70 ± 0% 0.50 ± 0% -28.57% (p=0.008 n=5+5) name old p99(ms) new p99(ms) delta kv95-c5/cores=16/nodes=1/splits=0 2.40 ± 0% 2.10 ± 0% -12.50% (p=0.000 n=5+4) kv95-c5/cores=36/nodes=1/splits=0 5.80 ± 0% 3.30 ± 0% -43.10% (p=0.000 n=5+4) kv50-c128/cores=16/nodes=1/splits=0 3.50 ± 0% 3.00 ± 0% -14.29% (p=0.008 n=5+5) kv50-c128/cores=36/nodes=1/splits=0 6.80 ± 0% 4.70 ± 0% -30.88% (p=0.079 n=4+5) kv70-20-c128/cores=16/nodes=1/splits=0 5.00 ± 0% 4.70 ± 0% -6.00% (p=0.029 n=4+4) kv70-20-c128/cores=36/nodes=1/splits=0 11.0 ± 0% 6.8 ± 0% -38.18% (p=0.008 n=5+5) ``` #### Batching Benchmarks One optimization left out of the new spanlatch.Manager was the "covering" optimization, where commands were initially added to the interval tree as a single spanning interval and only expanded later. I ran a series of benchmarks to verify that this optimization was not needed. My hypothesis was that the order of magnitude increase the speed of the interval tree would make the optimization unnecessary. It turns out that removing the optimization hurt a few benchmarks to a small degree but speed up others tremendously (some benchmarks improved by over 400%). I suspect that the covering optimization could actually hurt in cases where it causes non-overlapping requests to overlap. It is interesting how quickly a few of these benchmarks oscillate from small losses to big wins. It makes me think that there's some non-linear behavior with the old CommandQueue that would cause its performance to quickly degrade once it became a contention bottleneck. ``` name old ops/sec new ops/sec delta kv0-b16/cores=4/nodes=1/splits=0 2.41k ± 0% 2.06k ± 3% -14.75% (p=0.008 n=5+5) kv0-b16/cores=4/nodes=1/splits=100 514 ± 0% 534 ± 1% +3.88% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 2.95k ± 0% 4.35k ± 0% +47.74% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 1.80k ± 1% 1.88k ± 1% +4.46% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 2.74k ± 0% 4.92k ± 1% +79.55% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 2.39k ± 1% 2.45k ± 1% +2.41% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 422 ± 0% 518 ± 1% +22.60% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 98.4 ± 1% 98.8 ± 1% ~ (p=0.810 n=5+5) kv0-b128/cores=16/nodes=1/splits=0 532 ± 0% 1059 ± 0% +99.16% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 291 ± 1% 307 ± 1% +5.18% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 483 ± 0% 1288 ± 1% +166.37% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 394 ± 1% 408 ± 1% +3.51% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 49.7 ± 1% 72.8 ± 1% +46.52% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 30.8 ± 0% 23.4 ± 0% -24.03% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 48.9 ± 2% 160.6 ± 0% +228.38% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=100 101 ± 1% 80 ± 0% -21.64% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 37.5 ± 0% 208.1 ± 1% +454.99% (p=0.016 n=4+5) kv0-b1024/cores=36/nodes=1/splits=100 162 ± 0% 124 ± 0% -23.22% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 5.93k ± 0% 6.20k ± 1% +4.55% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 2.27k ± 1% 2.32k ± 1% +2.28% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=0 5.15k ± 1% 18.79k ± 1% +264.73% (p=0.008 n=5+5) kv95-b16/cores=16/nodes=1/splits=100 8.31k ± 1% 8.57k ± 1% +3.16% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 3.96k ± 0% 10.67k ± 1% +169.81% (p=0.008 n=5+5) kv95-b16/cores=36/nodes=1/splits=100 15.7k ± 2% 16.2k ± 4% +2.75% (p=0.151 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 1.12k ± 1% 1.27k ± 0% +13.28% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 290 ± 1% 299 ± 1% +3.02% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 1.06k ± 0% 3.31k ± 0% +213.09% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 662 ±91% 1095 ± 1% +65.42% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 715 ± 2% 3586 ± 0% +401.21% (p=0.008 n=5+5) kv95-b128/cores=36/nodes=1/splits=100 1.15k ±90% 2.01k ± 2% +74.79% (p=0.016 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 134 ± 1% 170 ± 1% +26.59% (p=0.008 n=5+5) kv95-b1024/cores=4/nodes=1/splits=100 54.8 ± 3% 53.3 ± 3% -2.84% (p=0.056 n=5+5) kv95-b1024/cores=16/nodes=1/splits=0 104 ± 3% 367 ± 1% +252.37% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 210 ± 1% 214 ± 1% +1.86% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 76.5 ± 2% 383.9 ± 1% +401.67% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 431 ± 1% 436 ± 1% +1.17% (p=0.016 n=5+5) name old p50(ms) new p50(ms) delta kv0-b16/cores=4/nodes=1/splits=0 3.00 ± 0% 3.40 ± 0% +13.33% (p=0.016 n=5+4) kv0-b16/cores=4/nodes=1/splits=100 15.2 ± 0% 14.7 ± 0% -3.29% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=0 10.5 ± 0% 7.7 ± 2% -26.48% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 17.8 ± 0% 16.8 ± 0% -5.62% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=0 26.2 ± 0% 14.2 ± 0% -45.80% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 29.0 ± 2% 28.3 ± 0% -2.28% (p=0.095 n=5+4) kv0-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 15.2 ± 0% -14.61% (p=0.000 n=5+4) kv0-b128/cores=4/nodes=1/splits=100 79.7 ± 0% 79.7 ± 0% ~ (all equal) kv0-b128/cores=16/nodes=1/splits=0 65.0 ± 0% 32.5 ± 0% -50.00% (p=0.029 n=4+4) kv0-b128/cores=16/nodes=1/splits=100 109 ± 0% 105 ± 0% -3.85% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 168 ± 0% 50 ± 0% -70.02% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 184 ± 0% 176 ± 0% -4.50% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 159 ± 0% 109 ± 0% -31.56% (p=0.000 n=5+4) kv0-b1024/cores=4/nodes=1/splits=100 252 ± 0% 319 ± 0% +26.66% (p=0.008 n=5+5) kv0-b1024/cores=16/nodes=1/splits=0 705 ± 0% 193 ± 0% -72.62% (p=0.000 n=5+4) kv0-b1024/cores=16/nodes=1/splits=100 319 ± 0% 386 ± 0% +21.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 1.88k ± 0% 0.24k ± 0% -87.05% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 436 ± 0% 570 ± 0% +30.77% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 1.20 ± 0% 1.20 ± 0% ~ (all equal) kv95-b16/cores=4/nodes=1/splits=100 2.60 ± 0% 2.60 ± 0% ~ (all equal) kv95-b16/cores=16/nodes=1/splits=0 6.30 ± 0% 1.40 ± 0% -77.78% (p=0.000 n=5+4) kv95-b16/cores=16/nodes=1/splits=100 1.74 ± 3% 1.76 ± 3% ~ (p=1.000 n=5+5) kv95-b16/cores=36/nodes=1/splits=0 11.5 ± 0% 5.5 ± 0% -52.17% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 2.42 ±20% 2.42 ±45% ~ (p=0.579 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 6.60 ± 0% 6.00 ± 0% -9.09% (p=0.008 n=5+5) kv95-b128/cores=4/nodes=1/splits=100 21.4 ± 3% 21.0 ± 0% ~ (p=0.444 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 30.4 ± 0% 9.4 ± 0% -69.08% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 38.2 ±76% 21.2 ± 4% -44.31% (p=0.063 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 88.1 ± 0% 16.8 ± 0% -80.93% (p=0.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 56.6 ±85% 29.6 ±15% ~ (p=0.873 n=5+4) kv95-b1024/cores=4/nodes=1/splits=0 52.4 ± 0% 44.0 ± 0% -16.03% (p=0.029 n=4+4) kv95-b1024/cores=4/nodes=1/splits=100 132 ± 2% 143 ± 0% +8.29% (p=0.016 n=5+4) kv95-b1024/cores=16/nodes=1/splits=0 325 ± 3% 80 ± 0% -75.51% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 151 ± 0% 151 ± 0% ~ (all equal) kv95-b1024/cores=36/nodes=1/splits=0 973 ± 0% 180 ± 3% -81.55% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 168 ± 0% 168 ± 0% ~ (all equal) name old p99(ms) new p99(ms) delta kv0-b16/cores=4/nodes=1/splits=0 8.40 ± 0% 10.30 ± 3% +22.62% (p=0.016 n=4+5) kv0-b16/cores=4/nodes=1/splits=100 29.4 ± 0% 27.3 ± 0% -7.14% (p=0.000 n=5+4) kv0-b16/cores=16/nodes=1/splits=0 16.3 ± 0% 15.5 ± 2% -4.91% (p=0.008 n=5+5) kv0-b16/cores=16/nodes=1/splits=100 31.5 ± 0% 29.4 ± 0% -6.67% (p=0.000 n=5+4) kv0-b16/cores=36/nodes=1/splits=0 37.7 ± 0% 28.7 ± 2% -23.77% (p=0.008 n=5+5) kv0-b16/cores=36/nodes=1/splits=100 62.1 ± 2% 68.4 ±10% +10.15% (p=0.008 n=5+5) kv0-b128/cores=4/nodes=1/splits=0 37.7 ± 0% 39.4 ± 6% +4.46% (p=0.167 n=5+5) kv0-b128/cores=4/nodes=1/splits=100 143 ± 0% 151 ± 0% +5.89% (p=0.016 n=4+5) kv0-b128/cores=16/nodes=1/splits=0 79.7 ± 0% 55.8 ± 2% -30.04% (p=0.008 n=5+5) kv0-b128/cores=16/nodes=1/splits=100 198 ± 3% 188 ± 3% -5.09% (p=0.048 n=5+5) kv0-b128/cores=36/nodes=1/splits=0 184 ± 0% 126 ± 3% -31.82% (p=0.008 n=5+5) kv0-b128/cores=36/nodes=1/splits=100 319 ± 0% 336 ± 0% +5.24% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=0 322 ± 6% 253 ± 4% -21.35% (p=0.008 n=5+5) kv0-b1024/cores=4/nodes=1/splits=100 470 ± 0% 772 ± 4% +64.28% (p=0.016 n=4+5) kv0-b1024/cores=16/nodes=1/splits=0 1.41k ± 0% 0.56k ±11% -60.00% (p=0.000 n=4+5) kv0-b1024/cores=16/nodes=1/splits=100 530 ± 2% 772 ± 0% +45.57% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=0 4.05k ± 7% 1.17k ± 3% -71.19% (p=0.008 n=5+5) kv0-b1024/cores=36/nodes=1/splits=100 792 ±14% 1020 ± 2% +28.81% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=0 3.90 ± 0% 3.22 ± 4% -17.44% (p=0.008 n=5+5) kv95-b16/cores=4/nodes=1/splits=100 21.0 ± 0% 19.9 ± 0% -5.24% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=0 15.2 ± 0% 7.1 ± 0% -53.29% (p=0.079 n=4+5) kv95-b16/cores=16/nodes=1/splits=100 38.5 ± 3% 37.7 ± 0% ~ (p=0.333 n=5+4) kv95-b16/cores=36/nodes=1/splits=0 128 ± 2% 52 ± 0% -59.16% (p=0.000 n=5+4) kv95-b16/cores=36/nodes=1/splits=100 41.1 ±13% 39.2 ±33% ~ (p=0.984 n=5+5) kv95-b128/cores=4/nodes=1/splits=0 17.8 ± 0% 14.7 ± 0% -17.42% (p=0.079 n=4+5) kv95-b128/cores=4/nodes=1/splits=100 107 ± 2% 106 ± 5% ~ (p=0.683 n=5+5) kv95-b128/cores=16/nodes=1/splits=0 75.5 ± 0% 23.1 ± 0% -69.40% (p=0.008 n=5+5) kv95-b128/cores=16/nodes=1/splits=100 107 ±34% 120 ± 2% ~ (p=1.000 n=5+4) kv95-b128/cores=36/nodes=1/splits=0 253 ± 4% 71 ± 0% -71.86% (p=0.016 n=5+4) kv95-b128/cores=36/nodes=1/splits=100 166 ±19% 164 ±74% ~ (p=0.310 n=5+5) kv95-b1024/cores=4/nodes=1/splits=0 146 ± 3% 101 ± 0% -31.01% (p=0.000 n=5+4) kv95-b1024/cores=4/nodes=1/splits=100 348 ± 4% 366 ± 6% ~ (p=0.317 n=4+5) kv95-b1024/cores=16/nodes=1/splits=0 624 ± 3% 221 ± 2% -64.52% (p=0.008 n=5+5) kv95-b1024/cores=16/nodes=1/splits=100 325 ± 3% 319 ± 0% ~ (p=0.444 n=5+5) kv95-b1024/cores=36/nodes=1/splits=0 1.56k ± 5% 0.41k ± 2% -73.71% (p=0.008 n=5+5) kv95-b1024/cores=36/nodes=1/splits=100 336 ± 0% 336 ± 0% ~ (all equal) ``` Co-authored-by: Nathan VanBenschoten <[email protected]>
Informs #4768.
Informs #31904.
This change was inspired by #31904 and is a progression of the thinking started in #4768 (comment).
The change introduces
spanlatch.Manager
, which will replace theCommandQueue
in a future PR. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes #31904 easier to implement.The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because
Replica.beginCmds
repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges.The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock.
While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the
spanlatch.Guard
. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting immutable btree nodes, which allows us to release them back into the btree node pools when they're no longer needed. This means that we don't expect any allocations when inserting into the internal trees, even with the copy-on-write policy.Finally, this will make the approach taken in #31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free.
Naming changes
CommandQueue
spanlatch.Manager
The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model.
Microbenchmarks
NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure.
There are a few interesting things to point about about these benchmark results:
ReadOnlyMix
results demonstrate a fixed improvement, regardless of size. This is due to the replacement of the hash-map with a linked-list for the readSet structure.ReadWriteMix
is more interesting. We see that the spanlatch implementation is faster across the board. This is especially true with a high write/read ratio.Release note: None