Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
storage/spanlatch: create spanlatch.Manager using immutable btrees
Informs cockroachdb#4768. Informs cockroachdb#31904. This change was inspired by cockroachdb#31904 and is a progression of the thinking started in cockroachdb#4768 (comment). The change introduces `spanlatch.Manager`, which will replace the `CommandQueue` **in a future PR**. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes cockroachdb#31904 easier to implement. The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because `Replica.beginCmds` repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges. The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock. While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the `spanlatch.Guard`. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting btree nodes, which allows us to release them back into the btree node pools when they're no longer references by any btree snapshots. This means that we don't expect any allocations when inserting into the internal trees, even with the COW policy. Finally, this will make the approach taken in cockroachdb#31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free. _### Naming changes | Before | After | |----------------------------|-----------------------------------| | `CommandQueue` | `spanlatch.Manager` | | "enter the command queue" | "acquire span latches" | | "exit the command queue" | "release span latches" | | "wait for prereq commands" | "wait for latches to be released" | The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model. _### Microbenchmarks NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure. ``` name cmdq time/op spanlatch time/op delta ReadOnlyMix/size=1-4 897ns ±21% 917ns ±18% ~ (p=0.897 n=8+10) ReadOnlyMix/size=4-4 827ns ±22% 772ns ±15% ~ (p=0.448 n=10+10) ReadOnlyMix/size=16-4 905ns ±19% 770ns ±10% -14.90% (p=0.004 n=10+10) ReadOnlyMix/size=64-4 907ns ±20% 730ns ±15% -19.51% (p=0.001 n=10+10) ReadOnlyMix/size=128-4 926ns ±17% 731ns ±11% -21.04% (p=0.000 n=9+10) ReadOnlyMix/size=256-4 977ns ±19% 726ns ± 9% -25.65% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=0-4 12.5µs ± 4% 0.7µs ±17% -94.70% (p=0.000 n=8+9) ReadWriteMix/readsPerWrite=1-4 8.18µs ± 5% 0.63µs ± 6% -92.24% (p=0.000 n=10+9) ReadWriteMix/readsPerWrite=4-4 3.80µs ± 2% 0.66µs ± 5% -82.58% (p=0.000 n=8+10) ReadWriteMix/readsPerWrite=16-4 1.82µs ± 2% 0.70µs ± 5% -61.43% (p=0.000 n=9+10) ReadWriteMix/readsPerWrite=64-4 894ns ±12% 514ns ± 6% -42.48% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=128-4 717ns ± 5% 472ns ± 1% -34.21% (p=0.000 n=10+8) ReadWriteMix/readsPerWrite=256-4 607ns ± 5% 453ns ± 3% -25.35% (p=0.000 n=7+10) name cmdq alloc/op spanlatch alloc/op delta ReadOnlyMix/size=1-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=4-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=16-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=64-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=128-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadOnlyMix/size=256-4 223B ± 0% 191B ± 0% -14.35% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=0-4 915B ± 0% 144B ± 0% -84.26% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=1-4 730B ± 0% 144B ± 0% -80.29% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=4-4 486B ± 0% 144B ± 0% -70.35% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=16-4 350B ± 0% 144B ± 0% -58.86% (p=0.000 n=9+10) ReadWriteMix/readsPerWrite=64-4 222B ± 0% 144B ± 0% -35.14% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=128-4 199B ± 0% 144B ± 0% -27.64% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=256-4 188B ± 0% 144B ± 0% -23.40% (p=0.000 n=10+10) name cmdq allocs/op spanlatch allocs/op delta ReadOnlyMix/size=1-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=4-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=16-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadOnlyMix/size=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=0-4 34.0 ± 0% 1.0 ± 0% -97.06% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=1-4 22.0 ± 0% 1.0 ± 0% -95.45% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=4-4 10.0 ± 0% 1.0 ± 0% -90.00% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=16-4 4.00 ± 0% 1.00 ± 0% -75.00% (p=0.000 n=10+10) ReadWriteMix/readsPerWrite=64-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=128-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ReadWriteMix/readsPerWrite=256-4 1.00 ± 0% 1.00 ± 0% ~ (all equal) ``` Release note: None
- Loading branch information