Skip to content

Commit

Permalink
Merge #31997 #32416
Browse files Browse the repository at this point in the history
31997: storage/spanlatch: create spanlatch.Manager using immutable btrees r=nvanbenschoten a=nvanbenschoten

Informs #4768.
Informs #31904.

This change was inspired by #31904 and is a progression of the thinking started in #4768 (comment).

The change introduces `spanlatch.Manager`, which will replace the `CommandQueue` **in a future PR**. The new type isn't hooked up yet because doing so will require a lot of plumbing changes in that storage package that are best kept in a separate PR. The structure uses a new strategy that reduces lock contention, simplifies the code, avoids allocations, and makes #31904 easier to implement.

The primary objective, reducing lock contention, is addressed by minimizing the amount of work we perform under the exclusive "sequencing" mutex while locking the structure. This is made possible by employing a copy-on-write strategy. Before this change, commands would lock the queue, create a large slice of prerequisites, insert into the queue and unlock. After the change, commands lock the manager, grab an immutable snapshot of the manager's trees in O(1) time, insert into the manager, and unlock. They can then iterate over the immutable tree snapshot outside of the lock. Effectively, this means that the work performed under lock is linear with respect to the number of spans that a command declares but NO LONGER linear with respect to the number of other commands that it will wait on. This is important because `Replica.beginCmds` repeatedly comes up as the largest source of mutex contention in our system, especially on hot ranges.

The use of immutable snapshots also simplifies the code significantly. We're no longer copying our prereqs into a slice so we no longer need to carefully determine which transitive dependencies we do or don't need to wait on explicitly. This also makes lock cancellation trivial because we no longer explicitly hold on to our prereqs at all. Instead, we simply iterate through the snapshot outside of the lock.

While rewriting the structure, I also spent some time optimizing its allocations. Under normal operation, acquiring a latch now incurs only a single allocation - that being for the `spanlatch.Guard`. All other allocations are avoided through object pooling where appropriate. The overhead of using a copy-on-write technique is almost entirely avoided by atomically reference counting immutable btree nodes, which allows us to release them back into the btree node pools when they're no longer needed. This means that we don't expect any allocations when inserting into the internal trees, even with the copy-on-write policy.

Finally, this will make the approach taken in #31904 much more natural. Instead of tracking dependents and prerequisites for speculative reads and then iterating through them to find overlaps after, we can use the immutable snapshots directly! We can grab a snapshot and sequence ourselves as usual, but avoid waiting for prereqs. We then execute optimistically before finally checking whether we overlapped any of our prereqs. The great thing about this is that we already have the prereqs in an interval tree structure, so we get an efficient validation check for free.

### Naming changes

| Before                     | After                             |
|----------------------------|-----------------------------------|
| `CommandQueue`             | `spanlatch.Manager`               |
| "enter the command queue"  | "acquire span latches"            |
| "exit the command queue"   | "release span latches"            |
| "wait for prereq commands" | "wait for latches to be released" |

The use of the word "latch" is based on the definition of latches presented by Goetz Graefe in https://15721.courses.cs.cmu.edu/spring2016/papers/a16-graefe.pdf (see https://i.stack.imgur.com/fSRzd.png). An important reason for avoiding the word "lock" here is that it is critical for understanding that we don't confuse the operational locking performed by the CommandQueue/spanlatch.Manager with the transaction-scoped locking enforced by intents and our transactional concurrency control model.

### Microbenchmarks

NOTE: these are single-threaded benchmarks that don't benefit at all from the concurrency improvements enabled by this new structure.

```
name                              old time/op    new time/op    delta
ReadOnlyMix/size=1-4                 706ns ±20%     404ns ±10%  -42.81%  (p=0.008 n=5+5)
ReadOnlyMix/size=4-4                 649ns ±23%     382ns ± 5%  -41.13%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                611ns ±16%     367ns ± 5%  -39.83%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                692ns ±14%     370ns ± 1%  -46.49%  (p=0.016 n=5+4)
ReadOnlyMix/size=128-4               637ns ±22%     398ns ±14%  -37.48%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4               676ns ±15%     385ns ± 4%  -43.01%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=0-4      12.2µs ± 4%     0.6µs ±17%  -94.85%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4      7.88µs ± 2%    0.55µs ± 7%  -92.99%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4      4.19µs ± 3%    0.58µs ± 5%  -86.26%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4     2.09µs ± 6%    0.54µs ±13%  -74.13%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4      875ns ±17%     423ns ±29%  -51.64%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4     655ns ± 6%     362ns ±16%  -44.71%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=256-4     549ns ±16%     314ns ±13%  -42.73%  (p=0.008 n=5+5)

name                              old alloc/op   new alloc/op   delta
ReadOnlyMix/size=1-4                  223B ± 0%      160B ± 0%  -28.25%  (p=0.079 n=4+5)
ReadOnlyMix/size=4-4                  223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=16-4                 223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=64-4                 223B ± 0%      160B ± 0%  -28.25%  (p=0.008 n=5+5)
ReadOnlyMix/size=128-4                217B ± 4%      160B ± 0%  -26.27%  (p=0.008 n=5+5)
ReadOnlyMix/size=256-4                223B ± 0%      160B ± 0%  -28.25%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=0-4      1.25kB ± 0%    0.16kB ± 0%  -87.15%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4      1.00kB ± 0%    0.16kB ± 0%  -84.00%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=4-4        708B ± 0%      160B ± 0%  -77.40%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=16-4       513B ± 0%      160B ± 0%  -68.81%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       264B ± 0%      160B ± 0%  -39.39%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      221B ± 0%      160B ± 0%  -27.60%  (p=0.079 n=4+5)
ReadWriteMix/readsPerWrite=256-4      198B ± 0%      160B ± 0%  -19.35%  (p=0.008 n=5+5)

name                              old allocs/op  new allocs/op  delta
ReadOnlyMix/size=1-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=4-4                  1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=16-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=64-4                 1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=128-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadOnlyMix/size=256-4                1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=0-4        38.0 ± 0%       1.0 ± 0%  -97.37%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=1-4        24.0 ± 0%       1.0 ± 0%  -95.83%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=4-4        12.0 ± 0%       1.0 ± 0%  -91.67%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=16-4       5.00 ± 0%      1.00 ± 0%  -80.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=64-4       2.00 ± 0%      1.00 ± 0%  -50.00%  (p=0.008 n=5+5)
ReadWriteMix/readsPerWrite=128-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
ReadWriteMix/readsPerWrite=256-4      1.00 ± 0%      1.00 ± 0%     ~     (all equal)
```

There are a few interesting things to point about about these benchmark results:
- The `ReadOnlyMix` results demonstrate a fixed improvement, regardless of size. This is due to the replacement of the hash-map with a linked-list for the readSet structure.
- The `ReadWriteMix` is more interesting. We see that the spanlatch implementation is faster across the board. This is especially true with a high write/read ratio.
- We see that the allocated memory stays constant regardless of the write/read ratio in the spanlatch implementation. This is due to the memory recylcing that it performs on btree nodes. It is not the case for the CommandQueue implementation.

Release note: None

32416:  scripts: enhance the release notes r=knz a=knz

Fixes #25180.

With this the amount of release notes for the first 2.2 alpha in cockroachdb/docs#4051 is reduced to just under two pages.

Also this PR makes it easier to monitor progress during the execution of the script.

Co-authored-by: Nathan VanBenschoten <[email protected]>
Co-authored-by: Raphael 'kena' Poss <[email protected]>
  • Loading branch information
3 people committed Nov 29, 2018
3 parents c3d535c + b2ab370 + 8cb12f3 commit 142476f
Show file tree
Hide file tree
Showing 21 changed files with 1,692 additions and 472 deletions.
65 changes: 39 additions & 26 deletions pkg/storage/command_queue_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ import (
"github.com/cockroachdb/cockroach/pkg/roachpb"
"github.com/cockroachdb/cockroach/pkg/util/hlc"
"github.com/cockroachdb/cockroach/pkg/util/leaktest"
"github.com/cockroachdb/cockroach/pkg/util/syncutil"
)

var zeroTS = hlc.Timestamp{}
Expand Down Expand Up @@ -805,13 +806,14 @@ func assertExpectedPrereqs(
}
}

func BenchmarkCommandQueueGetPrereqsAllReadOnly(b *testing.B) {
func BenchmarkCommandQueueReadOnlyMix(b *testing.B) {
// Test read-only getPrereqs performance for various number of command queue
// entries. See #13627 where a previous implementation of
// CommandQueue.getOverlaps had O(n) performance in this setup. Since reads
// do not wait on other reads, expected performance is O(1).
for _, size := range []int{1, 4, 16, 64, 128, 256} {
b.Run(fmt.Sprintf("size=%d", size), func(b *testing.B) {
var mu syncutil.Mutex
cq := NewCommandQueue(true)
spans := []roachpb.Span{{
Key: roachpb.Key("aaaaaaaaaa"),
Expand All @@ -823,7 +825,10 @@ func BenchmarkCommandQueueGetPrereqsAllReadOnly(b *testing.B) {

b.ResetTimer()
for i := 0; i < b.N; i++ {
_ = cq.getPrereqs(true, zeroTS, spans)
mu.Lock()
prereqs := cq.getPrereqs(true, zeroTS, spans)
cq.add(true, zeroTS, prereqs, spans)
mu.Unlock()
}
})
}
Expand All @@ -833,32 +838,40 @@ func BenchmarkCommandQueueReadWriteMix(b *testing.B) {
// Test performance with a mixture of reads and writes with a high number
// of reads per write.
// See #15544.
for _, readsPerWrite := range []int{1, 4, 16, 64, 128, 256} {
for _, readsPerWrite := range []int{0, 1, 4, 16, 64, 128, 256} {
b.Run(fmt.Sprintf("readsPerWrite=%d", readsPerWrite), func(b *testing.B) {
for i := 0; i < b.N; i++ {
totalCmds := 1 << 10
liveCmdQueue := make(chan *cmd, 16)
cq := NewCommandQueue(true /* coveringOptimization */)
for j := 0; j < totalCmds; j++ {
a, b := randBytes(100), randBytes(100)
// Overwrite first byte so that we do not mix local and global ranges
a[0], b[0] = 'a', 'a'
if bytes.Compare(a, b) > 0 {
a, b = b, a
}
spans := []roachpb.Span{{
Key: roachpb.Key(a),
EndKey: roachpb.Key(b),
}}
var cmd *cmd
readOnly := j%(readsPerWrite+1) != 0
prereqs := cq.getPrereqs(readOnly, zeroTS, spans)
cmd = cq.add(readOnly, zeroTS, prereqs, spans)
if len(liveCmdQueue) == cap(liveCmdQueue) {
cq.remove(<-liveCmdQueue)
}
liveCmdQueue <- cmd
var mu syncutil.Mutex
cq := NewCommandQueue(true /* coveringOptimization */)
liveCmdQueue := make(chan *cmd, 16)

spans := make([][]roachpb.Span, b.N)
for i := range spans {
a, b := randBytes(100), randBytes(100)
// Overwrite first byte so that we do not mix local and global ranges
a[0], b[0] = 'a', 'a'
if bytes.Compare(a, b) > 0 {
a, b = b, a
}
spans[i] = []roachpb.Span{{
Key: roachpb.Key(a),
EndKey: roachpb.Key(b),
}}
}

b.ResetTimer()
for i := range spans {
mu.Lock()
readOnly := i%(readsPerWrite+1) != 0
prereqs := cq.getPrereqs(readOnly, zeroTS, spans[i])
cmd := cq.add(readOnly, zeroTS, prereqs, spans[i])
mu.Unlock()

if len(liveCmdQueue) == cap(liveCmdQueue) {
mu.Lock()
cq.remove(<-liveCmdQueue)
mu.Unlock()
}
liveCmdQueue <- cmd
}
})
}
Expand Down
42 changes: 42 additions & 0 deletions pkg/storage/spanlatch/doc.go
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
// Copyright 2018 The Cockroach Authors.
//
// Licensed under the Apache License, Version 2.0 (the "License");
// you may not use this file except in compliance with the License.
// You may obtain a copy of the License at
//
// http://www.apache.org/licenses/LICENSE-2.0
//
// Unless required by applicable law or agreed to in writing, software
// distributed under the License is distributed on an "AS IS" BASIS,
// WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or
// implied. See the License for the specific language governing
// permissions and limitations under the License.

/*
Package spanlatch provides a latch management structure for serializing access
to keys and key ranges. Latch acquitions affecting keys or key ranges must wait
on already-acquired latches which overlap their key range to be released.
The evolution of complexity can best be understood as a series of incremental
changes, each in the name of increased lock granularity to reduce contention and
enable more concurrency between requests. The structure can trace its lineage
back to a simple sync.Mutex. From there, the structure evolved through the
following progression:
* The structure began by enforcing strict mutual exclusion for access to any
keys. Conceptually, it was a sync.Mutex.
* Concurrent read-only access to keys and key ranges was permitted. Read and
writes were serialized with each other, writes were serialized with each other,
but no ordering was enforced between reads. Conceptually, the structure became
a sync.RWMutex.
* The structure became key range-aware and concurrent access to non-overlapping
key ranges was permitted. Conceptually, the structure became an interval
tree of sync.RWMutexes.
* The structure became timestamp-aware and concurrent access of non-causal
read and write pairs was permitted. The effect of this was that reads no
longer waited for writes at higher timestamps and writes no longer waited
for reads at lower timestamps. Conceptually, the structure became an interval
tree of timestamp-aware sync.RWMutexes.
*/
package spanlatch
Loading

0 comments on commit 142476f

Please sign in to comment.