Skip to content

Commit

Permalink
storage: optimize MVCCGarbageCollect
Browse files Browse the repository at this point in the history
Prior to this change, MVCCGarbageCollect performed a linear scan of all
versions of a key, not just the versions being garbage collected. Given
the pagination of deleting versions above this call, the linear behavior
can result in quadratic runtime of GC when the number of versions vastly
exceeds the page size. The benchmark results demonstrate the change's
effectiveness.

It's worth noting that for a single key with a single version, the change
has a negative performance impact. I suspect this is due to the allocation
of a key in order to construct the iterator. In cases involving more keys,
I theorize the positive change is due to the fact that now the iterator
is never seeked backwards due to the sorting of the keys. It's worth
noting that since 20.1, the GC queue has been sending keys in the GC
request in reverse order. I anticipate that this sorting is likely a good
thing in that case too.

The stepping optimization seemed important in the microbenchmarks for cases
where most of the data was garbage. Without it, the change had small negative
impact on performance.

```
name                                                                                                     old time/op  new time/op  delta
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24           4.69µs ± 0%  5.12µs ± 1%   +9.12%  (p=0.004 n=5+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24         342µs ± 1%   337µs ± 0%   -1.35%  (p=0.004 n=6+5)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24        348µs ± 2%   347µs ± 1%     ~     (p=0.699 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24        359µs ± 4%   359µs ± 4%     ~     (p=0.699 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24       572µs ± 2%   580µs ± 3%     ~     (p=0.132 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24      799µs ± 1%   811µs ± 3%     ~     (p=0.126 n=5+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24      793µs ± 0%   819µs ± 2%   +3.19%  (p=0.004 n=5+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24        3.38ms ± 2%  3.11ms ± 2%   -8.07%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24      433ms ± 2%   374ms ± 2%  -13.70%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24     443ms ± 1%   380ms ± 1%  -14.16%  (p=0.004 n=6+5)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24     449ms ± 0%   386ms ± 1%  -14.22%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24    727ms ± 2%   653ms ± 2%  -10.17%  (p=0.002 n=6+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24   1.00s ± 1%   0.95s ± 3%   -5.84%  (p=0.004 n=5+6)
MVCCGarbageCollect/rocksdb/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1023-24   1.02s ± 2%   0.98s ±10%     ~     (p=0.180 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=2/deleteVersions=1-24            2.09µs ± 3%  2.42µs ± 1%  +15.56%  (p=0.004 n=6+5)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1-24          140µs ± 1%     5µs ± 1%  -96.78%  (p=0.010 n=6+4)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=16-24         142µs ± 1%     9µs ± 6%  -93.59%  (p=0.002 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=32-24         144µs ± 0%    14µs ± 2%  -90.50%  (p=0.004 n=5+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=512-24        216µs ± 1%   156µs ± 1%  -27.80%  (p=0.004 n=5+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1015-24       290µs ± 1%   303µs ± 1%   +4.52%  (p=0.002 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1/numVersions=1024/deleteVersions=1023-24       294µs ± 1%   304µs ± 2%   +3.24%  (p=0.002 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=2/deleteVersions=1-24         1.93ms ± 2%  1.70ms ± 2%  -11.80%  (p=0.002 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1-24       390ms ± 1%     7ms ±25%  -98.26%  (p=0.008 n=5+5)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=16-24      394ms ± 2%    20ms ±22%  -95.00%  (p=0.002 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=32-24      391ms ± 1%    38ms ± 7%  -90.38%  (p=0.004 n=6+5)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=512-24     782ms ± 1%   394ms ± 0%  -49.59%  (p=0.008 n=5+5)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1015-24    1.59s ± 4%   0.80s ± 4%  -49.65%  (p=0.002 n=6+6)
MVCCGarbageCollect/pebble/keySize=128/valSize=128/numKeys=1024/numVersions=1024/deleteVersions=1023-24    1.58s ± 2%   0.80s ± 2%  -49.23%  (p=0.002 n=6+6)
```

Release note (performance improvement): Improved the efficiency of garbage
collection when there are a large number of versions of a single key,
commonly found when utilizing sequences.
  • Loading branch information
ajwerner committed Jul 15, 2020
1 parent 989525a commit 8585416
Show file tree
Hide file tree
Showing 2 changed files with 310 additions and 33 deletions.
150 changes: 117 additions & 33 deletions pkg/storage/mvcc.go
Original file line number Diff line number Diff line change
Expand Up @@ -17,6 +17,7 @@ import (
"math"
"os"
"path/filepath"
"sort"
"sync"
"time"

Expand Down Expand Up @@ -3151,24 +3152,44 @@ func MVCCResolveWriteIntentRangeUsingIter(
// keys slice. The iterator is seeked in turn to each listed
// key, clearing all values with timestamps <= to expiration. The
// timestamp parameter is used to compute the intent age on GC.
//
// Note that this method will be sorting the keys.
func MVCCGarbageCollect(
ctx context.Context,
rw ReadWriter,
ms *enginepb.MVCCStats,
keys []roachpb.GCRequest_GCKey,
timestamp hlc.Timestamp,
) error {
// We're allowed to use a prefix iterator because we always Seek() the
// iterator when handling a new user key.
iter := rw.NewIterator(IterOptions{Prefix: true})
defer iter.Close()

var count int64
defer func(begin time.Time) {
log.Eventf(ctx, "done with GC evaluation for %d keys at %.2f keys/sec. Deleted %d entries",
len(keys), float64(len(keys))*1e9/float64(timeutil.Since(begin)), count)
}(timeutil.Now())

// If there are no keys then there is no work.
if len(keys) == 0 {
return nil
}

// Sort the slice to both determine the bounds and ensure that we're seeking
// in increasing order.
sort.Slice(keys, func(i, j int) bool {
iKey := MVCCKey{Key: keys[i].Key, Timestamp: keys[i].Timestamp}
jKey := MVCCKey{Key: keys[j].Key, Timestamp: keys[j].Timestamp}
return iKey.Less(jKey)
})

// Bound the iterator appropriately for the set of keys we'll be garbage
// collecting.
iter := rw.NewIterator(IterOptions{
LowerBound: keys[0].Key,
UpperBound: keys[len(keys)-1].Key.Next(),
})
defer iter.Close()
supportsPrev := iter.SupportsPrev()

// Iterate through specified GC keys.
meta := &enginepb.MVCCMetadata{}
for _, gcKey := range keys {
Expand Down Expand Up @@ -3222,19 +3243,84 @@ func MVCCGarbageCollect(
iter.Next()
}

// TODO(tschottdorf): Can't we just Seek() to a key with timestamp
// gcKey.Timestamp to avoid potentially cycling through a large prefix
// of versions we can't GC? The batching mechanism in the GC queue sends
// requests susceptible to that happening when there are lots of versions.
// A minor complication there will be that we need to know the first non-
// deletable value's timestamp (for prevNanos).

// Now, iterate through all values, GC'ing ones which have expired.
// For GCBytesAge, this requires keeping track of the previous key's
// timestamp (prevNanos). See ComputeStatsGo for a more easily digested
// and better commented version of this logic.

// and better commented version of this logic. The below block will set
// prevNanos to the appropriate value and position the iterator at the
// first garbage version.
prevNanos := timestamp.WallTime
{

var foundPrevNanos bool
{
// If reverse iteration is supported (supportsPrev), we'll step the
// iterator a few time before attempting to seek.
var foundNextKey bool

// If there are a large number of versions which are not garbage,
// iterating through all of them is very inefficient. However, if there
// are few, SeekLT is inefficient. MVCCGarbageCollect will try to step
// the iterator a few times to find the predecessor of gcKey before
// resorting to seeking.
//
// In a synthetic benchmark where there is one version of garbage and
// one not, this optimization showed a 50% improvement. More
// importantly, this optimization mitigated the overhead of the Seek
// approach when almost all of the versions are garbage.
const nextsBeforeSeekLT = 4
for i := 0; !supportsPrev || i < nextsBeforeSeekLT; i++ {
if i > 0 {
iter.Next()
}
if ok, err := iter.Valid(); err != nil {
return err
} else if !ok {
foundNextKey = true
break
}
unsafeIterKey := iter.UnsafeKey()
if !unsafeIterKey.Key.Equal(encKey.Key) {
foundNextKey = true
break
}
if unsafeIterKey.Timestamp.LessEq(gcKey.Timestamp) {
foundPrevNanos = true
break
}
prevNanos = unsafeIterKey.Timestamp.WallTime
}

// We have nothing to GC for this key if we found the next key.
if foundNextKey {
continue
}
}

// Stepping with the iterator did not get us to our target garbage key or
// its predecessor. Seek to the predecessor to find the right value for
// prevNanos and position the iterator on the gcKey.
if !foundPrevNanos {
if !supportsPrev {
log.Fatalf(ctx, "failed to find first garbage key without"+
"support for reverse iteration")
}
gcKeyMVCC := MVCCKey{Key: gcKey.Key, Timestamp: gcKey.Timestamp}
iter.SeekLT(gcKeyMVCC)
if ok, err := iter.Valid(); err != nil {
return err
} else if ok {
// Use the previous version's timestamp if it's for this key.
if iter.UnsafeKey().Key.Equal(gcKey.Key) {
prevNanos = iter.UnsafeKey().Timestamp.WallTime
}
// Seek to the first version for deletion.
iter.Next()
}
}
}

// Iterate through the garbage versions, accumulating their stats and
// issuing clear operations.
for ; ; iter.Next() {
if ok, err := iter.Valid(); err != nil {
return err
Expand All @@ -3248,26 +3334,24 @@ func MVCCGarbageCollect(
if !unsafeIterKey.IsValue() {
break
}
if unsafeIterKey.Timestamp.LessEq(gcKey.Timestamp) {
if ms != nil {
// FIXME: use prevNanos instead of unsafeIterKey.Timestamp, except
// when it's a deletion.
valSize := int64(len(iter.UnsafeValue()))

// A non-deletion becomes non-live when its newer neighbor shows up.
// A deletion tombstone becomes non-live right when it is created.
fromNS := prevNanos
if valSize == 0 {
fromNS = unsafeIterKey.Timestamp.WallTime
}

ms.Add(updateStatsOnGC(gcKey.Key, MVCCVersionTimestampSize,
valSize, nil, fromNS))
}
count++
if err := rw.Clear(unsafeIterKey); err != nil {
return err
if ms != nil {
// FIXME: use prevNanos instead of unsafeIterKey.Timestamp, except
// when it's a deletion.
valSize := int64(len(iter.UnsafeValue()))

// A non-deletion becomes non-live when its newer neighbor shows up.
// A deletion tombstone becomes non-live right when it is created.
fromNS := prevNanos
if valSize == 0 {
fromNS = unsafeIterKey.Timestamp.WallTime
}

ms.Add(updateStatsOnGC(gcKey.Key, MVCCVersionTimestampSize,
valSize, nil, fromNS))
}
count++
if err := rw.Clear(unsafeIterKey); err != nil {
return err
}
prevNanos = unsafeIterKey.Timestamp.WallTime
}
Expand Down
193 changes: 193 additions & 0 deletions pkg/storage/mvcc_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -5018,6 +5018,199 @@ func TestMVCCGarbageCollectIntent(t *testing.T) {
}
}

// readWriterReturningSeekLTTrackingIterator is used in a test to inject errors
// and ensure that SeekLT is returned an appropriate number of times.
type readWriterReturningSeekLTTrackingIterator struct {
it seekLTTrackingIterator
ReadWriter
}

// NewIterator injects a seekLTTrackingIterator over the engine's real iterator.
func (rw *readWriterReturningSeekLTTrackingIterator) NewIterator(opts IterOptions) Iterator {
rw.it.Iterator = rw.ReadWriter.NewIterator(opts)
return &rw.it
}

// seekLTTrackingIterator is used to determine the number of times seekLT is
// called.
type seekLTTrackingIterator struct {
seekLTCalled int
Iterator
}

func (it *seekLTTrackingIterator) SeekLT(k MVCCKey) {
it.seekLTCalled++
it.Iterator.SeekLT(k)
}

// TestMVCCGarbageCollectUsesSeekLTAppropriately ensures that the garbage
// collection only utilizes SeekLT if there are enough undeleted versions.
func TestMVCCGarbageCollectUsesSeekLTAppropriately(t *testing.T) {
defer leaktest.AfterTest(t)()

type testCaseKey struct {
key string
timestamps []int
gcTimestamp int
expSeekLT bool
}
type testCase struct {
name string
keys []testCaseKey
}
bytes := []byte("value")
toHLC := func(seconds int) hlc.Timestamp {
return hlc.Timestamp{WallTime: (time.Duration(seconds) * time.Second).Nanoseconds()}
}
engineBatchIteratorSupportsPrev := func(engine Engine) bool {
batch := engine.NewBatch()
defer batch.Close()
it := batch.NewIterator(IterOptions{
UpperBound: keys.UserTableDataMin,
LowerBound: keys.MaxKey,
})
defer it.Close()
return it.SupportsPrev()
}
runTestCase := func(t *testing.T, tc testCase, engine Engine) {
ctx := context.Background()
ms := &enginepb.MVCCStats{}
for _, key := range tc.keys {
for _, seconds := range key.timestamps {
val := roachpb.MakeValueFromBytes(bytes)
ts := toHLC(seconds)
if err := MVCCPut(
ctx, engine, ms, roachpb.Key(key.key), ts, val, nil,
); err != nil {
t.Fatal(err)
}
}
}

supportsPrev := engineBatchIteratorSupportsPrev(engine)

var keys []roachpb.GCRequest_GCKey
var expectedSeekLTs int
for _, key := range tc.keys {
keys = append(keys, roachpb.GCRequest_GCKey{
Key: roachpb.Key(key.key),
Timestamp: toHLC(key.gcTimestamp),
})
if supportsPrev && key.expSeekLT {
expectedSeekLTs++
}
}

batch := engine.NewBatch()
defer batch.Close()
rw := readWriterReturningSeekLTTrackingIterator{ReadWriter: batch}

require.NoError(t, MVCCGarbageCollect(ctx, &rw, ms, keys, toHLC(10)))
require.Equal(t, expectedSeekLTs, rw.it.seekLTCalled)
}
cases := []testCase{
{
name: "basic",
keys: []testCaseKey{
{
key: "a",
timestamps: []int{1, 2},
gcTimestamp: 1,
},
{
key: "b",
timestamps: []int{1, 2, 3},
gcTimestamp: 1,
},
{
key: "c",
timestamps: []int{1, 2, 3, 4},
gcTimestamp: 1,
},
{
key: "d",
timestamps: []int{1, 2, 3, 4, 5},
gcTimestamp: 1,
expSeekLT: true,
},
{
key: "e",
timestamps: []int{1, 2, 3, 4, 5, 6, 7, 8, 9},
gcTimestamp: 1,
expSeekLT: true,
},
{
key: "f",
timestamps: []int{1, 2, 3, 4, 5, 6, 7, 8, 9},
gcTimestamp: 6,
expSeekLT: false,
},
},
},
{
name: "SeekLT to the end",
keys: []testCaseKey{
{
key: "ee",
timestamps: []int{2, 3, 4, 5, 6, 7, 8, 9},
gcTimestamp: 1,
expSeekLT: true,
},
},
},
{
name: "Next to the end",
keys: []testCaseKey{
{
key: "eee",
timestamps: []int{8, 9},
gcTimestamp: 1,
expSeekLT: false,
},
},
},
{
name: "Next to the next key",
keys: []testCaseKey{
{
key: "eeee",
timestamps: []int{8, 9},
gcTimestamp: 1,
expSeekLT: false,
},
{
key: "eeeee",
timestamps: []int{8, 9},
gcTimestamp: 1,
expSeekLT: false,
},
},
},
{
name: "Next to the end on the first version",
keys: []testCaseKey{
{
key: "h",
timestamps: []int{9},
gcTimestamp: 1,
expSeekLT: false,
},
},
},
}
for _, engineImpl := range mvccEngineImpls {
t.Run(engineImpl.name, func(t *testing.T) {
for _, tc := range cases {
t.Run(tc.name, func(t *testing.T) {
engine := engineImpl.create()
defer engine.Close()
runTestCase(t, tc, engine)
})
}
})
}
}

// TestResolveIntentWithLowerEpoch verifies that trying to resolve
// an intent at an epoch that is lower than the epoch of the intent
// leaves the intent untouched.
Expand Down

0 comments on commit 8585416

Please sign in to comment.