Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DNM] release-19.2: storage/gc: create gc pkg, reverse iteration in rditer, paginate versions during GC #44257

Closed

Conversation

ajwerner
Copy link
Contributor

Backport of #43862 for a hot fix.

SeekLT is exclusive on the key being seeked. It should be allowed to SeekLT
to the end of a span. This commit makes that possible.

Release note: None
Prior to this commit, the ReplicaDataIterator only permitted forward iteration.
This PR exposes the ability to iterate in reverse by adding a `Prev()` method
as well as a constructor option to seek the iterator to the end of the data
range.

Release note: None
This commit is just code movement. It moves the helper functions of `RunGC()`,
`processLocalKeyRange()` and `processAbortSpan()` from above to below.

Release note: None
This commit moves the logic of RunGC as well as the engine.GC struct into a
separate subpackage. As of this commit that subpackage contains no testing
other than what existed for the engine.GC code.

The RunGC function is a reasonably well specified interface to separate
the logic of scanning a range and collecting the garbage from the gcQueue.

I was getting overwhelmed by testing boundaries and unit testing in general.

Release note: None
This commit reworks the processing of replicated state underneath the gcQueue
for the purpose of determining and sending GC requests. The primary intention
of this commit is to remove the need to buffer all of the versions of a key
in memory. As we learned in cockroachdb#42531, this bufferring can be extremely
unfortunate when using sequence data types which are written to frequently.

Prior to this commit, the code forward iterates through the range's data and
eagerly reads all versions of the a key into memory. It then binary searches
those versions to find the latest timestamp for the key which can be GC'd.
It then reverse iterates through all of those versions to determine the latest
version of the key which would put the current batch over its limit. This last
step works to paginate the process of actually deleting the data for many
versions of the same key. I suppose this pagination was added to ensure that
write batches due to GC requests don't get too large. Unfortunately this logic
was unable to paginate the loading of versions from the storage engine.

In this new commit, the entire process of computing data to GC now uses reverse
iteration; for each key we examine versions from oldest to newest. The commit
adds a `gcIterator` which wraps this reverse iteration with some useful
lookahead. During this GC process, at most two additional versions need to
examined to determine whether a given version is garbage.

While this approach relies on reverse iteration which is known to be less
efficient than forward iteration, it offers the opportunity to avoid allocating
memory for versions of a key which are not going to end up as a part of a GC
request. This reduction in memory usage shows up in benchmarks (see below).
The change retains the old implementation as a testing strategy and as a basis
for the benchmarks.

```
name                                                                                                                      old time/op    new time/op    delta
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8                  924ns ± 8%     590ns ± 1%   -36.13%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8               976ns ± 2%     578ns ± 1%   -40.75%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8               944ns ± 0%     570ns ± 9%   -39.63%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8               903ns ± 0%     612ns ± 3%   -32.29%  (p=0.016 n=4+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8               994ns ± 9%     592ns ± 9%   -40.47%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8               669ns ± 4%     526ns ± 1%   -21.34%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8            624ns ± 0%     529ns ± 2%   -15.16%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8            636ns ± 4%     534ns ± 2%   -16.04%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8            612ns ± 1%     532ns ± 3%   -13.08%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8            638ns ± 2%     534ns ±10%   -16.35%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8        603ns ± 6%     527ns ± 8%   -12.51%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8     613ns ± 5%     517ns ± 6%   -15.78%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8     619ns ± 6%     534ns ± 4%   -13.61%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8     607ns ± 7%     520ns ± 2%   -14.39%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8     599ns ± 4%     501ns ± 7%   -16.36%  (p=0.008 n=5+5)

name                                                                                                                      old speed      new speed      delta
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8               23.9MB/s ± 8%  37.3MB/s ± 1%   +56.23%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8            22.6MB/s ± 2%  38.1MB/s ± 1%   +68.81%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8            23.3MB/s ± 0%  38.7MB/s ± 9%   +66.06%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8            24.4MB/s ± 0%  36.0MB/s ± 3%   +47.73%  (p=0.016 n=4+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8            22.2MB/s ± 8%  37.3MB/s ± 9%   +68.09%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8            34.4MB/s ± 4%  43.7MB/s ± 1%   +27.08%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8         36.9MB/s ± 0%  43.4MB/s ± 2%   +17.84%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8         36.2MB/s ± 4%  43.1MB/s ± 2%   +19.02%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8         37.6MB/s ± 1%  43.3MB/s ± 3%   +15.02%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8         36.0MB/s ± 2%  43.2MB/s ±10%   +19.87%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8     36.5MB/s ± 5%  41.8MB/s ± 9%   +14.39%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8  35.9MB/s ± 5%  42.7MB/s ± 6%   +18.83%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8  35.6MB/s ± 6%  41.2MB/s ± 4%   +15.66%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8  36.3MB/s ± 6%  42.3MB/s ± 2%   +16.69%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8  36.7MB/s ± 4%  44.0MB/s ± 7%   +19.69%  (p=0.008 n=5+5)

name                                                                                                                      old alloc/op   new alloc/op   delta
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8                   325B ± 0%       76B ± 0%   -76.62%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8                358B ± 0%       49B ± 0%      ~     (p=0.079 n=4+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8                340B ± 0%       29B ± 0%   -91.47%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8                328B ± 0%       18B ± 0%   -94.51%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8                325B ± 0%       14B ± 0%   -95.69%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8                226B ± 0%        2B ± 0%      ~     (p=0.079 n=4+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8             228B ± 0%        3B ± 0%   -98.69%  (p=0.000 n=5+4)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8             228B ± 0%        2B ± 0%   -99.12%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8             228B ± 0%        2B ± 0%   -99.12%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8             226B ± 0%        0B       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8         388B ± 2%        0B       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8      391B ± 2%        0B       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8      389B ± 1%        0B       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8      391B ± 2%        0B       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8      390B ± 1%        0B       -100.00%  (p=0.008 n=5+5)

name                                                                                                                      old allocs/op  new allocs/op  delta
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8                   4.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8                4.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8                4.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8                4.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8                4.00 ± 0%      0.00       -100.00%  (p=0.008 n=5+5)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8                0.00           0.00           ~     (all equal)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8             0.00           0.00           ~     (all equal)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8             0.00           0.00           ~     (all equal)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8             0.00           0.00           ~     (all equal)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8             0.00           0.00           ~     (all equal)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8         0.00           0.00           ~     (all equal)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8      0.00           0.00           ~     (all equal)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8      0.00           0.00           ~     (all equal)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8      0.00           0.00           ~     (all equal)
Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8      0.00           0.00           ~     (all equal)
```

Release note (bug fix): The GC process was improved to paginate the key
versions of a key to fix OOM crashes which can occur when there are
extremely large numbers of versions for a given key.
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@ajwerner ajwerner force-pushed the ajwerner/19.2-gc-fix branch from 3853b09 to cad6077 Compare January 23, 2020 02:12
@ajwerner ajwerner closed this Jan 28, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants