-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OOM occurred while running replica GC old version #42531
Comments
I try to support a PR to fix this bug, which I had do and test in my environment.
|
Yes, this is known. See #17229. As for the OOM, I think I understand the problem but I'd like to understand how it occurred as I suspect there are easier solutions than your span-oriented GC request. That being said I admire your tenacity in attempting to fix the problem. As I read the code, it seems that we load every version of each individual key into RAM. This seems problematic in the case where there is a very large amount of data underneath a single key. That situation also happens to be exactly the case where a range becomes unsplittable. All that being said, you should receive write backpressure when a range exceeds Did you mess with the range size settings or the backpressure? Generally if there are just kilobytes of a single key in terms of versions I don't see how you would run of out RAM during GC. Is your machine extremely memory constrained? I am happy you filed this issue. I'm amenable to (though not enthusiastic about) the idea of using ClearRange to clear many versions of a key. I find #42640 in its current state to be more invasive than it needs to be. As a first step I'd look to paginate the individual versions of a key during the GC process rather than jumping straight to ClearRange. I'd also like to understand how the backpressure mechanisms we have failed to prevent the OOM you encountered before moving forward with any sort of code change. |
We use docker to limit the CPU and memory. In general, we use 4 CPU and 8GB of memory by default. In our usage scenario, if I keep writing a table with self-incrementing columns, or if I keep updating a row, then A large number of historical versions will be generated in the table that maintains the self-added column ID, but the fragment of this table cannot be split because there is only one valid key, and the original mechanism will actually generate a write error due to backpressure, but this kind of The scene is very common on our side. Based on business considerations, we must properly correct the logic of the split-related source code. When we identify whether we need to split, we only find that the fragment has only one valid key (range size > 64MB). |
The version we are using is V19.1. I used to use another solution to solve the problem of OOM. I used the bucket. With the iteration, I will only keep a limited bucket, and each bucket is small. 1MB, up to 10 buckets, so only the last 10MB of data will be processed, but the test shows that the GC is too slow, because it requires multiple GC schedules, iterates through each time, takes a lot of time, and is easy Cause timeout, in the event of a timeout will encounter another tragedy, that is, there will never be a chance for the GC history version, because the future data will be more and more, vicious circle. |
We are also worried that the introduction of range delete in the GC has some unpredictable risks, so we submitted ISSUE and PR, and we hope to get community discussion and guidance. |
Oh! I finally now understand that this is caused by our SQL sequence type! The code today simply increments the value on a single key. I suspect there are better implementations. There are at least implementations which make different tradeoffs. One such might me to keep sequences as a single row table where the primary key gets incremented every time, deleting the old row. This will lead to much saner GC behavior. The downside is that it can't be done in a single request and becomes quite exposed to retries. Perhaps we could add lower level support to make incrementing cheaper. A hybrid option might to only store so many underneath a given key but need to do an initial read to figure out the key and optimistically assume it doesn't change. Maybe every 16k or something. Another trick would be to dramatically turn down the GC TTL for the sequence table. What do you have it set to?
:) I was just looking into timeouts from the queues today... #42686 I can easily extend that change to include a knob to control the timeout of the GC queue processing.
I don't think I understand this part. Say you did one 10MB batch of versions. You're saying that if you did that continuously until they were all gone, you wouldn't be able to keep up? Or maybe you would be if you hadn't accumulated so much garbage in the sequence table? Short term I think we should do the following:
At a certain size a clear range might be more efficient but I'd generally be pretty stunned if the GC logic couldn't keep up with increments if the GC TTL was set low enough. It's possible that we should default the GC TTL for sequences to a low value though I can see reasons not to. As I come to better understand you use case and how the above steps might be insufficient then we can talk about more ideas. |
The use of buckets means this. For example, if there is a 1GB outdated version that requires GC, I only keep the last 10MB of data processing in the memory. The use of the bucket allows me to discard the data generated by the previous iteration. It is not the key to use the bucket. The key is that in order to avoid OOM, for each key, a GC process will only delete up to 10MB of data, so that a round of GC can not complete all the overdue data cleanup, must wait for the next round, so that the overall need for multiple rounds of scheduling Finished, every RunGC needs to traverse the entire range from scratch, causing duplicates. |
we can set the sequence table GC TTL, but this is time consuming and laborious. Every time you create a table, you need extra work. We hope to avoid this kind of manual intervention. It is often easy to make mistakes. |
If we can solve the problem of the sequence table GC, many times we don't have to keep so many versions. In this case, the probability of the problem we are discussing is very low, only extreme cases. |
Can we keep a mechanism like Fast GC, when users encounter similar extreme situations, users can quickly deal with it through this special channel, because this deteriorating scene is difficult to improve, we have tested and ignored GC timeout It took a day to use the original GC logic, but the effect is not ideal, the range size is still very large, if the business is still written, then this situation will not get better, because each time the GC costs a lot Iterate over all the keys, but only delete the last 10MB (for a key). |
I think that you misunderstand what I mean by pagination. The idea is not to allow the RunGC method to return until all of the garbage has been removed. The RunGC method already subdivides it’s processing into several requests if the data is due to multiple keys. I’m not suggesting only clearing 10MB per gc queue process cycle. I’m suggesting clearing as much as possible during a gc queue cycle but with 10 MB worth of data scanned at a time. The iterator state can be preserved avoiding a rescan. I agree that if you could only clear 10MB per gc process it would be a problem. That problem would be made much worse if the gc ttl is 25 hours.
It sounds like providing a mechanism to have a short default gc ttl for sequences is a good idea and seems like it would have prevented your problem in the first place. |
That “special channel” is something I actively would like to avoid. I get your point that ClearRange can be a more efficient way to delete many versions of the same key. Let’s see about making the existing per-key delete logic more robust to multiple versions before we go and do anything extreme. |
I will also conduct more tests and attempts to fix this problem. The ideas and principles you provided give a new perspective to solve the problem, thank you very much |
This commit reworks the processing of replicated state underneath the gcQueue for the purpose of determining and sending GC requests. The primary intention of this commit is to remove the need to buffer all of the versions of a key in memory. As we learned in cockroachdb#42531, this bufferring can be extremely unfortunate when using sequence data types which are written to frequently. Prior to this commit, the code forward iterates through the range's data and eagerly reads all versions of the a key into memory. It then binary searches those versions to find the latest timestamp for the key which can be GC'd. It then reverse iterates through all of those versions to determine the latest version of the key which would put the current batch over its limit. This last step works to paginate the process of actually deleting the data for many versions of the same key. I suppose this pagination was added to ensure that write batches due to GC requests don't get too large. Unfortunately this logic was unable to paginate the loading of versions from the storage engine. In this new commit, the entire process of computing data to GC now uses reverse iteration; for each key we examine versions from oldest to newest. The commit adds a `gcIterator` which wraps this reverse iteration with some useful lookahead. During this GC process, at most two additional versions need to examined to determine whether a given version is garbage. There is some concern that this reverse iteration is slower and less efficient than the forward iteration which preceded it. I'm not too concerned by this cost as I expect the IO required to load the data to dominate computational overhead of this process. That being said, benchmarks are probably a good idea. TODO(ajwerner): Add more and better testing that exercises the pagination that motivated this change. TODO(ajwerner): Add benchmarks that explore the cost of reverse iteration. Release note (bug fix): The GC process was improved to paginate the key versions of a key to fix OOM crashes which can occur when there are extremely large numbers of versions for a given key.
This commit reworks the processing of replicated state underneath the gcQueue for the purpose of determining and sending GC requests. The primary intention of this commit is to remove the need to buffer all of the versions of a key in memory. As we learned in cockroachdb#42531, this bufferring can be extremely unfortunate when using sequence data types which are written to frequently. Prior to this commit, the code forward iterates through the range's data and eagerly reads all versions of the a key into memory. It then binary searches those versions to find the latest timestamp for the key which can be GC'd. It then reverse iterates through all of those versions to determine the latest version of the key which would put the current batch over its limit. This last step works to paginate the process of actually deleting the data for many versions of the same key. I suppose this pagination was added to ensure that write batches due to GC requests don't get too large. Unfortunately this logic was unable to paginate the loading of versions from the storage engine. In this new commit, the entire process of computing data to GC now uses reverse iteration; for each key we examine versions from oldest to newest. The commit adds a `gcIterator` which wraps this reverse iteration with some useful lookahead. During this GC process, at most two additional versions need to examined to determine whether a given version is garbage. While this approach relies on reverse iteration which is known to be less efficient than forward iteration, it offers the opportunity to avoid allocating memory for versions of a key which are not going to end up as a part of a GC request. This reduction in memory usage shows up in benchmarks (see below). The change retains the old implementation as a testing strategy and as a basis for the benchmarks. ``` name old time/op new time/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 924ns ± 8% 590ns ± 1% -36.13% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 976ns ± 2% 578ns ± 1% -40.75% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 944ns ± 0% 570ns ± 9% -39.63% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 903ns ± 0% 612ns ± 3% -32.29% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 994ns ± 9% 592ns ± 9% -40.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 669ns ± 4% 526ns ± 1% -21.34% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 624ns ± 0% 529ns ± 2% -15.16% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 636ns ± 4% 534ns ± 2% -16.04% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 612ns ± 1% 532ns ± 3% -13.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 638ns ± 2% 534ns ±10% -16.35% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 603ns ± 6% 527ns ± 8% -12.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 613ns ± 5% 517ns ± 6% -15.78% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 619ns ± 6% 534ns ± 4% -13.61% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 607ns ± 7% 520ns ± 2% -14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 599ns ± 4% 501ns ± 7% -16.36% (p=0.008 n=5+5) name old speed new speed delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 23.9MB/s ± 8% 37.3MB/s ± 1% +56.23% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 22.6MB/s ± 2% 38.1MB/s ± 1% +68.81% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 23.3MB/s ± 0% 38.7MB/s ± 9% +66.06% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 24.4MB/s ± 0% 36.0MB/s ± 3% +47.73% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 22.2MB/s ± 8% 37.3MB/s ± 9% +68.09% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 34.4MB/s ± 4% 43.7MB/s ± 1% +27.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 36.9MB/s ± 0% 43.4MB/s ± 2% +17.84% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 36.2MB/s ± 4% 43.1MB/s ± 2% +19.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 37.6MB/s ± 1% 43.3MB/s ± 3% +15.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.0MB/s ± 2% 43.2MB/s ±10% +19.87% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 36.5MB/s ± 5% 41.8MB/s ± 9% +14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 35.9MB/s ± 5% 42.7MB/s ± 6% +18.83% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 35.6MB/s ± 6% 41.2MB/s ± 4% +15.66% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 36.3MB/s ± 6% 42.3MB/s ± 2% +16.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.7MB/s ± 4% 44.0MB/s ± 7% +19.69% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 325B ± 0% 76B ± 0% -76.62% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 358B ± 0% 49B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 340B ± 0% 29B ± 0% -91.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 328B ± 0% 18B ± 0% -94.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 325B ± 0% 14B ± 0% -95.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 226B ± 0% 2B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 228B ± 0% 3B ± 0% -98.69% (p=0.000 n=5+4) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 226B ± 0% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 388B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 389B ± 1% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 390B ± 1% 0B -100.00% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) ``` Release note (bug fix): The GC process was improved to paginate the key versions of a key to fix OOM crashes which can occur when there are extremely large numbers of versions for a given key.
This commit reworks the processing of replicated state underneath the gcQueue for the purpose of determining and sending GC requests. The primary intention of this commit is to remove the need to buffer all of the versions of a key in memory. As we learned in cockroachdb#42531, this bufferring can be extremely unfortunate when using sequence data types which are written to frequently. Prior to this commit, the code forward iterates through the range's data and eagerly reads all versions of the a key into memory. It then binary searches those versions to find the latest timestamp for the key which can be GC'd. It then reverse iterates through all of those versions to determine the latest version of the key which would put the current batch over its limit. This last step works to paginate the process of actually deleting the data for many versions of the same key. I suppose this pagination was added to ensure that write batches due to GC requests don't get too large. Unfortunately this logic was unable to paginate the loading of versions from the storage engine. In this new commit, the entire process of computing data to GC now uses reverse iteration; for each key we examine versions from oldest to newest. The commit adds a `gcIterator` which wraps this reverse iteration with some useful lookahead. During this GC process, at most two additional versions need to examined to determine whether a given version is garbage. While this approach relies on reverse iteration which is known to be less efficient than forward iteration, it offers the opportunity to avoid allocating memory for versions of a key which are not going to end up as a part of a GC request. This reduction in memory usage shows up in benchmarks (see below). The change retains the old implementation as a testing strategy and as a basis for the benchmarks. ``` name old time/op new time/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 924ns ± 8% 590ns ± 1% -36.13% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 976ns ± 2% 578ns ± 1% -40.75% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 944ns ± 0% 570ns ± 9% -39.63% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 903ns ± 0% 612ns ± 3% -32.29% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 994ns ± 9% 592ns ± 9% -40.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 669ns ± 4% 526ns ± 1% -21.34% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 624ns ± 0% 529ns ± 2% -15.16% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 636ns ± 4% 534ns ± 2% -16.04% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 612ns ± 1% 532ns ± 3% -13.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 638ns ± 2% 534ns ±10% -16.35% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 603ns ± 6% 527ns ± 8% -12.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 613ns ± 5% 517ns ± 6% -15.78% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 619ns ± 6% 534ns ± 4% -13.61% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 607ns ± 7% 520ns ± 2% -14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 599ns ± 4% 501ns ± 7% -16.36% (p=0.008 n=5+5) name old speed new speed delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 23.9MB/s ± 8% 37.3MB/s ± 1% +56.23% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 22.6MB/s ± 2% 38.1MB/s ± 1% +68.81% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 23.3MB/s ± 0% 38.7MB/s ± 9% +66.06% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 24.4MB/s ± 0% 36.0MB/s ± 3% +47.73% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 22.2MB/s ± 8% 37.3MB/s ± 9% +68.09% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 34.4MB/s ± 4% 43.7MB/s ± 1% +27.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 36.9MB/s ± 0% 43.4MB/s ± 2% +17.84% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 36.2MB/s ± 4% 43.1MB/s ± 2% +19.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 37.6MB/s ± 1% 43.3MB/s ± 3% +15.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.0MB/s ± 2% 43.2MB/s ±10% +19.87% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 36.5MB/s ± 5% 41.8MB/s ± 9% +14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 35.9MB/s ± 5% 42.7MB/s ± 6% +18.83% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 35.6MB/s ± 6% 41.2MB/s ± 4% +15.66% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 36.3MB/s ± 6% 42.3MB/s ± 2% +16.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.7MB/s ± 4% 44.0MB/s ± 7% +19.69% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 325B ± 0% 76B ± 0% -76.62% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 358B ± 0% 49B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 340B ± 0% 29B ± 0% -91.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 328B ± 0% 18B ± 0% -94.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 325B ± 0% 14B ± 0% -95.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 226B ± 0% 2B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 228B ± 0% 3B ± 0% -98.69% (p=0.000 n=5+4) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 226B ± 0% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 388B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 389B ± 1% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 390B ± 1% 0B -100.00% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) ``` Release note (bug fix): The GC process was improved to paginate the key versions of a key to fix OOM crashes which can occur when there are extremely large numbers of versions for a given key.
This commit reworks the processing of replicated state underneath the gcQueue for the purpose of determining and sending GC requests. The primary intention of this commit is to remove the need to buffer all of the versions of a key in memory. As we learned in cockroachdb#42531, this bufferring can be extremely unfortunate when using sequence data types which are written to frequently. Prior to this commit, the code forward iterates through the range's data and eagerly reads all versions of the a key into memory. It then binary searches those versions to find the latest timestamp for the key which can be GC'd. It then reverse iterates through all of those versions to determine the latest version of the key which would put the current batch over its limit. This last step works to paginate the process of actually deleting the data for many versions of the same key. I suppose this pagination was added to ensure that write batches due to GC requests don't get too large. Unfortunately this logic was unable to paginate the loading of versions from the storage engine. In this new commit, the entire process of computing data to GC now uses reverse iteration; for each key we examine versions from oldest to newest. The commit adds a `gcIterator` which wraps this reverse iteration with some useful lookahead. During this GC process, at most two additional versions need to examined to determine whether a given version is garbage. While this approach relies on reverse iteration which is known to be less efficient than forward iteration, it offers the opportunity to avoid allocating memory for versions of a key which are not going to end up as a part of a GC request. This reduction in memory usage shows up in benchmarks (see below). The change retains the old implementation as a testing strategy and as a basis for the benchmarks. ``` name old time/op new time/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 924ns ± 8% 590ns ± 1% -36.13% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 976ns ± 2% 578ns ± 1% -40.75% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 944ns ± 0% 570ns ± 9% -39.63% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 903ns ± 0% 612ns ± 3% -32.29% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 994ns ± 9% 592ns ± 9% -40.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 669ns ± 4% 526ns ± 1% -21.34% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 624ns ± 0% 529ns ± 2% -15.16% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 636ns ± 4% 534ns ± 2% -16.04% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 612ns ± 1% 532ns ± 3% -13.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 638ns ± 2% 534ns ±10% -16.35% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 603ns ± 6% 527ns ± 8% -12.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 613ns ± 5% 517ns ± 6% -15.78% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 619ns ± 6% 534ns ± 4% -13.61% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 607ns ± 7% 520ns ± 2% -14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 599ns ± 4% 501ns ± 7% -16.36% (p=0.008 n=5+5) name old speed new speed delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 23.9MB/s ± 8% 37.3MB/s ± 1% +56.23% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 22.6MB/s ± 2% 38.1MB/s ± 1% +68.81% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 23.3MB/s ± 0% 38.7MB/s ± 9% +66.06% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 24.4MB/s ± 0% 36.0MB/s ± 3% +47.73% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 22.2MB/s ± 8% 37.3MB/s ± 9% +68.09% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 34.4MB/s ± 4% 43.7MB/s ± 1% +27.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 36.9MB/s ± 0% 43.4MB/s ± 2% +17.84% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 36.2MB/s ± 4% 43.1MB/s ± 2% +19.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 37.6MB/s ± 1% 43.3MB/s ± 3% +15.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.0MB/s ± 2% 43.2MB/s ±10% +19.87% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 36.5MB/s ± 5% 41.8MB/s ± 9% +14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 35.9MB/s ± 5% 42.7MB/s ± 6% +18.83% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 35.6MB/s ± 6% 41.2MB/s ± 4% +15.66% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 36.3MB/s ± 6% 42.3MB/s ± 2% +16.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.7MB/s ± 4% 44.0MB/s ± 7% +19.69% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 325B ± 0% 76B ± 0% -76.62% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 358B ± 0% 49B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 340B ± 0% 29B ± 0% -91.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 328B ± 0% 18B ± 0% -94.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 325B ± 0% 14B ± 0% -95.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 226B ± 0% 2B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 228B ± 0% 3B ± 0% -98.69% (p=0.000 n=5+4) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 226B ± 0% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 388B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 389B ± 1% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 390B ± 1% 0B -100.00% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) ``` Release note (bug fix): The GC process was improved to paginate the key versions of a key to fix OOM crashes which can occur when there are extremely large numbers of versions for a given key.
This commit reworks the processing of replicated state underneath the gcQueue for the purpose of determining and sending GC requests. The primary intention of this commit is to remove the need to buffer all of the versions of a key in memory. As we learned in cockroachdb#42531, this bufferring can be extremely unfortunate when using sequence data types which are written to frequently. Prior to this commit, the code forward iterates through the range's data and eagerly reads all versions of the a key into memory. It then binary searches those versions to find the latest timestamp for the key which can be GC'd. It then reverse iterates through all of those versions to determine the latest version of the key which would put the current batch over its limit. This last step works to paginate the process of actually deleting the data for many versions of the same key. I suppose this pagination was added to ensure that write batches due to GC requests don't get too large. Unfortunately this logic was unable to paginate the loading of versions from the storage engine. In this new commit, the entire process of computing data to GC now uses reverse iteration; for each key we examine versions from oldest to newest. The commit adds a `gcIterator` which wraps this reverse iteration with some useful lookahead. During this GC process, at most two additional versions need to examined to determine whether a given version is garbage. While this approach relies on reverse iteration which is known to be less efficient than forward iteration, it offers the opportunity to avoid allocating memory for versions of a key which are not going to end up as a part of a GC request. This reduction in memory usage shows up in benchmarks (see below). The change retains the old implementation as a testing strategy and as a basis for the benchmarks. ``` name old time/op new time/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 924ns ± 8% 590ns ± 1% -36.13% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 976ns ± 2% 578ns ± 1% -40.75% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 944ns ± 0% 570ns ± 9% -39.63% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 903ns ± 0% 612ns ± 3% -32.29% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 994ns ± 9% 592ns ± 9% -40.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 669ns ± 4% 526ns ± 1% -21.34% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 624ns ± 0% 529ns ± 2% -15.16% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 636ns ± 4% 534ns ± 2% -16.04% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 612ns ± 1% 532ns ± 3% -13.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 638ns ± 2% 534ns ±10% -16.35% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 603ns ± 6% 527ns ± 8% -12.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 613ns ± 5% 517ns ± 6% -15.78% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 619ns ± 6% 534ns ± 4% -13.61% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 607ns ± 7% 520ns ± 2% -14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 599ns ± 4% 501ns ± 7% -16.36% (p=0.008 n=5+5) name old speed new speed delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 23.9MB/s ± 8% 37.3MB/s ± 1% +56.23% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 22.6MB/s ± 2% 38.1MB/s ± 1% +68.81% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 23.3MB/s ± 0% 38.7MB/s ± 9% +66.06% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 24.4MB/s ± 0% 36.0MB/s ± 3% +47.73% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 22.2MB/s ± 8% 37.3MB/s ± 9% +68.09% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 34.4MB/s ± 4% 43.7MB/s ± 1% +27.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 36.9MB/s ± 0% 43.4MB/s ± 2% +17.84% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 36.2MB/s ± 4% 43.1MB/s ± 2% +19.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 37.6MB/s ± 1% 43.3MB/s ± 3% +15.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.0MB/s ± 2% 43.2MB/s ±10% +19.87% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 36.5MB/s ± 5% 41.8MB/s ± 9% +14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 35.9MB/s ± 5% 42.7MB/s ± 6% +18.83% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 35.6MB/s ± 6% 41.2MB/s ± 4% +15.66% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 36.3MB/s ± 6% 42.3MB/s ± 2% +16.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.7MB/s ± 4% 44.0MB/s ± 7% +19.69% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 325B ± 0% 76B ± 0% -76.62% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 358B ± 0% 49B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 340B ± 0% 29B ± 0% -91.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 328B ± 0% 18B ± 0% -94.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 325B ± 0% 14B ± 0% -95.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 226B ± 0% 2B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 228B ± 0% 3B ± 0% -98.69% (p=0.000 n=5+4) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 226B ± 0% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 388B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 389B ± 1% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 390B ± 1% 0B -100.00% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) ``` Release note (bug fix): The GC process was improved to paginate the key versions of a key to fix OOM crashes which can occur when there are extremely large numbers of versions for a given key.
This commit reworks the processing of replicated state underneath the gcQueue for the purpose of determining and sending GC requests. The primary intention of this commit is to remove the need to buffer all of the versions of a key in memory. As we learned in cockroachdb#42531, this bufferring can be extremely unfortunate when using sequence data types which are written to frequently. Prior to this commit, the code forward iterates through the range's data and eagerly reads all versions of the a key into memory. It then binary searches those versions to find the latest timestamp for the key which can be GC'd. It then reverse iterates through all of those versions to determine the latest version of the key which would put the current batch over its limit. This last step works to paginate the process of actually deleting the data for many versions of the same key. I suppose this pagination was added to ensure that write batches due to GC requests don't get too large. Unfortunately this logic was unable to paginate the loading of versions from the storage engine. In this new commit, the entire process of computing data to GC now uses reverse iteration; for each key we examine versions from oldest to newest. The commit adds a `gcIterator` which wraps this reverse iteration with some useful lookahead. During this GC process, at most two additional versions need to examined to determine whether a given version is garbage. While this approach relies on reverse iteration which is known to be less efficient than forward iteration, it offers the opportunity to avoid allocating memory for versions of a key which are not going to end up as a part of a GC request. This reduction in memory usage shows up in benchmarks (see below). The change retains the old implementation as a testing strategy and as a basis for the benchmarks. ``` name old time/op new time/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 924ns ± 8% 590ns ± 1% -36.13% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 976ns ± 2% 578ns ± 1% -40.75% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 944ns ± 0% 570ns ± 9% -39.63% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 903ns ± 0% 612ns ± 3% -32.29% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 994ns ± 9% 592ns ± 9% -40.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 669ns ± 4% 526ns ± 1% -21.34% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 624ns ± 0% 529ns ± 2% -15.16% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 636ns ± 4% 534ns ± 2% -16.04% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 612ns ± 1% 532ns ± 3% -13.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 638ns ± 2% 534ns ±10% -16.35% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 603ns ± 6% 527ns ± 8% -12.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 613ns ± 5% 517ns ± 6% -15.78% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 619ns ± 6% 534ns ± 4% -13.61% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 607ns ± 7% 520ns ± 2% -14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 599ns ± 4% 501ns ± 7% -16.36% (p=0.008 n=5+5) name old speed new speed delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 23.9MB/s ± 8% 37.3MB/s ± 1% +56.23% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 22.6MB/s ± 2% 38.1MB/s ± 1% +68.81% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 23.3MB/s ± 0% 38.7MB/s ± 9% +66.06% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 24.4MB/s ± 0% 36.0MB/s ± 3% +47.73% (p=0.016 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 22.2MB/s ± 8% 37.3MB/s ± 9% +68.09% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 34.4MB/s ± 4% 43.7MB/s ± 1% +27.08% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 36.9MB/s ± 0% 43.4MB/s ± 2% +17.84% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 36.2MB/s ± 4% 43.1MB/s ± 2% +19.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 37.6MB/s ± 1% 43.3MB/s ± 3% +15.02% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.0MB/s ± 2% 43.2MB/s ±10% +19.87% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 36.5MB/s ± 5% 41.8MB/s ± 9% +14.39% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 35.9MB/s ± 5% 42.7MB/s ± 6% +18.83% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 35.6MB/s ± 6% 41.2MB/s ± 4% +15.66% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 36.3MB/s ± 6% 42.3MB/s ± 2% +16.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 36.7MB/s ± 4% 44.0MB/s ± 7% +19.69% (p=0.008 n=5+5) name old alloc/op new alloc/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 325B ± 0% 76B ± 0% -76.62% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 358B ± 0% 49B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 340B ± 0% 29B ± 0% -91.47% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 328B ± 0% 18B ± 0% -94.51% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 325B ± 0% 14B ± 0% -95.69% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 226B ± 0% 2B ± 0% ~ (p=0.079 n=4+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 228B ± 0% 3B ± 0% -98.69% (p=0.000 n=5+4) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 228B ± 0% 2B ± 0% -99.12% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 226B ± 0% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 388B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 389B ± 1% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 391B ± 2% 0B -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 390B ± 1% 0B -100.00% (p=0.008 n=5+5) name old allocs/op new allocs/op delta Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#01-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#02-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#03-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[2,3],valueLen=[1,1],keysPerValue=[1,2],deleteFrac=0.000000,intentFrac=0.100000#04-8 4.00 ± 0% 0.00 -100.00% (p=0.008 n=5+5) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1,100],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#01-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#02-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#03-8 0.00 0.00 ~ (all equal) Run/ts=[0,100],keySuffix=[8,8],valueLen=[8,16],keysPerValue=[1000,1000000],deleteFrac=0.100000,intentFrac=0.100000#04-8 0.00 0.00 ~ (all equal) ``` Release note (bug fix): The GC process was improved to paginate the key versions of a key to fix OOM crashes which can occur when there are extremely large numbers of versions for a given key.
Describe the problem
cockroach data node crash due to OOM when a replica running GC.
To Reproduce
What did you do? Describe in your own words.
I load data to a table(load like batch insert a large number of rows ) which has auto-increment, after 25hours later, the replica running GC old version for the table, but there are too many version, and when running GC we copy all version into memory. I known auto-increment is not recommended in cockroach, but I think it is a bug.
And I found another problem, when a key has to many version to GC(more than 256KB), it will Increase normal read and write request latency in that key since the latch.
If possible, provide steps to reproduce the behavior:
Expected behavior
A clear and concise description of what you expected to happen.
Additional data / screenshots
If the problem is SQL-related, include a copy of the SQL query and the schema
of the supporting tables.
If a node in your cluster encountered a fatal error, supply the contents of the
log directories (at minimum of the affected node(s), but preferably all nodes).
Note that log files can contain confidential information. Please continue
creating this issue, but contact [email protected] to submit the log
files in private.
If applicable, add screenshots to help explain your problem.
Environment:
cockroach sql
, JDBC, ...]Additional context
What was the impact?
Add any other context about the problem here.
The text was updated successfully, but these errors were encountered: