kvserver: make GC intent scoring more aggressive #65001

erikgrinaker · 2021-05-11T14:16:22Z

keys: rename RangeLastGCKey to RangeGCThresholdKey

The key RangeLastGCKey was not actually used to store the last GC
timestamp, but rather the last GC threshold. This commit renames the key
to the more appropriate RangeGCThresholdKey.

Release note: None

kvserver: increase GC queue timeout

The MVCC GC queue timeout was left at the default of 1 minute. Since the
GCThreshold is advanced and persisted at the very start of the GC run
(to coordinate with other processes such as exports and rangefeeds
before removing any data), this means that if a GC run takes more than 1
minute to complete it will be aborted due to the timeout and may not run
again for some time.

This patch increases the timeout to 10 minutes. The primary purpose of
the timeout is to avoid head-of-line blocking in the case of unavailable
ranges, and a larger timeout still satisfies this property while also
allowing more expensive cleanup attempts to complete.

Release note (ops change): increased the timeout for range MVCC
garbage collection from 1 minute to 10 minutes, to allow larger jobs to
run to completion.

kvserver: use PUSH_TOUCH during GC to push intent owners

Using PUSH_ABORT to push intent owners during intent GC can cause it
to block on (or possibly abort) long-running transactions.

Since we would prefer GC to continue with other intents rather than
waiting for active transactions, this patch changes the GC queue to use
PUSH_TOUCH instead. This will still clean up abandoned txns.

Resolves #65777.

Release note (bug fix): Intent garbage collection no longer waits for
or aborts running transactions.

kvserver: add GC queue cooldown timer

When the GC queue considered replicas for MVCC GC, it primarily
considered the previous GC threshold and the amount of MVCC garbage that
would be removed by advancing the threshold. We would like to also
trigger GC based solely on intent stats. However, we can't know whether
we'll actually be able to clean up any intents, since they may belong to
an in-progress long-running transaction. There is therefore a risk that
we will keep spinning on GC attempts due to high intent counts.

This patch records the last GC queue processing timestamp, and adds a
cooldown timer of 2 hours between each range GC attempt triggered by
intent score alone to prevent spinning on GC attempts.

Release note: None

kvserver: make GC intent scoring more aggressive

Users often experience buildup of intents due to the GC queue not being
aggressive enough in cleaning them up. Currently, the GC queue will only
trigger based on intents if the average intent age is 10 days and if
there is also MVCC garbage that can be cleaned up.

This patch makes the intent scoring more aggressive, triggering GC when
the average intent age is 8 hours regardless of other MVCC garbage. The
previous commit added a cooldown timer to prevent the GC queue from
spinning on a replica if the intents couldn't be cleaned up (e.g.
because they belong to an in-progress long-running transaction).

Resolves #64266.
Touches #60585.

Release note (ops change): trigger MVCC and intent garbage collection
when the average intent age is 8 hours, down from 10 days.

/cc @cockroachdb/kv

cockroach-teamcity · 2021-05-11T14:16:29Z

This change is

nvanbenschoten · 2021-05-12T03:06:37Z

I think this makes sense as a first step towards addressing #64266.

erikgrinaker · 2021-05-12T09:14:27Z

Yeah, I'm just worried about continually spinning due to long-running txns, so would like to address that before merging this. The simplest solution I can think of is to add a cooldown timer, either to the GC queue or to the base queue.

tbg

A few comments based only on my reading of the commit messages:

the cooldown timer of 2h seems like it could possibly be dangerous. If a range sees sustained overwrites, it needs regular GC or it will become too large and eventually will be backpressured. Regardless of the cooldown timer, I think we want to trigger GC of old values if the score says we should (that score is guaranteed to drop following a GC attempt).
while most of GC is local to the range, the pushing of txns is not, and so it can be affected by an unavailable range. We should be pushing with a reasonable sub-timeout (5s?) and make sure one unavailable txn record doesn't prevent queue progress where it possible. We're possibly doing this already, not sure, just making sure we do it in the future. Upping the queue processing timeout (while the right thing) makes this a bit more high-stakes, since we don't want to spend 10 minutes doing nothing (& then possibly re-queueing the same range again). Not sure that there is a good way out in general. The cooldown timer will make sure things don't get delayed too much on other ranges, so it's hopefully ok in practice.
the commit message about GCThreshold being persisted at the beginning of the run could explain why that is still the right choice (if it even is).

It also renames RangeLastGCKey to RangeGCThresholdKey, since I
initially went down the path of a dedicated RangeLastGCTimestampKey to
track the last timestamp, and even though this approach was later
reverted the key rename was kept since there was a TODO for it
already.

I read this five times but still wondering what it means :-)

Reviewed 1 of 1 files at r1, 18 of 18 files at r2, 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nvanbenschoten, and @pbardea)

erikgrinaker

the cooldown timer of 2h seems like it could possibly be dangerous. If a range sees sustained overwrites, it needs regular GC or it will become too large and eventually will be backpressured. Regardless of the cooldown timer, I think we want to trigger GC of old values if the score says we should.

Good point, I'll change it to apply specifically for intent-triggered GC.

while most of GC is local to the range, the pushing of txns is not, and so it can be affected by an unavailable range.

Looks like you're right. For txns anchored on the range, it uses the intent resolver's async task pool, but pushing remote txns is synchronous and doesn't appear to have any timeout. I don't know what the expected latency variance is on these operations, is 5s really sufficient? Although I suppose if it fails the txn would get GCed by its local range eventually anyway, so it doesn't seem critical to go through with these.

I read this five times but still wondering what it means :-)

😄 I suppose I should just skip the added background flavor, and maybe also split it out as a separate commit. Anyway, to stop you wondering: Before I discovered QueueLastProcessedKey, I wanted to add RangeLastGCKey to record the last GC run. Then I discovered that the key already existed, but it didn't actually store the last GC run, it stored the GC threshold. So I renamed it to RangeGCThresholdKey so that I could add the real RangeLastGCKey. After doing that, however, I found QueueLastProcessedKey -- this made RangeLastGCKey obsolete, so I removed it and its implementation. However, there was already a TODO to rename RangeLastGCKey to RangeGCThresholdKey, and since I'd already done that work I figured I might as well just keep that change -- even though it isn't needed for this PR.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nvanbenschoten, and @pbardea)

The key `RangeLastGCKey` was not actually used to store the last GC timestamp, but rather the last GC threshold. This commit renames the key to the more appropriate `RangeGCThresholdKey`. Release note: None

The MVCC GC queue timeout was left at the default of 1 minute. Since the `GCThreshold` is advanced and persisted at the very start of the GC run (to coordinate with other processes such as exports and rangefeeds before removing any data), this means that if a GC run takes more than 1 minute to complete it will be aborted due to the timeout and may not run again for some time. This patch increases the timeout to 10 minutes. The primary purpose of the timeout is to avoid head-of-line blocking in the case of unavailable ranges, and a larger timeout still satisfies this property while also allowing more expensive cleanup attempts to complete. Release note (ops change): increased the timeout for range MVCC garbage collection from 1 minute to 10 minutes, to allow larger jobs to run to completion.

Using `PUSH_ABORT` to push intent owners during intent GC can cause it to block on (or possibly abort) long-running transactions. Since we would prefer GC to continue with other intents rather than waiting for active transactions, this patch changes the GC queue to use `PUSH_TOUCH` instead. This will still clean up abandoned txns. Release note (bug fix): Intent garbage collection no longer waits for or aborts running transactions.

When the GC queue considered replicas for MVCC GC, it primarily considered the previous GC threshold and the amount of MVCC garbage that would be removed by advancing the threshold. We would like to also trigger GC based solely on intent stats. However, we can't know whether we'll actually be able to clean up any intents, since they may belong to an in-progress long-running transaction. There is therefore a risk that we will keep spinning on GC attempts due to high intent counts. This patch records the last GC queue processing timestamp, and adds a cooldown timer of 2 hours between each range GC attempt triggered by intent score alone to prevent spinning on GC attempts. Release note: None

Users often experience buildup of intents due to the GC queue not being aggressive enough in cleaning them up. Currently, the GC queue will only trigger based on intents if the average intent age is 10 days and if there is also MVCC garbage that can be cleaned up. This patch makes the intent scoring more aggressive, triggering GC when the average intent age is 8 hours regardless of other MVCC garbage. The previous commit added a cooldown timer to prevent the GC queue from spinning on a replica if the intents couldn't be cleaned up (e.g. because they belong to an in-progress long-running transaction). Release note (ops change): trigger MVCC and intent garbage collection when the average intent age is 8 hours, down from 10 days.

erikgrinaker

I've updated this with a few changes:

Split out the RangeLastGCKey rename to a separate commit.
Added a note when increasing the GC queue timeout that we should wait for kv: split up intents into batches during gc cleanup #65847 and use per-batch timeouts to ensure we can make progress even with unavailable ranges.
Applied the GC cooldown to intent-triggered GC runs only (MVCC GC scoring ignores the cooldown).
Added a commit to use PUSH_TOUCH instead of PUSH_ABORT for intent resolution.

Would be happy to split off some PRs if it'd make reviewing easier, but they're all fairly small in isolation.

There are a couple of things still bothering me about this:

Firstly, since we haven't recorded last GC times on existing clusters, this is likely to trigger lots of GC runs when nodes upgrade and they find ranges with old intents and no last GC timestamp. Is this a problem?

Secondly, we now trigger GC based solely on the average intent age: if there's a single 8-hour-old intent in a huge range, we'll GC the range. We could try to be clever here and introduce some scaling factor or threshold that takes into account the relative or absolute amount of intents in the range. But on the other hand, we expect intents to generally get cleaned up by other mechanisms within a reasonable timeframe (e.g. txn finalization and reads), and we want some backstop to make sure any missed intents do get cleaned up before long, so maybe a simple time limit is sufficient.

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nvanbenschoten, @pbardea, and @tbg)

tbg

Thanks for the clean commit history, LGTM!

erikgrinaker · 2021-07-07T08:47:23Z

bors r=tbg

craig · 2021-07-07T09:33:50Z

Build succeeded:

GitHub CI (Cockroach)

erikgrinaker requested review from aliher1911 and nvanbenschoten May 11, 2021 14:16

erikgrinaker self-assigned this May 11, 2021

erikgrinaker force-pushed the intent-gc branch 11 times, most recently from 26bccab to ee0ab27 Compare May 18, 2021 19:29

erikgrinaker marked this pull request as ready for review May 18, 2021 19:35

erikgrinaker requested review from a team and pbardea and removed request for a team May 18, 2021 19:35

tbg reviewed May 25, 2021

View reviewed changes

erikgrinaker commented May 25, 2021

View reviewed changes

erikgrinaker force-pushed the intent-gc branch from ee0ab27 to c0cce80 Compare May 27, 2021 12:53

erikgrinaker added 5 commits June 1, 2021 10:33

keys: rename RangeLastGCKey to RangeGCThresholdKey

a6e4b3e

The key `RangeLastGCKey` was not actually used to store the last GC timestamp, but rather the last GC threshold. This commit renames the key to the more appropriate `RangeGCThresholdKey`. Release note: None

erikgrinaker force-pushed the intent-gc branch from c0cce80 to 99da240 Compare June 1, 2021 11:38

erikgrinaker commented Jun 1, 2021

View reviewed changes

erikgrinaker requested review from tbg and removed request for pbardea June 1, 2021 11:57

tbg approved these changes Jun 2, 2021

View reviewed changes

craig bot merged commit 034bec9 into cockroachdb:master Jul 7, 2021

erikgrinaker mentioned this pull request Jul 7, 2021

kvserver: add batch timeouts during intent GC #67296

Closed

erikgrinaker deleted the intent-gc branch August 23, 2021 09:03

This was referenced Sep 8, 2021

kvserver: increase GC queue timeout cockroachdb/docs#11468

Open

kvserver: make GC intent scoring more aggressive cockroachdb/docs#11469

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: make GC intent scoring more aggressive #65001

kvserver: make GC intent scoring more aggressive #65001

erikgrinaker commented May 11, 2021 •

edited

Loading

cockroach-teamcity commented May 11, 2021

nvanbenschoten commented May 12, 2021

erikgrinaker commented May 12, 2021

tbg left a comment

erikgrinaker left a comment

erikgrinaker left a comment •

edited

Loading

tbg left a comment

erikgrinaker commented Jul 7, 2021

craig bot commented Jul 7, 2021

kvserver: make GC intent scoring more aggressive #65001

kvserver: make GC intent scoring more aggressive #65001

Conversation

erikgrinaker commented May 11, 2021 • edited Loading

keys: rename RangeLastGCKey to RangeGCThresholdKey

kvserver: increase GC queue timeout

kvserver: use PUSH_TOUCH during GC to push intent owners

kvserver: add GC queue cooldown timer

kvserver: make GC intent scoring more aggressive

cockroach-teamcity commented May 11, 2021

nvanbenschoten commented May 12, 2021

erikgrinaker commented May 12, 2021

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment • edited Loading

Choose a reason for hiding this comment

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker commented Jul 7, 2021

craig bot commented Jul 7, 2021

erikgrinaker commented May 11, 2021 •

edited

Loading

erikgrinaker left a comment •

edited

Loading