-
Notifications
You must be signed in to change notification settings - Fork 3.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: make GC intent scoring more aggressive #65001
Conversation
I think this makes sense as a first step towards addressing #64266. |
Yeah, I'm just worried about continually spinning due to long-running txns, so would like to address that before merging this. The simplest solution I can think of is to add a cooldown timer, either to the GC queue or to the base queue. |
26bccab
to
ee0ab27
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few comments based only on my reading of the commit messages:
- the cooldown timer of 2h seems like it could possibly be dangerous. If a range sees sustained overwrites, it needs regular GC or it will become too large and eventually will be backpressured. Regardless of the cooldown timer, I think we want to trigger GC of old values if the score says we should (that score is guaranteed to drop following a GC attempt).
- while most of GC is local to the range, the pushing of txns is not, and so it can be affected by an unavailable range. We should be pushing with a reasonable sub-timeout (5s?) and make sure one unavailable txn record doesn't prevent queue progress where it possible. We're possibly doing this already, not sure, just making sure we do it in the future. Upping the queue processing timeout (while the right thing) makes this a bit more high-stakes, since we don't want to spend 10 minutes doing nothing (& then possibly re-queueing the same range again). Not sure that there is a good way out in general. The cooldown timer will make sure things don't get delayed too much on other ranges, so it's hopefully ok in practice.
- the commit message about GCThreshold being persisted at the beginning of the run could explain why that is still the right choice (if it even is).
It also renames
RangeLastGCKey
toRangeGCThresholdKey
, since I
initially went down the path of a dedicatedRangeLastGCTimestampKey
to
track the last timestamp, and even though this approach was later
reverted the key rename was kept since there was aTODO
for it
already.
I read this five times but still wondering what it means :-)
Reviewed 1 of 1 files at r1, 18 of 18 files at r2, 2 of 2 files at r3.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nvanbenschoten, and @pbardea)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the cooldown timer of 2h seems like it could possibly be dangerous. If a range sees sustained overwrites, it needs regular GC or it will become too large and eventually will be backpressured. Regardless of the cooldown timer, I think we want to trigger GC of old values if the score says we should.
Good point, I'll change it to apply specifically for intent-triggered GC.
while most of GC is local to the range, the pushing of txns is not, and so it can be affected by an unavailable range.
Looks like you're right. For txns anchored on the range, it uses the intent resolver's async task pool, but pushing remote txns is synchronous and doesn't appear to have any timeout. I don't know what the expected latency variance is on these operations, is 5s really sufficient? Although I suppose if it fails the txn would get GCed by its local range eventually anyway, so it doesn't seem critical to go through with these.
I read this five times but still wondering what it means :-)
😄 I suppose I should just skip the added background flavor, and maybe also split it out as a separate commit. Anyway, to stop you wondering: Before I discovered QueueLastProcessedKey
, I wanted to add RangeLastGCKey
to record the last GC run. Then I discovered that the key already existed, but it didn't actually store the last GC run, it stored the GC threshold. So I renamed it to RangeGCThresholdKey
so that I could add the real RangeLastGCKey
. After doing that, however, I found QueueLastProcessedKey
-- this made RangeLastGCKey
obsolete, so I removed it and its implementation. However, there was already a TODO
to rename RangeLastGCKey
to RangeGCThresholdKey
, and since I'd already done that work I figured I might as well just keep that change -- even though it isn't needed for this PR.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nvanbenschoten, and @pbardea)
The key `RangeLastGCKey` was not actually used to store the last GC timestamp, but rather the last GC threshold. This commit renames the key to the more appropriate `RangeGCThresholdKey`. Release note: None
The MVCC GC queue timeout was left at the default of 1 minute. Since the `GCThreshold` is advanced and persisted at the very start of the GC run (to coordinate with other processes such as exports and rangefeeds before removing any data), this means that if a GC run takes more than 1 minute to complete it will be aborted due to the timeout and may not run again for some time. This patch increases the timeout to 10 minutes. The primary purpose of the timeout is to avoid head-of-line blocking in the case of unavailable ranges, and a larger timeout still satisfies this property while also allowing more expensive cleanup attempts to complete. Release note (ops change): increased the timeout for range MVCC garbage collection from 1 minute to 10 minutes, to allow larger jobs to run to completion.
Using `PUSH_ABORT` to push intent owners during intent GC can cause it to block on (or possibly abort) long-running transactions. Since we would prefer GC to continue with other intents rather than waiting for active transactions, this patch changes the GC queue to use `PUSH_TOUCH` instead. This will still clean up abandoned txns. Release note (bug fix): Intent garbage collection no longer waits for or aborts running transactions.
When the GC queue considered replicas for MVCC GC, it primarily considered the previous GC threshold and the amount of MVCC garbage that would be removed by advancing the threshold. We would like to also trigger GC based solely on intent stats. However, we can't know whether we'll actually be able to clean up any intents, since they may belong to an in-progress long-running transaction. There is therefore a risk that we will keep spinning on GC attempts due to high intent counts. This patch records the last GC queue processing timestamp, and adds a cooldown timer of 2 hours between each range GC attempt triggered by intent score alone to prevent spinning on GC attempts. Release note: None
Users often experience buildup of intents due to the GC queue not being aggressive enough in cleaning them up. Currently, the GC queue will only trigger based on intents if the average intent age is 10 days and if there is also MVCC garbage that can be cleaned up. This patch makes the intent scoring more aggressive, triggering GC when the average intent age is 8 hours regardless of other MVCC garbage. The previous commit added a cooldown timer to prevent the GC queue from spinning on a replica if the intents couldn't be cleaned up (e.g. because they belong to an in-progress long-running transaction). Release note (ops change): trigger MVCC and intent garbage collection when the average intent age is 8 hours, down from 10 days.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated this with a few changes:
- Split out the
RangeLastGCKey
rename to a separate commit. - Added a note when increasing the GC queue timeout that we should wait for kv: split up intents into batches during gc cleanup #65847 and use per-batch timeouts to ensure we can make progress even with unavailable ranges.
- Applied the GC cooldown to intent-triggered GC runs only (MVCC GC scoring ignores the cooldown).
- Added a commit to use
PUSH_TOUCH
instead ofPUSH_ABORT
for intent resolution.
Would be happy to split off some PRs if it'd make reviewing easier, but they're all fairly small in isolation.
There are a couple of things still bothering me about this:
Firstly, since we haven't recorded last GC times on existing clusters, this is likely to trigger lots of GC runs when nodes upgrade and they find ranges with old intents and no last GC timestamp. Is this a problem?
Secondly, we now trigger GC based solely on the average intent age: if there's a single 8-hour-old intent in a huge range, we'll GC the range. We could try to be clever here and introduce some scaling factor or threshold that takes into account the relative or absolute amount of intents in the range. But on the other hand, we expect intents to generally get cleaned up by other mechanisms within a reasonable timeframe (e.g. txn finalization and reads), and we want some backstop to make sure any missed intents do get cleaned up before long, so maybe a simple time limit is sufficient.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nvanbenschoten, @pbardea, and @tbg)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the clean commit history, LGTM!
bors r=tbg |
Build succeeded: |
keys: rename RangeLastGCKey to RangeGCThresholdKey
The key
RangeLastGCKey
was not actually used to store the last GCtimestamp, but rather the last GC threshold. This commit renames the key
to the more appropriate
RangeGCThresholdKey
.Release note: None
kvserver: increase GC queue timeout
The MVCC GC queue timeout was left at the default of 1 minute. Since the
GCThreshold
is advanced and persisted at the very start of the GC run(to coordinate with other processes such as exports and rangefeeds
before removing any data), this means that if a GC run takes more than 1
minute to complete it will be aborted due to the timeout and may not run
again for some time.
This patch increases the timeout to 10 minutes. The primary purpose of
the timeout is to avoid head-of-line blocking in the case of unavailable
ranges, and a larger timeout still satisfies this property while also
allowing more expensive cleanup attempts to complete.
Release note (ops change): increased the timeout for range MVCC
garbage collection from 1 minute to 10 minutes, to allow larger jobs to
run to completion.
kvserver: use PUSH_TOUCH during GC to push intent owners
Using
PUSH_ABORT
to push intent owners during intent GC can cause itto block on (or possibly abort) long-running transactions.
Since we would prefer GC to continue with other intents rather than
waiting for active transactions, this patch changes the GC queue to use
PUSH_TOUCH
instead. This will still clean up abandoned txns.Resolves #65777.
Release note (bug fix): Intent garbage collection no longer waits for
or aborts running transactions.
kvserver: add GC queue cooldown timer
When the GC queue considered replicas for MVCC GC, it primarily
considered the previous GC threshold and the amount of MVCC garbage that
would be removed by advancing the threshold. We would like to also
trigger GC based solely on intent stats. However, we can't know whether
we'll actually be able to clean up any intents, since they may belong to
an in-progress long-running transaction. There is therefore a risk that
we will keep spinning on GC attempts due to high intent counts.
This patch records the last GC queue processing timestamp, and adds a
cooldown timer of 2 hours between each range GC attempt triggered by
intent score alone to prevent spinning on GC attempts.
Release note: None
kvserver: make GC intent scoring more aggressive
Users often experience buildup of intents due to the GC queue not being
aggressive enough in cleaning them up. Currently, the GC queue will only
trigger based on intents if the average intent age is 10 days and if
there is also MVCC garbage that can be cleaned up.
This patch makes the intent scoring more aggressive, triggering GC when
the average intent age is 8 hours regardless of other MVCC garbage. The
previous commit added a cooldown timer to prevent the GC queue from
spinning on a replica if the intents couldn't be cleaned up (e.g.
because they belong to an in-progress long-running transaction).
Resolves #64266.
Touches #60585.
Release note (ops change): trigger MVCC and intent garbage collection
when the average intent age is 8 hours, down from 10 days.
/cc @cockroachdb/kv