Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: make GC intent scoring more aggressive #65001

Merged
merged 5 commits into from
Jul 7, 2021

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented May 11, 2021

keys: rename RangeLastGCKey to RangeGCThresholdKey

The key RangeLastGCKey was not actually used to store the last GC
timestamp, but rather the last GC threshold. This commit renames the key
to the more appropriate RangeGCThresholdKey.

Release note: None

kvserver: increase GC queue timeout

The MVCC GC queue timeout was left at the default of 1 minute. Since the
GCThreshold is advanced and persisted at the very start of the GC run
(to coordinate with other processes such as exports and rangefeeds
before removing any data), this means that if a GC run takes more than 1
minute to complete it will be aborted due to the timeout and may not run
again for some time.

This patch increases the timeout to 10 minutes. The primary purpose of
the timeout is to avoid head-of-line blocking in the case of unavailable
ranges, and a larger timeout still satisfies this property while also
allowing more expensive cleanup attempts to complete.

Release note (ops change): increased the timeout for range MVCC
garbage collection from 1 minute to 10 minutes, to allow larger jobs to
run to completion.

kvserver: use PUSH_TOUCH during GC to push intent owners

Using PUSH_ABORT to push intent owners during intent GC can cause it
to block on (or possibly abort) long-running transactions.

Since we would prefer GC to continue with other intents rather than
waiting for active transactions, this patch changes the GC queue to use
PUSH_TOUCH instead. This will still clean up abandoned txns.

Resolves #65777.

Release note (bug fix): Intent garbage collection no longer waits for
or aborts running transactions.

kvserver: add GC queue cooldown timer

When the GC queue considered replicas for MVCC GC, it primarily
considered the previous GC threshold and the amount of MVCC garbage that
would be removed by advancing the threshold. We would like to also
trigger GC based solely on intent stats. However, we can't know whether
we'll actually be able to clean up any intents, since they may belong to
an in-progress long-running transaction. There is therefore a risk that
we will keep spinning on GC attempts due to high intent counts.

This patch records the last GC queue processing timestamp, and adds a
cooldown timer of 2 hours between each range GC attempt triggered by
intent score alone to prevent spinning on GC attempts.

Release note: None

kvserver: make GC intent scoring more aggressive

Users often experience buildup of intents due to the GC queue not being
aggressive enough in cleaning them up. Currently, the GC queue will only
trigger based on intents if the average intent age is 10 days and if
there is also MVCC garbage that can be cleaned up.

This patch makes the intent scoring more aggressive, triggering GC when
the average intent age is 8 hours regardless of other MVCC garbage. The
previous commit added a cooldown timer to prevent the GC queue from
spinning on a replica if the intents couldn't be cleaned up (e.g.
because they belong to an in-progress long-running transaction).

Resolves #64266.
Touches #60585.

Release note (ops change): trigger MVCC and intent garbage collection
when the average intent age is 8 hours, down from 10 days.

/cc @cockroachdb/kv

@erikgrinaker erikgrinaker self-assigned this May 11, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@nvanbenschoten
Copy link
Member

I think this makes sense as a first step towards addressing #64266.

@erikgrinaker
Copy link
Contributor Author

Yeah, I'm just worried about continually spinning due to long-running txns, so would like to address that before merging this. The simplest solution I can think of is to add a cooldown timer, either to the GC queue or to the base queue.

@erikgrinaker erikgrinaker force-pushed the intent-gc branch 11 times, most recently from 26bccab to ee0ab27 Compare May 18, 2021 19:29
@erikgrinaker erikgrinaker marked this pull request as ready for review May 18, 2021 19:35
@erikgrinaker erikgrinaker requested review from a team and pbardea and removed request for a team May 18, 2021 19:35
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few comments based only on my reading of the commit messages:

  • the cooldown timer of 2h seems like it could possibly be dangerous. If a range sees sustained overwrites, it needs regular GC or it will become too large and eventually will be backpressured. Regardless of the cooldown timer, I think we want to trigger GC of old values if the score says we should (that score is guaranteed to drop following a GC attempt).
  • while most of GC is local to the range, the pushing of txns is not, and so it can be affected by an unavailable range. We should be pushing with a reasonable sub-timeout (5s?) and make sure one unavailable txn record doesn't prevent queue progress where it possible. We're possibly doing this already, not sure, just making sure we do it in the future. Upping the queue processing timeout (while the right thing) makes this a bit more high-stakes, since we don't want to spend 10 minutes doing nothing (& then possibly re-queueing the same range again). Not sure that there is a good way out in general. The cooldown timer will make sure things don't get delayed too much on other ranges, so it's hopefully ok in practice.
  • the commit message about GCThreshold being persisted at the beginning of the run could explain why that is still the right choice (if it even is).

It also renames RangeLastGCKey to RangeGCThresholdKey, since I
initially went down the path of a dedicated RangeLastGCTimestampKey to
track the last timestamp, and even though this approach was later
reverted the key rename was kept since there was a TODO for it
already.

I read this five times but still wondering what it means :-)

Reviewed 1 of 1 files at r1, 18 of 18 files at r2, 2 of 2 files at r3.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nvanbenschoten, and @pbardea)

Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the cooldown timer of 2h seems like it could possibly be dangerous. If a range sees sustained overwrites, it needs regular GC or it will become too large and eventually will be backpressured. Regardless of the cooldown timer, I think we want to trigger GC of old values if the score says we should.

Good point, I'll change it to apply specifically for intent-triggered GC.

while most of GC is local to the range, the pushing of txns is not, and so it can be affected by an unavailable range.

Looks like you're right. For txns anchored on the range, it uses the intent resolver's async task pool, but pushing remote txns is synchronous and doesn't appear to have any timeout. I don't know what the expected latency variance is on these operations, is 5s really sufficient? Although I suppose if it fails the txn would get GCed by its local range eventually anyway, so it doesn't seem critical to go through with these.

I read this five times but still wondering what it means :-)

😄 I suppose I should just skip the added background flavor, and maybe also split it out as a separate commit. Anyway, to stop you wondering: Before I discovered QueueLastProcessedKey, I wanted to add RangeLastGCKey to record the last GC run. Then I discovered that the key already existed, but it didn't actually store the last GC run, it stored the GC threshold. So I renamed it to RangeGCThresholdKey so that I could add the real RangeLastGCKey. After doing that, however, I found QueueLastProcessedKey -- this made RangeLastGCKey obsolete, so I removed it and its implementation. However, there was already a TODO to rename RangeLastGCKey to RangeGCThresholdKey, and since I'd already done that work I figured I might as well just keep that change -- even though it isn't needed for this PR.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nvanbenschoten, and @pbardea)

The key `RangeLastGCKey` was not actually used to store the last GC
timestamp, but rather the last GC threshold. This commit renames the key
to the more appropriate `RangeGCThresholdKey`.

Release note: None
The MVCC GC queue timeout was left at the default of 1 minute. Since the
`GCThreshold` is advanced and persisted at the very start of the GC run
(to coordinate with other processes such as exports and rangefeeds
before removing any data), this means that if a GC run takes more than 1
minute to complete it will be aborted due to the timeout and may not run
again for some time.

This patch increases the timeout to 10 minutes. The primary purpose of
the timeout is to avoid head-of-line blocking in the case of unavailable
ranges, and a larger timeout still satisfies this property while also
allowing more expensive cleanup attempts to complete.

Release note (ops change): increased the timeout for range MVCC
garbage collection from 1 minute to 10 minutes, to allow larger jobs to
run to completion.
Using `PUSH_ABORT` to push intent owners during intent GC can cause it
to block on (or possibly abort) long-running transactions.

Since we would prefer GC to continue with other intents rather than
waiting for active transactions, this patch changes the GC queue to use
`PUSH_TOUCH` instead. This will still clean up abandoned txns.

Release note (bug fix): Intent garbage collection no longer waits for
or aborts running transactions.
When the GC queue considered replicas for MVCC GC, it primarily
considered the previous GC threshold and the amount of MVCC garbage that
would be removed by advancing the threshold. We would like to also
trigger GC based solely on intent stats. However, we can't know whether
we'll actually be able to clean up any intents, since they may belong to
an in-progress long-running transaction. There is therefore a risk that
we will keep spinning on GC attempts due to high intent counts.

This patch records the last GC queue processing timestamp, and adds a
cooldown timer of 2 hours between each range GC attempt triggered by
intent score alone to prevent spinning on GC attempts.

Release note: None
Users often experience buildup of intents due to the GC queue not being
aggressive enough in cleaning them up. Currently, the GC queue will only
trigger based on intents if the average intent age is 10 days and if
there is also MVCC garbage that can be cleaned up.

This patch makes the intent scoring more aggressive, triggering GC when
the average intent age is 8 hours regardless of other MVCC garbage. The
previous commit added a cooldown timer to prevent the GC queue from
spinning on a replica if the intents couldn't be cleaned up (e.g.
because they belong to an in-progress long-running transaction).

Release note (ops change): trigger MVCC and intent garbage collection
when the average intent age is 8 hours, down from 10 days.
Copy link
Contributor Author

@erikgrinaker erikgrinaker left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated this with a few changes:

  • Split out the RangeLastGCKey rename to a separate commit.
  • Added a note when increasing the GC queue timeout that we should wait for kv: split up intents into batches during gc cleanup #65847 and use per-batch timeouts to ensure we can make progress even with unavailable ranges.
  • Applied the GC cooldown to intent-triggered GC runs only (MVCC GC scoring ignores the cooldown).
  • Added a commit to use PUSH_TOUCH instead of PUSH_ABORT for intent resolution.

Would be happy to split off some PRs if it'd make reviewing easier, but they're all fairly small in isolation.

There are a couple of things still bothering me about this:

Firstly, since we haven't recorded last GC times on existing clusters, this is likely to trigger lots of GC runs when nodes upgrade and they find ranges with old intents and no last GC timestamp. Is this a problem?

Secondly, we now trigger GC based solely on the average intent age: if there's a single 8-hour-old intent in a huge range, we'll GC the range. We could try to be clever here and introduce some scaling factor or threshold that takes into account the relative or absolute amount of intents in the range. But on the other hand, we expect intents to generally get cleaned up by other mechanisms within a reasonable timeframe (e.g. txn finalization and reads), and we want some backstop to make sure any missed intents do get cleaned up before long, so maybe a simple time limit is sufficient.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aliher1911, @nvanbenschoten, @pbardea, and @tbg)

@erikgrinaker erikgrinaker requested review from tbg and removed request for pbardea June 1, 2021 11:57
Copy link
Member

@tbg tbg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the clean commit history, LGTM!

@erikgrinaker
Copy link
Contributor Author

bors r=tbg

@craig
Copy link
Contributor

craig bot commented Jul 7, 2021

Build succeeded:

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

kvserver: don't PUSH_ABORT encountered intent txns during GC kvserver: tweak GC queue intent heuristics
4 participants