Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: lower priority level for mvcc gc work #85823

Merged
merged 2 commits into from
Aug 10, 2022

Conversation

irfansharif
Copy link
Contributor

@irfansharif irfansharif commented Aug 9, 2022

GC could be expected to be LowPri, so that it does not impact
user-facing traffic when resources (e.g. CPU, write capacity of the
store) are scarce. However long delays in GC can slow down user-facing
traffic due to more versions in the store, and can increase write
amplification of the store since there is more live data. Ideally, we
should adjust this priority based on how far behind we are with respect
to GC-ing a particular range. Keeping the priority level static at
NormalPri proved disruptive when a large volume of MVCC GC work is
suddenly accrued (if an old protected timestamp record was just released
for ex. following a long paused backup job being completed/canceled, or
just an old, long running backup job finishing).

After dynamic priority adjustment, it's not yet clear whether we need
additional pacing mechanisms to provide better latency isolation,
similar to ongoing work for backups. MVCC GC work is CPU intensive:
#82955. This patch is also speculative in nature and in response to
observed incidents where NormalPri proved too disruptive. Fuller
treatment would entail working off of reliable reproductions of this
behaviour.

We also added a cluster setting (kv.mvcc_gc.queue_interval)
that controls how long the MVCC GC queue waits between
processing replicas. It was previously hardcoded to 1s (which is the
default value), but changing it would've come in handy in support
incidents as a form of manual pacing of MVCC GC work (we have a similar
useful knob for the merge queue).

Release note (performance improvement): Previously if there was sudden
increase in the volume of pending MVCC GC work, there was an impact on
foreground latencies. These sudden increases commonly occurred when:

  • gc.ttlseconds was reduced dramatically over tables/indexes that accrue
    a lot of MVCC garbage (think "rows being frequently deleted")
  • a paused backup job from a while ago (think > 1 day) was
    canceled/failed
  • a backup job that started a while ago (think > 1 day) just finished

Indicators of a large increase in the volume of pending MVCC GC work is
a steep climb in the "GC Queue" graph found in the DB console page, when
navigating to 'Metrics', and selecting the 'Queues' dashboard. With this
patch, the effect on foreground latencies as a result of this sudden
build up should be reduced.

@irfansharif irfansharif requested review from tbg and sumeerbhola August 9, 2022 16:50
@irfansharif irfansharif requested a review from a team as a code owner August 9, 2022 16:50
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@@ -115,6 +116,20 @@ var MaxIntentKeyBytesPerCleanupBatch = settings.RegisterIntSetting(
settings.NonNegativeInt,
)

// AdmissionPriority determines the admission priority level to use for MVCC GC
// work.
var AdmissionPriority = settings.RegisterEnumSetting(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I dont think a setting is necessary, but also have no objections to adding one.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Should we follow this up with a change that makes mvccGCQueueTimerDuration a cluster setting? We have precedent for that with kv.range_merge.queue_interval.

Reviewed 3 of 3 files at r1.
Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @lunevalex, @sumeerbhola, and @tbg)


-- commits line 16 at r1:
I suspect that we will eventually need additional pacing in the keyspace iteration that the MVCC GC queue performs (e.g. processReplicatedKeyRange) to compute which keys to GC. In cases where many keys are being garbage collected, it is probably the case that we will issue batches of GCRequests frequently enough for this change to serve as a form of pacing. However, for MVCC GC queue passes where few keys are actually GCed, this probably won't be enough. In those cases, we'll loop quickly through the entire range with no intermittent GC requests, which you've demonstrated elsewhere to be a CPU hog with adverse effects on goroutine scheduling latency.


-- commits line 18 at r1:

MVCC GC work is CPU intensive

I think it's also worth considering that a future revision of MVCC GC might do away with most of the CPU intensive work. In such a world, the synchronous keyspace iteration might be the only work left in the MVCC GC queue, assuming we want accurate stats even with delayed GC during compaction.

That doesn't change anything in this PR, but it might shift how we think about the solution space around disruptive MVCC GC going forward.


pkg/kv/kvserver/gc/gc.go line 121 at r1 (raw file):

Previously, lunevalex wrote…

nit: I dont think a setting is necessary, but also have no objections to adding one.

I think it's a good idea to add one, given the potential for a negative feedback loop with starved MVCC GC and foreground traffic whose efficiency depends on timely MVCC GC. We might as well build the tool that we would need to combat this if it ever comes up in a support escalation.


pkg/kv/kvserver/gc/gc.go line 126 at r1 (raw file):

	"the admission priority to use for mvcc gc work",
	"low_pri",
	map[int64]string{

Not for now, but I wouldn't be surprised if we follow this pattern in other places and decide to package it up as a known enum in the settings package.

Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @irfansharif, @lunevalex, @sumeerbhola, and @tbg)


pkg/kv/kvserver/gc/gc.go line 127 at r1 (raw file):

	"low_pri",
	map[int64]string{
		int64(admissionpb.LowPri):    "low_pri",

Actually, are these the correct priority levels, and is this the correct default? We use BulkNormalPri for all other bulk jobs. LowPri doesn't seem to be used outside of the admission package.

@irfansharif irfansharif force-pushed the 220809.mvcc-gc-priority branch from 09a21bf to cec0265 Compare August 9, 2022 18:42
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we follow this up with a change that makes mvccGCQueueTimerDuration a cluster setting?

Sure, just added a commit here. Will merge on green.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @lunevalex, @nvanbenschoten, @sumeerbhola, and @tbg)


-- commits line 16 at r1:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I suspect that we will eventually need additional pacing in the keyspace iteration that the MVCC GC queue performs (e.g. processReplicatedKeyRange) to compute which keys to GC. In cases where many keys are being garbage collected, it is probably the case that we will issue batches of GCRequests frequently enough for this change to serve as a form of pacing. However, for MVCC GC queue passes where few keys are actually GCed, this probably won't be enough. In those cases, we'll loop quickly through the entire range with no intermittent GC requests, which you've demonstrated elsewhere to be a CPU hog with adverse effects on goroutine scheduling latency.

Agreed, more experimentation here will tell us. Added a sentence below


-- commits line 18 at r1:

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

MVCC GC work is CPU intensive

I think it's also worth considering that a future revision of MVCC GC might do away with most of the CPU intensive work. In such a world, the synchronous keyspace iteration might be the only work left in the MVCC GC queue, assuming we want accurate stats even with delayed GC during compaction.

That doesn't change anything in this PR, but it might shift how we think about the solution space around disruptive MVCC GC going forward.

Doing MVCC GC as part of compactions makes a lot of sense (I need to periodically sift through X-noremind PRs to dig up such things). Added a sentence.


pkg/kv/kvserver/gc/gc.go line 121 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

I think it's a good idea to add one, given the potential for a negative feedback loop with starved MVCC GC and foreground traffic whose efficiency depends on timely MVCC GC. We might as well build the tool that we would need to combat this if it ever comes up in a support escalation.

Added it for that same reason, especially given it's something we're looking to backport to 22.1. We should want this knob to go away though after more experimentation work, if that's what you're getting at Alex.


pkg/kv/kvserver/gc/gc.go line 127 at r1 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

Actually, are these the correct priority levels, and is this the correct default? We use BulkNormalPri for all other bulk jobs. LowPri doesn't seem to be used outside of the admission package.

LowPri was mentioned in an earlier revision of a comment, but I think BulkNormalPri wasn't a thing then. Changed, and removed LowPri. The important thing is to have something lower than NormalPri.

@irfansharif irfansharif force-pushed the 220809.mvcc-gc-priority branch from cec0265 to 4d0d98b Compare August 9, 2022 18:46
Copy link
Member

@nvanbenschoten nvanbenschoten left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 2 files at r2, 1 of 1 files at r3, all commit messages.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @sumeerbhola, and @tbg)


pkg/kv/kvserver/mvcc_gc_queue.go line 72 at r3 (raw file):

var mvccGCQueueInterval = settings.RegisterDurationSetting(
	settings.SystemOnly,
	"kv.mvcc_gc_merge.queue_interval",

s/kv.mvcc_gc_merge.queue_interval/kv.mvcc_gc.queue_interval/


pkg/kv/kvserver/gc/gc.go line 127 at r1 (raw file):

Previously, irfansharif (irfan sharif) wrote…

LowPri was mentioned in an earlier revision of a comment, but I think BulkNormalPri wasn't a thing then. Changed, and removed LowPri. The important thing is to have something lower than NormalPri.

Should HighPri also be changed to UserHighPri?

GC could be expected to be LowPri, so that it does not impact
user-facing traffic when resources (e.g. CPU, write capacity of the
store) are scarce. However long delays in GC can slow down user-facing
traffic due to more versions in the store, and can increase write
amplification of the store since there is more live data. Ideally, we
should adjust this priority based on how far behind we are with respect
to GC-ing a particular range. Keeping the priority level static at
NormalPri proved disruptive when a large volume of MVCC GC work is
suddenly accrued (if an old protected timestamp record was just released
for ex. following a long paused backup job being completed/canceled, or
just an old, long running backup job finishing).

After dynamic priority adjustment, it's not yet clear whether we need
additional pacing mechanisms to provide better latency isolation,
similar to ongoing work for backups. MVCC GC work is CPU intensive:
\cockroachdb#82955. This patch is also speculative in nature and in response to
observed incidents where NormalPri proved too disruptive. Fuller
treatment would entail working off of reliable reproductions of this
behaviour.

Release note (performance improvement): Previously if there was sudden
increase in the volume of pending MVCC GC work, there was an impact on
foreground latencies. These sudden increases commonly occurred when:
- gc.ttlseconds was reduced dramatically over tables/indexes that accrue
  a lot of MVCC garbage (think "rows being frequently deleted")
- a paused backup job from a while ago (think > 1 day) was
  canceled/failed
- a backup job that started a while ago (think > 1 day) just finished

Indicators of a large increase in the volume of pending MVCC GC work is
a steep climb in the "GC Queue" graph found in the DB console page, when
navigating to 'Metrics', and selecting the 'Queues' dashboard. With this
patch, the effect on foreground latencies as a result of this sudden
build up should be reduced.
@irfansharif irfansharif force-pushed the 220809.mvcc-gc-priority branch from 4d0d98b to 7119010 Compare August 9, 2022 20:48
Copy link
Contributor Author

@irfansharif irfansharif left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

bors r+

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/mvcc_gc_queue.go line 72 at r3 (raw file):

Previously, nvanbenschoten (Nathan VanBenschoten) wrote…

s/kv.mvcc_gc_merge.queue_interval/kv.mvcc_gc.queue_interval/

🤦

Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 1 of 3 files at r1, 2 of 2 files at r4.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @irfansharif, @nvanbenschoten, @sumeerbhola, and @tbg)


pkg/kv/kvserver/mvcc_gc_queue.go line 465 at r4 (raw file):

			// BulkNormalPri.
			//
			// After we implement dynamic priority adjustment, it's not clear

It is good to have this knob to start with. But I hope we never have to tune it, or do dynamic priority adjustment in the code (it will just make things harder to understand). If someone has not provisioned well enough for GC to function, then it is their problem to add more resources.

@craig
Copy link
Contributor

craig bot commented Aug 9, 2022

Build failed (retrying...):

@irfansharif
Copy link
Contributor Author

bors r-

@craig
Copy link
Contributor

craig bot commented Aug 10, 2022

Canceled.

Cluster setting that controls how long the MVCC GC queue waits between
processing replicas. It was previously hardcoded to 1s (which is the
default value), but changing it would've come in handy in support
incidents as a form of manual pacing of MVCC GC work (we have a similar
useful knob for the merge queue).

Release note: None
@irfansharif irfansharif force-pushed the 220809.mvcc-gc-priority branch from 7119010 to f00b130 Compare August 10, 2022 00:01
@irfansharif
Copy link
Contributor Author

bors r+

@craig
Copy link
Contributor

craig bot commented Aug 10, 2022

🕐 Waiting for PR status (Github check) to be set, probably by CI. Bors will automatically try to run when all required PR statuses are set.

@craig
Copy link
Contributor

craig bot commented Aug 10, 2022

Build succeeded:

@craig craig bot merged commit a3cb8aa into cockroachdb:master Aug 10, 2022
@blathers-crl
Copy link

blathers-crl bot commented Aug 10, 2022

Encountered an error creating backports. Some common things that can go wrong:

  1. The backport branch might have already existed.
  2. There was a merge conflict.
  3. The backport branch contained merge commits.

You might need to create your backport manually using the backport tool.


error creating merge commit from d5e7e0a to blathers/backport-release-22.1-85823: POST https://api.github.com/repos/cockroachdb/cockroach/merges: 409 Merge conflict []

you may need to manually resolve merge conflicts with the backport tool.

Backport to branch 22.1.x failed. See errors above.


🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is otan.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants