kvserver: improve suspect replica GC heuristics #65062

erikgrinaker · 2021-05-12T15:17:38Z

The replica GC queue will normally check a replica against the canonical
range descriptor every 12 hours. Under some circumstances the replica
may be considered suspect, which causes it to be checked against the
canonical descriptor every second instead. However, these heuristics
were fairly limited and missed a couple of cases that could cause stale
replicas to linger.

This patch adds two conditions to the suspect replica heuristics:
followers that have lost contact with their leader (which in particular
handles non-voting replicas), and quiescent replicas that lose contact
with any other voters (which could cause false underreplication alerts).

Since this change is expected to increase suspect replica matches, the
ReplicaGCQueueSuspectCheckInterval duration between checking suspect
replica descriptors was also increased from 1 to 5 seconds, and the
replicaGCQueueTimerDuration interval between replica GCs was increased
from 50 to 100 ms.

The previous logic would take into account replica activity such as
store startup and lease proposals as the offset for timeouts, but this
did not appear to have any significant benefit over simply using the
last check time, so these have been removed and the timeouts given more
appropriate names. The previous logic also failed to enforce the check
interval for suspect replicas, and would always check them in a tight
50ms loop, this has been fixed as well.

Resolves #62075, resolves #60259.

Release note (bug fix): Improved garbage collection of stale replicas by
proactively checking certain replicas that have lost contact with other
voting replicas.

/cc @cockroachdb/kv

cockroach-teamcity · 2021-05-12T15:17:44Z

This change is

tbg

Reviewed 4 of 4 files at r1.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @ajwerner and @erikgrinaker)

pkg/kv/kvserver/replica_gc_queue.go, line 191 at r1 (raw file):

		}

	// If the replica is a leader, check that is has a quorum. This handles e.g.

nit: it has a quorum

pkg/kv/kvserver/replica_gc_queue.go, line 231 at r1 (raw file):

	// Only queue for GC if the timeout interval has passed since both the last
	// lease was proposed and since the last check.

lastActivity is weird, you fill it with meaning here in this comment but maybe we can clarify what it indicates. As is, it's confusing. It is really just there to prevent us from considering replicas for replicaGC "prematurely", but is this buying us anything? It almost feels like this is maybe detritus from when we had preemptive snapshots (which were not part of the range desc, i.e. they would get removed whenever replicaGC looked at them). If we can remove it, that would be nice and from the looks of it we can.

erikgrinaker

Reviewed 3 of 3 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @ajwerner and @tbg)

pkg/kv/kvserver/replica_gc_queue.go, line 231 at r1 (raw file):

Previously, tbg (Tobias Grieger) wrote…

lastActivity is weird, you fill it with meaning here in this comment but maybe we can clarify what it indicates. As is, it's confusing. It is really just there to prevent us from considering replicas for replicaGC "prematurely", but is this buying us anything? It almost feels like this is maybe detritus from when we had preemptive snapshots (which were not part of the range desc, i.e. they would get removed whenever replicaGC looked at them). If we can remove it, that would be nice and from the looks of it we can.

Yeah, I don't think it's needed, so I've removed it and renamed the timeout constants. It might serve some purpose in staggering the GC checks, but the replicas are likely to start stores and propose leases around the same time anyway, at least after cluster restarts.

The replica GC queue will normally check a replica against the canonical range descriptor every 12 hours. Under some circumstances the replica may be considered suspect, which causes it to be checked against the canonical descriptor every second instead. However, these heuristics were fairly limited and missed a couple of cases that could cause stale replicas to linger. This patch adds two conditions to the suspect replica heuristics: followers that have lost contact with their leader (which in particular handles non-voting replicas), and quiescent replicas that lose contact with any other voters (which could cause false underreplication alerts). Since this change is expected to increase suspect replica matches, the `ReplicaGCQueueSuspectCheckInterval` duration between checking suspect replica descriptors was also increased from 1 to 5 seconds, and the `replicaGCQueueTimerDuration` interval between replica GCs was increased from 50 to 100 ms. The previous logic would take into account replica activity such as store startup and lease proposals as the offset for timeouts, but this did not appear to have any significant benefit over simply using the last check time, so these have been removed and the timeouts given more appropriate names. The previous logic also failed to enforce the check interval for suspect replicas, and would always check them in a tight 50ms loop, this has been fixed as well. Release note (bug fix): Improved garbage collection of stale replicas by proactively checking certain replicas that have lost contact with other voting replicas.

erikgrinaker · 2021-05-13T20:25:06Z

bors r=tbg

craig · 2021-05-13T21:35:21Z

Build succeeded:

GitHub CI (Cockroach)

In cockroachdb#65062 we added a condition to the replica GC queue that quiescent followers with a single unavailable voter would be considered suspect. This was added to try to detect followers who were partitioned away from the Raft group during its own removal from the range. However, this case has a very high false positive rate, and on second thought it is probably higher than it is worth. There is already a secondary condition that considers followers who have lost touch with their leader suspect, which would be somewhat sufficient in this case, and with a far lower false positive rate. Even though this heuristic is vulnerable to race conditions, it seems a better fit considering that in the worst case the replica will always be GCed within 12 hours anyway. We have also since moved range-level metrics to the leaseholder, which reduces the impact of these stale replicas. This patch therefore removes the quiescent replica condition, and reduces `ReplicaGCQueueSuspectCheckInterval` from 5 to 3 seconds since we now expect far fewer false positives. Release note: None

65126: ui: cluster-ui integration r=koorosh a=koorosh Note: Most of the commits are transferred "as is" from "cockroachdb/ui" repository and only two last commits (be2cf6f and 7391346) include changes related to actual integration of package and adjusting build tools. Initially db-console referenced to `cluster-ui` package as an external package that is hosted in another repository and required to update version of this package every time new changes are made. Now, when `cluster-ui` package is extracted from `cockroachdb/ui` repository and moved in this repo (under `pkg/ui/cluster-ui`), it is possible to reference to this package as a linked dependency and immediately observe all changes. To do this following changes were made: - db-console references to cluster-ui as a linked dependency and doesn't rely on specific version - `cluster-ui` package relies on local version of crdb-protobufjs-client package instead of specific version - Make commands related to db-console are extended to take into account nested `cluster-ui` package. It runs tests, linting and watch commands for both packages. One notable workaround has to mentioned... yarn installation for db-console fails during first run because it cannot properly consume symlinked `cluster-ui` dependency and only with second attempt it passes without issues (this is known issue in yarn project). Depends on: cockroachdb/yarn-vendored#73 65140: ui: rework data handling for the job details page r=dhartunian,nihalpednekar a=oeph Rework the data fetching for the job details page in the admin ui in order to prevent cases, where the jo details page was stuck showing a loading indicator. This was caused by the combined data handling of the job table and the details, leading to loading issues if the job was not listed in the table and therefore could not be shown in the detail page. (see #64281 for reproduction steps) The job details page now uses a dedicated api endpoint to fetch the job details for the given job_id. Release note: None Fixes #64281 65609: kvserver: remove replica GC heuristic for quiesced followers r=tbg a=erikgrinaker In #65062 we added a condition to the replica GC queue that quiescent followers with a single unavailable voter would be considered suspect. This was added to try to detect followers who were partitioned away from the Raft group during its own removal from the range. However, this case has a very high false positive rate, and on second thought it is probably higher than it is worth. There is already a secondary condition that considers followers who have lost touch with their leader suspect, which would be somewhat sufficient in this case, and with a far lower false positive rate. Even though this heuristic is vulnerable to race conditions, it seems a better fit considering that in the worst case the replica will always be GCed within 12 hours anyway. We have also since moved range-level metrics to the leaseholder, which reduces the impact of these stale replicas. This patch therefore removes the quiescent replica condition, and reduces `ReplicaGCQueueSuspectCheckInterval` from 5 to 3 seconds since we now expect far fewer false positives. Touches #65202. Release note: None /cc @cockroachdb/kv 65883: docs: update Life of a Query r=tbg a=erikgrinaker [Rendered](https://github.com/erikgrinaker/cockroach/blob/life-of-a-query/docs/tech-notes/life_of_a_query.md) This is a first pass at updating the Life of a Query document. It primarily focuses on updating links and descriptions to the current code base, and rewriting sections that no longer apply. It has not added new sections on significant new concepts that have since been introduced (except where necessary). Parts of the SQL sections have not been updated -- in particular "Notable planNodes" -- as the SQL execution engine has undergone significant changes recently that are better described by members of the SQL team. Instead, a note has been left informing the reader about this. A second pass is planned for later, which will add missing concepts, expand on existing concepts, and tighten up the prose to make it more cohesive. Touches #65196. Release note: None /cc @cockroachdb/kv Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Nathan Stilwell <[email protected]> Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Alfonso Subiotto Marques <[email protected]> Co-authored-by: Alfonso Subiotto Marqués <[email protected]> Co-authored-by: oeph <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

In cockroachdb#65062 we added a condition to the replica GC queue that quiescent followers with a single unavailable voter would be considered suspect. This was added to try to detect followers who were partitioned away from the Raft group during its own removal from the range. However, this case has a very high false positive rate, and on second thought it is probably higher than it is worth. There is already a secondary condition that considers followers who have lost touch with their leader suspect, which would be somewhat sufficient in this case, and with a far lower false positive rate. Even though this heuristic is vulnerable to race conditions, it seems a better fit considering that in the worst case the replica will always be GCed within 12 hours anyway. We have also since moved range-level metrics to the leaseholder, which reduces the impact of these stale replicas. This patch therefore removes the quiescent replica condition, and reduces `ReplicaGCQueueSuspectCheckInterval` from 5 to 3 seconds since we now expect far fewer false positives. Release note: None

erikgrinaker requested review from ajwerner and tbg May 12, 2021 15:17

erikgrinaker self-assigned this May 12, 2021

erikgrinaker mentioned this pull request May 12, 2021

Replica stuck in under-replicated state because it didn't know it had been removed from its range #60259

Closed

erikgrinaker force-pushed the replica-gc-heuristics branch from 5364a92 to 2b2772f Compare May 13, 2021 08:10

tbg approved these changes May 13, 2021

View reviewed changes

erikgrinaker force-pushed the replica-gc-heuristics branch from 2b2772f to cc33454 Compare May 13, 2021 18:57

erikgrinaker commented May 13, 2021

View reviewed changes

erikgrinaker force-pushed the replica-gc-heuristics branch from cc33454 to 1041afb Compare May 13, 2021 19:03

craig bot merged commit 9279c8f into cockroachdb:master May 13, 2021

erikgrinaker deleted the replica-gc-heuristics branch May 14, 2021 09:59

This was referenced May 14, 2021

release-21.1: kvserver: improve suspect replica GC heuristics #65186

Merged

kvserver: suspect replica GC heuristics should consider lease epoch and validity #65202

Closed

erikgrinaker mentioned this pull request May 24, 2021

kvserver: remove replica GC heuristic for quiesced followers #65609

Merged

erikgrinaker added the backport-21.1.x label May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: improve suspect replica GC heuristics #65062

kvserver: improve suspect replica GC heuristics #65062

erikgrinaker commented May 12, 2021 •

edited

Loading

cockroach-teamcity commented May 12, 2021

tbg left a comment

erikgrinaker left a comment

erikgrinaker commented May 13, 2021

craig bot commented May 13, 2021

kvserver: improve suspect replica GC heuristics #65062

kvserver: improve suspect replica GC heuristics #65062

Conversation

erikgrinaker commented May 12, 2021 • edited Loading

cockroach-teamcity commented May 12, 2021

tbg left a comment

Choose a reason for hiding this comment

erikgrinaker left a comment

Choose a reason for hiding this comment

erikgrinaker commented May 13, 2021

craig bot commented May 13, 2021

erikgrinaker commented May 12, 2021 •

edited

Loading