Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-21.1: kvserver: improve suspect replica GC heuristics #65186

Merged

Conversation

erikgrinaker
Copy link
Contributor

@erikgrinaker erikgrinaker commented May 14, 2021

Backport 1/1 commits from #65062.
Backport 1/1 commits from #65609.

/cc @cockroachdb/release @cockroachdb/kv


kvserver: improve suspect replica GC heuristics

The replica GC queue will normally check a replica against the canonical
range descriptor every 12 hours. Under some circumstances the replica
may be considered suspect, which causes it to be checked against the
canonical descriptor every second instead. However, these heuristics
were fairly limited and missed a couple of cases that could cause stale
replicas to linger.

This patch adds two conditions to the suspect replica heuristics:
followers that have lost contact with their leader (which in particular
handles non-voting replicas), and quiescent replicas that lose contact
with any other voters (which could cause false underreplication alerts).

Since this change is expected to increase suspect replica matches, the
ReplicaGCQueueSuspectCheckInterval duration between checking suspect
replica descriptors was also increased from 1 to 5 seconds, and the
replicaGCQueueTimerDuration interval between replica GCs was increased
from 50 to 100 ms.

The previous logic would take into account replica activity such as
store startup and lease proposals as the offset for timeouts, but this
did not appear to have any significant benefit over simply using the
last check time, so these have been removed and the timeouts given more
appropriate names. The previous logic also failed to enforce the check
interval for suspect replicas, and would always check them in a tight
50ms loop, this has been fixed as well.

Resolves #62075, resolves #60259.

Release note (bug fix): Improved garbage collection of stale replicas by
proactively checking certain replicas that have lost contact with other
voting replicas.

kvserver: remove replica GC heuristic for quiesced followers

In #65062 we added a condition to the replica GC queue that quiescent
followers with a single unavailable voter would be considered suspect.
This was added to try to detect followers who were partitioned away from
the Raft group during its own removal from the range. However, this case
has a very high false positive rate, and on second thought it is probably
higher than it is worth.

There is already a secondary condition that considers followers who have
lost touch with their leader suspect, which would be somewhat sufficient
in this case, and with a far lower false positive rate. Even though this
heuristic is vulnerable to race conditions, it seems a better fit
considering that in the worst case the replica will always be GCed
within 12 hours anyway. We have also since moved range-level metrics
to the leaseholder, which reduces the impact of these stale replicas.

This patch therefore removes the quiescent replica condition, and
reduces ReplicaGCQueueSuspectCheckInterval from 5 to 3 seconds since
we now expect far fewer false positives.

Touches #65202.

Release note: None

@erikgrinaker erikgrinaker self-assigned this May 14, 2021
@cockroach-teamcity
Copy link
Member

This change is Reviewable

The replica GC queue will normally check a replica against the canonical
range descriptor every 12 hours. Under some circumstances the replica
may be considered suspect, which causes it to be checked against the
canonical descriptor every second instead. However, these heuristics
were fairly limited and missed a couple of cases that could cause stale
replicas to linger.

This patch adds two conditions to the suspect replica heuristics:
followers that have lost contact with their leader (which in particular
handles non-voting replicas), and quiescent replicas that lose contact
with any other voters (which could cause false underreplication alerts).

Since this change is expected to increase suspect replica matches, the
`ReplicaGCQueueSuspectCheckInterval` duration between checking suspect
replica descriptors was also increased from 1 to 5 seconds, and the
`replicaGCQueueTimerDuration` interval between replica GCs was increased
from 50 to 100 ms.

The previous logic would take into account replica activity such as
store startup and lease proposals as the offset for timeouts, but this
did not appear to have any significant benefit over simply using the
last check time, so these have been removed and the timeouts given more
appropriate names. The previous logic also failed to enforce the check
interval for suspect replicas, and would always check them in a tight
50ms loop, this has been fixed as well.

Release note (bug fix): Improved garbage collection of stale replicas by
proactively checking certain replicas that have lost contact with other
voting replicas.
In cockroachdb#65062 we added a condition to the replica GC queue that quiescent
followers with a single unavailable voter would be considered suspect.
This was added to try to detect followers who were partitioned away from
the Raft group during its own removal from the range. However, this case
has a very high false positive rate, and on second thought it is probably
higher than it is worth.

There is already a secondary condition that considers followers who have
lost touch with their leader suspect, which would be somewhat sufficient
in this case, and with a far lower false positive rate. Even though this
heuristic is vulnerable to race conditions, it seems a better fit
considering that in the worst case the replica will always be GCed
within 12 hours anyway. We have also since moved range-level metrics
to the leaseholder, which reduces the impact of these stale replicas.

This patch therefore removes the quiescent replica condition, and
reduces `ReplicaGCQueueSuspectCheckInterval` from 5 to 3 seconds since
we now expect far fewer false positives.

Release note: None
@erikgrinaker erikgrinaker marked this pull request as ready for review June 3, 2021 12:00
@erikgrinaker erikgrinaker requested a review from tbg June 3, 2021 12:00
@erikgrinaker erikgrinaker merged commit 95321f3 into cockroachdb:release-21.1 Jun 4, 2021
@erikgrinaker erikgrinaker deleted the backport21.1-65062 branch June 14, 2021 13:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants