kvserver: suspect replica GC heuristics should consider lease epoch and validity #65202

erikgrinaker · 2021-05-14T16:45:37Z

As seen in #60259, quiesced replicas may not be aware they have been removed from a range if they were partitioned during the removal. In #65062 we made some improvements to GC heuristics that would attempt to GC these once they noticed other replicas being unavailable. However, this can have a high false positive rate, and doesn't notice when the other members of the old range descriptor are still available.

@ajwerner suggested

We could use the fact that the lease belongs to an old liveness epoch on a quiesced range as part of the heuristic for the replica GC queue.

We should try doing this instead, and also check expired leases -- this should reduce the false positive rate as well.

erikgrinaker · 2021-05-24T14:21:17Z

@ajwerner I looked into this, and I don't believe this would work -- in this case, the leaseholder will still be heartbeating at its old epoch, and even though the follower may bump its own epoch when partitioned I don't think we can compare that with the lease epoch since the epoch is per-node. Am I missing anything?

Having had a second look at the heuristics, and really not liking the high false positive rate of the quiescent follower condition, I think the other existing condition which considers a follower suspect if it loses touch with the leader would be sufficient for this case as well. It is vulnerable to race conditions, but in the worst case it would be GCed within 12 hours anyway. Submitted #65609 for this.

ajwerner · 2021-05-24T15:52:37Z

I'm fine with not adding this heuristic because I agree it has lots of room for a false positive rate. I also agree with the 12 hour fallback being likely okay. I'm not sure I understand the first paragraph.

in this case, the leaseholder will still be heartbeating at its old epoch

What heartbeating do you mean? If you mean heartbeating its old liveness record, it will fail to heartbeat once the epoch has been incremented. If you mean heartbeating this follower, that's ideal, that would mean that the follower gets removed.

even though the follower may bump its own epoch when partitioned I don't think we can compare that with the lease epoch since the epoch is per-node.

I'm not sure I follow this part either. We can look up the epoch for any node. We can look at the lease to determine the leaseholder's node.

Here's a stab at what I had in mind:

+
+       // ReplicaGCQueueStaleLeaseTimeout is used to avoid checking whether a replica
+       // should be removed at too high of a frequency when the lease is now invalid.
+       // This can happen if the leaseholder dies and no new lease has been
+       // established. However, if it persists for a long time, then it likely means
+       // that this replica has been removed as a new lease should be established
+       // eventually.
+       ReplicaGCQueueStaleLeaseTimeout = 10 * time.Minute
 )
 
 // Priorities for the replica GC queue.
@@ -144,7 +152,8 @@ func (rgcq *replicaGCQueue) shouldQueue(
                WallTime: repl.store.startedAt,
        }
 
-       if lease, _ := repl.GetLease(); lease.ProposedTS != nil {
+       lease, _ := repl.GetLease()
+       if lease.ProposedTS != nil {
                lastActivity.Forward(lease.ProposedTS.ToTimestamp())
        }
 
@@ -186,6 +195,19 @@ func (rgcq *replicaGCQueue) shouldQueue(
                                        return livenessMap[d.NodeID].IsLive
                                })
                        }
+               case raft.StateFollower:
+                       // If the replica is a follower and the lease is invalid, and we haven't checked
+                       // the last few minutes, then we have a clue that we might be a quiesced replica
+                       // which is no longer part of the range. We still only treat this as suspect if
+                       // we have not checked in minutes because this state can happen when a leaseholder
+                       // dies.
+                       if repl.store.cfg.NodeLiveness != nil && lease.Expiration == nil {
+                               livenessRecord, ok := repl.store.cfg.NodeLiveness.GetLiveness(lease.Replica.NodeID)
+                               leaseInvalid := !ok || livenessRecord.Epoch > lease.Epoch ||
+                                       livenessRecord.IsDead(now.ToTimestamp().GoTime(), 0)
+                               isSuspect = leaseInvalid &&
+                                       now.ToTimestamp().GoTime().Sub(lastCheck.GoTime()) > ReplicaGCQueueStaleLeaseTimeout
+                       }
                }

erikgrinaker · 2021-05-24T16:34:58Z

In the scenario that we're trying to address here (#60259), from the perspective of the overall range, the lease is not invalid and the leaseholder does not die. The leaseholder keeps heartbeating the KV liveness record for the entire duration, and the lease remains valid. As such, the leaseholder's liveness epoch will remain equal to the lease epoch.

From the perspective of the partitioned quiesced follower, as far as I can tell, the only condition that will trigger in this scenario is livenessRecord.IsDead(now.ToTimestamp().GoTime(), 0). The follower otherwise believes the lease to be valid, since when the partition is healed (as seen in the original issue screenshots), the quiesced follower still uses the same lease as the rest of the range. This condition therefore effectively means that we have lost contact with the leaseholder for some time, but that condition is already covered by another existing condition, so I don't see how it adds anything:

cockroach/pkg/kv/kvserver/replica_gc_queue.go

Lines 174 to 181 in 1041afb

    
           // If the replica is a follower, check that the leader is in our range 
        
           // descriptor and that we're still in touch with it. This handles e.g. a 
        
           // non-voting replica which has lost its leader. 
        
           case raft.StateFollower: 
        
           	leadDesc, ok := repl.Desc().GetReplicaDescriptorByID(roachpb.ReplicaID(raftStatus.Lead)) 
        
           	if !ok || !livenessMap[leadDesc.NodeID].IsLive { 
        
           		return true 
        
           	}

The slight downside here is that there is a race condition: we could queue the replica for GC, then the GC attempt fails because the follower is partitioned (can't fetch the range descriptor), then the partition heals, and when the GC queue checks the replica again the condition does not trigger because the follower now sees the leaseholder as live again. I thought your epoch suggestion might be able to tell us after-the-fact (i.e. after the partition has been healed) that we had temporarily lost contact with the leaseholder, but I don't think that's the case. We could work around this by tracking some state in the GC queue, but that seems a bit excessive.

Anyway, I don't consider this to be a big problem, and I think #65609 is sufficient. As for the high frequency of checks, it will only check one replica at a time so the check interval is a lower bound, and we already have equivalent conditions (e.g. when the replica is a Raft candidate) that don't appear to cause problems.

ajwerner · 2021-05-24T18:10:19Z

SGTM

65126: ui: cluster-ui integration r=koorosh a=koorosh Note: Most of the commits are transferred "as is" from "cockroachdb/ui" repository and only two last commits (be2cf6f and 7391346) include changes related to actual integration of package and adjusting build tools. Initially db-console referenced to `cluster-ui` package as an external package that is hosted in another repository and required to update version of this package every time new changes are made. Now, when `cluster-ui` package is extracted from `cockroachdb/ui` repository and moved in this repo (under `pkg/ui/cluster-ui`), it is possible to reference to this package as a linked dependency and immediately observe all changes. To do this following changes were made: - db-console references to cluster-ui as a linked dependency and doesn't rely on specific version - `cluster-ui` package relies on local version of crdb-protobufjs-client package instead of specific version - Make commands related to db-console are extended to take into account nested `cluster-ui` package. It runs tests, linting and watch commands for both packages. One notable workaround has to mentioned... yarn installation for db-console fails during first run because it cannot properly consume symlinked `cluster-ui` dependency and only with second attempt it passes without issues (this is known issue in yarn project). Depends on: cockroachdb/yarn-vendored#73 65140: ui: rework data handling for the job details page r=dhartunian,nihalpednekar a=oeph Rework the data fetching for the job details page in the admin ui in order to prevent cases, where the jo details page was stuck showing a loading indicator. This was caused by the combined data handling of the job table and the details, leading to loading issues if the job was not listed in the table and therefore could not be shown in the detail page. (see #64281 for reproduction steps) The job details page now uses a dedicated api endpoint to fetch the job details for the given job_id. Release note: None Fixes #64281 65609: kvserver: remove replica GC heuristic for quiesced followers r=tbg a=erikgrinaker In #65062 we added a condition to the replica GC queue that quiescent followers with a single unavailable voter would be considered suspect. This was added to try to detect followers who were partitioned away from the Raft group during its own removal from the range. However, this case has a very high false positive rate, and on second thought it is probably higher than it is worth. There is already a secondary condition that considers followers who have lost touch with their leader suspect, which would be somewhat sufficient in this case, and with a far lower false positive rate. Even though this heuristic is vulnerable to race conditions, it seems a better fit considering that in the worst case the replica will always be GCed within 12 hours anyway. We have also since moved range-level metrics to the leaseholder, which reduces the impact of these stale replicas. This patch therefore removes the quiescent replica condition, and reduces `ReplicaGCQueueSuspectCheckInterval` from 5 to 3 seconds since we now expect far fewer false positives. Touches #65202. Release note: None /cc @cockroachdb/kv 65883: docs: update Life of a Query r=tbg a=erikgrinaker [Rendered](https://github.com/erikgrinaker/cockroach/blob/life-of-a-query/docs/tech-notes/life_of_a_query.md) This is a first pass at updating the Life of a Query document. It primarily focuses on updating links and descriptions to the current code base, and rewriting sections that no longer apply. It has not added new sections on significant new concepts that have since been introduced (except where necessary). Parts of the SQL sections have not been updated -- in particular "Notable planNodes" -- as the SQL execution engine has undergone significant changes recently that are better described by members of the SQL team. Instead, a note has been left informing the reader about this. A second pass is planned for later, which will add missing concepts, expand on existing concepts, and tighten up the prose to make it more cohesive. Touches #65196. Release note: None /cc @cockroachdb/kv Co-authored-by: David Hartunian <[email protected]> Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Nathan Stilwell <[email protected]> Co-authored-by: Andrii Vorobiov <[email protected]> Co-authored-by: Alfonso Subiotto Marques <[email protected]> Co-authored-by: Alfonso Subiotto Marqués <[email protected]> Co-authored-by: oeph <[email protected]> Co-authored-by: Erik Grinaker <[email protected]>

erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-replication Relating to Raft, consensus, and coordination. labels May 14, 2021

erikgrinaker self-assigned this May 14, 2021

erikgrinaker mentioned this issue May 14, 2021

Replica stuck in under-replicated state because it didn't know it had been removed from its range #60259

Closed

erikgrinaker mentioned this issue May 24, 2021

kvserver: remove replica GC heuristic for quiesced followers #65609

Merged

erikgrinaker closed this as completed May 25, 2021

erikgrinaker mentioned this issue May 31, 2021

release-21.1: kvserver: improve suspect replica GC heuristics #65186

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

kvserver: suspect replica GC heuristics should consider lease epoch and validity #65202

kvserver: suspect replica GC heuristics should consider lease epoch and validity #65202

erikgrinaker commented May 14, 2021 •

edited

Loading

erikgrinaker commented May 24, 2021

ajwerner commented May 24, 2021

erikgrinaker commented May 24, 2021 •

edited

Loading

ajwerner commented May 24, 2021

kvserver: suspect replica GC heuristics should consider lease epoch and validity #65202

kvserver: suspect replica GC heuristics should consider lease epoch and validity #65202

Comments

erikgrinaker commented May 14, 2021 • edited Loading

erikgrinaker commented May 24, 2021

ajwerner commented May 24, 2021

erikgrinaker commented May 24, 2021 • edited Loading

ajwerner commented May 24, 2021

erikgrinaker commented May 14, 2021 •

edited

Loading

erikgrinaker commented May 24, 2021 •

edited

Loading