Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

kvserver: improve ReplicaUnavailableError #94691

Open
erikgrinaker opened this issue Jan 4, 2023 · 2 comments
Open

kvserver: improve ReplicaUnavailableError #94691

erikgrinaker opened this issue Jan 4, 2023 · 2 comments
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jan 4, 2023

Loss of quorum-scenarios can be difficult to detect. Currently, the signals are mostly around "slowless", e.g. replica circuit breakers reporting slow proposals, requests reporting slow latch acquisition, requests and queries timing out, etc. We have also seen situations where we lost quorum on the liveness range, and thus didn't have access to the DB console or any other debugging tools, which took a long time to resolve (https://github.com/cockroachlabs/support/issues/1963).

We should warn loudly, both in logs and the DB console/metrics, when unquiesced ranges are unable to acquire a Raft leader for some time. This is a much more specific signal for quorum loss, which immediately rules out lots of other causes of "slowness".

Jira issue: CRDB-23087

Epic CRDB-39898

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-observability T-kv-replication labels Jan 4, 2023
@blathers-crl
Copy link

blathers-crl bot commented Jan 4, 2023

cc @cockroachdb/replication

@tbg
Copy link
Member

tbg commented Jan 4, 2023

For reference, here is what is being printed today:

I230104 12:52:46.406039 4347 kv/kvserver/replica_circuit_breaker.go:149 ⋮ [n3,s3,r2/2:‹/System/NodeLiveness{-Max}›] 768 breaker: now tripped with error: replica unavailable: (n3,s3):2 unable to serve request to r2:‹/System/NodeLiveness{-Max}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]: lost quorum (down: (n1,s1):1,(n2,s2):3); closed timestamp: 1672836641.760882000,0 (2023-01-04 12:50:41); raft status: {"id":"2","term":7,"vote":"2","commit":88,"lead":"2","raftState":"StateLeader","applied":88,"progress":{"2":{"match":192,"next":193,"state":"StateReplicate"},"3":{"match":88,"next":89,"state":"StateProbe"},"1":{"match":88,"next":89,"state":"StateProbe"}},"leadtransferee":"0"}: ‹operation "probe" timed out after 1m0.002s (given timeout 1m0s)›: result is ambiguous: after 60.00s of attempting command: context deadline exceeded (previously: replica unavailable: (n3,s3):2 unable to serve request to r2:‹/System/NodeLiveness{-Max}› [(n1,s1):1, (n3,s3):2, (n2,s2):3, next=4, gen=4]: lost quorum (down: (n1,s1):1,(n2,s2):3); closed timestamp: 1672836641.760882000,0 (2023-01-04 12:50:41); raft status: {"id":"2","term":7,"vote":"2","commit":88,"lead":"2","raftState":"StateLeader","applied":88,"progress":{"3":{"match":88,"next":89,"state":"StateProbe"},"1":{"match":88,"next":89,"state":"StateProbe"},"2":{"match":131,"next":132,"state":"StateReplicate"}},"leadtransferee":"0"}: have been waiting 61.60s for slow proposal RequestLease [‹/System/NodeLiveness›,/Min))

Source & repro

func replicaUnavailableError(
err error,
desc *roachpb.RangeDescriptor,
replDesc roachpb.ReplicaDescriptor,
lm livenesspb.IsLiveMap,
rs *raft.Status,
closedTS hlc.Timestamp,
) error {
nonLiveRepls := roachpb.MakeReplicaSet(nil)
for _, rDesc := range desc.Replicas().Descriptors() {
if lm[rDesc.NodeID].IsLive {
continue
}
nonLiveRepls.AddReplica(rDesc)
}
canMakeProgress := desc.Replicas().CanMakeProgress(
func(replDesc roachpb.ReplicaDescriptor) bool {
return lm[replDesc.NodeID].IsLive
},
)
// Ensure good redaction.
var _ redact.SafeFormatter = nonLiveRepls
var _ redact.SafeFormatter = desc
var _ redact.SafeFormatter = replDesc
var buf redact.StringBuilder
if !canMakeProgress {
buf.Printf("lost quorum (down: %v); ", nonLiveRepls)
}
buf.Printf(
"closed timestamp: %s (%s); raft status: %+v",
closedTS,
redact.Safe(timeutil.Unix(0, closedTS.WallTime).UTC().Format("2006-01-02 15:04:05")),
redact.Safe(rs), /* raft status contains no PII */
)
return roachpb.NewReplicaUnavailableError(errors.Wrapf(err, "%s", buf), desc, replDesc)
}

roachprod create -n 3 local
roachprod stage local release v22.2.1
roachprod start local
sleep 60
roachprod stop local:1-2
sleep 65
less ~/local/3/logs/cockroach.log

A few observations:

  • at least for the message above, no raft elections are taking place - we had a stable leader, other nodes died, we don't use CheckQuorum, so leader doesn't step down, so nobody fails elections. But, the progress will be populated and a quorum of followers will be in StateProbe, indicating that there are issues.
  • if there were elections, we'd see either State{Pre,}Candidate or StateFollower here, which are both indications that something's awry at the raft level. After a failed election timeout though, we revert back to StateFollower for a while - but then with a zero state.Lead.
  • the "down nodes" in the message are determined by liveness, so if the liveness range itself is the problem the info will be somewhat misleading, since it will print that all nodes are down - but only some of them may be truly down, others are fine but can't heartbeat liveness. This is difficult to understand.

I reviewed https://github.com/cockroachlabs/support/issues/1963 and the circuit breaker did trip there.

Chatted with @92345 and we agreed that we should improve the circuit breaker message instead of adding a new log message, for now:

  • make sure the "now tripped with error" message is logged at error level, not info. This is the authoritative place where we want to surface this in the logs.
  • reformulate the error message so that it takes the shape "sentence that indicates really bad problem>\n\nmore debug-type info"
  • if no leader (which is also the case when in {Pre,}Campaign, or quorum is down according to either ConnHealth or, if StateLeader, the Progress (i.e. no quorum in StateReplicate), strongly suggest loss of quorum and print out the nodes considered down.

tbg added a commit to tbg/cockroach that referenced this issue Feb 13, 2023
craig bot pushed a commit that referenced this issue Feb 20, 2023
97035: kvserver: log circuit breaker trip events at error severity r=erikgrinaker a=tbg

Informs #94691.

Epic: CRDB-23087
Release note: None


Co-authored-by: Tobias Grieger <[email protected]>
@tbg tbg changed the title kvserver: observability into ranges without leaders kvserver: improve ReplicaUnavailableError Mar 21, 2023
@exalate-issue-sync exalate-issue-sync bot added T-kv KV Team and removed T-kv-replication labels Jun 28, 2024
@github-project-automation github-project-automation bot moved this to Incoming in KV Aug 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv KV Team
Projects
No open projects
Status: Incoming
Development

No branches or pull requests

2 participants