-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
kvserver: improve ReplicaUnavailableError #94691
Comments
cc @cockroachdb/replication |
For reference, here is what is being printed today:
Source & reprocockroach/pkg/kv/kvserver/replica_circuit_breaker.go Lines 238 to 277 in a94858b
A few observations:
I reviewed https://github.com/cockroachlabs/support/issues/1963 and the circuit breaker did trip there. Chatted with @92345 and we agreed that we should improve the circuit breaker message instead of adding a new log message, for now:
|
Informs cockroachdb#94691. Epic: CRDB-23087 Release note: None
97035: kvserver: log circuit breaker trip events at error severity r=erikgrinaker a=tbg Informs #94691. Epic: CRDB-23087 Release note: None Co-authored-by: Tobias Grieger <[email protected]>
Loss of quorum-scenarios can be difficult to detect. Currently, the signals are mostly around "slowless", e.g. replica circuit breakers reporting slow proposals, requests reporting slow latch acquisition, requests and queries timing out, etc. We have also seen situations where we lost quorum on the liveness range, and thus didn't have access to the DB console or any other debugging tools, which took a long time to resolve (https://github.com/cockroachlabs/support/issues/1963).
We should warn loudly, both in logs and the DB console/metrics, when unquiesced ranges are unable to acquire a Raft leader for some time. This is a much more specific signal for quorum loss, which immediately rules out lots of other causes of "slowness".
Jira issue: CRDB-23087
Epic CRDB-39898
The text was updated successfully, but these errors were encountered: