-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
ui: surface PausedReplicas
in DB Console
#84489
Comments
CC @cockroachdb/replication. |
Just wanted to point out that the paused follower metric is populated on the leaseholders. So if s3 is overloaded, we expect other stores to register nonzero values on the paused follower metric, since they are doing the pausing. We could add a metric that basically just indicates whether a store is pausable (reported by the store itself). This is basically just a metric that reports On a related note, I think something like this is also lacking for IO admission control (which triggers at iothreshold score 1). |
What is the simplest way to simulate pausing replicas? |
@Santamaura one reliably way is to rebase on top of #81516 and then invoke the roachtest like this:
The roachtest will "fail" after a few minutes but it will leave the cluster running with load against it. You can then navigate to the fourth node, port 3000, log in with admin:admin and then dashboards -> browse to open the one dashboard that exists. You will find a graph of pausable replicas there. As the "L0 threshold" graph crosses 0.8, you will see paused replicas. (This might take an hour or two, but once it's there it'll periodically be the case). |
86407: ui, server: surface paused replicas in problem ranges, range report, and replication metrics r=Santamaura a=Santamaura This change surfaces paused replicas in the problem ranges page, in the range report, and as a new chart in the replication metrics. Release justification: low risk, high benefit changes to existing functionality. Resolves: #84489 Release note (ui change): surface paused replicas to range report, problem ranges, and replication metrics pages. Co-authored-by: Santamaura <[email protected]>
In #83851, we added a concept of "paused replicas". If a Raft follower's Pebble store is overloaded, then the leaseholder will temporarily pause replication to that follower until Pebble recovers (assuming we can still maintain a quorum).
The number of currently paused replicas is surfaced via the status server's range info as
RangeInfo.paused_replicas
:cockroach/pkg/kv/kvserver/kvserverpb/state.proto
Line 187 in 43a37d5
And via the metric
admission.raft.paused_replicas
:cockroach/pkg/ts/catalog/chart_catalog.go
Lines 567 to 572 in 43a37d5
We should surface this in the DB Console, e.g.:
Display it in the range report.
Display it in the problem ranges report, which requires extending the
ProblemRanges
API call with a new response fieldPausedFollowers
and populating it.cockroach/pkg/server/serverpb/status.proto
Line 1176 in 8b07764
cockroach/pkg/server/problem_ranges.go
Line 93 in 0ffc720
Display it in the replication metrics, either on an existing chart or a new one.
Other places?
Jira issue: CRDB-17690
The text was updated successfully, but these errors were encountered: