Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: surface PausedReplicas in DB Console #84489

Closed
erikgrinaker opened this issue Jul 15, 2022 · 4 comments · Fixed by #86407
Closed

ui: surface PausedReplicas in DB Console #84489

erikgrinaker opened this issue Jul 15, 2022 · 4 comments · Fixed by #86407
Assignees
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@erikgrinaker
Copy link
Contributor

erikgrinaker commented Jul 15, 2022

In #83851, we added a concept of "paused replicas". If a Raft follower's Pebble store is overloaded, then the leaseholder will temporarily pause replication to that follower until Pebble recovers (assuming we can still maintain a quorum).

The number of currently paused replicas is surfaced via the status server's range info as RangeInfo.paused_replicas:

repeated int32 paused_replicas = 21 [(gogoproto.casttype) = "github.com/cockroachdb/cockroach/pkg/roachpb.ReplicaID"];

And via the metric admission.raft.paused_replicas:

{
Title: "Paused Followers",
Metrics: []string{
"admission.raft.paused_replicas",
},
},

We should surface this in the DB Console, e.g.:

  • Display it in the range report.

  • Display it in the problem ranges report, which requires extending the ProblemRanges API call with a new response field PausedFollowers and populating it.

    message ProblemRangesResponse {

    for _, info := range resp.resp.Ranges {

  • Display it in the replication metrics, either on an existing chart or a new one.

  • Other places?

Jira issue: CRDB-17690

@erikgrinaker erikgrinaker added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) A-kv-observability T-kv-observability labels Jul 15, 2022
@erikgrinaker
Copy link
Contributor Author

CC @cockroachdb/replication.

@tbg
Copy link
Member

tbg commented Aug 16, 2022

Just wanted to point out that the paused follower metric is populated on the leaseholders. So if s3 is overloaded, we expect other stores to register nonzero values on the paused follower metric, since they are doing the pausing.

We could add a metric that basically just indicates whether a store is pausable (reported by the store itself). This is basically just a metric that reports int(iothreshold.Score > configured_pausable_threshold_cluster_setting).

On a related note, I think something like this is also lacking for IO admission control (which triggers at iothreshold score 1).

@Santamaura
Copy link
Contributor

What is the simplest way to simulate pausing replicas?

@tbg
Copy link
Member

tbg commented Aug 18, 2022

@Santamaura one reliably way is to rebase on top of #81516 and then invoke the roachtest like this:

export GCE_PROJECT=andrei-jepsen &&./pkg/cmd/roachtest/roachstress.sh -c 1 -u admission/follower-overload/presplit-with-leases -- --debug

The roachtest will "fail" after a few minutes but it will leave the cluster running with load against it. You can then navigate to the fourth node, port 3000, log in with admin:admin and then dashboards -> browse to open the one dashboard that exists. You will find a graph of pausable replicas there. As the "L0 threshold" graph crosses 0.8, you will see paused replicas. (This might take an hour or two, but once it's there it'll periodically be the case).

craig bot pushed a commit that referenced this issue Aug 23, 2022
86407: ui, server: surface paused replicas in problem ranges, range report, and replication metrics r=Santamaura a=Santamaura

This change surfaces paused replicas in the problem ranges
page, in the range report, and as a new chart in the
replication metrics.

Release justification: low risk, high benefit changes to
existing functionality.

Resolves: #84489

Release note (ui change): surface paused replicas to range report,
problem ranges, and replication metrics pages.

Co-authored-by: Santamaura <[email protected]>
@craig craig bot closed this as completed in 2bd238a Aug 23, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
3 participants