Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ui: add snapshots dashboard to metrics page #86599

Closed
Santamaura opened this issue Aug 22, 2022 · 0 comments · Fixed by #86702
Closed

ui: add snapshots dashboard to metrics page #86599

Santamaura opened this issue Aug 22, 2022 · 0 comments · Fixed by #86702
Assignees
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)

Comments

@Santamaura
Copy link
Contributor

Santamaura commented Aug 22, 2022

Is your feature request related to a problem? Please describe.
Part of #85445.

Describe the solution you'd like
As part of the effort to improve decommissioning observability, one thing that could help is adding another dashboard with some useful metrics:

Success/error counts by allocator action

  • queue.replicate.addreplica.(success|error)
  • queue.replicate.removereplica.(success|error)
  • queue.replicate.replacedeadreplica.(success|error)
  • queue.replicate.removedeadreplica.(success|error)
  • queue.replicate.replacedecommissioningreplica.(success|error)
  • queue.replicate.removedecommissioningreplica.(success|error)

Snapshots queued and in-progress

  • range.snapshots.send-queue
  • range.snapshots.recv-queue
  • range.snapshots.send-in-progress
  • range.snapshots.recv-in-progress
  • (optional) range.snapshots.send-total-in-progress
  • (optional) range.snapshots.recv-total-in-progress

Queue metrics

  • queue.replicate.process.(success|failure)
  • queue.replicate.purgatory
  • (optional)queue.replicate.processingnanos

Transferred bytes
Note: Might make sense to visualize these as rates

  • range.snapshots.unknown.rcvd-bytes
  • range.snapshots.unknown.sent-bytes
  • range.snapshots.rebalancing.rcvd-bytes
  • range.snapshots.rebalancing.sent-bytes
  • range.snapshots.recovery.rcvd-bytes
  • range.snapshots.recovery.sent-bytes

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Additional context
Add any other context or screenshots about the feature request here.

Jira issue: CRDB-18834

Epic CRDB-10792

@Santamaura Santamaura added C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception) T-kv-observability labels Aug 22, 2022
Santamaura added a commit to Santamaura/cockroach that referenced this issue Aug 24, 2022
This change adds new graphs to the metrics replication
dashboard. New metrics visualized on the dashboard can be used
to help triage decommissioning issues. Metrics visualized
include:
- queue.replicate.addreplica.(success|error)
- queue.replicate.removereplica.(success|error)
- queue.replicate.replacedeadreplica.(success|error)
- queue.replicate.removedeadreplica.(success|error)
- queue.replicate.replacedecommissioningreplica.(success|error)
- queue.replicate.removedecommissioningreplica.(success|error)
- range.snapshots.recv-queue
- queue.replicate.purgatory
- range.snapshots.unknown.rcvd-bytes
- range.snapshots.rebalancing.rcvd-bytes
- range.snapshots.recovery.rcvd-bytes

Release justification: low risk, high benefit changes to
existing functionality.

Resolves cockroachdb#86599

Release note (ui change): introduce new graphs on metrics
replication dashboard to improve decommissioning observability
Santamaura added a commit to Santamaura/cockroach that referenced this issue Aug 25, 2022
This change adds new graphs to the metrics replication
dashboard. New metrics visualized on the dashboard can be used
to help triage decommissioning issues. Metrics visualized
include:
- queue.replicate.addreplica.(success|error)
- queue.replicate.removereplica.(success|error)
- queue.replicate.replacedeadreplica.(success|error)
- queue.replicate.removedeadreplica.(success|error)
- queue.replicate.replacedecommissioningreplica.(success|error)
- queue.replicate.removedecommissioningreplica.(success|error)
- range.snapshots.recv-queue
- range.snapshots.unknown.rcvd-bytes
- range.snapshots.rebalancing.rcvd-bytes
- range.snapshots.recovery.rcvd-bytes

Release justification: low risk, high benefit changes to
existing functionality.

Resolves cockroachdb#86599

Release note (ui change): introduce new graphs on metrics
replication dashboard to improve decommissioning observability
AlexTalks pushed a commit to AlexTalks/cockroach that referenced this issue Aug 26, 2022
This change adds new graphs to the metrics replication
dashboard. New metrics visualized on the dashboard can be used
to help triage decommissioning issues. Metrics visualized
include:
- queue.replicate.addreplica.(success|error)
- queue.replicate.removereplica.(success|error)
- queue.replicate.replacedeadreplica.(success|error)
- queue.replicate.removedeadreplica.(success|error)
- queue.replicate.replacedecommissioningreplica.(success|error)
- queue.replicate.removedecommissioningreplica.(success|error)
- range.snapshots.recv-queue
- range.snapshots.unknown.rcvd-bytes
- range.snapshots.rebalancing.rcvd-bytes
- range.snapshots.recovery.rcvd-bytes

Release justification: low risk, high benefit changes to
existing functionality.

Resolves cockroachdb#86599

Release note (ui change): introduce new graphs on metrics
replication dashboard to improve decommissioning observability
Santamaura added a commit to Santamaura/cockroach that referenced this issue Aug 29, 2022
This change adds new graphs to the metrics replication
dashboard. New metrics visualized on the dashboard can be used
to help triage decommissioning issues. Metrics visualized
include:
- queue.replicate.addreplica.(success|error)
- queue.replicate.removereplica.(success|error)
- queue.replicate.replacedeadreplica.(success|error)
- queue.replicate.removedeadreplica.(success|error)
- queue.replicate.replacedecommissioningreplica.(success|error)
- queue.replicate.removedecommissioningreplica.(success|error)
- range.snapshots.recv-queue
- range.snapshots.unknown.rcvd-bytes
- range.snapshots.rebalancing.rcvd-bytes
- range.snapshots.recovery.rcvd-bytes

Release justification: low risk, high benefit changes to
existing functionality.

Resolves cockroachdb#86599

Release note (ui change): introduce new graphs on metrics
replication dashboard to improve decommissioning observability
Santamaura added a commit to Santamaura/cockroach that referenced this issue Sep 1, 2022
This change adds new graphs to the metrics replication
dashboard. New metrics visualized on the dashboard can be used
to help triage decommissioning issues. Metrics visualized
include:
- queue.replicate.addreplica.(success|error)
- queue.replicate.removereplica.(success|error)
- queue.replicate.replacedeadreplica.(success|error)
- queue.replicate.removedeadreplica.(success|error)
- queue.replicate.replacedecommissioningreplica.(success|error)
- queue.replicate.removedecommissioningreplica.(success|error)
- range.snapshots.recv-queue
- range.snapshots.unknown.rcvd-bytes
- range.snapshots.rebalancing.rcvd-bytes
- range.snapshots.recovery.rcvd-bytes

Release justification: low risk, high benefit changes to
existing functionality.

Resolves cockroachdb#86599

Release note (ui change): introduce new graphs on metrics
replication dashboard to improve decommissioning observability
Santamaura added a commit to Santamaura/cockroach that referenced this issue Sep 1, 2022
This change adds new graphs to the metrics replication
dashboard. New metrics visualized on the dashboard can be used
to help triage decommissioning issues. Metrics visualized
include:
- queue.replicate.addreplica.(success|error)
- queue.replicate.removereplica.(success|error)
- queue.replicate.replacedeadreplica.(success|error)
- queue.replicate.removedeadreplica.(success|error)
- queue.replicate.replacedecommissioningreplica.(success|error)
- queue.replicate.removedecommissioningreplica.(success|error)
- range.snapshots.recv-queue
- range.snapshots.unknown.rcvd-bytes
- range.snapshots.rebalancing.rcvd-bytes
- range.snapshots.recovery.rcvd-bytes

Release justification: low risk, high benefit changes to
existing functionality.

Resolves cockroachdb#86599

Release note (ui change): introduce new graphs on metrics
replication dashboard to improve decommissioning observability
craig bot pushed a commit that referenced this issue Sep 6, 2022
86702: ui: add decommissioning relevant graphs to metrics replication dashboard r=Santamaura a=Santamaura

This change adds new graphs to the metrics replication
dashboard. New metrics visualized on the dashboard can be used
to help triage decommissioning issues. Metrics visualized
include:
- queue.replicate.addreplica.(success|error)
- queue.replicate.removereplica.(success|error)
- queue.replicate.replacedeadreplica.(success|error)
- queue.replicate.removedeadreplica.(success|error)
- queue.replicate.replacedecommissioningreplica.(success|error)
- queue.replicate.removedecommissioningreplica.(success|error)
- range.snapshots.recv-queue
- queue.replicate.purgatory
- range.snapshots.rebalancing.rcvd-bytes
- range.snapshots.recovery.rcvd-bytes

Release justification: low risk, high benefit changes to
existing functionality.

Resolves #86599

Release note (ui change): introduce new graphs on metrics
replication dashboard to improve decommissioning observability

86988: kvserver: lazily translate Spans to LockUpdates instead of pre-alloca… r=shralex a=shralex

…ting

Previously, we called LocksAsLockUpdates before calling ResolveIntents, which
pre-allocated memory for all LockUpdates. In this PR we change the interface
of ResolveIntents to avoid this memory allocation and perform the translation
of Span to LockUpdate lazily, as we iterate over them in ResolveIntents.

Release justification: stability change that may help avoid OOM.
Release note: None

Resolves: #77219
Jira issue: https://cockroachlabs.atlassian.net/browse/CRDB-13478

Co-authored-by: Santamaura <[email protected]>
Co-authored-by: shralex <[email protected]>
@craig craig bot closed this as completed in 3952ab4 Sep 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-kv-observability C-enhancement Solution expected to add code/behavior + preserve backward-compat (pg compat issues are exception)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant