Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-24.1: dbconsole: overload page improvements #124509

Merged
merged 7 commits into from
May 22, 2024

Conversation

blathers-crl[bot]
Copy link

@blathers-crl blathers-crl bot commented May 21, 2024

Backport 7/7 commits from #123522 on behalf of @aadityasondhi.

/cc @cockroachdb/release


This PR contains a series of improvements to the overload page of the DB console as part of #121574. It is separated into multiple commits for ease of review.


dbconsole: remove non useful charts on the overload page

In investigations, we have found that the following charts are not
useful and frequently cause confusion:

  • Admission work rate
  • Admission Delay rate
  • Requests Waiting For Flow Tokens

Informs #121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.


dbconsole: reorder overload page metrics for better readability

This patch reorders the existing metrics in a more usable order:

  1. Metrics to help determine which resource is constrained (IO, CPU)
  2. Metrics to narrow down which AC queues are seeing requests waiting
  3. More advanced metrics about the system health (goroutine scheduler,
    L0 sublevels, etc.)

Informs #121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:

  1. Metrics to help determine which resource is constrained (IO, CPU)
  2. Metrics to narrow down which AC queues are seeing requests waiting
  3. More advanced metrics about the system health (goroutine scheduler,
    L0 sublevels, etc.)

dbconsole: include better names and descriptions for overload page
This patch improves the metric descriptions for the metrics on the
overload page.

Fixes #120853.

Release note (ui change): The overload page now includes descriptions for all
metrics.


dbconsole: additional higher granularity metrics for overload

This patch adds additional metrics to the overload page that allow for
more granular look at the system:

  • cr.store.storage.l0-sublevels
  • cr.node.go.scheduler_latency-p99.9

Informs #121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:

  • cr.store.storage.l0-sublevels
  • cr.node.go.scheduler_latency-p99.9

dbconsole: split Admission Queue graphs to avoid overcrowding

Informs #121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:

  1. Foreground (regular) CPU work
  2. Store (IO) work
  3. Background (elastic) CPU work
  4. Replication Admission Control, store overload on replicas

dbconsole: add elastic store metric to the overload page

This patch uses the new sperated elastic-stores metrics for queing
delay from #123890.

Informs #121572.

Release note (ui change): The Admission Queueing Delay – Store chart
now separates elastic (background) work from the regular foreground
work.


dbconsole: add elastic io token exhausted duration to overload page

This patch adds the metric elastic_io_tokens_exhausted_duration.kv
introduced in #124078.

Informs #121572.

Release note (ui change): The Admission IO Tokens Exhausted chart now
separates elastic and regular io work.


Release justification: Metrics only change that will significantly help Admission Control escalations.

In investigations, we have found that the following charts are not
useful and frequently cause confusion:
- Admission work rate
- Admission Delay rate
- Requests Waiting For Flow Tokens

Informs #121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.
This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs #121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)
This patch improves the metric descriptions for the metrics on the
overload page.

Fixes #120853.

Release note (ui change): The overload page now includes descriptions for all
metrics.
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs #121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
Informs #121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
This patch uses the new sperated `elastic-stores` metrics for queing
delay from #123890.

Informs #121572.

Release note (ui change): The `Admission Queueing Delay – Store` chart
now separates elastic (background) work from the regular foreground
work.
This patch adds the metric `elastic_io_tokens_exhausted_duration.kv`
introduced in #124078.

Informs #121572.

Release note (ui change): The `Admission IO Tokens Exhausted` chart now
separates elastic and regular io work.
@blathers-crl blathers-crl bot requested a review from a team as a code owner May 21, 2024 18:51
@blathers-crl blathers-crl bot force-pushed the blathers/backport-release-24.1-123522 branch from d1f8f90 to 43e1a10 Compare May 21, 2024 18:51
@blathers-crl blathers-crl bot requested review from kyle-a-wong and removed request for a team May 21, 2024 18:51
@blathers-crl blathers-crl bot added blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot. labels May 21, 2024
Copy link
Author

blathers-crl bot commented May 21, 2024

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Backports should only be created for serious
    issues
    or test-only changes.
  • Backports should not break backwards-compatibility.
  • Backports should change as little code as possible.
  • Backports should not change on-disk formats or node communication protocols.
  • Backports should not add new functionality (except as defined
    here).
  • Backports must not add, edit, or otherwise modify cluster versions; or add version gates.
  • All backports must be reviewed by the owning areas TL and one additional
    TL. For more information as to how that review should be conducted, please consult the backport
    policy
    .
If your backport adds new functionality, please ensure that the following additional criteria are satisfied:
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters. State changes must be further protected such that nodes running old binaries will not be negatively impacted by the new state (with a mixed version test added).
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.
  • Your backport must be accompanied by a post to the appropriate Slack
    channel (#db-backports-point-releases or #db-backports-XX-X-release) for awareness and discussion.

Also, please add a brief release justification to the body of your PR to justify this
backport.

@blathers-crl blathers-crl bot added the backport Label PR's that are backports to older release branches label May 21, 2024
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aadityasondhi aadityasondhi requested a review from dhartunian May 21, 2024 19:32
@aadityasondhi aadityasondhi merged commit e6ef2c0 into release-24.1 May 22, 2024
19 of 20 checks passed
@aadityasondhi aadityasondhi deleted the blathers/backport-release-24.1-123522 branch May 22, 2024 15:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport Label PR's that are backports to older release branches blathers-backport This is a backport that Blathers created automatically. O-robot Originated from a bot.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants