Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dbconsole: overload page improvements #123522

Merged
merged 7 commits into from
May 21, 2024

Conversation

aadityasondhi
Copy link
Collaborator

@aadityasondhi aadityasondhi commented May 2, 2024

This PR contains a series of improvements to the overload page of the DB console as part of #121574. It is separated into multiple commits for ease of review.


dbconsole: remove non useful charts on the overload page

In investigations, we have found that the following charts are not
useful and frequently cause confusion:

  • Admission work rate
  • Admission Delay rate
  • Requests Waiting For Flow Tokens

Informs #121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.


dbconsole: reorder overload page metrics for better readability

This patch reorders the existing metrics in a more usable order:

  1. Metrics to help determine which resource is constrained (IO, CPU)
  2. Metrics to narrow down which AC queues are seeing requests waiting
  3. More advanced metrics about the system health (goroutine scheduler,
    L0 sublevels, etc.)

Informs #121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:

  1. Metrics to help determine which resource is constrained (IO, CPU)
  2. Metrics to narrow down which AC queues are seeing requests waiting
  3. More advanced metrics about the system health (goroutine scheduler,
    L0 sublevels, etc.)

dbconsole: include better names and descriptions for overload page
This patch improves the metric descriptions for the metrics on the
overload page.

Fixes #120853.

Release note (ui change): The overload page now includes descriptions for all
metrics.


dbconsole: additional higher granularity metrics for overload

This patch adds additional metrics to the overload page that allow for
more granular look at the system:

  • cr.store.storage.l0-sublevels
  • cr.node.go.scheduler_latency-p99.9

Informs #121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:

  • cr.store.storage.l0-sublevels
  • cr.node.go.scheduler_latency-p99.9

dbconsole: split Admission Queue graphs to avoid overcrowding

Informs #121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:

  1. Foreground (regular) CPU work
  2. Store (IO) work
  3. Background (elastic) CPU work
  4. Replication Admission Control, store overload on replicas

dbconsole: add elastic store metric to the overload page

This patch uses the new sperated elastic-stores metrics for queing
delay from #123890.

Informs #121572.

Release note (ui change): The Admission Queueing Delay – Store chart
now separates elastic (background) work from the regular foreground
work.


dbconsole: add elastic io token exhausted duration to overload page

This patch adds the metric elastic_io_tokens_exhausted_duration.kv
introduced in #124078.

Informs #121572.

Release note (ui change): The Admission IO Tokens Exhausted chart now
separates elastic and regular io work.

@aadityasondhi aadityasondhi force-pushed the 20240502.overload-page-fixes branch from a2c9f5e to 68ca397 Compare May 2, 2024 21:48
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@aadityasondhi aadityasondhi changed the title 20240502.overload page fixes dbconsole: overload page improvements May 2, 2024
@aadityasondhi aadityasondhi force-pushed the 20240502.overload-page-fixes branch from 68ca397 to 499c4e5 Compare May 6, 2024 17:48
@aadityasondhi
Copy link
Collaborator Author

Some screenshots of the new page (in order of how the metrics show on the page).

https://docs.google.com/document/d/1txZ6zcAYR_8fnqR6LR53vCZ1dM6sZh0a52VUg_QIs78/edit?usp=sharing

@aadityasondhi aadityasondhi marked this pull request as ready for review May 6, 2024 17:49
@aadityasondhi aadityasondhi requested a review from a team as a code owner May 6, 2024 17:49
@aadityasondhi aadityasondhi requested review from abarganier and sumeerbhola and removed request for a team May 6, 2024 17:49
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 39 at r4 (raw file):

      tenantSource={tenantSource}
      showMetricsInTooltip={true}
      tooltip={`CPU utilization as measured by the host, displayed per node.`}

CPU utilization for the CRDB process as measured ....


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 57 at r4 (raw file):

      tenantSource={tenantSource}
      showMetricsInTooltip={true}
      tooltip={`Duration of CPU slots exhaustion regular work, in microseconds. This metric indicates whether the CPU is overloaded and how long do we experience token exhaustion for regular work.`}

in microseconds/sec. ... we experience slot exhaustion for regular CPU work.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 80 at r4 (raw file):

      tenantSource={tenantSource}
      showMetricsInTooltip={true}
      tooltip={`Duration of IO token exhaustion, in microseconds per second. This metric indicates whether the disk is overloaded and how long do we experience token exhaustion.`}

whether the store is overloaded ...


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 99 at r4 (raw file):

      sources={storeSources}
      tenantSource={tenantSource}
      tooltip={`A 1-normalized IO overload score. A value above 1 indicates that the store is overloaded.`}

We don't want to change the definition of the metric yet, but the interpretation of this needs to change, and we can capture that in this comment. Something like:

Historically, a 1-normalized IO overload score. Currently, admission control attempts to maintain a score of 0.5.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 140 at r4 (raw file):

    <LineGraph
      title="Flow Tokens Wait Time: 99th percentile"

We should consider using the term "queuing delay" in this and other queue delay metrics. Consider calling this:

Flow Tokens Queueing Delay: 99th percentile


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 289 at r4 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`P99 scheduling latency for goroutines. A value above 1ms typically indicates high load.`}

... indicates high load that causes elastic CPU work to be throttled.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 311 at r4 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`P99.9 scheduling latency for goroutines. A value above 1ms typically indicates high load.`}

we shouldn't say "A value above 1ms typically indicates high load" since we don't really know whether it indicates high load. This is just useful in realizing whether some high percentile user observed latency could be due to goroutine scheduling latency.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 333 at r4 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`The number of Goroutines waiting per CPU. A value above 32 typically indicates high load.`}

We should say: A value above 32 (sampled at 1ms) is used by admission control to throttle regular CPU work.

Instead of saying "A value above 32 typically indicates high load" since the former is more specific. The latter may result in someone thinking < 32 is not high load.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 351 at r4 (raw file):

      sources={storeSources}
      tenantSource={tenantSource}
      tooltip={`Number of sublevels in L0 of the LSM. A value above 20 typically indicates that the store is overloaded.`}

above 10

(which happens to correspond to a score of 0.5)

@aadityasondhi aadityasondhi force-pushed the 20240502.overload-page-fixes branch 2 times, most recently from beb82cc to cce4580 Compare May 7, 2024 21:29
Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TFTR! I have made the changes PTAL. I also added a new commit to split the queue graphs as we discussed offline.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @sumeerbhola)


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 57 at r4 (raw file):

Previously, sumeerbhola wrote…

in microseconds/sec. ... we experience slot exhaustion for regular CPU work.

Done.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 80 at r4 (raw file):

Previously, sumeerbhola wrote…

whether the store is overloaded ...

Done.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 99 at r4 (raw file):

Previously, sumeerbhola wrote…

We don't want to change the definition of the metric yet, but the interpretation of this needs to change, and we can capture that in this comment. Something like:

Historically, a 1-normalized IO overload score. Currently, admission control attempts to maintain a score of 0.5.

Done.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 140 at r4 (raw file):

Previously, sumeerbhola wrote…

We should consider using the term "queuing delay" in this and other queue delay metrics. Consider calling this:

Flow Tokens Queueing Delay: 99th percentile

Done.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 289 at r4 (raw file):

Previously, sumeerbhola wrote…

... indicates high load that causes elastic CPU work to be throttled.

Done.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 311 at r4 (raw file):

Previously, sumeerbhola wrote…

we shouldn't say "A value above 1ms typically indicates high load" since we don't really know whether it indicates high load. This is just useful in realizing whether some high percentile user observed latency could be due to goroutine scheduling latency.

Done.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 333 at r4 (raw file):

Previously, sumeerbhola wrote…

We should say: A value above 32 (sampled at 1ms) is used by admission control to throttle regular CPU work.

Instead of saying "A value above 32 typically indicates high load" since the former is more specific. The latter may result in someone thinking < 32 is not high load.

Done.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 351 at r4 (raw file):

Previously, sumeerbhola wrote…

above 10

(which happens to correspond to a score of 0.5)

Done.

@aadityasondhi aadityasondhi requested a review from sumeerbhola May 7, 2024 21:58
@aadityasondhi aadityasondhi force-pushed the 20240502.overload-page-fixes branch 2 times, most recently from 94b0dac to ba86f54 Compare May 8, 2024 14:52
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 180 at r10 (raw file):

    <LineGraph
      title="Admission Queueing Delay – Store"

I just noticed there is a problem here in our code instrumentation. The metrics passed to initWorkQueue for the regular WorkQueue and elastic WorkQueue for a store are the same. So this will include queueing delay for elastic and regular. I think it is worth making a note here and mention in the tooltip that to break this down one should look at the per priority metrics.

I think we should fix this in the code and backport. And we could avoid plotting the p99 for the elastic store WorkQueue since that queueing should all be logical queueing due to replication AC, and the physical queueing effect of that is already reflected in kvadmission.flow_controller.elastic_wait_duration-p99. Though thinking some more, since the flow controller metric is at the source and the WorkQueue metric is at the destination, both can be useful -- the latter tells us the cause, and the former about who is affected.


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 326 at r10 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`P99 scheduling latency for goroutines. A high value here indicates high load that causes background (elastic) CPU work to be throttled.`}

A value above 1ms here indicates high load ...


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 348 at r10 (raw file):

      sources={nodeSources}
      tenantSource={tenantSource}
      tooltip={`P99.9 scheduling latency for goroutines. A high value here indicates high load that causes background (elastic) CPU work to be throttled.`}

something generic like the following, since we don't use p99.9 in AC decision making:
A high value here can be indicative of high tail latency in various queries.

@aadityasondhi aadityasondhi added the backport-24.1.x Flags PRs that need to be backported to 24.1. label May 8, 2024
@aadityasondhi aadityasondhi force-pushed the 20240502.overload-page-fixes branch from ba86f54 to 0289a10 Compare May 8, 2024 21:10
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)

@aadityasondhi aadityasondhi force-pushed the 20240502.overload-page-fixes branch from 0289a10 to 8a475d7 Compare May 17, 2024 16:42
@aadityasondhi
Copy link
Collaborator Author

@sumeerbhola could you please take a look at the last two commits. They are follow ups from #124078 and #123890.

Just adding those metrics in with their regular counterparts.

@aadityasondhi aadityasondhi force-pushed the 20240502.overload-page-fixes branch from d47a6d6 to f77fbf4 Compare May 17, 2024 18:42
@aadityasondhi aadityasondhi requested review from a team as code owners May 17, 2024 18:42
@aadityasondhi aadityasondhi requested a review from a team May 17, 2024 18:42
@aadityasondhi aadityasondhi requested review from a team as code owners May 17, 2024 18:42
@aadityasondhi aadityasondhi requested a review from andyyang890 May 17, 2024 18:43
Copy link

blathers-crl bot commented May 17, 2024

Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks.

It looks like your PR touches SQL parser code but doesn't add or edit parser tests. Please make sure you add or edit parser tests if you edit the parser.

🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf.

@aadityasondhi
Copy link
Collaborator Author

Apologies for the noise. Bad rebase.

@aadityasondhi aadityasondhi force-pushed the 20240502.overload-page-fixes branch from f77fbf4 to 7258e95 Compare May 17, 2024 18:54
Copy link
Collaborator

@sumeerbhola sumeerbhola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

:lgtm:

Reviewable status: :shipit: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 155 at r39 (raw file):

    <LineGraph
      title="Admission Queueing Delay – Foreground (Regular) CPU"

nit: I have a mild preference for putting "p99" in the titles too. Since not everyone looks at the tooltip.

In investigations, we have found that the following charts are not
useful and frequently cause confusion:
- Admission work rate
- Admission Delay rate
- Requests Waiting For Flow Tokens

Informs cockroachdb#121572

Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.
This patch reorders the existing metrics in a more usable order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)

Informs cockroachdb#121572.

Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
1. Metrics to help determine which resource is constrained (IO, CPU)
2. Metrics to narrow down which AC queues are seeing requests waiting
3. More advanced metrics about the system health (goroutine scheduler,
   L0 sublevels, etc.)
This patch improves the metric descriptions for the metrics on the
overload page.

Fixes cockroachdb#120853.

Release note (ui change): The overload page now includes descriptions for all
metrics.
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9

Informs cockroachdb#121572.

Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
- cr.store.storage.l0-sublevels
- cr.node.go.scheduler_latency-p99.9
Informs cockroachdb#121572.

Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
1. Foreground (regular) CPU work
2. Store (IO) work
3. Background (elastic) CPU work
4. Replication Admission Control, store overload on replicas
This patch uses the new sperated `elastic-stores` metrics for queing
delay from cockroachdb#123890.

Informs cockroachdb#121572.

Release note (ui change): The `Admission Queueing Delay – Store` chart
now separates elastic (background) work from the regular foreground
work.
This patch adds the metric `elastic_io_tokens_exhausted_duration.kv`
introduced in cockroachdb#124078.

Informs cockroachdb#121572.

Release note (ui change): The `Admission IO Tokens Exhausted` chart now
separates elastic and regular io work.
@aadityasondhi aadityasondhi force-pushed the 20240502.overload-page-fixes branch from 7258e95 to c397d35 Compare May 21, 2024 18:23
Copy link
Collaborator Author

@aadityasondhi aadityasondhi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier and @sumeerbhola)


pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx line 155 at r39 (raw file):

Previously, sumeerbhola wrote…

nit: I have a mild preference for putting "p99" in the titles too. Since not everyone looks at the tooltip.

Done.

@aadityasondhi
Copy link
Collaborator Author

TFTR!

bors r=sumeerbhola

@craig craig bot merged commit 02ca30c into cockroachdb:master May 21, 2024
21 of 22 checks passed
@aadityasondhi aadityasondhi deleted the 20240502.overload-page-fixes branch May 21, 2024 19:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport-24.1.x Flags PRs that need to be backported to 24.1.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

admission: Provide better description on Overload metrics page
3 participants