-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
dbconsole: overload page improvements #123522
dbconsole: overload page improvements #123522
Conversation
a2c9f5e
to
68ca397
Compare
68ca397
to
499c4e5
Compare
Some screenshots of the new page (in order of how the metrics show on the page). https://docs.google.com/document/d/1txZ6zcAYR_8fnqR6LR53vCZ1dM6sZh0a52VUg_QIs78/edit?usp=sharing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 39 at r4 (raw file):
tenantSource={tenantSource} showMetricsInTooltip={true} tooltip={`CPU utilization as measured by the host, displayed per node.`}
CPU utilization for the CRDB process as measured ....
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 57 at r4 (raw file):
tenantSource={tenantSource} showMetricsInTooltip={true} tooltip={`Duration of CPU slots exhaustion regular work, in microseconds. This metric indicates whether the CPU is overloaded and how long do we experience token exhaustion for regular work.`}
in microseconds/sec. ... we experience slot exhaustion for regular CPU work.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 80 at r4 (raw file):
tenantSource={tenantSource} showMetricsInTooltip={true} tooltip={`Duration of IO token exhaustion, in microseconds per second. This metric indicates whether the disk is overloaded and how long do we experience token exhaustion.`}
whether the store is overloaded ...
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 99 at r4 (raw file):
sources={storeSources} tenantSource={tenantSource} tooltip={`A 1-normalized IO overload score. A value above 1 indicates that the store is overloaded.`}
We don't want to change the definition of the metric yet, but the interpretation of this needs to change, and we can capture that in this comment. Something like:
Historically, a 1-normalized IO overload score. Currently, admission control attempts to maintain a score of 0.5.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 140 at r4 (raw file):
<LineGraph title="Flow Tokens Wait Time: 99th percentile"
We should consider using the term "queuing delay" in this and other queue delay metrics. Consider calling this:
Flow Tokens Queueing Delay: 99th percentile
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 289 at r4 (raw file):
sources={nodeSources} tenantSource={tenantSource} tooltip={`P99 scheduling latency for goroutines. A value above 1ms typically indicates high load.`}
... indicates high load that causes elastic CPU work to be throttled.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 311 at r4 (raw file):
sources={nodeSources} tenantSource={tenantSource} tooltip={`P99.9 scheduling latency for goroutines. A value above 1ms typically indicates high load.`}
we shouldn't say "A value above 1ms typically indicates high load" since we don't really know whether it indicates high load. This is just useful in realizing whether some high percentile user observed latency could be due to goroutine scheduling latency.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 333 at r4 (raw file):
sources={nodeSources} tenantSource={tenantSource} tooltip={`The number of Goroutines waiting per CPU. A value above 32 typically indicates high load.`}
We should say: A value above 32 (sampled at 1ms) is used by admission control to throttle regular CPU work.
Instead of saying "A value above 32 typically indicates high load" since the former is more specific. The latter may result in someone thinking < 32 is not high load.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 351 at r4 (raw file):
sources={storeSources} tenantSource={tenantSource} tooltip={`Number of sublevels in L0 of the LSM. A value above 20 typically indicates that the store is overloaded.`}
above 10
(which happens to correspond to a score of 0.5)
beb82cc
to
cce4580
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TFTR! I have made the changes PTAL. I also added a new commit to split the queue graphs as we discussed offline.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @abarganier and @sumeerbhola)
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 57 at r4 (raw file):
Previously, sumeerbhola wrote…
in microseconds/sec. ... we experience slot exhaustion for regular CPU work.
Done.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 80 at r4 (raw file):
Previously, sumeerbhola wrote…
whether the store is overloaded ...
Done.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 99 at r4 (raw file):
Previously, sumeerbhola wrote…
We don't want to change the definition of the metric yet, but the interpretation of this needs to change, and we can capture that in this comment. Something like:
Historically, a 1-normalized IO overload score. Currently, admission control attempts to maintain a score of 0.5.
Done.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 140 at r4 (raw file):
Previously, sumeerbhola wrote…
We should consider using the term "queuing delay" in this and other queue delay metrics. Consider calling this:
Flow Tokens Queueing Delay: 99th percentile
Done.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 289 at r4 (raw file):
Previously, sumeerbhola wrote…
... indicates high load that causes elastic CPU work to be throttled.
Done.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 311 at r4 (raw file):
Previously, sumeerbhola wrote…
we shouldn't say "A value above 1ms typically indicates high load" since we don't really know whether it indicates high load. This is just useful in realizing whether some high percentile user observed latency could be due to goroutine scheduling latency.
Done.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 333 at r4 (raw file):
Previously, sumeerbhola wrote…
We should say: A value above 32 (sampled at 1ms) is used by admission control to throttle regular CPU work.
Instead of saying "A value above 32 typically indicates high load" since the former is more specific. The latter may result in someone thinking < 32 is not high load.
Done.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 351 at r4 (raw file):
Previously, sumeerbhola wrote…
above 10
(which happens to correspond to a score of 0.5)
Done.
94b0dac
to
ba86f54
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 180 at r10 (raw file):
<LineGraph title="Admission Queueing Delay – Store"
I just noticed there is a problem here in our code instrumentation. The metrics passed to initWorkQueue
for the regular WorkQueue and elastic WorkQueue for a store are the same. So this will include queueing delay for elastic and regular. I think it is worth making a note here and mention in the tooltip that to break this down one should look at the per priority metrics.
I think we should fix this in the code and backport. And we could avoid plotting the p99 for the elastic store WorkQueue
since that queueing should all be logical queueing due to replication AC, and the physical queueing effect of that is already reflected in kvadmission.flow_controller.elastic_wait_duration-p99
. Though thinking some more, since the flow controller metric is at the source and the WorkQueue metric is at the destination, both can be useful -- the latter tells us the cause, and the former about who is affected.
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 326 at r10 (raw file):
sources={nodeSources} tenantSource={tenantSource} tooltip={`P99 scheduling latency for goroutines. A high value here indicates high load that causes background (elastic) CPU work to be throttled.`}
A value above 1ms here indicates high load ...
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 348 at r10 (raw file):
sources={nodeSources} tenantSource={tenantSource} tooltip={`P99.9 scheduling latency for goroutines. A high value here indicates high load that causes background (elastic) CPU work to be throttled.`}
something generic like the following, since we don't use p99.9 in AC decision making:
A high value here can be indicative of high tail latency in various queries.
ba86f54
to
0289a10
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)
0289a10
to
8a475d7
Compare
@sumeerbhola could you please take a look at the last two commits. They are follow ups from #124078 and #123890. Just adding those metrics in with their |
d47a6d6
to
f77fbf4
Compare
Your pull request contains more than 1000 changes. It is strongly encouraged to split big PRs into smaller chunks. It looks like your PR touches SQL parser code but doesn't add or edit parser tests. Please make sure you add or edit parser tests if you edit the parser. 🦉 Hoot! I am a Blathers, a bot for CockroachDB. My owner is dev-inf. |
Apologies for the noise. Bad rebase. |
f77fbf4
to
7258e95
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 1 of 0 LGTMs obtained (waiting on @aadityasondhi and @abarganier)
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 155 at r39 (raw file):
<LineGraph title="Admission Queueing Delay – Foreground (Regular) CPU"
nit: I have a mild preference for putting "p99" in the titles too. Since not everyone looks at the tooltip.
In investigations, we have found that the following charts are not useful and frequently cause confusion: - Admission work rate - Admission Delay rate - Requests Waiting For Flow Tokens Informs cockroachdb#121572 Release note (ui change): This patch removes "Admission Delay Rate", "Admission Work Rate", and "Requests Waiting For Flow Tokens". These charts often cause confusion and are not useful for general overload investigations.
This patch reorders the existing metrics in a more usable order: 1. Metrics to help determine which resource is constrained (IO, CPU) 2. Metrics to narrow down which AC queues are seeing requests waiting 3. More advanced metrics about the system health (goroutine scheduler, L0 sublevels, etc.) Informs cockroachdb#121572. Release note (ui change): Reordering of metrics on the overload page to help categorizing them better. They are roughly in the following order: 1. Metrics to help determine which resource is constrained (IO, CPU) 2. Metrics to narrow down which AC queues are seeing requests waiting 3. More advanced metrics about the system health (goroutine scheduler, L0 sublevels, etc.)
This patch improves the metric descriptions for the metrics on the overload page. Fixes cockroachdb#120853. Release note (ui change): The overload page now includes descriptions for all metrics.
This patch adds additional metrics to the overload page that allow for more granular look at the system: - cr.store.storage.l0-sublevels - cr.node.go.scheduler_latency-p99.9 Informs cockroachdb#121572. Release note (ui change): Two additional metrics on the overload page for better visibility into overloaded resources: - cr.store.storage.l0-sublevels - cr.node.go.scheduler_latency-p99.9
Informs cockroachdb#121572. Release note (ui change): There are now 4 graphs for Admission Queue Delay: 1. Foreground (regular) CPU work 2. Store (IO) work 3. Background (elastic) CPU work 4. Replication Admission Control, store overload on replicas
This patch uses the new sperated `elastic-stores` metrics for queing delay from cockroachdb#123890. Informs cockroachdb#121572. Release note (ui change): The `Admission Queueing Delay – Store` chart now separates elastic (background) work from the regular foreground work.
This patch adds the metric `elastic_io_tokens_exhausted_duration.kv` introduced in cockroachdb#124078. Informs cockroachdb#121572. Release note (ui change): The `Admission IO Tokens Exhausted` chart now separates elastic and regular io work.
7258e95
to
c397d35
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (and 1 stale) (waiting on @abarganier and @sumeerbhola)
pkg/ui/workspaces/db-console/src/views/cluster/containers/nodeGraphs/dashboards/overload.tsx
line 155 at r39 (raw file):
Previously, sumeerbhola wrote…
nit: I have a mild preference for putting "p99" in the titles too. Since not everyone looks at the tooltip.
Done.
TFTR! bors r=sumeerbhola |
This PR contains a series of improvements to the overload page of the DB console as part of #121574. It is separated into multiple commits for ease of review.
dbconsole: remove non useful charts on the overload page
In investigations, we have found that the following charts are not
useful and frequently cause confusion:
Informs #121572
Release note (ui change): This patch removes "Admission Delay Rate",
"Admission Work Rate", and "Requests Waiting For Flow Tokens". These
charts often cause confusion and are not useful for general overload
investigations.
dbconsole: reorder overload page metrics for better readability
This patch reorders the existing metrics in a more usable order:
L0 sublevels, etc.)
Informs #121572.
Release note (ui change): Reordering of metrics on the overload page to
help categorizing them better. They are roughly in the following order:
L0 sublevels, etc.)
dbconsole: include better names and descriptions for overload page
This patch improves the metric descriptions for the metrics on the
overload page.
Fixes #120853.
Release note (ui change): The overload page now includes descriptions for all
metrics.
dbconsole: additional higher granularity metrics for overload
This patch adds additional metrics to the overload page that allow for
more granular look at the system:
Informs #121572.
Release note (ui change): Two additional metrics on the overload page
for better visibility into overloaded resources:
dbconsole: split Admission Queue graphs to avoid overcrowding
Informs #121572.
Release note (ui change): There are now 4 graphs for Admission Queue
Delay:
dbconsole: add elastic store metric to the overload page
This patch uses the new sperated
elastic-stores
metrics for queingdelay from #123890.
Informs #121572.
Release note (ui change): The
Admission Queueing Delay – Store
chartnow separates elastic (background) work from the regular foreground
work.
dbconsole: add elastic io token exhausted duration to overload page
This patch adds the metric
elastic_io_tokens_exhausted_duration.kv
introduced in #124078.
Informs #121572.
Release note (ui change): The
Admission IO Tokens Exhausted
chart nowseparates elastic and regular io work.