server/status: add running non-idle jobs metric #79022

darinpp · 2022-03-30T04:27:31Z

Previously serverless was using the sql jobs running metric to determine
if a tenant process is idle and can be shut down. With the introduction
of continiously running jobs this isn't a good indicator anymore. A
recent addition is a per job metrics that show running or idle. The auto
scaler doesn't care about the individual jobs and only cares about the
total number of jobs that a running but haven't reported as being idle.
The pull rate is also very high so the retriving all the individual
running/idle metrics for each job type isn't optimal. So this PR adds a
single metric that just aggregates and tracks the total count of jobs
running and not idle.

Release justification: Bug fixes and low-risk updates to new functionality
Release note: None

Will be re-based once #79021 is merged

cockroach-teamcity · 2022-03-30T04:27:40Z

This change is

miretskiy

Reviewed 5 of 9 files at r1, 1 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp and @miretskiy)

pkg/util/metric/prometheus_exporter.go, line 129 at r2 (raw file):

) error {
	pm.muScrapeAndPrint.Lock()
	defer pm.muScrapeAndPrint.Unlock()

should this be a read lock?

pkg/util/metric/registry.go, line 161 at r2 (raw file):

// Select calls the given closure for the selected metric names.
func (r *Registry) Select(metrics map[string]struct{}, f func(name string, val interface{})) {

I don't know if this method is really needed. Couldn't f perform the same filtering and thus you could
just use Each w/out expanding the interface?

darinpp

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

pkg/util/metric/prometheus_exporter.go, line 129 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

should this be a read lock?

A read lock is shared so it wouldn't work here. The scrape call adds data to the exporter so it isn't read only. The print clears after the data is printed so also isn't read only.

pkg/util/metric/registry.go, line 161 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I don't know if this method is really needed. Couldn't f perform the same filtering and thus you could
just use Each w/out expanding the interface?

f can't perform the same filtering as it doesn't have access to the internals of the registry. It is possible to iterate over all the metrics and let f filter out the ones that aren't required but this isn't optimal from performance point of view and we want to do very frequent calls to this method.

abarganier · 2022-03-30T15:57:08Z

Great to see more observability on the way! From the obs-inf review side, can you look into and consider defining some AlertingRule(s) and/or AggregationRule(s) for this? They can go a long way towards helping people understand expected value ranges/how to effectively use CRDB metrics.

Let us know if we can provide any more useful context on this.

darinpp · 2022-03-30T16:22:42Z

Great to see more observability on the way! From the obs-inf review side, can you look into and consider defining some AlertingRule(s) and/or AggregationRule(s) for this? They can go a long way towards helping people understand expected value ranges/how to effectively use CRDB metrics.

Let us know if we can provide any more useful context on this.

I don't see the need to define either type of rule for the added metric. It doesn't indicate a problem so not sure why would anyone have to define alert rule. Also it doesn't make sense aggregating it with anything else. It already is sort of an aggregation of the status of all jobs. While it can be used on the UI side, the main reason for adding this metric is to drive autoscaler for serverless.

miretskiy

Just giving 👍 from jobs perspective.

abarganier

I don't see the need to define either type of rule for the added metric. It doesn't indicate a problem so not sure why would anyone have to define alert rule. Also it doesn't make sense aggregating it with anything else. It already is sort of an aggregation of the status of all jobs. While it can be used on the UI side, the main reason for adding this metric is to drive autoscaler for serverless.

Observability Infra. always asks those implementing new metrics to consider aggregation rules or alerting rules, but it's not mandatory to add them. Thanks for looking into it & considering!

Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

Previously serverless was using the sql jobs running metric to determine if a tenant process is idle and can be shut down. With the introduction of continiously running jobs this isn't a good indicator anymore. A recent addition is a per job metrics that show running or idle. The auto scaler doesn't care about the individual jobs and only cares about the total number of jobs that a running but haven't reported as being idle. The pull rate is also very high so the retriving all the individual running/idle metrics for each job type isn't optimal. So this PR adds a single metric that just aggregates and tracks the total count of jobs running and not idle. Release justification: Bug fixes and low-risk updates to new functionality Release note: None

darinpp · 2022-04-05T18:56:11Z

bors r+

craig · 2022-04-05T21:09:56Z

Build succeeded:

GitHub CI (Cockroach)

79023: server/status: add load related metrics r=darinpp a=darinpp Previously there were only CPU related metrics available on the _status/load endpoint. For serverless we need in addition to these, the metrics which show the total number of current sql connections, the number of sql queries executed and the number of jobs currently running that are not idle. This PR adds the three new metrics by using selective prometheus exporter and scraping the MetricsRecorder. Release justification: Low risk, high reward changes to existing functionality Release note: None Will be re-based once #79021 and #79022 are merged. Co-authored-by: Darin Peshev <[email protected]>

darinpp requested a review from a team March 30, 2022 04:27

darinpp requested review from a team as code owners March 30, 2022 04:27

darinpp requested review from gh-casper and removed request for a team March 30, 2022 04:27

darinpp mentioned this pull request Mar 30, 2022

server/status: add load related metrics #79023

Merged

shermanCRL requested review from miretskiy and removed request for gh-casper March 30, 2022 12:22

miretskiy suggested changes Mar 30, 2022

View reviewed changes

miretskiy self-requested a review March 30, 2022 12:34

darinpp commented Mar 30, 2022

View reviewed changes

darinpp force-pushed the running-non-idle-jobs-metric branch from fd91a32 to 44ab019 Compare March 30, 2022 15:55

miretskiy approved these changes Mar 30, 2022

View reviewed changes

abarganier reviewed Apr 1, 2022

View reviewed changes

darinpp force-pushed the running-non-idle-jobs-metric branch from 44ab019 to 09e5af0 Compare April 5, 2022 16:30

craig bot merged commit 985344a into cockroachdb:master Apr 5, 2022

darinpp deleted the running-non-idle-jobs-metric branch April 5, 2022 21:19

darinpp mentioned this pull request Apr 6, 2022

release-22.1: Adding metrics required by the serverless autoscaler #79519

Merged

tbg mentioned this pull request Jun 22, 2022

roachperf: regression around 2022-04-06 #82136

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

server/status: add running non-idle jobs metric #79022

server/status: add running non-idle jobs metric #79022

darinpp commented Mar 30, 2022

cockroach-teamcity commented Mar 30, 2022

miretskiy left a comment

darinpp left a comment

abarganier commented Mar 30, 2022

darinpp commented Mar 30, 2022

miretskiy left a comment

abarganier left a comment

darinpp commented Apr 5, 2022

craig bot commented Apr 5, 2022

server/status: add running non-idle jobs metric #79022

server/status: add running non-idle jobs metric #79022

Conversation

darinpp commented Mar 30, 2022

cockroach-teamcity commented Mar 30, 2022

miretskiy left a comment

Choose a reason for hiding this comment

darinpp left a comment

Choose a reason for hiding this comment

abarganier commented Mar 30, 2022

darinpp commented Mar 30, 2022

miretskiy left a comment

Choose a reason for hiding this comment

abarganier left a comment

Choose a reason for hiding this comment

darinpp commented Apr 5, 2022

craig bot commented Apr 5, 2022