Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

server/status: add running non-idle jobs metric #79022

Merged
merged 1 commit into from
Apr 5, 2022

Conversation

darinpp
Copy link
Contributor

@darinpp darinpp commented Mar 30, 2022

Previously serverless was using the sql jobs running metric to determine
if a tenant process is idle and can be shut down. With the introduction
of continiously running jobs this isn't a good indicator anymore. A
recent addition is a per job metrics that show running or idle. The auto
scaler doesn't care about the individual jobs and only cares about the
total number of jobs that a running but haven't reported as being idle.
The pull rate is also very high so the retriving all the individual
running/idle metrics for each job type isn't optimal. So this PR adds a
single metric that just aggregates and tracks the total count of jobs
running and not idle.

Release justification: Bug fixes and low-risk updates to new functionality
Release note: None

Will be re-based once #79021 is merged

@darinpp darinpp requested a review from a team March 30, 2022 04:27
@darinpp darinpp requested review from a team as code owners March 30, 2022 04:27
@darinpp darinpp requested review from gh-casper and removed request for a team March 30, 2022 04:27
@cockroach-teamcity
Copy link
Member

This change is Reviewable

@shermanCRL shermanCRL requested review from miretskiy and removed request for gh-casper March 30, 2022 12:22
Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed 5 of 9 files at r1, 1 of 4 files at r2.
Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @darinpp and @miretskiy)


pkg/util/metric/prometheus_exporter.go, line 129 at r2 (raw file):

) error {
	pm.muScrapeAndPrint.Lock()
	defer pm.muScrapeAndPrint.Unlock()

should this be a read lock?


pkg/util/metric/registry.go, line 161 at r2 (raw file):

// Select calls the given closure for the selected metric names.
func (r *Registry) Select(metrics map[string]struct{}, f func(name string, val interface{})) {

I don't know if this method is really needed. Couldn't f perform the same filtering and thus you could
just use Each w/out expanding the interface?

@miretskiy miretskiy self-requested a review March 30, 2022 12:34
Copy link
Contributor Author

@darinpp darinpp left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)


pkg/util/metric/prometheus_exporter.go, line 129 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

should this be a read lock?

A read lock is shared so it wouldn't work here. The scrape call adds data to the exporter so it isn't read only. The print clears after the data is printed so also isn't read only.


pkg/util/metric/registry.go, line 161 at r2 (raw file):

Previously, miretskiy (Yevgeniy Miretskiy) wrote…

I don't know if this method is really needed. Couldn't f perform the same filtering and thus you could
just use Each w/out expanding the interface?

f can't perform the same filtering as it doesn't have access to the internals of the registry. It is possible to iterate over all the metrics and let f filter out the ones that aren't required but this isn't optimal from performance point of view and we want to do very frequent calls to this method.

@darinpp darinpp force-pushed the running-non-idle-jobs-metric branch from fd91a32 to 44ab019 Compare March 30, 2022 15:55
@abarganier
Copy link
Contributor

Great to see more observability on the way! From the obs-inf review side, can you look into and consider defining some AlertingRule(s) and/or AggregationRule(s) for this? They can go a long way towards helping people understand expected value ranges/how to effectively use CRDB metrics.

Let us know if we can provide any more useful context on this.

@darinpp
Copy link
Contributor Author

darinpp commented Mar 30, 2022

Great to see more observability on the way! From the obs-inf review side, can you look into and consider defining some AlertingRule(s) and/or AggregationRule(s) for this? They can go a long way towards helping people understand expected value ranges/how to effectively use CRDB metrics.

Let us know if we can provide any more useful context on this.

I don't see the need to define either type of rule for the added metric. It doesn't indicate a problem so not sure why would anyone have to define alert rule. Also it doesn't make sense aggregating it with anything else. It already is sort of an aggregation of the status of all jobs. While it can be used on the UI side, the main reason for adding this metric is to drive autoscaler for serverless.

Copy link
Contributor

@miretskiy miretskiy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just giving 👍 from jobs perspective.

Copy link
Contributor

@abarganier abarganier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't see the need to define either type of rule for the added metric. It doesn't indicate a problem so not sure why would anyone have to define alert rule. Also it doesn't make sense aggregating it with anything else. It already is sort of an aggregation of the status of all jobs. While it can be used on the UI side, the main reason for adding this metric is to drive autoscaler for serverless.

Observability Infra. always asks those implementing new metrics to consider aggregation rules or alerting rules, but it's not mandatory to add them. Thanks for looking into it & considering!

Reviewable status: :shipit: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)

Previously serverless was using the sql jobs running metric to determine
if a tenant process is idle and can be shut down. With the introduction
of continiously running jobs this isn't a good indicator anymore. A
recent addition is a per job metrics that show running or idle. The auto
scaler doesn't care about the individual jobs and only cares about the
total number of jobs that a running but haven't reported as being idle.
The pull rate is also very high so the retriving all the individual
running/idle metrics for each job type isn't optimal. So this PR adds a
single metric that just aggregates and tracks the total count of jobs
running and not idle.

Release justification: Bug fixes and low-risk updates to new functionality
Release note: None
@darinpp darinpp force-pushed the running-non-idle-jobs-metric branch from 44ab019 to 09e5af0 Compare April 5, 2022 16:30
@darinpp
Copy link
Contributor Author

darinpp commented Apr 5, 2022

bors r+

@craig craig bot merged commit 985344a into cockroachdb:master Apr 5, 2022
@craig
Copy link
Contributor

craig bot commented Apr 5, 2022

Build succeeded:

@darinpp darinpp deleted the running-non-idle-jobs-metric branch April 5, 2022 21:19
craig bot pushed a commit that referenced this pull request Apr 6, 2022
79023: server/status: add load related metrics r=darinpp a=darinpp

Previously there were only CPU related metrics available on the
_status/load endpoint. For serverless we need in addition to these, the
metrics which show the total number of current sql connections, the
number of sql queries executed and the number of jobs currently running
that are not idle. This PR adds the three new metrics by using selective
prometheus exporter and scraping the MetricsRecorder.

Release justification: Low risk, high reward changes to existing functionality

Release note: None

Will be re-based once #79021 and #79022 are merged.

Co-authored-by: Darin Peshev <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants