-
Notifications
You must be signed in to change notification settings - Fork 3.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
server/status: add running non-idle jobs metric #79022
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed 5 of 9 files at r1, 1 of 4 files at r2.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @darinpp and @miretskiy)
pkg/util/metric/prometheus_exporter.go, line 129 at r2 (raw file):
) error { pm.muScrapeAndPrint.Lock() defer pm.muScrapeAndPrint.Unlock()
should this be a read lock?
pkg/util/metric/registry.go, line 161 at r2 (raw file):
// Select calls the given closure for the selected metric names. func (r *Registry) Select(metrics map[string]struct{}, f func(name string, val interface{})) {
I don't know if this method is really needed. Couldn't f
perform the same filtering and thus you could
just use Each
w/out expanding the interface?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
pkg/util/metric/prometheus_exporter.go, line 129 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
should this be a read lock?
A read lock is shared so it wouldn't work here. The scrape call adds data to the exporter so it isn't read only. The print clears after the data is printed so also isn't read only.
pkg/util/metric/registry.go, line 161 at r2 (raw file):
Previously, miretskiy (Yevgeniy Miretskiy) wrote…
I don't know if this method is really needed. Couldn't
f
perform the same filtering and thus you could
just useEach
w/out expanding the interface?
f
can't perform the same filtering as it doesn't have access to the internals of the registry. It is possible to iterate over all the metrics and let f
filter out the ones that aren't required but this isn't optimal from performance point of view and we want to do very frequent calls to this method.
fd91a32
to
44ab019
Compare
Great to see more observability on the way! From the obs-inf review side, can you look into and consider defining some AlertingRule(s) and/or AggregationRule(s) for this? They can go a long way towards helping people understand expected value ranges/how to effectively use CRDB metrics. Let us know if we can provide any more useful context on this. |
I don't see the need to define either type of rule for the added metric. It doesn't indicate a problem so not sure why would anyone have to define alert rule. Also it doesn't make sense aggregating it with anything else. It already is sort of an aggregation of the status of all jobs. While it can be used on the UI side, the main reason for adding this metric is to drive autoscaler for serverless. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just giving 👍 from jobs perspective.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't see the need to define either type of rule for the added metric. It doesn't indicate a problem so not sure why would anyone have to define alert rule. Also it doesn't make sense aggregating it with anything else. It already is sort of an aggregation of the status of all jobs. While it can be used on the UI side, the main reason for adding this metric is to drive autoscaler for serverless.
Observability Infra. always asks those implementing new metrics to consider aggregation rules or alerting rules, but it's not mandatory to add them. Thanks for looking into it & considering!
Reviewable status: complete! 0 of 0 LGTMs obtained (waiting on @miretskiy)
Previously serverless was using the sql jobs running metric to determine if a tenant process is idle and can be shut down. With the introduction of continiously running jobs this isn't a good indicator anymore. A recent addition is a per job metrics that show running or idle. The auto scaler doesn't care about the individual jobs and only cares about the total number of jobs that a running but haven't reported as being idle. The pull rate is also very high so the retriving all the individual running/idle metrics for each job type isn't optimal. So this PR adds a single metric that just aggregates and tracks the total count of jobs running and not idle. Release justification: Bug fixes and low-risk updates to new functionality Release note: None
44ab019
to
09e5af0
Compare
bors r+ |
Build succeeded: |
79023: server/status: add load related metrics r=darinpp a=darinpp Previously there were only CPU related metrics available on the _status/load endpoint. For serverless we need in addition to these, the metrics which show the total number of current sql connections, the number of sql queries executed and the number of jobs currently running that are not idle. This PR adds the three new metrics by using selective prometheus exporter and scraping the MetricsRecorder. Release justification: Low risk, high reward changes to existing functionality Release note: None Will be re-based once #79021 and #79022 are merged. Co-authored-by: Darin Peshev <[email protected]>
Previously serverless was using the sql jobs running metric to determine
if a tenant process is idle and can be shut down. With the introduction
of continiously running jobs this isn't a good indicator anymore. A
recent addition is a per job metrics that show running or idle. The auto
scaler doesn't care about the individual jobs and only cares about the
total number of jobs that a running but haven't reported as being idle.
The pull rate is also very high so the retriving all the individual
running/idle metrics for each job type isn't optimal. So this PR adds a
single metric that just aggregates and tracks the total count of jobs
running and not idle.
Release justification: Bug fixes and low-risk updates to new functionality
Release note: None
Will be re-based once #79021 is merged