Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

release-22.1: Adding metrics required by the serverless autoscaler #79519

Conversation

darinpp
Copy link
Contributor

@darinpp darinpp commented Apr 6, 2022

Backport:

Please see individual PRs for details.

/cc @cockroachdb/release

Release justification: Low risk, high reward changes to existing functionality

Release note: None

darinpp added 3 commits April 6, 2022 09:56
Previously PrometheusExporter could only export all the metrics in a
registry without ability to select a subset. For serverless we use a
separate metric endpoint (_status/load)  that currently shows cpu
utilization metrics that are generated each time the metrics are
pulled. We need however some additional metrics that are currently
tracked by MetricRecorder. Exporting all the metrics tracked by the
MetricRecorder is not desirables as this incurs performabnce penalty
given the higher poll rate on the load endpoint.
So this PR modifies PrometheusExporter to only scrape a subset of all
the metrics.
A second change is how the locking is done when scraping and writing the
screaped output. Previously the lock when doing that was external and
was a responsibility of the caller. This PR adds a ScrapeAndPrintAsText
method to the exporter that is thread safe and does the locking
internally.

Release justification: Low risk, high reward changes to existing functionality
Release note: None
Previously serverless was using the sql jobs running metric to determine
if a tenant process is idle and can be shut down. With the introduction
of continiously running jobs this isn't a good indicator anymore. A
recent addition is a per job metrics that show running or idle. The auto
scaler doesn't care about the individual jobs and only cares about the
total number of jobs that a running but haven't reported as being idle.
The pull rate is also very high so the retriving all the individual
running/idle metrics for each job type isn't optimal. So this PR adds a
single metric that just aggregates and tracks the total count of jobs
running and not idle.

Release justification: Bug fixes and low-risk updates to new functionality
Release note: None
Previously there were only CPU related metrics available on the
_status/load endpoint. For serverless we need in addition to these, the
metrics which show the total number of current sql connections, the
number of sql queries executed and the number of jobs currently running
that are not idle. This PR adds the three new metrics by using selective
prometheus exporter and scraping the MetricsRecorder.

Release justification: Low risk, high reward changes to existing functionality
Release note: None
@darinpp darinpp requested a review from a team April 6, 2022 16:58
@darinpp darinpp requested review from a team as code owners April 6, 2022 16:58
@darinpp darinpp requested review from samiskin and removed request for a team April 6, 2022 16:58
@blathers-crl
Copy link

blathers-crl bot commented Apr 6, 2022

Thanks for opening a backport.

Please check the backport criteria before merging:

  • Patches should only be created for serious issues or test-only changes.
  • Patches should not break backwards-compatibility.
  • Patches should change as little code as possible.
  • Patches should not change on-disk formats or node communication protocols.
  • Patches should not add new functionality.
  • Patches must not add, edit, or otherwise modify cluster versions; or add version gates.
If some of the basic criteria cannot be satisfied, ensure that the exceptional criteria are satisfied within.
  • There is a high priority need for the functionality that cannot wait until the next release and is difficult to address in another way.
  • The new functionality is additive-only and only runs for clusters which have specifically “opted in” to it (e.g. by a cluster setting).
  • New code is protected by a conditional check that is trivial to verify and ensures that it only runs for opt-in clusters.
  • The PM and TL on the team that owns the changed code have signed off that the change obeys the above rules.

Add a brief release justification to the body of your PR to justify this backport.

Some other things to consider:

  • What did we do to ensure that a user that doesn’t know & care about this backport, has no idea that it happened?
  • Will this work in a cluster of mixed patch versions? Did we test that?
  • If a user upgrades a patch version, uses this feature, and then downgrades, what happens?

@cockroach-teamcity
Copy link
Member

This change is Reviewable

@darinpp darinpp merged commit 2963625 into cockroachdb:release-22.1 Apr 6, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants