-
Notifications
You must be signed in to change notification settings - Fork 89
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add upjet runtime Prometheus metrics #170
Conversation
- upjet_terraform_cli_duration: Reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete - upjet_terraform_active_cli_invocations: The number of active (running) Terraform CLI invocations - upjet_terraform_running_processes: The number of running Terraform CLI and Terraform provider processes - upjet_resource_ttr: Measures, in seconds, the time-to-readiness for managed resources - terraform.Operation.MarkStart now atomically checks for any previous ongoing operation before starting a new one - terraform.Operation.{Start,End}Time no longer return pointers that could potentially be used to modify the shared state outside of critical sections. Signed-off-by: Alper Rifat Ulucinar <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @ulucinar LGTM!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ulucinar this looks really great! I'm however by no means a Prometheus expert.
I asked @AaronME could you take a look as well and tagged him for the review.
I think the question we had was around high cardinality metrics, especially the gauge ones that collect frequently. What I can recommend is to see how the metrics are collected and if some of them are too frequent, maybe aggregate over them.
I'm approving to get this unblocked and we can keep iterating.
Hi @sergenyalcin, @Piotr1215,
Yes, correct. When the PR was first opened, it contained |
Thanks for looping me in, @Piotr1215 ! @ulucinar This looks good to me! Thank you for tackling this. |
package metrics | ||
|
||
import ( | ||
"github.com/prometheus/client_golang/prometheus" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider this non-blocking, but I'd prefer to use OpenTelemetry to expose Prom metrics. I believe we mostly use Otel for Upbound things internally, and it would open a path to use one SDK for all observability (i.e. traces and logs too).
We've held off on this in the past waiting to see what controller-runtime would do per kubernetes-sigs/controller-runtime#305.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @negz for the pointer. Makes sense to me.
Let's proceed with the Prometheus metrics for now as the controller-runtime still makes use of them. I have not checked if it's possible with OpenTelemetry metrics but it was convenient to register upjet's custom metrics with the controller-runtime's registry. What do you think?
Opened #171 to track this. Thank you @negz for bringing this up. Let's track it there.
Description of your changes
Fixes #167
This PR adds the following Prometheus metrics to the upjet runtime. These are upjet runtime metrics, meaning that they are exposed by a provider while reconciling its managed resources via upjet:
upjet_terraform_cli_duration
: This is a histogram metric and reports statistics, in seconds, on how long it takes a Terraform CLI invocation to complete.upjet_terraform_active_cli_invocations
: This is a gauge metric and it's the number of active (running) Terraform CLI invocations.upjet_terraform_running_processes
: This is a gauge metric and it's the number of running Terraform CLI and Terraform provider processes.upjet_resource_ttr
: This is a histogram metric and it measures, in seconds, the time-to-readiness for managed resources.terraform.Operation.MarkStart
now atomically checks for any previous ongoing operation before starting a new one, andterraform.Operation.{Start,End}Time
no longer return pointers that could potentially be used to modify the shared state outside of critical sections.The following labels are available for the exposed runtime metrics:
upjet_terraform_cli_duration
:subcommand
andmode
.subcommand
: Theterraform
subcommand that's run, e.g.,init
,apply
,plan
,destroy
, etc.mode
: The execution mode of the Terraform CLI, one ofsync
(so that the CLI was invoked synchronously as part of a reconcile loop),async
(so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).upjet_terraform_active_cli_invocations
:subcommand
andmode
.subcommand
: Theterraform
subcommand that's run, e.g.,init
,apply
,plan
,destroy
, etc.mode
: The execution mode of the Terraform CLI, one ofsync
(so that the CLI was invoked synchronously as part of a reconcile loop),async
(so that the CLI was invoked asynchronously, the reconciler goroutine will poll and collect results in future).upjet_terraform_running_processes
:type
type
: Eithercli
for Terraform CLI (theterraform
process) processes orprovider
for the Terraform provider processes. Please note that this is a best effort metric that may not be able to precisely catch & report all relevant processes. We may, in the future, improve this if needed by for example watching thefork
system calls. But currently, it may prove to be useful to watch rouge Terraform provider processes.upjet_resource_ttr
:group
,version
,kind
group
,version
,kind
labels record the API group, version and kind for the managed resource, whose time-to-readiness measurement is captured.Notes on the concurrency-related changes:
terraform.Operation.MarkStart
now atomically checks for ongoing async operations and reserves the "operation slot" (by recording the start time): We were previously checking whether there's an ongoing async operation in a critical section, exiting out of the critical section and then entering another section where we do the reservation like follows:From a theoretical perspective this does not look right but in fact, the above section is never executed by two concurrent goroutines (on the same operation) and thus is safe, as long as the controller-runtime behaves according to this assumption. But nevertheless, this PR proposes to change
MarkStart
so that it atomically checks and reserves the slot because:terraform.Operation.{Start,End}Time
no longer return pointers that could potentially be used to modify the shared state outside of critical sections: Not sure if this has practical implications but again from a theoretical point of view, it's good practice to read the data in a critical section, make a copy of it, and return that snapshot copy so that its clients will not have a chance to modify the shared state outside of a critical section.I have:
make reviewable
to ensure this PR is ready for review.backport release-x.y
labels to auto-backport this PR if necessary.How has this code been tested
10
userpool.cognitoidp
resources fromupbound/provider-aws
were provisioned, reconciled with a poll interval of 1m twice after acquiring theReady=True
status condition, and they were finally destroyed. Here are some sample screenshots from the Prometheus UI:upjet_terraform_active_cli_invocations
gauge metric showing the sync & asyncterraform init/apply/plan/destroy
invocations:upjet_terraform_running_processes
gauge metric showing bothcli
andprovider
labels:upjet_terraform_cli_duration
histogram metric, showing average Terraform CLI running times for the last 5m:The medians (0.5-quantiles) for these observations aggregated by the mode and Terraform subcommand being invoked:
upjet_resource_ttr
histogram metric, showing average resource TTR for the last 10m:The median (0.5-quantile) for these TTR observations: