Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

extend metrics for github and provider executions #217

Merged
merged 5 commits into from
Feb 22, 2024

Conversation

bavarianbidi
Copy link
Contributor

@bavarianbidi bavarianbidi commented Feb 21, 2024

This PR add couple of new metrics to count interactions with the github API and external provider binaries.

i've followed the recommendation of the prometheus-folks when it came to metrics for this kind of problem: https://promlabs.com/blog/2023/09/19/errors-successes-totals-which-metrics-should-i-expose-to-prometheus/#recommended-for-binary-outcomes-exposing-errors-and-totals

provider binary example metrics:

# HELP garm_runner_operations_total Total number of instance operation attempts
# TYPE garm_runner_operations_total counter
garm_runner_operation_total{operation="CreateInstance",provider="kubernetes_external"} 2
garm_runner_operation_total{operation="DeleteInstance",provider="kubernetes_external"} 2

github api call example metrics:

# HELP garm_github_errors_total Total number of failed github operation attempts
# TYPE garm_github_errors_total counter
garm_github_errors_total{operation="GenerateOrgJITConfig",scope="Organization"} 2
# HELP garm_github_operations_total Total number of github operation attempts
# TYPE garm_github_operations_total counter
garm_github_operations_total{operation="GenerateOrgJITConfig",scope="Organization"} 2
garm_github_operations_total{operation="ListOrganizationRunnerApplicationDownloads",scope="Organization"} 2
garm_github_operations_total{operation="ListOrganizationRunners",scope="Organization"} 1
garm_github_operations_total{operation="RemoveRunner",scope="Organization"} 4

towards #181

TODO:

  • document new metrics
  • discuss golangci.yml with metrics linter --> will do another PR with a first golangci.yaml for this project
  • do a second round and search for all github api calls

introduce metrics counter for github api calls

Signed-off-by: Mario Constanti <[email protected]>
Signed-off-by: Mario Constanti <[email protected]>
metrics.GithubOperationFailedCount.WithLabelValues(
"GenerateEnterpriseJITConfig", // label: operation
metricsLabelEnterpriseScope, // label: scope
).Inc()
if resp != nil && resp.StatusCode == http.StatusUnauthorized {
return nil, nil, fmt.Errorf("failed to get JIT config: %w", err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no related to this change.

Note to self: This needs to wrap runnerErrors.ErrUnauthorized

@gabriel-samfira
Copy link
Member

This looks good to me. All this makes me think we might benefit from some telemetry as well. Maybe in a future release we could leverage OpenTelemetry.

I could merge this if you're satisfied with it. Let me know. If you're still working on it, feel free to convert it to draft so I don't accidentally merge it prematurely 😁

@bavarianbidi
Copy link
Contributor Author

This looks good to me. All this makes me think we might benefit from some telemetry as well. Maybe in a future release we could leverage OpenTelemetry.

I could merge this if you're satisfied with it. Let me know. If you're still working on it, feel free to convert it to draft so I don't accidentally merge it prematurely 😁

OpenTelemetry is definitely (IMHO) the next thing when it came to observability. TBH i didn't spend that much time into it to understand how/what we need on our side to use it within our company internal managed prometheus service.
But as you said, something totally different and unrelated to this PR.

Will convert the PR to draft to finalize it tomorrow morning (especially the second round of looking for external api-calls).

@bavarianbidi bavarianbidi marked this pull request as draft February 21, 2024 18:38
@gabriel-samfira
Copy link
Member

OpenTelemetry is definitely (IMHO) the next thing when it came to observability. TBH i didn't spend that much time into it to understand how/what we need on our side to use it within our company internal managed prometheus service.
But as you said, something totally different and unrelated to this PR.

definitely something for future releases, not even for the next one. otel allows us to trace calls throughout the codebase and allows us to add context to those traces, like what we're calling functions with, what error they return. With that info, we can then generate detailed graphs of how a call traveled throughout the code base. Further more, this can be correlated even with external providers.

But again, this is for much further into the future. For now, and especially in this PR, metrics are the focus.

@bavarianbidi
Copy link
Contributor Author

I've found three other API calls i missed in the first round - but for now it looks good - feel free to merge whenever you want

@gabriel-samfira gabriel-samfira merged commit dd6f1e4 into cloudbase:main Feb 22, 2024
4 checks passed
bavarianbidi pushed a commit to mercedes-benz/garm that referenced this pull request May 22, 2024
extend metrics for github and provider executions
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants