-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics #1477
Comments
@randomvariable: You must be a member of the kubernetes-sigs/cluster-api-maintainers GitHub team to set the milestone. If you believe you should be able to issue the /milestone command, please contact your Cluster API Maintainers and have them propose you as an additional delegate for this responsibility. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@randomvariable: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/priority important-long-term |
@vincepri: The label(s) In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Are we thinking of adding prometheus style metrics and an endpoint on the controller manager? |
There should already be rudimentary support for the metrics endpoint through controller-runtime, should just need to add additional metrics. |
Yes. We already have the metrics server ( Line 102 in aa86af2
|
Thanks for the tips. I can start to take a look into adding some metrics. |
/assign |
Let me know when you start work on it @wfernandes and we'll mark it as lifecycle/active |
/lifecycle active This file (capi-metrics.txt) contains some of the current metrics provided by the controller_runtime framework. I noticed we have an event recorder that is being used by some of the controllers to track important events. For example, in the machine controller we have I would like for the additional metrics we expose to be useful to an operator such that they may want to set up an alerting rule based on these metrics for example. |
Adding these for reference: |
We've talked about adding additional events to the Machine. For example, I think it would make sense to emit an event based on the following things:
|
We've also just added https://github.com/kubernetes-sigs/cluster-api-provider-aws/blob/master/docs/observability.md to the AWS provider, a similar doc here would also be useful to consumers. |
It would be great if the cluster/machine deployment create/delete/resize latencies can be included in the CRD status field once they are created. |
@harishspqr if we record them, they wouldn't be included in the status field. They would be exposed as metrics accessible from the metrics server port. |
@harishspqr What type of calculation are you looking for out of these latencies?
|
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
/lifecycle frozen |
/reopen |
@sbueringer: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
@fabriziopandini to split this issue into multiple ones and provide additional context / resolution |
I think the best way forward for the metrics effort is to join forces with https://github.com/mercedes-benz/cluster-api-state-metrics in order to provide a first set of "core metrics" about the Cluster API API objects. WRT to implementing metrics inside CAPI, I will limit this to a limited number of use cases TBD and focused on internal functioning of the controllers, e.g. :
I will raise the point at the next office hours, to see if there is agreement on this course of actions |
This make sense to me, this can be used as basic SLIs that e.g CI systems consume to ensure SLOs are met. Regarding https://github.com/mercedes-benz/cluster-api-state-metrics which looks awesome, in order to move it towards a community pluggable component solution I think It'd be good to have a common doc/issue to discuss and agree on all use cases/goals we want to achieve with those metrics. |
We could also ask the maintainers of |
If I understood correctly, maintainers offered that already :) (@chrischdi @tobiasgiese please confirm :)) @vincepri What would be the road towards donation? |
Yes that was one possible target or reason for us to even publish it to github. I think the donation would help to get more contributions for the project and would increase acceptance to use it. We try to talk internally to the people to clarify if proceeding is okay for them (to accomplish the formalities) 👍 |
So after having a first talk, we would be willed to donate the project :-) Would a donation to the kubernetes-sig be a better fit than donating it to the CNCF? If contributing to the kubernetes-sig the requirements are documented here: source Which would require to
|
@chrischdi that's great news! Personally I'm also intrigued by Vince's suggestion
Because this can help in having a quicker start (it is just a PR), and most importantly it allows a stricter collaboration with the CAPI core team while migrating the project to v1beta1, consolidating the first set of metrics etc. but I will be happy with both options |
I agree, it's an interesting option to integrate it in the core CAPI repo. It would definitely make it a lot easier to integrate it in Tilt, the release process, ... etc. I think in that case we would create a new (probably top-level) folder in the CAPI repo with an From a high-level perspective, it would essentially be a core component of ClusterAPI. |
That sounds good. I will attend next week 👍 . @tobiasgiese or @johannesfrey : would be great if one of you could also free up some time :-) |
One additional point: I think it also helps us to close the feedback loop for CAPI developers for metrics better. Just like we currently try to make logging part of our development workflow. If we're able to include metrics and logging in our development workflows (especially for debugging) it will help to continuously improve both and we will slowly considering them as part of the ClusterAPI "deliverable". Essentially take on responsibility not only for the behavior/implementation of our controllers but also about how observable/ operable / debuggable they are. I think this helps us to get closer to the principle, that usually things improve once the folks responsible for development are using them themselves and rely on them (in it's most extreme form if they are on call for it, but in our case if we integrate them into dev workflows). |
Sure, I'll attend also, looking forward to it |
Sure, count me in as well :). |
First of all thanks for the discussion in the office hours. As follow-up: so the next steps would be:
|
We created a first draft of the Cluster API State Metrics proposal |
If there's no objections I think we should close this issue now that #6404 is open. For future conversations around metrics it's better IMO to have issue with a tighter scope. |
+1 to close this as soon and #6404 merges |
I think #6404 only addresses state metrics. This still leaves metrics exposed by our controllers. Did I understand correctly that we would open a new tightly scoped issue for this before closing this issue? |
@sbueringer I think this issue is very old and has gone through too many transformations to be really useful anymore so any future work should go in newer more scoped issues. I'm not sure exactly what specific metrics exposed by controllers we'd be missing by closing this issue now though. |
/close |
@fabriziopandini: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
User Story
(copied from kubernetes-sigs/cluster-api-provider-aws#78 )
A cluster operator may want to metrics from the controllers available in their monitoring system of choice, e.g. Prometheus. They would be interested in metrics with regards to the creation, deletion, mutation and failure rates of AWS APIs, and the state of their objects at least to help diagnose when limits have been reached.
Goals
Non-Goals
Anything else you would like to add:
Current version of controller-runtime should have OpenCensus support.
/kind feature
/priority important/long-term
/milestone next
The text was updated successfully, but these errors were encountered: