Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics bind address and port in CAPI controllers #7957

Closed
MaxFedotov opened this issue Jan 19, 2023 · 8 comments
Closed

Metrics bind address and port in CAPI controllers #7957

MaxFedotov opened this issue Jan 19, 2023 · 8 comments
Labels
area/metrics Issues or PRs related to metrics kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.

Comments

@MaxFedotov
Copy link
Contributor

User Story

As a /user/operator I would like to collect controller-runtime metrics for CAPI controllers using the Prometheus operator without a need to patch each of the CAPI deployments manually after clusterctl install\update procedure.

Detailed Description

All CAPI components expose controller-runtime metrics, but by default metrics-bind-addr is set to localhost, and the metrics port is not added to the CAPI componentsDeployment specs. So in order to be able to collect these metrics using, for example, Prometheus Operator PodMonitor user had to patch manually each of the CAPI component's Deployment specs, update the metrics-bind-addr and add additional metrics port. And these operations should be repeated after every clusterctl update performed. Although it is possible to automate this pathing, this approach could be error-prone, in case of new parameters or ports will be added to Deployment specs in the future.

Maybe it is possible to set 0.0.0.0:8080 as the default value for metrics-bind-addr and add metrics port to each of the CAPI components?

/kind feature

@k8s-ci-robot k8s-ci-robot added kind/feature Categorizes issue or PR as related to a new feature. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 19, 2023
@sbueringer
Copy link
Member

sbueringer commented Jan 20, 2023

I generally understand and agree that something like this is desirable.

Some history:

  • A while back we were using kube-rbac-proxy which provided a secured metrics endpoint
  • This endpoint was exposed per default
  • Then we decided to get rid of the external dependency to not depend on a project hosted on a personal account
  • As a consequence we decided to only bind to localhost as there were concerns in the community that exposing the metrics port unauthenticated is a security issue

Those discussions happened IIRC on this PR (#4640) and also in office hours.

Independent of that, I think we should try to find a way forward which makes it easier to consume metrics from Cluster API controllers.

I'm not sure if it would be productive to try to change the decision to not expose an unsecured metrics port per default (not saying I personally have a strong opinion, I just don't think that that's a good way to resolve this).

I see the following options:

  1. Introduce a var to our YAMLs to allow exposing the metrics port ~ ${METRICS_BIND_ADDR:=localhost:8080} + always add the metrics port to our Deployment ports list.
  2. Find another way to secure the metrics port and then expose it per default. In an effort to find a successor to kube-rbac-proxy I proposed implementing a subset of the kube-rbac-proxy feature set in controller-runtime: Add authorization for metrics endpoint controller-runtime#2073. This would basically mean that every controller using controller-runtime could expose a secured metrics port like Kubernetes core components (apiserver, ...)

I think eventually 2 would be the best option, but it requires some time to actually get it done.

@sbueringer
Copy link
Member

/triage accepted

@k8s-ci-robot k8s-ci-robot added triage/accepted Indicates an issue or PR is ready to be actively worked on. and removed needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Jan 20, 2023
@fabriziopandini
Copy link
Member

I'm +1 for solution 2, the entire ecosystem will benefit from it

@MaxFedotov
Copy link
Contributor Author

hi @sbueringer, thanks! I also vote for the second solution, but it will require quite a lot of dev efforts to implement. May be it is possible to start with the first one (as it is more simple) and then start investigating the second one?

@sbueringer
Copy link
Member

No objection from my side, but I think we have to discuss this with more folks. Especially as solution 1 only becomes really useful if infra providers agree and are adopting this as well.

@fabriziopandini
Copy link
Member

/priority important-longterm
@sbueringer can we close this?

@k8s-ci-robot k8s-ci-robot added the priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. label Apr 11, 2024
@sbueringer
Copy link
Member

Yup. Done in v1.6 and also documented for providers in the migration guide (v1.5 => v1.6)

Here is the user facing documentation: https://cluster-api.sigs.k8s.io/tasks/diagnostics

/close

@MaxFedotov Please let us know if we missed something, but I think we're good

@k8s-ci-robot
Copy link
Contributor

@sbueringer: Closing this issue.

In response to this:

Yup. Done in v1.6 and also documented for providers in the migration guide (v1.5 => v1.6)

Here is the user facing documentation: https://cluster-api.sigs.k8s.io/tasks/diagnostics

/close

@MaxFedotov Please let us know if we missed something, but I think we're good

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/metrics Issues or PRs related to metrics kind/feature Categorizes issue or PR as related to a new feature. priority/important-longterm Important over the long term, but may not be staffed and/or may need multiple releases to complete. triage/accepted Indicates an issue or PR is ready to be actively worked on.
Projects
None yet
Development

No branches or pull requests

5 participants