Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bind controller-manager and scheduler to node ip address to expose /metrics endpoint #2388

Closed
ksa-real opened this issue Feb 7, 2021 · 5 comments
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Milestone

Comments

@ksa-real
Copy link

ksa-real commented Feb 7, 2021

FEATURE REQUEST

Versions

kubeadm version (use kubeadm version): 1.20.2

Environment:

  • Kubernetes version (use kubectl version): v1.20.2
  • Cloud provider or hardware configuration: Bare-metal
  • OS (e.g. from /etc/os-release): Gentoo 2.7
  • Kernel (e.g. uname -a): 5.10.4-gentoo
  • Others:

What happened?

TL;DR Controller manager and scheduler's metrics endpoint is inaccessible with the default kubeadm setup. The proposal is to use the node IP address instead of 127.0.0.1 for kube-controller-manager (KCM) and kube-scheduler's (KS) --bind-address argument and probe configs.

Both KCM and KS bind to 127.0.0.1 with port 10257 and 10258 to expose /healthz and /metrics endpoints. /healthz is exposed mostly for local consumptions (probes), where /metrics is expected to be scraped by central metrics scraper (Prometheus or similar) which is not expected to run as a daemon set on control plane nodes. The current workaround is to use --bind-address=0.0.0.0 which may overexpose metrics to unwanted interfaces and effectively to the internet. This may be fixed by applying firewall rules (e.g. iptables) to drop all traffic coming to these ports except for node IP destination.

Another solution would be patching kubeadm.yml kubeadm config file with the below sections right before applying kubeadm init phase control-plane controller-manager --config kubeadm.yml (same for KS):

controllerManager:
  extraArgs:
    bind-address: <NODE IP ADDRESS>
scheduler:
  extraArgs:
    bind-address: <NODE IP ADDRESS>

One more workaround is to use kubeadm's --experimental-patches argument. The downside is that it fixes static manifests instead of kubeadm config. It depends on --bind-address argument position in the argument list, and requires patching multiple places (command line and two probes for each component).

Arguably, most users would prefer /metrics endpoint be set up in a useful way by default and without such an effort.

The proposed solutions are:

  • Bind to node IP address by default, similarly to api-server and etcd.
  • Provide argument in kubeadm config to bind to node IP instead of 127.0.0.1. For example:
controllerManager:
  bindToNodeAddress: true
scheduler:
  bindToNodeAddress: true
  • Provide variable for kubeadm, which will be replaced with the node id. Example:
controllerManager:
  extraArgs:
    bind-address: $NODE_IP_ADDRESS # or template style - {{ nodeIPAddress }}
scheduler:
  extraArgs:
    bind-address: $NODE_IP_ADDRESS
@neolit123
Copy link
Member

neolit123 commented Feb 8, 2021

binding to the node IP is intuitive, but here is my argument against this:

  • kubeadm users that care about metrics are not a majority.
  • this goes against security hardening. if a component doesn't have to be served on a non-loopback by default for most users then it shouldn't be.

bindToNodeAddress: true
...
bind-address: $NODE_IP_ADDRESS

these options are unlikely going to happen, given there exist a number of workarounds including phases and patches.

One more workaround is to use kubeadm's --experimental-patches argument. The downside is that it fixes static manifests instead of kubeadm config. It depends on --bind-address argument position in the argument list, and requires patching multiple places (command line and two probes for each component).

the positional argument problem is not true. you can pass additional --bind-address flags and only the last one will be used.
i.e. your patch can insert at the end of the container args.

the probe host is a key, so there is no problem there.

the kube-scheduler.json and kube-controller-manager.json patches look like this.

[
	{ "op": "add", "path": "/spec/containers/0/command/-", "value": "--bind-address=SOME_IP" },
	{ "op": "replace", "path": "/spec/containers/0/livenessProbe/httpGet/host", "value": "SOME_IP" }
	{ "op": "replace", "path": "/spec/containers/0/startupProbe/httpGet/host", "value": "SOME_IP" }
]

EDIT: i've commented on the prometheus-operator ticket.
prometheus-operator/kube-prometheus#718
let's see if more users want this change.

EDIT: added startup probe.

@neolit123 neolit123 added kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Feb 8, 2021
@neolit123 neolit123 added this to the v1.21 milestone Feb 8, 2021
@neolit123 neolit123 added the kind/design Categorizes issue or PR as related to design. label Feb 8, 2021
@ksa-real
Copy link
Author

ksa-real commented Feb 8, 2021

Thanks. Confirmed the suggested kubeadm patches work. My slightly modified version:

# kubeadm-patches/kube-controller-manager+json.yaml
- op: add
  path: /spec/containers/0/command/-
  value: --bind-address=SOME_IP
- op: replace
  path: /spec/containers/0/livenessProbe/httpGet/host
  value: SOME_IP
- op: replace
  path: /spec/containers/0/startupProbe/httpGet/host
  value: SOME_IP
  
# kubeadm-patches/kube-scheduler+json.yaml
- op: add
  path: /spec/containers/0/command/-
  value: --bind-address=SOME_IP
- op: replace
  path: /spec/containers/0/livenessProbe/httpGet/host
  value: SOME_IP
- op: replace
  path: /spec/containers/0/startupProbe/httpGet/host
  value: SOME_IP

My small concern about patches was that they are applied to manifests. If manifests structure stability is not guaranteed, then patches may become wrong or non-applicable. However, in this particular case a breakage is unlikely, so patches are good enough for me.

@neolit123
Copy link
Member

neolit123 commented Feb 9, 2021

i brought this topic for discussion in the SIG Cluster Lifecycle meeting today and we agreed that this change is not something we see as necessary, since the user base of consuming metrics for these components is not big and it's a topology change.

some discussed alternatives:

  • use a proxy (apparently kops already has something like that)
  • use deamonset that scrapes on localhost
  • use the kubeadm patches (proposed above).

https://docs.google.com/document/d/1Gmc7LyCIL_148a9Tft7pdhdee0NBHdOfHS1SAF0duI4/edit#

@saphoooo
Copy link

How can I figure out which instance of kube-scheduler or kube-controller-manager is the leader without endpoint?

@kingdonb
Copy link

kingdonb commented Jan 17, 2022

kubeadm users that care about metrics are not a majority.

I'm not saying I disagree with your conclusions, but this seems like an opinionated take with bad results for UX

I was just able to configure kube-prometheus-stack operator Helm chart on my cluster with alertmanager for the first time successfully, and I spent about two hours tracking down "why do these four important Kubernetes services read TargetDown status" – while I was able to figure out how to make all four of these services (etcd, kube-controller-manager, kube-scheduler, and kube-proxy) listen on addresses that prometheus could scrape them at, I'm not at all clear on how to do this at kubeadm init time, or if there's a stable way to impose this configuration, even with that, I had to make four separate manual configuration updates to /etc/kubernetes/manifests with changes that I'm reasonably confident about, but pretty sure they would be blown away next time I run kubeadm reset and kubeadm init.

It would be great if there was a way to centrally configure kubeadm's child components and opt into this, I think even if we are not in the majority, there are enough people out there using prometheus and alertmanager that they would care about monitoring these kubeadm components, (I am not a kube-prometheus maintainer, but from my perspective as a user, figuring things out for the first time, it sure seems they cared enough about it to put them in the default configuration!)

Admittedly I have not read all the kubeadm docs and it might be that a page I'm unaware of addresses this use case specifically, I think that Prometheus integration should get special treatment and it should be straightforward to configure all of these components at once for metrics, or at least one straightforward way to configure all four that is the same for all four.

Least I can say is, based on the steps I describe taking here, (which were the least I could do to turn my running kubernetes cluster into one with a functioning alertmanager without silences, based on the default configuration from kube-prometheus-stack) it's not straightforward monitoring kubeadm-derived Kubernetes installations right now:

Maybe this was the document that I needed to find, and maybe there are fixes that could be added to it which would address all of my concerns (I've only just found this doc for the first time upon hypothesizing that it might exist):

It looks like there are still some gaps, I think there's a bit more needed to make this run smoothly. Comparing this with my comment on the helm chart, I can see there are a few things I needed that aren't mentioned in this doc. Maybe we can take care of it all here. The parts that are missing are for kube-proxy and etcd, I'm not confident that I've done this correctly but it feels a bit on my own up until finding this doc.

The more I recount these docs, finding things that are just a bit too outdated (like references to CoreOS that seem to date the documents and explain why there might be a difference between my experience and the documented latest art here...)

The more I look at it, the more I think my issue needs to go in the kube-prometheus community. Anyway I think this is pretty low traffic on this issue indicates this probably wasn't a very popular option, but I hope if we make the UX good, it will be more popular in the future? It seems like besides scheduler and controller-manager, there might also be etcd and kube-proxy that are ripe for configuration, and I'm afraid they might not all be able to be configured straightforwardly through options in the kubeadm cluster configuration, (or else someone might have documented this already, since it would be easy.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done.
Projects
None yet
Development

No branches or pull requests

4 participants