Bind controller-manager and scheduler to node ip address to expose `/metrics` endpoint #2388

ksa-real · 2021-02-07T20:18:00Z

FEATURE REQUEST

Versions

kubeadm version (use kubeadm version): 1.20.2

Environment:

Kubernetes version (use kubectl version): v1.20.2
Cloud provider or hardware configuration: Bare-metal
OS (e.g. from /etc/os-release): Gentoo 2.7
Kernel (e.g. uname -a): 5.10.4-gentoo
Others:

What happened?

TL;DR Controller manager and scheduler's metrics endpoint is inaccessible with the default kubeadm setup. The proposal is to use the node IP address instead of 127.0.0.1 for kube-controller-manager (KCM) and kube-scheduler's (KS) --bind-address argument and probe configs.

Both KCM and KS bind to 127.0.0.1 with port 10257 and 10258 to expose /healthz and /metrics endpoints. /healthz is exposed mostly for local consumptions (probes), where /metrics is expected to be scraped by central metrics scraper (Prometheus or similar) which is not expected to run as a daemon set on control plane nodes. The current workaround is to use --bind-address=0.0.0.0 which may overexpose metrics to unwanted interfaces and effectively to the internet. This may be fixed by applying firewall rules (e.g. iptables) to drop all traffic coming to these ports except for node IP destination.

Another solution would be patching kubeadm.yml kubeadm config file with the below sections right before applying kubeadm init phase control-plane controller-manager --config kubeadm.yml (same for KS):

controllerManager:
  extraArgs:
    bind-address: <NODE IP ADDRESS>
scheduler:
  extraArgs:
    bind-address: <NODE IP ADDRESS>

One more workaround is to use kubeadm's --experimental-patches argument. The downside is that it fixes static manifests instead of kubeadm config. It depends on --bind-address argument position in the argument list, and requires patching multiple places (command line and two probes for each component).

Arguably, most users would prefer /metrics endpoint be set up in a useful way by default and without such an effort.

The proposed solutions are:

Bind to node IP address by default, similarly to api-server and etcd.
Provide argument in kubeadm config to bind to node IP instead of 127.0.0.1. For example:

controllerManager:
  bindToNodeAddress: true
scheduler:
  bindToNodeAddress: true

Provide variable for kubeadm, which will be replaced with the node id. Example:

controllerManager:
  extraArgs:
    bind-address: $NODE_IP_ADDRESS # or template style - {{ nodeIPAddress }}
scheduler:
  extraArgs:
    bind-address: $NODE_IP_ADDRESS

The text was updated successfully, but these errors were encountered:

neolit123 · 2021-02-08T00:10:03Z

binding to the node IP is intuitive, but here is my argument against this:

kubeadm users that care about metrics are not a majority.
this goes against security hardening. if a component doesn't have to be served on a non-loopback by default for most users then it shouldn't be.

bindToNodeAddress: true
...
bind-address: $NODE_IP_ADDRESS

these options are unlikely going to happen, given there exist a number of workarounds including phases and patches.

One more workaround is to use kubeadm's --experimental-patches argument. The downside is that it fixes static manifests instead of kubeadm config. It depends on --bind-address argument position in the argument list, and requires patching multiple places (command line and two probes for each component).

the positional argument problem is not true. you can pass additional --bind-address flags and only the last one will be used.
i.e. your patch can insert at the end of the container args.

the probe host is a key, so there is no problem there.

the kube-scheduler.json and kube-controller-manager.json patches look like this.

[
	{ "op": "add", "path": "/spec/containers/0/command/-", "value": "--bind-address=SOME_IP" },
	{ "op": "replace", "path": "/spec/containers/0/livenessProbe/httpGet/host", "value": "SOME_IP" }
	{ "op": "replace", "path": "/spec/containers/0/startupProbe/httpGet/host", "value": "SOME_IP" }
]

EDIT: i've commented on the prometheus-operator ticket.
prometheus-operator/kube-prometheus#718
let's see if more users want this change.

EDIT: added startup probe.

ksa-real · 2021-02-08T19:55:03Z

Thanks. Confirmed the suggested kubeadm patches work. My slightly modified version:

# kubeadm-patches/kube-controller-manager+json.yaml
- op: add
  path: /spec/containers/0/command/-
  value: --bind-address=SOME_IP
- op: replace
  path: /spec/containers/0/livenessProbe/httpGet/host
  value: SOME_IP
- op: replace
  path: /spec/containers/0/startupProbe/httpGet/host
  value: SOME_IP
  
# kubeadm-patches/kube-scheduler+json.yaml
- op: add
  path: /spec/containers/0/command/-
  value: --bind-address=SOME_IP
- op: replace
  path: /spec/containers/0/livenessProbe/httpGet/host
  value: SOME_IP
- op: replace
  path: /spec/containers/0/startupProbe/httpGet/host
  value: SOME_IP

My small concern about patches was that they are applied to manifests. If manifests structure stability is not guaranteed, then patches may become wrong or non-applicable. However, in this particular case a breakage is unlikely, so patches are good enough for me.

neolit123 · 2021-02-09T16:41:30Z

i brought this topic for discussion in the SIG Cluster Lifecycle meeting today and we agreed that this change is not something we see as necessary, since the user base of consuming metrics for these components is not big and it's a topology change.

some discussed alternatives:

use a proxy (apparently kops already has something like that)
use deamonset that scrapes on localhost
use the kubeadm patches (proposed above).

https://docs.google.com/document/d/1Gmc7LyCIL_148a9Tft7pdhdee0NBHdOfHS1SAF0duI4/edit#

saphoooo · 2021-10-30T11:59:29Z

How can I figure out which instance of kube-scheduler or kube-controller-manager is the leader without endpoint?

kingdonb · 2022-01-17T20:33:05Z

kubeadm users that care about metrics are not a majority.

I'm not saying I disagree with your conclusions, but this seems like an opinionated take with bad results for UX

I was just able to configure kube-prometheus-stack operator Helm chart on my cluster with alertmanager for the first time successfully, and I spent about two hours tracking down "why do these four important Kubernetes services read TargetDown status" – while I was able to figure out how to make all four of these services (etcd, kube-controller-manager, kube-scheduler, and kube-proxy) listen on addresses that prometheus could scrape them at, I'm not at all clear on how to do this at kubeadm init time, or if there's a stable way to impose this configuration, even with that, I had to make four separate manual configuration updates to /etc/kubernetes/manifests with changes that I'm reasonably confident about, but pretty sure they would be blown away next time I run kubeadm reset and kubeadm init.

It would be great if there was a way to centrally configure kubeadm's child components and opt into this, I think even if we are not in the majority, there are enough people out there using prometheus and alertmanager that they would care about monitoring these kubeadm components, (I am not a kube-prometheus maintainer, but from my perspective as a user, figuring things out for the first time, it sure seems they cared enough about it to put them in the default configuration!)

Admittedly I have not read all the kubeadm docs and it might be that a page I'm unaware of addresses this use case specifically, I think that Prometheus integration should get special treatment and it should be straightforward to configure all of these components at once for metrics, or at least one straightforward way to configure all four that is the same for all four.

Least I can say is, based on the steps I describe taking here, (which were the least I could do to turn my running kubernetes cluster into one with a functioning alertmanager without silences, based on the default configuration from kube-prometheus-stack) it's not straightforward monitoring kubeadm-derived Kubernetes installations right now:

Kube-Prometheus-Stack Helm Chart v14.40 : Some Scrape Targets Are UnavailableOn macOS Catalina 10.15.7 When Using Default Values prometheus-community/helm-charts#812 (comment)

Maybe this was the document that I needed to find, and maybe there are fixes that could be added to it which would address all of my concerns (I've only just found this doc for the first time upon hypothesizing that it might exist):

https://github.com/prometheus-operator/kube-prometheus/blob/main/docs/kube-prometheus-on-kubeadm.md

It looks like there are still some gaps, I think there's a bit more needed to make this run smoothly. Comparing this with my comment on the helm chart, I can see there are a few things I needed that aren't mentioned in this doc. Maybe we can take care of it all here. The parts that are missing are for kube-proxy and etcd, I'm not confident that I've done this correctly but it feels a bit on my own up until finding this doc.

https://github.com/prometheus-operator/kube-prometheus/blob/main/docs/kube-prometheus-on-kubeadm.md#getting-up-and-running-fast-with-kube-prometheus

The more I recount these docs, finding things that are just a bit too outdated (like references to CoreOS that seem to date the documents and explain why there might be a difference between my experience and the documented latest art here...)

The more I look at it, the more I think my issue needs to go in the kube-prometheus community. Anyway I think this is pretty low traffic on this issue indicates this probably wasn't a very popular option, but I hope if we make the UX good, it will be more popular in the future? It seems like besides scheduler and controller-manager, there might also be etcd and kube-proxy that are ripe for configuration, and I'm afraid they might not all be able to be configured straightforwardly through options in the kubeadm cluster configuration, (or else someone might have documented this already, since it would be easy.)

ksa-real mentioned this issue Feb 7, 2021

KubeControllerManagerDown & kubeSchedulerDown firing on kubeadm 1.18 cluster prometheus-operator/kube-prometheus#718

Open

neolit123 added kind/feature Categorizes issue or PR as related to a new feature. priority/awaiting-more-evidence Lowest priority. Possibly useful, but not yet enough support to actually get it done. labels Feb 8, 2021

neolit123 added this to the v1.21 milestone Feb 8, 2021

neolit123 added the kind/design Categorizes issue or PR as related to design. label Feb 8, 2021

neolit123 closed this as completed Feb 9, 2021

pduany mentioned this issue Apr 16, 2021

KubeControllerManagerDown & kubeSchedulerDown geerlingguy/turing-pi-cluster#34

Closed

paologallinaharbur mentioned this issue Oct 25, 2021

Modify autodiscovery for ControlPlane considering new Arch newrelic/nri-kubernetes#218

Closed

saphoooo mentioned this issue Oct 30, 2021

FR: expose kube-scheduler and kube-controller-manager endpoints in order to determine which instance is the leader #2594

Closed

jgillich mentioned this issue Feb 8, 2023

Missing pod metrics on minikube VictoriaMetrics/helm-charts#464

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bind controller-manager and scheduler to node ip address to expose `/metrics` endpoint #2388

Bind controller-manager and scheduler to node ip address to expose `/metrics` endpoint #2388

ksa-real commented Feb 7, 2021

neolit123 commented Feb 8, 2021 •

edited

Loading

ksa-real commented Feb 8, 2021

neolit123 commented Feb 9, 2021 •

edited

Loading

saphoooo commented Oct 30, 2021

kingdonb commented Jan 17, 2022 •

edited

Loading

Bind controller-manager and scheduler to node ip address to expose /metrics endpoint #2388

Bind controller-manager and scheduler to node ip address to expose /metrics endpoint #2388

Comments

ksa-real commented Feb 7, 2021

Versions

What happened?

neolit123 commented Feb 8, 2021 • edited Loading

ksa-real commented Feb 8, 2021

neolit123 commented Feb 9, 2021 • edited Loading

saphoooo commented Oct 30, 2021

kingdonb commented Jan 17, 2022 • edited Loading

Bind controller-manager and scheduler to node ip address to expose `/metrics` endpoint #2388

Bind controller-manager and scheduler to node ip address to expose `/metrics` endpoint #2388

neolit123 commented Feb 8, 2021 •

edited

Loading

neolit123 commented Feb 9, 2021 •

edited

Loading

kingdonb commented Jan 17, 2022 •

edited

Loading