Provide metrics for the network bandwidth usage between the api server and etcd #15

istvanballok · 2022-06-14T08:12:48Z

What would you like to be added

Provide metrics for end users so that they can check which api server requests contribute to the network bandwidth usage between the api server and etcd.

Why is this needed

The api server uses in-memory filtering for list requests with label selectors, so that client requests that seem to be reasonable and have a small response size can still incur a high bandwidth usage in the "backend": between the api server and etcd. (See gardener/gardener#5374)

We have seen that when the network link between the api server and etcd is saturated, multiple components start to fail.

The goal of this issue is to provide metrics for shoot owners so that they can identify the clients that contribute to the excessive network usage and can optimize their requests accordingly.

istvanballok · 2022-06-14T08:35:13Z

Initially we explored parsing the etcd debug logs and added up the response sizes of the range requests, broken down by the resource type. We used a heuristic to guess the resource type from the etcd object key string.

This approach requires a custom etcd image. In the meantime, new metrics have been added to the api server in Kubernetes 1.23.

https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/CHANGELOG/CHANGELOG-1.23.md#feature-6
The kube-apiserver's Prometheus metrics have been extended with some that describe the costs of handling LIST requests. They are as follows.
...
kubernetes/kubernetes#104983

We shall explore those metrics and possibly extend them with a new metric that does not report the number of the fetched objects, but rather the total size of the fetched objects. That should be helpful to reason about the network bandwidth usage between the api server and etcd.

We could get the response size from the size of the data variable, around here: https://github.com/kubernetes/kubernetes/blob/5b489e2846a7fb959252dc5a04fe21ec844e9fad/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L773-L778.

istvanballok added the kind/enhancement Enhancement, improvement, extension label Jun 14, 2022

gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 22, 2022

gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 1, 2023

istvanballok mentioned this issue Jun 7, 2024

Avoid hosting API server instances that use much more memory than they request gardener/gardener#9934

Open

vicwicker mentioned this issue Dec 11, 2024

☂️ Continuous Enhancement of the Monitoring Stack #29

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide metrics for the network bandwidth usage between the api server and etcd #15

Provide metrics for the network bandwidth usage between the api server and etcd #15

istvanballok commented Jun 14, 2022

istvanballok commented Jun 14, 2022

Provide metrics for the network bandwidth usage between the api server and etcd #15

Provide metrics for the network bandwidth usage between the api server and etcd #15

Comments

istvanballok commented Jun 14, 2022

istvanballok commented Jun 14, 2022