Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Provide metrics for the network bandwidth usage between the api server and etcd #15

Open
istvanballok opened this issue Jun 14, 2022 · 1 comment
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage)

Comments

@istvanballok
Copy link
Member

What would you like to be added

Provide metrics for end users so that they can check which api server requests contribute to the network bandwidth usage between the api server and etcd.

Why is this needed

The api server uses in-memory filtering for list requests with label selectors, so that client requests that seem to be reasonable and have a small response size can still incur a high bandwidth usage in the "backend": between the api server and etcd. (See gardener/gardener#5374)

We have seen that when the network link between the api server and etcd is saturated, multiple components start to fail.

The goal of this issue is to provide metrics for shoot owners so that they can identify the clients that contribute to the excessive network usage and can optimize their requests accordingly.

@istvanballok istvanballok added the kind/enhancement Enhancement, improvement, extension label Jun 14, 2022
@istvanballok
Copy link
Member Author

Initially we explored parsing the etcd debug logs and added up the response sizes of the range requests, broken down by the resource type. We used a heuristic to guess the resource type from the etcd object key string.

This approach requires a custom etcd image. In the meantime, new metrics have been added to the api server in Kubernetes 1.23.

https://github.com/kubernetes/kubernetes/blob/f173d01c011c3574dea73a6fa3e20b0ab94531bb/CHANGELOG/CHANGELOG-1.23.md#feature-6
The kube-apiserver's Prometheus metrics have been extended with some that describe the costs of handling LIST requests. They are as follows.
...
kubernetes/kubernetes#104983

We shall explore those metrics and possibly extend them with a new metric that does not report the number of the fetched objects, but rather the total size of the fetched objects. That should be helpful to reason about the network bandwidth usage between the api server and etcd.

We could get the response size from the size of the data variable, around here: https://github.com/kubernetes/kubernetes/blob/5b489e2846a7fb959252dc5a04fe21ec844e9fad/staging/src/k8s.io/apiserver/pkg/storage/etcd3/store.go#L773-L778.

@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Dec 22, 2022
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Sep 1, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage)
Projects
None yet
Development

No branches or pull requests

2 participants