Author: Max Inden ([email protected])
Date: 23. July 2018
Target release: v1.5.0
-
kube-state-metrics: “Simple service that listens to the Kubernetes API server and generates metrics about the state of the objects”
-
Time series: A single line in a /metrics response e.g. “metric_name{label="value"} 1”
There has been repeated reports of two issues running kube-state-metrics on production Kubernetes clusters. First kube-state-metrics takes a long time (“10s - 20s”) to respond on its /metrics endpoint, leading to Prometheus instances dropping the scrape interval request and marking the given time series as stale. Second kube-state-metrics uses a lot of memory and thereby being out-of-memory killed due to low set Kubernetes resource limits.
The goal of this proposal can be split into the following sub-goals ordered by their priority:
-
Decrease response time on /metrics endpoint
-
Decrease overall runtime memory usage
Instead of requesting the needed information from the Kubernetes API-Server on demand (on scrape), kube-state-metrics uses the Kubernetes client-go cache tool to keep a full in memory representation of all Kubernetes objects of a given cluster. Using the cache speeds up the performance critical path of replying to a scrape request, and reduces the load on the Kubernetes API-Server by only sending deltas whenever they occur. Kube-state-metrics does not make use of all properties and sub-objects of these Kubernetes objects that it stores in its cache.
On a scrape request by e.g. Prometheus on the /metrics endpoint kube-state-metrics calculates the configured time series on demand based on the objects in its cache and converts them to the Prometheus string representation.
Instead of a full representation of all Kubernetes objects with all its properties in memory via the Kubernetes client-go cache, use a map, addressable by the Kubernetes object uuid, containing all time series of that object as a single multi-line string.
var cache = map[uuid][]byte{}
Kube-state-metrics listens on add, update and delete events via Kubernetes client-go reflectors. On add and update events kube-state-metrics generates all time series related to the Kubernetes object based on the event’s payload, concatenates the time series to a single byte slice and sets / replaces the byte slice in the store at the uuid of the Kubernetes object. One can precompute the length of a time series byte slice before allocation as the sum of the length of the metric name, label keys and values as well as the metric value in string representation. On delete events kube-state-metrics deletes the uuid entry of the given Kubernetes object in the cache map.
On a scrape request on the /metrics endpoint, kube-state-metrics iterates over the cache map and concatenates all time series string blobs into a single string, which is finally passed on as a response.
+---------------+ +-----------+ +---------------+ +-------------------+
| pod_reflector | | pod_store | | pod_collector | | metrics_endpoint |
+---------------+ +-----------+ +---------------+ +-------------------+
-------------\ | | | |
| new pod p1 |-| | | |
|------------| | | | |
| | | |
| Add(p1) | | |
|-------------->| | |
| | ----------------------\ | |
| |-| generateMetrics(p1) | | |
| | |---------------------| | |
| | | |
| nil | | |
|<--------------| | |
| | | | ---------------\
| | | |-| GET /metrics |
| | | | |--------------|
| | | |
| | | Collect() |
| | |<--------------------------|
| | | |
| | GetAll() | |
| |<------------------------------| |
| | | |
| | []string{metrics} | |
| |------------------------------>| |
| | | |
| | | concat(metrics) |
| | |-------------------------->|
| | | |
Code to reproduce diagram
Build via text-diagram
object pod_reflector pod_store pod_collector metrics_endpoint
note left of pod_reflector: new pod p1
pod_reflector -> pod_store: Add(p1)
note right of pod_store: generateMetrics(p1)
pod_store -> pod_reflector: nil
note right of metrics_endpoint: GET /metrics
metrics_endpoint -> pod_collector: Collect()
pod_collector -> pod_store: GetAll()
pod_store -> pod_collector: []string{metrics}
pod_collector -> metrics_endpoint: concat(metrics)
-
If kube-state-metrics only listens on add, update and delete events, how is it aware of already existing Kubernetes objects created before kube-state-metrics was started? Leveraging Kubernetes client-go, reflectors can initialize all existing objects before any add, update or delete events. To ensure no events are missed in the long run, periodic resyncs via Kubernetes client-go can be triggered. This extra confidence is not a must and should be compared to its costs, as Kubernetes client-go already gives decent guarantees on event delivery.
-
What about metadata (HELP and description) in the /metrics output? As a first iteration they would be skipped until we have a better idea on the design.
-
How can the cache map be concurrently accessed? The core golang map implementation is not thread-safe. As a first iteration a simple mutex should be sufficient. Golang's sync.Map might be considered.
-
To solve the problem of out of order events send by the Kubernetes API-Server to kube-state-metrics, to each blob of time series inside the cache map it can keep the Kubernetes resource version. On add and update events, first compare the resource version of the event with than the resource version in the cache. Only move forward if the former is higher than the latter.
-
In case the memory consumption of the time series string blobs is a problem the following optimization can be considered: Among the time series strings, multiple sub-strings will be heavily duplicated like the metric name. Instead of saving unstructured strings inside the cache map, one can structure them, using pointers to deduplicate e.g. metric names.
-
...
-
Kube-state-metrics does not make use of all properties of all Kubernetes objects. Instead of unmarshalling unused properties, their json struct tags or their Protobuf representation could be removed.