WIP: blog: node: kubelet podresources API GA in 1.28

I'm intentionally covering multiple related enhancements with a single blog post. Enhancements: - kubernetes/enhancements#606 - kubernetes/enhancements#2403 - kubernetes/enhancements#3743 Signed-off-by: Francesco Romani <[email protected]>
ffromani · Aug 2, 2023 · 3bd2759 · 3bd2759
1 parent 1eabaee
commit 3bd2759
Showing 1 changed file with 135 additions and 0 deletions.
diff --git a/content/en/blog/_posts/2023-MM-DD-kubelet-podresources-api-ga.md b/content/en/blog/_posts/2023-MM-DD-kubelet-podresources-api-ga.md
@@ -0,0 +1,135 @@
+---
+layout: blog
+title: 'Kubernetes v1.28: Kubelet podresources API GA'
+date: 2023-MM-DD
+slug: kubelet-podresources-api-GA
+---
+
+**Author:**
+Francesco Romani (Red Hat)
+
+The podresources API is an API served by the kubelet locally on the node, which exposes the compute resources exclusively
+allocated to containers. In Kubernetes 1.28 the API is now Generally Available.
+
+## What problem does it solve?
+
+The kubelet can allocate exclusive resources to containers, like
+[CPUs, granting exclusive access to full cores](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager)
+or [memory, either regions or hugepages](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager).
+Workloads which require high performance, or low latency (or both) leverage these features.
+The kubelet also can assign [devices to containers](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3573-device-plugin)
+Collectively, these features which enable exclusive assignments are known as "resource managers".
+
+The podresources API was [initially proposed to enable device monitoring](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/606-compute-device-assignment#motivation).
+In order to enable monitoring agents, a key prerequisite is to enable introspection of device assignment, which is performed by the kubelet.
+Serving this purpose was the initial goal of the API. The API was initially extremely simple, with just a single function implemented, `List`,
+to  return information about the assignment of devices to containers.
+
+Without an API like podresources, the only possible option to learn about resource assignment was to read the state files the
+resource managers use. While done out of necessity, the problem with this approach is the path and the format of these file are
+both internal implementation details. Albeit very stable, the project reserves the chance to change them freely.
+Consuming the content of the state files is thus fragile and unsupported, and projects doing this are recommended to consider
+moving to podresources API or to other supported APIs.
+
+Since its inception, the podresources API increased its scope to cover other resource managers than device manager.
+In Kubernetes 1.20, the `List` API reports also CPU cores and memory regions (including hugepages); the API also
+reports the NUMA locality of the devices, while the locality of CPUs and memory can be inferred from the system.
+In kubernetes 1.21, [the API gained the `GetAllocatableResources` function](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2403-pod-resources-allocatable-resources/README.md).
+This new API complements the existing `List` API and enables monitoring agents to determine the unallocated resources,
+thus enabling new features built on top of the podresources API like a
+[NUMA-aware scheduler plugin](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/README.md).
+
+## Consuming the API
+
+The podresources API is served by the kubelet over a unix domain socket, exposed by default on `/var/lib/kubelet/pod-resources/kubelet.sock`.
+In order for the containerized monitoring application consume the API, the socket should be mounted inside the container.
+A good practice is to mount the directory, not the socket.
+An example manifest for a hypothetical monitoring agent consuming the podresources API and deployed as daemonset could look like
+
+```yaml
+apiVersion: apps/v1
+kind: DaemonSet
+metadata:
+  name: podresources-monitoring-app
+  namespace: monitoring
+spec:
+  selector:
+    matchLabels:
+      name: podresources-monitoring
+  template:
+    metadata:
+      labels:
+        name: podresources-monitoring
+    spec:
+      containers:
+      - args:
+        - --podresources-socket=unix:///host-podresources/kubelet.sock
+        command:
+        - /bin/podresources-monitor
+        image: podresources-monitor:latest
+        resources: {}
+        volumeMounts:
+        - mountPath: /host-podresources
+          name: host-podresources
+      serviceAccountName: podresources-monitor
+      volumes:
+      - hostPath:
+          path: /var/lib/kubelet/pod-resources
+          type: Directory
+        name: host-podresources
+```
+
+Consuming the API programmatically is expected to be straightforward. The kubelet API package provides client code ready
+to be consumed. A simple client wrapper, setting default values for timeout and max message size, could look like
+
+```
+import (
+	"fmt"
+	"log"
+	"time"
+
+	podresourcesapi "k8s.io/kubelet/pkg/apis/podresources/v1"
+	"k8s.io/kubernetes/pkg/kubelet/apis/podresources"
+)
+
+const (
+	// obtained the following values from node e2e tests : https://github.com/kubernetes/kubernetes/blob/82baa26905c94398a0d19e1b1ecf54eb8acb6029/test/e2e_node/util.go#L70
+	defaultPodResourcesTimeout = 10 * time.Second
+	defaultPodResourcesMaxSize = 1024 * 1024 * 16 // 16 Mb
+)
+
+func GetPodResClient(socketPath string) (podresourcesapi.PodResourcesListerClient, error) {
+	podResourceClient, _, err := podresources.GetV1Client(socketPath, defaultPodResourcesTimeout, defaultPodResourcesMaxSize)
+	if err != nil {
+		return nil, fmt.Errorf("failed to create podresource client: %w", err)
+	}
+	log.Printf("Connected to '%q'!", socketPath)
+	return podResourceClient, nil
+}
+```
+
+When operating the containerized monitoring application consuming the podresources API, few points are worth highlighting to prevent "gotcha" moments:
+
+- Even though the API only exposes data, and doesn't allow by design clients to mutate the kubelet state, the gRPC request/response model requires
+  read-write access to the podresources API socket. In other words, it is not possible to limit the container mount to `ReadOnly`.
+- Multiple clients are allowed to connect to the podresources socket and consume the API, since it is stateless.
+- The kubelet has [built-in rate limits](https://github.com/kubernetes/kubernetes/pull/116459) to mitigate local Denial of Service attacks from
+  misbehaving or malicious consumers. The consumers of the API must tolerate rate limit errors returned by the server. The rate limit is currently
+  hardcoded and global, so misbehaving clients can consume all the quota and potentially starve correctly behaving clients.
+
+## Future enhancements
+
+For historical reasons, the podresources API has a less precise specification with respect to the other kubernetes API.
+This leads to unspecified behavior in corner cases.
+[An effort is ongoing](https://github.com/kubernetes/kubernetes/issues/119423) to rectify this state and to have a more precise specification.
+
+The [Dynamic Resource Allocation - DRA infrastructure](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation)
+is a major overhaul of the resource management.
+[Integration with the podresources API](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3695-pod-resources-for-dra)
+is already ongoing.
+
+## Getting involved
+
+This feature is driven by the [SIG Node](https://github.com/Kubernetes/community/blob/master/sig-node/README.md) community.
+Please join us to connect with the community and share your ideas and feedback around the above feature and
+beyond. We look forward to hearing from you!