-
Notifications
You must be signed in to change notification settings - Fork 14.5k
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
blog: node: kubelet podresources API GA in 1.28
I'm intentionally covering multiple related enhancements with a single blog post. Enhancements: - kubernetes/enhancements#606 - kubernetes/enhancements#2403 - kubernetes/enhancements#3743 Signed-off-by: Francesco Romani <[email protected]>
- Loading branch information
Showing
1 changed file
with
135 additions
and
0 deletions.
There are no files selected for viewing
135 changes: 135 additions & 0 deletions
135
content/en/blog/_posts/2023-MM-DD-kubelet-podresources-api-ga.md
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,135 @@ | ||
--- | ||
layout: blog | ||
title: 'Kubernetes v1.28: Kubelet podresources API GA' | ||
date: 2023-MM-DD | ||
slug: kubelet-podresources-api-GA | ||
--- | ||
|
||
**Author:** | ||
Francesco Romani (Red Hat) | ||
|
||
The podresources API is an API served by the kubelet locally on the node, which exposes the compute resources exclusively | ||
allocated to containers. In Kubernetes 1.28 the API is now Generally Available. | ||
|
||
## What problem does it solve? | ||
|
||
The kubelet can allocate exclusive resources to containers, like | ||
[CPUs, granting exclusive access to full cores](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3570-cpumanager) | ||
or [memory, either regions or hugepages](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/1769-memory-manager). | ||
Workloads which require high performance, or low latency (or both) leverage these features. | ||
The kubelet also can assign [devices to containers](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3573-device-plugin) | ||
Collectively, these features which enable exclusive assignments are known as "resource managers". | ||
|
||
The podresources API was [initially proposed to enable device monitoring](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/606-compute-device-assignment#motivation). | ||
In order to enable monitoring agents, a key prerequisite is to enable introspection of device assignment, which is performed by the kubelet. | ||
Serving this purpose was the initial goal of the API. The API was initially extremely simple, with just a single function implemented, `List`, | ||
to return information about the assignment of devices to containers. | ||
|
||
Without an API like podresources, the only possible option to learn about resource assignment was to read the state files the | ||
resource managers use. While done out of necessity, the problem with this approach is the path and the format of these file are | ||
both internal implementation details. Albeit very stable, the project reserves the chance to change them freely. | ||
Consuming the content of the state files is thus fragile and unsupported, and projects doing this are recommended to consider | ||
moving to podresources API or to other supported APIs. | ||
|
||
Since its inception, the podresources API increased its scope to cover other resource managers than device manager. | ||
In Kubernetes 1.20, the `List` API reports also CPU cores and memory regions (including hugepages); the API also | ||
reports the NUMA locality of the devices, while the locality of CPUs and memory can be inferred from the system. | ||
In kubernetes 1.21, [the API gained the `GetAllocatableResources` function](https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/2403-pod-resources-allocatable-resources/README.md). | ||
This new API complements the existing `List` API and enables monitoring agents to determine the unallocated resources, | ||
thus enabling new features built on top of the podresources API like a | ||
[NUMA-aware scheduler plugin](https://github.com/kubernetes-sigs/scheduler-plugins/blob/master/pkg/noderesourcetopology/README.md). | ||
|
||
## Consuming the API | ||
|
||
The podresources API is served by the kubelet over a unix domain socket, exposed by default on `/var/lib/kubelet/pod-resources/kubelet.sock`. | ||
In order for the containerized monitoring application consume the API, the socket should be mounted inside the container. | ||
A good practice is to mount the directory, not the socket. | ||
An example manifest for a hypothetical monitoring agent consuming the podresources API and deployed as daemonset could look like | ||
|
||
```yaml | ||
apiVersion: apps/v1 | ||
kind: DaemonSet | ||
metadata: | ||
name: podresources-monitoring-app | ||
namespace: monitoring | ||
spec: | ||
selector: | ||
matchLabels: | ||
name: podresources-monitoring | ||
template: | ||
metadata: | ||
labels: | ||
name: podresources-monitoring | ||
spec: | ||
containers: | ||
- args: | ||
- --podresources-socket=unix:///host-podresources/kubelet.sock | ||
command: | ||
- /bin/podresources-monitor | ||
image: podresources-monitor:latest | ||
resources: {} | ||
volumeMounts: | ||
- mountPath: /host-podresources | ||
name: host-podresources | ||
serviceAccountName: podresources-monitor | ||
volumes: | ||
- hostPath: | ||
path: /var/lib/kubelet/pod-resources | ||
type: Directory | ||
name: host-podresources | ||
``` | ||
Consuming the API programmatically is expected to be straightforward. The kubelet API package provides client code ready | ||
to be consumed. A simple client wrapper, setting default values for timeout and max message size, could look like | ||
``` | ||
import ( | ||
"fmt" | ||
"log" | ||
"time" | ||
|
||
podresourcesapi "k8s.io/kubelet/pkg/apis/podresources/v1" | ||
"k8s.io/kubernetes/pkg/kubelet/apis/podresources" | ||
) | ||
|
||
const ( | ||
// obtained the following values from node e2e tests : https://github.com/kubernetes/kubernetes/blob/82baa26905c94398a0d19e1b1ecf54eb8acb6029/test/e2e_node/util.go#L70 | ||
defaultPodResourcesTimeout = 10 * time.Second | ||
defaultPodResourcesMaxSize = 1024 * 1024 * 16 // 16 Mb | ||
) | ||
|
||
func GetPodResClient(socketPath string) (podresourcesapi.PodResourcesListerClient, error) { | ||
podResourceClient, _, err := podresources.GetV1Client(socketPath, defaultPodResourcesTimeout, defaultPodResourcesMaxSize) | ||
if err != nil { | ||
return nil, fmt.Errorf("failed to create podresource client: %w", err) | ||
} | ||
log.Printf("Connected to '%q'!", socketPath) | ||
return podResourceClient, nil | ||
} | ||
``` | ||
|
||
When operating the containerized monitoring application consuming the podresources API, few points are worth highlighting to prevent "gotcha" moments: | ||
|
||
- Even though the API only exposes data, and doesn't allow by design clients to mutate the kubelet state, the gRPC request/response model requires | ||
read-write access to the podresources API socket. In other words, it is not possible to limit the container mount to `ReadOnly`. | ||
- Multiple clients are allowed to connect to the podresources socket and consume the API, since it is stateless. | ||
- The kubelet has [built-in rate limits](https://github.com/kubernetes/kubernetes/pull/116459) to mitigate local Denial of Service attacks from | ||
misbehaving or malicious consumers. The consumers of the API must tolerate rate limit errors returned by the server. The rate limit is currently | ||
hardcoded and global, so misbehaving clients can consume all the quota and potentially starve correctly behaving clients. | ||
|
||
## Future enhancements | ||
|
||
For historical reasons, the podresources API has a less precise specification with respect to the other kubernetes API. | ||
This leads to unspecified behavior in corner cases. | ||
[An effort is ongoing](https://github.com/kubernetes/kubernetes/issues/119423) to rectify this state and to have a more precise specification. | ||
|
||
The [Dynamic Resource Allocation - DRA infrastructure](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3063-dynamic-resource-allocation) | ||
is a major overhaul of the resource management. | ||
[Integration with the podresources API](https://github.com/kubernetes/enhancements/tree/master/keps/sig-node/3695-pod-resources-for-dra) | ||
is already ongoing. | ||
|
||
## Getting involved | ||
|
||
This feature is driven by the [SIG Node](https://github.com/Kubernetes/community/blob/master/sig-node/README.md) community. | ||
Please join us to connect with the community and share your ideas and feedback around the above feature and | ||
beyond. We look forward to hearing from you! |