KEP-4112: Pass down resources to CRI
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The CRI runtime lacks visibility to the application resource requirements.
First, the resources required by the containers of a pod are not visible at the pod sandbox creation time. This can be problematic for example in the case of VM-based runtimes where all resources need to be reserved/prepared when the VM (i.e. sandbox) is being created.
Second, the kubelet does not provide complete information about the container resources specification of native and extended resources (requests and limits) to CRI. However, various use cases have been identified where detailed knowledge of all the resources can be utilized in container runtimes for more optimal resource allocation to improve application performance and reduce cross-application interference.
This KEP proposes CRI API extensions for providing complete view of pods resources at sandbox creation, and, providing unobfuscated information about the resource requests and limits to container runtimes.
When the pod sandbox is created, the kubelet does not provide the CRI runtime any information about the resources (such as native resources, host devices, mounts, CDI devices etc) that will be required by the application. The CRI runtime only becomes aware of the resources piece by piece when containers of the pod are created (one-by-one).
This can cause issues with VM-based runtimes (e.g. Kata containers and Confidential Containers) that need to prepare the VM before containers are created.
For Kata to handle PCIe devices properly the CRI needs to tell the kata-runtime how many PCIe root-ports or PCIe switch-ports the hypervisor needs to create at sandbox creation depending on the number of devices allocated by the containers. The PCIe root-port is a static configuration and the hypervisor cannot adjust it once the sandbox is created. During container creation the PCIe devices are hot-plugged to the PCIe root-port or switch-port. If the number of pre-allocated pluggable ports is too low, the attachment will fail (container devices > pre-allocated hot-pluggable ports).
In the case of Confidential Containers (uses Kata unter the hood with additional software components for attestation) the CRI needs to consider the cold-plug aka direct attachment use-case. At sandbox creation time the hypervisor needs to know the exact number of pass-through devices and its properties (VFIO IOMMU group, the actual VFIO device - there can be several devices in a IOMMU group, attach to PCIe root-port or PCIe switch-port (PCI-Bridge)). In a confidential setting a user does not want to reconfigure the VM (creates an attack-vector) on every create container request. The hypervisor needs a fully static view of resources needed for VM sizing.
Independent of hot or cold-plug the hypervisor needs to know how the PCI(e) topology needs to look like at sandbox creation time.
Updating resources of a container means also resizing the VM, hence the hypervisors needs the complete list of resources available at a update container request.
Another visibility issue is related to the native and extended resources.
Kubelet manages the native resources (CPU and memory) and communicates resource
parameters over the CRI API to the runtime. The following snippet shows the
currently supported CRI annotations that are provided by the Kubelet to e.g.
containerd
:
pkg/cri/annotations/annotations.go
// SandboxCPU annotations are based on the initial CPU configuration for the sandbox. This is calculated as the
// sum of container CPU resources, optionally provided by Kubelet (introduced in 1.23) as part of the PodSandboxConfig
SandboxCPUPeriod = "io.kubernetes.cri.sandbox-cpu-period"
SandboxCPUQuota = "io.kubernetes.cri.sandbox-cpu-quota"
SandboxCPUShares = "io.kubernetes.cri.sandbox-cpu-shares"
// SandboxMemory is the initial amount of memory associated with this sandbox. This is calculated as the sum
// of container memory, optionally provided by Kubelet (introduced in 1.23) as part of the PodSandboxConfig.
SandboxMem = "io.kubernetes.cri.sandbox-memory"
However, the original details of the resource spec are lost as they get translated (within kubelet) to platform-specific (i.e. Linux or Windows) resource controller parameters like cpu shares, memory limits etc. Non-native resources such as extended resources and the device plugin resources completely invisible to the CRI runtime. However, OCI hooks, runC wrappers, NRI plugins or in some cases even applications themselves would benefit on seeing the original resource requests and limits e.g. for doing customized resource optimization.
Extending the CRI API to communicate all resources already at sandbox creation and pass down resource requests and limits (of native and extended resources) would provide a comprehensive and early-enough view of the resource usage of all containers of the pod, allowing improved resource allocation without breaking any existing use cases.
- make the information about all required resources (e.g. native and extended resources, devices, mounts, CDI devices) of a Pod available to the CRI at sandbox creation time
- make container resource spec transparently visible to CRI (the container runtime)
- change kubelet resource management
- change existing behavior of CRI
As a VM-based container runtime developer, I want to allocate/expose enough RAM, hugepages, hot- or cold-pluggable PCI(e) ports, protected memory sections and other resources for the VM to ensure that all containers in the pod are guaranteed to get the resources they require.
As a developer of non-runc / non-Linux CRI runtime, I want to know detailed container resource requests to be able to make correct resource allocation for the applications. I cannot rely on cgroup parameters on this but need to know what the user requested to fairly allocate resources between applications.
As a cluster administrator, I want to install an NRI plugin that does customized resource handling. I run kubelet with CPU manager and memory manager disabled. Instead I use my NRI plugin to do customized resource allocation (e.g. cpu and memory pinning). To do that properly I need the actual resource requests and limits requested by the user.
The proposal only adds new informantional data to the CRI API between kubelet and the container runtime with no user-visible changes which mitigates possible risks considerably.
Data duplication/inconsistency with native resources could be considered a risk as those are passed down to CRI both as "raw" requests and limits and as "translated" resource control parameters (like cpu shares oom scoring etc). But this should be largely mitigated by code reviews and unit tests.
The proposal is that kubelet discloses full resources information from the PodSpec to the container runtime. This is accomplished by extending the ContainerConfig, UpdateContainerResourcesRequest and PodSandboxConfig messages of the CRI API.
With this information, the runtime can for example do detailed resource allocation so that CPU, memory and other resources for each container are optimally aligned.
The resource information is included in PodSandboxConfig so that the runtime can see the full picture of Pod's resource usage at Pod creation time, for example enabling more holistic resource allocation and thus better interoperability between containers inside the Pod.
The PodSandboxConfig message (part of the RunPodSandbox request) will be
extended to contain information about resources of all its containers known at
the pod creation time. The container runtime may use this information to make
preparations for all upcoming containers of the pod. E.g. setup all needed
resources for a VM-based pod or prepare for optimal allocation of resources of
all the containers of the Pod. However, the container runtime may continue to
operate as they did (before this enhancement). That is, it can safely ignore
the per-container resource information and allocate resources for each
container separately, one at a time, with the CreateContainer
.
message PodSandboxConfig {
...
// Optional configurations specific to Linux hosts.
LinuxPodSandboxConfig linux = 8;
// Optional configurations specific to Windows hosts.
WindowsPodSandboxConfig windows = 9;
+
+ // Kubernetes resource spec of the containers in the pod.
+ PodResourceConfig pod_resources = 10;
}
+// PodResourceConfig contains information of all resources requirements of
+// the containers of a pod.
+message PodResourceConfig {
+ repeated ContainerResourceConfig init_containers = 1;
+ repeated ContainerResourceConfig containers = 2;
+}
+// ContainerResourceConfig contains information of all resource requirements of
+// one container.
+message ContainerResourceConfig {
+ // Name of the container
+ string name= 1;
+
+ // Kubernetes resource spec of the container
+ KubernetesResources kubernetes_resources = 2;
+
+ // Mounts for the container.
+ repeated Mount mounts = 3;
+
+ // Devices for the container.
+ repeated Device devices = 4;
+
+ // CDI devices for the container.
+ repeated CDIDevice CDI_devices = 5;
+}
The ContainerConfig message (used in CreateContainer request) is extended to contain unmodified resource requests from the PodSpec.
+import "k8s.io/apimachinery/pkg/api/resource/generated.proto";
message ContainerConfig {
...
// Configuration specific to Windows containers.
WindowsContainerConfig windows = 16;
// CDI devices for the container.
repeated CDIDevice CDI_devices = 17;
+
+ // Kubernetes resource spec of the container
+ KubernetesResources kubernetes_resources = 18;
}
+// KubernetesResources contains the resource requests and limits as specified
+// in the Kubernetes core API ResourceRequirements.
+message KubernetesResources {
+ // Requests and limits from the Kubernetes container config.
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 1;
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 2;
+}
The resources (mounts, devices, CDI devices, Kubernetes resources) in the CreateContainer request should be identical to what was (pre-)informed in the RunPodSandbox request. If they are different, the CRI runtime may fail the container creation, for example because changes cannot be applied after a VM-based Pod has been created.
The UpdateContainerResourcesRequest message is extended to pass down unmodified resource requests from the PodSpec.
message UpdateContainerResourcesRequest {
// ID of the container to update.
string container_id = 1;
// Resource configuration specific to Linux containers.
LinuxContainerResources linux = 2;
// Resource configuration specific to Windows containers.
WindowsContainerResources windows = 3;
// Unstructured key-value map holding arbitrary additional information for
// container resources updating. This can be used for specifying experimental
// resources to update or other options to use when updating the container.
map<string, string> annotations = 4;
+
+ // Kubernetes resource spec of the container
+ KubernetesResources kubernetes_resources = 5;
}
Kubelet code is refactored/modified so that all container resources are known before sandbox creation. This mainly consists of preparing all mounts (of all containers) early.
Kubelet will be be extended to pass down all mounts, devices, CDI devices and the unmodified resource requests and limits to the container runtime in all related CRI requests, i.e. RunPodSandbox, CreateContainer and UpdateContainerResources.
For example, take a PodSpec:
apiVersion: v1
kind: Pod
...
spec:
containers:
- name: cnt-1
image: k8s.gcr.io/pause
resources:
requests:
cpu: 1
memory: 1G
example.com/resource: 1
limits:
cpu: 2
memory: 2G
example.com/resource: 1
volumeMounts:
- mountPath: /my-volume
name: my-volume
volumes:
- name: my-volume
emptyDir:
Then kubelet will send the following RunPodSandboxRequest when creating the Pod (represented here in yaml format):
RunPodSandboxRequest:
config:
...
podResources:
containers:
- name: cnt-1
kubernetes_resources:
requests:
cpu: "1"
memory: 1G
example.com/resource: "1"
limits:
cpu: "2"
memory: 2G
example.com/resource: "1"
CDI_devices:
- name: example.com/resource=CDI-Dev-1
mounts:
- container_path: /my-volume
host_path: /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~empty-dir/my-volume
- container_path: /var/run/secrets/kubernetes.io/serviceaccount
host_path: /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~projected/kube-api-access-4srqm
readonly: true
- container_path: /dev/termination-log
host_path: /var/lib/kubelet/pods/<pod-uid>/containers/cnt-1/<uuid>
Note that all device plugin resources are passed down in the
kubernetes_resources
field but this does not contain any properties of the
device that was actually allocated for the container. However, these properties
are exposed through the CDI_devices
, mounts
and devices
fields.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
No prerequisite testing updates have been identified.
k8s.io/kubernetes/pkg/kubelet/kuberuntime
:2024-02-02
-68.3%
The fake_runtime will be used in unit tests to verify that the Kubelet correctly passes down the resource information to the CRI runtime.
For alpha, no new integration tests are planned.
For alpha, no new e2e tests are planned.
- Feature implemented behind a feature flag
- Initial unit tests completed and enabled
- Gather feedback from developers and surveys
- Feature gate enabled by default
- containerd and CRI-O runtimes have released versions that have adopted the new CRI API changes
- The NRI API has adopted the feature
- No bugs reported in the previous cycle
- N examples of real-world usage
- N installs
The feature gate (in kubelet) controls the feature enablement. Existing runtime implementations will continue to work as previously, even if the feature is enabled.
The feature is node-local (kubelet-only) so there is no dependencies or effects to other Kubernetes components.
The behavior is unchanged if either kubelet or the CRI runtime running on a node does not support the feature. If kubelet has the feature enabled but the CRI runtime does not support it, the CRI runtime will ignore the new fields in the CRI API and function as previously. Similarly, if the CRI runtime supports the feature but the kubelet does not, the runtime will resort to the previous behavior.
- Feature gate (also fill in values in
kep.yaml
)- Feature gate name:
- Components depending on the feature gate:
- Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
- Events
- Event Reason:
- API .status
- Condition name:
- Other field:
- Other (treat as last resort)
- Details:
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
- Other (treat as last resort)
- Details:
Are there any missing metrics that would be useful to have to improve observability of this feature?
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
Container annotations could be used as an alternative way to pass down the resource requests and limits to the container runtime.