KEP-4112: Pass down resources to CRI
- Release Signoff Checklist
- Summary
- Motivation
- Proposal
- Design Details
- Production Readiness Review Questionnaire
- Implementation History
- Drawbacks
- Alternatives
- Infrastructure Needed (Optional)
Items marked with (R) are required prior to targeting to a milestone / release.
- (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
- (R) KEP approvers have approved the KEP status as
implementable
- (R) Design details are appropriately documented
- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
- (R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
- (R) Production readiness review completed
- (R) Production readiness review approved
- "Implementation History" section is up-to-date for milestone
- User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes
The CRI runtime lacks visibility to the application resource requirements.
First, the resources required by the containers of a pod are not visible at the pod sandbox creation time. This can be problematic for example in the case of VM-based runtimes where all resources need to be reserved/prepared when the VM (i.e. sandbox) is being created.
Second, the kubelet does not provide complete information about the container resources specification of native and extended resources (requests and limits) to CRI. However, various use cases have been identified where detailed knowledge of all the resources can be utilized in container runtimes for more optimal resource allocation to improve application performance and reduce cross-application interference.
This KEP proposes CRI API extensions for providing complete view of pods resources at sandbox creation, and, providing unobfuscated information about the resource requests and limits to container runtimes.
When the pod sandbox is created, the kubelet does not provide the CRI runtime any information about the resources (such as native resources, host devices, mounts, CDI devices etc) that will be required by the application. The CRI runtime only becomes aware of the resources piece by piece when containers of the pod are created (one-by-one).
This can cause issues with VM-based runtimes (e.g. Kata containers and Confidential Containers) that need to prepare the VM before containers are created.
For Kata to handle PCIe devices properly the CRI needs to tell the kata-runtime how many PCIe root-ports or PCIe switch-ports the hypervisor needs to create at sandbox creation depending on the number of devices allocated by the containers. The PCIe root-port is a static configuration and the hypervisor cannot adjust it once the sandbox is created. During container creation the PCIe devices are hot-plugged to the PCIe root-port or switch-port. If the number of pre-allocated pluggable ports is too low, the attachment will fail (container devices > pre-allocated hot-pluggable ports).
In the case of Confidential Containers (uses Kata under the hood with additional software components for attestation) the CRI needs to consider the cold-plug aka direct attachment use-case. At sandbox creation time the hypervisor needs to know the exact number of pass-through devices and its properties (VFIO IOMMU group, the actual VFIO device - there can be several devices in a IOMMU group, attach to PCIe root-port or PCIe switch-port (PCI-Bridge)). In a confidential setting a user does not want to reconfigure the VM (creates an attack-vector) on every create container request. The hypervisor needs a fully static view of resources needed for VM sizing.
Independent of hot or cold-plug the hypervisor needs to know how the PCI(e) topology needs to look like at sandbox creation time.
Updating resources of a container means also resizing the VM, hence the hypervisors needs the complete list of resources available at a update container request.
Another visibility issue is related to the native and extended resources.
Kubelet manages the native resources (CPU and memory) and communicates resource
parameters over the CRI API to the runtime. The following snippet shows the
currently supported CRI annotations that are provided by the Kubelet to e.g.
containerd
:
pkg/cri/annotations/annotations.go
// SandboxCPU annotations are based on the initial CPU configuration for the sandbox. This is calculated as the
// sum of container CPU resources, optionally provided by Kubelet (introduced in 1.23) as part of the PodSandboxConfig
SandboxCPUPeriod = "io.kubernetes.cri.sandbox-cpu-period"
SandboxCPUQuota = "io.kubernetes.cri.sandbox-cpu-quota"
SandboxCPUShares = "io.kubernetes.cri.sandbox-cpu-shares"
// SandboxMemory is the initial amount of memory associated with this sandbox. This is calculated as the sum
// of container memory, optionally provided by Kubelet (introduced in 1.23) as part of the PodSandboxConfig.
SandboxMem = "io.kubernetes.cri.sandbox-memory"
However, the original details of the resource spec are lost as they get translated (within kubelet) to platform-specific (i.e. Linux or Windows) resource controller parameters like cpu shares, memory limits etc. Non-native resources such as extended resources and the device plugin resources completely invisible to the CRI runtime. However, OCI hooks, runC wrappers, NRI plugins or in some cases even applications themselves would benefit on seeing the original resource requests and limits e.g. for doing customized resource optimization.
Extending the CRI API to communicate all resources already at sandbox creation and pass down resource requests and limits (of native and extended resources) would provide a comprehensive and early-enough view of the resource usage of all containers of the pod, allowing improved resource allocation without breaking any existing use cases.
- make the information about all required resources (e.g. native and extended resources, devices, mounts, CDI devices) of a Pod available to the CRI at sandbox creation time
- make container resource spec transparently visible to CRI (the container runtime)
- change kubelet resource management
- change existing behavior of CRI
- add UpdatePodSandboxResources CRI rpc (this is covered by KEP-1287, PR)
- add pod-level resource requirements (this is covered by KEP-2837, PR)
As a VM-based container runtime developer, I want to allocate/expose enough RAM, hugepages, hot- or cold-pluggable PCI(e) ports, protected memory sections and other resources for the VM to ensure that all containers in the pod are guaranteed to get the resources they require.
As a developer of non-runc / non-Linux CRI runtime, I want to know detailed container resource requests to be able to make correct resource allocation for the applications. I cannot rely on cgroup parameters on this but need to know what the user requested to fairly allocate resources between applications.
As a cluster administrator, I want to install an NRI plugin that does
customized resource handling. I run kubelet with CPU manager and memory manager
disabled (CPU manager policy set to none
). Instead I use my NRI plugin to do
customized resource allocation (e.g. cpu and memory pinning). To do that
properly I need the actual resource requests and limits requested by the user.
The proposal only adds new informational data to the CRI API between kubelet and the container runtime with no user-visible changes which mitigates possible risks considerably.
Data duplication/inconsistency with native resources could be considered a risk as those are passed down to CRI both as "raw" requests and limits and as "translated" resource control parameters (like cpu shares oom scoring etc). But this should be largely mitigated by code reviews and unit tests.
The proposal is that kubelet discloses full resources information from the PodSpec to the container runtime. This is accomplished by extending the ContainerConfig, UpdateContainerResourcesRequest and PodSandboxConfig messages of the CRI API.
With this information, the runtime can for example do detailed resource
allocation so that CPU, memory and other resources for each container are
optimally aligned. This applies to scenarios where the kubelet CPU manager is
disabled (by using the none
CPU manager policy).
The resource information is included in PodSandboxConfig so that the runtime can see the full picture of Pod's resource usage at Pod creation time, for example enabling more holistic resource allocation and thus better interoperability between containers inside the Pod.
Also the CreateContainer request is extended to include the unmodified resource requirements. This make it possible for the CRI runtime to detect any changes in the pod resources that happen between the Pod creation and container creation in e.g. scenarios where in-place pod updates are involved.
KEP-1287 Beta (PR) proposes to add new UpdatePodSandboxResources rpc to the CRI API. If/when KEP-1287 is implemented as proposed, the UpdatePodSandboxResources CRI message is updated to include the resource information of all containers (aligning with UpdateContainerResourcesRequest).
KEP-2837 Alpha (PR) proposes to add new Pod-level resource requirements field to the PodSpec. This information will be be added to the PodResourceConfig message, similar to the container resource information, if/when KEP-2837 is implemented as proposed.
The PodSandboxConfig message (part of the RunPodSandbox request) will be
extended to contain information about resources of all its containers known at
the pod creation time. The container runtime may use this information to make
preparations for all upcoming containers of the pod. E.g. setup all needed
resources for a VM-based pod or prepare for optimal allocation of resources of
all the containers of the Pod. However, the container runtime may continue to
operate as they did (before this enhancement). That is, it can ignore
the resource information presented here and allocate resources for each
container separately at container creation time with the CreateContainer
request.
message PodSandboxConfig {
...
// Optional configurations specific to Linux hosts.
LinuxPodSandboxConfig linux = 8;
// Optional configurations specific to Windows hosts.
WindowsPodSandboxConfig windows = 9;
+
+ // Kubernetes resource spec of the containers in the pod.
+ PodResourceConfig pod_resources = 10;
}
+// PodResourceConfig contains information of all resources requirements of
+// the containers of a pod.
+message PodResourceConfig {
+ repeated ContainerResourceConfig containers = 1;
+}
+// ContainerResourceConfig contains information of all resource requirements of
+// one container.
+message ContainerResourceConfig {
+ // Name of the container
+ string name= 1;
+
+ // Type of the container
+ ContainerType type= 2;
+
+ // Kubernetes resource spec of the container
+ KubernetesResources kubernetes_resources = 3;
+
+ // Mounts for the container.
+ repeated Mount mounts = 4;
+
+ // Devices for the container.
+ repeated Device devices = 5;
+
+ // CDI devices for the container.
+ repeated CDIDevice CDI_devices = 6;
+}
+enum ContainerType {
+ INIT_CONTAINER = 0;
+ SIDECAR_CONTAINER = 1;
+ CONTAINER = 2;
+}
The Pod-level resources enhancement KEP-2837 (alpha PR) proposes to add new Pod-level resource requirements fields to the PodSpec. This information will be be added to the PodResourceConfig message, similar to the container resource information.
message PodResourceConfig {
repeated ContainerResourceConfig containers = 1;
+
+ // Kubernetes resource spec of the pod-level resource requirements.
+ KubernetesResources kubernetes_resources = 2;
}
The implementation if adding the KubernetesResources field to the PodResourceConfig is synced with KEP-2837.
The ContainerConfig message (used in CreateContainer request) is extended to contain unmodified resource requests from the PodSpec.
+import "k8s.io/apimachinery/pkg/api/resource/generated.proto";
message ContainerConfig {
...
// Configuration specific to Windows containers.
WindowsContainerConfig windows = 16;
// CDI devices for the container.
repeated CDIDevice CDI_devices = 17;
+
+ // Kubernetes resource spec of the container
+ KubernetesResources kubernetes_resources = 18;
}
+// KubernetesResources contains the resource requests and limits as specified
+// in the Kubernetes core API ResourceRequirements.
+message KubernetesResources {
+ // Requests and limits from the Kubernetes container config.
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 1;
+ map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 2;
+}
Note that mounts, devices, CDI devices are part of the ContainerConfig message but are left out of the diff snippet above.
Including the KubernetesResources in the ContainerConfig message serves multiple purposes:
- Catch changes that happen between pod sandbox creation and container creation. For example, in-place pod updates might change the container before it was created.
- Catch changes that happen over container restarts in in-place pod update scenarios
- Consistency/completeness. Have enough information to make consistent action based only on information present in this rpc caal.
The resources (mounts, devices, CDI devices, Kubernetes resources) in the CreateContainer request should be identical to what was (pre-)informed in the RunPodSandbox request. If they are different, the CRI runtime may fail the container creation, for example because changes cannot be applied after a VM-based Pod has been created.
The UpdateContainerResourcesRequest message is extended to pass down unmodified resource requests from the PodSpec.
message UpdateContainerResourcesRequest {
// ID of the container to update.
string container_id = 1;
// Resource configuration specific to Linux containers.
LinuxContainerResources linux = 2;
// Resource configuration specific to Windows containers.
WindowsContainerResources windows = 3;
// Unstructured key-value map holding arbitrary additional information for
// container resources updating. This can be used for specifying experimental
// resources to update or other options to use when updating the container.
map<string, string> annotations = 4;
+
+ // Kubernetes resource spec of the container
+ KubernetesResources kubernetes_resources = 5;
}
Note that mounts, devices, CDI devices are not part of the UpdateContainerResourcesRequest message and this proposal does not suggest adding them.
The In-Place Update of Pod Resources (KEP-1287) Beta (PR) proposes to add new UpdatePodSandboxResources rpc to inform the CRI runtime about the changes in the pod resources.
The UpdatePodSandboxResourcesRequest message is extended similarly to the PodSandboxConfig message to contain information about resources of all its containers. In UpdatePodSandboxResourcesRequest this will reflect the updated resource requirements of the containers.
message UpdatePodSandboxResourcesRequest {
// ID of the PodSandbox to update.
string pod_sandbox_id = 1;
// Optional overhead represents the overheads associated with this sandbox
LinuxContainerResources overhead = 2;
// Optional resources represents the sum of container resources for this sandbox
LinuxContainerResources resources = 3;
// Unstructured key-value map holding arbitrary additional information for
// sandbox resources updating. This can be used for specifying experimental
// resources to update or other options to use when updating the sandbox.
map<string, string> annotations = 4;
+
+ // Kubernetes resource spec of the containers in the pod.
+ PodResourceConfig pod_resources = 5;
}
The implementation will be synced with KEP-1287.
Kubelet code is refactored/modified so that all container resources are known before sandbox creation. This mainly consists of preparing all mounts (of all containers) early.
Kubelet will be extended to pass down all resources of containers in all related CRI requests (as described in the CRI API section). That is:
- adding mounts, devices, CDI devices and the unmodified resource requests and limits of all containers into RunPodSandbox request
- adding unmodified resource requests and limits into CreateContainer and UpdateContainerResources requests
For example, take a PodSpec:
apiVersion: v1
kind: Pod
...
spec:
containers:
- name: cnt-1
image: k8s.gcr.io/pause
resources:
requests:
cpu: 1
memory: 1G
example.com/resource: 1
limits:
cpu: 2
memory: 2G
example.com/resource: 1
volumeMounts:
- mountPath: /my-volume
name: my-volume
- mountPath: /image-volume
name: image-volume
volumes:
- name: my-volume
emptyDir:
- name: image-volume
image:
reference: example.com/registry/artifact:tag
Then kubelet will send the following RunPodSandboxRequest when creating the Pod (represented here in yaml format):
RunPodSandboxRequest:
config:
...
podResources:
containers:
- name: cnt-1
kubernetes_resources:
requests:
cpu: "1"
memory: 1G
example.com/resource: "1"
limits:
cpu: "2"
memory: 2G
example.com/resource: "1"
CDI_devices:
- name: example.com/resource=CDI-Dev-1
mounts:
- container_path: /my-volume
host_path: /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~empty-dir/my-volume
- container_path: /image-volume
image:
image: example.com/registry/artifact:tag
...
- container_path: /var/run/secrets/kubernetes.io/serviceaccount
host_path: /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~projected/kube-api-access-4srqm
readonly: true
- container_path: /dev/termination-log
host_path: /var/lib/kubelet/pods/<pod-uid>/containers/cnt-1/<uuid>
Note that all device plugin resources are passed down in the
kubernetes_resources
field but this does not contain any properties of the
device that was actually allocated for the container. However, these properties
are exposed through the CDI_devices
, mounts
and devices
fields.
[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.
No prerequisite testing updates have been identified.
k8s.io/kubernetes/pkg/kubelet/kuberuntime
:2024-02-02
-68.3%
The fake_runtime will be used in unit tests to verify that the Kubelet correctly passes down the resource information to the CRI runtime.
For alpha, no new integration tests are planned.
For alpha, no new e2e tests are planned.
For Beta: a suite of NRI tests will be added to verify that the runtime receives the resource information correctly and passes it down to the NRI plugins.
- Feature implemented behind a feature flag
- Initial unit tests completed and enabled
- Gather feedback from developers and surveys
- Feature gate enabled by default
- containerd and CRI-O runtimes have released versions that have adopted the new CRI API changes
- No bugs reported in the previous cycle
The feature gate (in kubelet) controls the feature enablement. Existing runtime implementations will continue to work as previously, even if the feature is enabled.
The feature is node-local (kubelet-only) so there is no dependencies or effects to other Kubernetes components.
The behavior is unchanged if either kubelet or the CRI runtime running on a node does not support the feature. If kubelet has the feature enabled but the CRI runtime does not support it, the CRI runtime will ignore the new fields in the CRI API and function as previously. Similarly, if the CRI runtime supports the feature but the kubelet does not, the runtime will resort to the previous behavior.
- Feature gate
- Feature gate name: KubeletContainerResourcesInPodSandbox
- Components depending on the feature gate:
- kubelet
Yes. The kubelet will start passing the extra information to the CRI runtime for every container it creates. Whether this has any effect depends on if the underlying CRI runtime supports this feature. For example, an NRI plugin relying on the feature may cause the application to behave differently.
Long running pods that persist (without restart) over kubelet and CRI runtime update which enables the feature may experience version skew of the metadata. After enabling the feature, the CRI runtime does not have the aggregated information of all resources of the pod, provided with this feature, as the kubelet didn't restart these pods (didn't send the CreatePodSandbox CRI request). This may affect some scenarios e.g. NRI plugins. This "metadata skew" can be avoided by draining the node before updating the kubelet and the CRI runtime.
Yes, disabling the KubeletContainerResourcesInPodSandbox
feature gate will
disable the feature. Restarting pods may be needed to reset the information
that was passed down to the CRI.
New pods will have the feature enabled. Existing pods will continue to operate as before until restarted.
Unit tests for the feature gate will be added.
Rollback or rollout in the kubelet should not fail - it only enables/disabled the information (fields in the CRI message) passed down to the CRI runtime.
However, if the CRI runtime depends on the feature, a rollout or rollback may cause failures of applications on pod restarts. Running pods are not affected.
Alpha: No new metrics are planned. Increase in the existing
kubelet_started_pods_errors_total
metric can indicate a problem caused by
this feature.
Generally, non-ready pods with CreatePodSandboxError status (reflected by the
kubelet_started_pods_errors_total
metric) is a possible indicator. The error
message will contain details if the CRI failure is related to the feature.
Alpha: Manual testing of the feature gate is performed.
Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?
No.
By examing the kubelet feature gate and the version of the CRI runtime. The
enablement of the kubelet feature gate can be determined from the
kubernetes_feature_enabled
metric.
The end users do not see the status of the feature directly.
The cluster operator can verify that the feature is working by examining the kubelet and CRI runtime logs.
The CRI runtime or NRI plugin developers depending on the feature can ensure that it is working by verifying that all the required information is available at pod sandbox creation time.
No increase in the kubelet_started_pods_errors_total
rate.
What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
- Metrics
- Metric name:
kubelet_started_pods_errors_total
- Components exposing the metric: kubelet
- Metric name:
NOTE: The
kubelet_started_pods_errors_total
metric is a general metric for any errors that occur when starting pods. The error message (Pod events, kubelet logs) will contain details if the CRI failure is related to the feature.
Are there any missing metrics that would be useful to have to improve observability of this feature?
N/A.
No.
However, the practical usability of this feature requires that also the CRI runtime supports it. The feature is effectively a no-op if the CRI runtime does not support it.
No.
No.
No.
No.
Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?
Not noticeably.
Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?
No. The new data fields in the CRI API would not count as significant increase.
Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?
No.
N/A. The feature is node-local.
The feature in Kubernetes is relatively straightforward - passing extra information to the CRI runtime. The failure scenarios arise in the CRI runtime level, e.g.:
- misbehaving CRI runtime or NRI plugin
- CRI runtime or NRI plugin is depending on the feature but it is not enabled in the kubelet
- configuration skew in the cluster where some nodes have the feature enabled and some do not
Pod events and CRI runtime logs are the primary sources of information for these failure scenarios.
N/A.
Container annotations could be used as an alternative way to pass down the resource requests and limits to the container runtime.