KEP-4112: Pass down resources to CRI

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Container annotations
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The CRI runtime lacks visibility to the application resource requirements.

First, the resources required by the containers of a pod are not visible at the pod sandbox creation time. This can be problematic for example in the case of VM-based runtimes where all resources need to be reserved/prepared when the VM (i.e. sandbox) is being created.

Second, the kubelet does not provide complete information about the container resources specification of native and extended resources (requests and limits) to CRI. However, various use cases have been identified where detailed knowledge of all the resources can be utilized in container runtimes for more optimal resource allocation to improve application performance and reduce cross-application interference.

This KEP proposes CRI API extensions for providing complete view of pods resources at sandbox creation, and, providing unobfuscated information about the resource requests and limits to container runtimes.

Motivation

When the pod sandbox is created, the kubelet does not provide the CRI runtime any information about the resources (such as native resources, host devices, mounts, CDI devices etc) that will be required by the application. The CRI runtime only becomes aware of the resources piece by piece when containers of the pod are created (one-by-one).

This can cause issues with VM-based runtimes (e.g. Kata containers and Confidential Containers) that need to prepare the VM before containers are created.

For Kata to handle PCIe devices properly the CRI needs to tell the kata-runtime how many PCIe root-ports or PCIe switch-ports the hypervisor needs to create at sandbox creation depending on the number of devices allocated by the containers. The PCIe root-port is a static configuration and the hypervisor cannot adjust it once the sandbox is created. During container creation the PCIe devices are hot-plugged to the PCIe root-port or switch-port. If the number of pre-allocated pluggable ports is too low, the attachment will fail (container devices > pre-allocated hot-pluggable ports).

In the case of Confidential Containers (uses Kata under the hood with additional software components for attestation) the CRI needs to consider the cold-plug aka direct attachment use-case. At sandbox creation time the hypervisor needs to know the exact number of pass-through devices and its properties (VFIO IOMMU group, the actual VFIO device - there can be several devices in a IOMMU group, attach to PCIe root-port or PCIe switch-port (PCI-Bridge)). In a confidential setting a user does not want to reconfigure the VM (creates an attack-vector) on every create container request. The hypervisor needs a fully static view of resources needed for VM sizing.

Independent of hot or cold-plug the hypervisor needs to know how the PCI(e) topology needs to look like at sandbox creation time.

Updating resources of a container means also resizing the VM, hence the hypervisors needs the complete list of resources available at a update container request.

Another visibility issue is related to the native and extended resources. Kubelet manages the native resources (CPU and memory) and communicates resource parameters over the CRI API to the runtime. The following snippet shows the currently supported CRI annotations that are provided by the Kubelet to e.g. containerd:

pkg/cri/annotations/annotations.go

  // SandboxCPU annotations are based on the initial CPU configuration for the sandbox. This is calculated as the
  // sum of container CPU resources, optionally provided by Kubelet (introduced  in 1.23) as part of the PodSandboxConfig
  SandboxCPUPeriod = "io.kubernetes.cri.sandbox-cpu-period"
  SandboxCPUQuota  = "io.kubernetes.cri.sandbox-cpu-quota"
  SandboxCPUShares = "io.kubernetes.cri.sandbox-cpu-shares"

  // SandboxMemory is the initial amount of memory associated with this sandbox. This is calculated as the sum
  // of container memory, optionally provided by Kubelet (introduced in 1.23) as part of the PodSandboxConfig.
  SandboxMem = "io.kubernetes.cri.sandbox-memory"

However, the original details of the resource spec are lost as they get translated (within kubelet) to platform-specific (i.e. Linux or Windows) resource controller parameters like cpu shares, memory limits etc. Non-native resources such as extended resources and the device plugin resources completely invisible to the CRI runtime. However, OCI hooks, runC wrappers, NRI plugins or in some cases even applications themselves would benefit on seeing the original resource requests and limits e.g. for doing customized resource optimization.

Extending the CRI API to communicate all resources already at sandbox creation and pass down resource requests and limits (of native and extended resources) would provide a comprehensive and early-enough view of the resource usage of all containers of the pod, allowing improved resource allocation without breaking any existing use cases.

Goals

make the information about all required resources (e.g. native and extended resources, devices, mounts, CDI devices) of a Pod available to the CRI at sandbox creation time
make container resource spec transparently visible to CRI (the container runtime)

Non-Goals

change kubelet resource management
change existing behavior of CRI
add UpdatePodSandboxResources CRI rpc (this is covered by KEP-1287, PR)
add pod-level resource requirements (this is covered by KEP-2837, PR)

Proposal

User Stories

Story 1

As a VM-based container runtime developer, I want to allocate/expose enough RAM, hugepages, hot- or cold-pluggable PCI(e) ports, protected memory sections and other resources for the VM to ensure that all containers in the pod are guaranteed to get the resources they require.

Story 2

As a developer of non-runc / non-Linux CRI runtime, I want to know detailed container resource requests to be able to make correct resource allocation for the applications. I cannot rely on cgroup parameters on this but need to know what the user requested to fairly allocate resources between applications.

Story 3

As a cluster administrator, I want to install an NRI plugin that does customized resource handling. I run kubelet with CPU manager and memory manager disabled (CPU manager policy set to none). Instead I use my NRI plugin to do customized resource allocation (e.g. cpu and memory pinning). To do that properly I need the actual resource requests and limits requested by the user.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

The proposal only adds new informational data to the CRI API between kubelet and the container runtime with no user-visible changes which mitigates possible risks considerably.

Data duplication/inconsistency with native resources could be considered a risk as those are passed down to CRI both as "raw" requests and limits and as "translated" resource control parameters (like cpu shares oom scoring etc). But this should be largely mitigated by code reviews and unit tests.

Design Details

The proposal is that kubelet discloses full resources information from the PodSpec to the container runtime. This is accomplished by extending the ContainerConfig, UpdateContainerResourcesRequest and PodSandboxConfig messages of the CRI API.

With this information, the runtime can for example do detailed resource allocation so that CPU, memory and other resources for each container are optimally aligned. This applies to scenarios where the kubelet CPU manager is disabled (by using the none CPU manager policy).

The resource information is included in PodSandboxConfig so that the runtime can see the full picture of Pod's resource usage at Pod creation time, for example enabling more holistic resource allocation and thus better interoperability between containers inside the Pod.

Also the CreateContainer request is extended to include the unmodified resource requirements. This make it possible for the CRI runtime to detect any changes in the pod resources that happen between the Pod creation and container creation in e.g. scenarios where in-place pod updates are involved.

KEP-1287 Beta (PR) proposes to add new UpdatePodSandboxResources rpc to the CRI API. If/when KEP-1287 is implemented as proposed, the UpdatePodSandboxResources CRI message is updated to include the resource information of all containers (aligning with UpdateContainerResourcesRequest).

KEP-2837 Alpha (PR) proposes to add new Pod-level resource requirements field to the PodSpec. This information will be be added to the PodResourceConfig message, similar to the container resource information, if/when KEP-2837 is implemented as proposed.

CRI API

PodSandboxConfig

The PodSandboxConfig message (part of the RunPodSandbox request) will be extended to contain information about resources of all its containers known at the pod creation time. The container runtime may use this information to make preparations for all upcoming containers of the pod. E.g. setup all needed resources for a VM-based pod or prepare for optimal allocation of resources of all the containers of the Pod. However, the container runtime may continue to operate as they did (before this enhancement). That is, it can ignore the resource information presented here and allocate resources for each container separately at container creation time with the CreateContainer request.

 message PodSandboxConfig {
 
 ...
 
     // Optional configurations specific to Linux hosts.
     LinuxPodSandboxConfig linux = 8;
     // Optional configurations specific to Windows hosts.
     WindowsPodSandboxConfig windows = 9;
+
+    // Kubernetes resource spec of the containers in the pod.
+    PodResourceConfig pod_resources = 10;
 }
 
+// PodResourceConfig contains information of all resources requirements of
+// the containers of a pod.
+message PodResourceConfig {
+    repeated ContainerResourceConfig containers = 1;
+}
 
+// ContainerResourceConfig contains information of all resource requirements of
+// one container.
+message ContainerResourceConfig {
+    // Name of the container
+    string name= 1;
+
+    // Type of the container
+    ContainerType type= 2;
+
+    // Kubernetes resource spec of the container
+    KubernetesResources kubernetes_resources = 3;
+
+    // Mounts for the container.
+    repeated Mount mounts = 4;
+
+    // Devices for the container.
+    repeated Device devices = 5;
+
+    // CDI devices for the container.
+    repeated CDIDevice CDI_devices = 6;
+}

+enum ContainerType {
+    INIT_CONTAINER    = 0;
+    SIDECAR_CONTAINER = 1;
+    CONTAINER = 2;
+}

The Pod-level resources enhancement KEP-2837 (alpha PR) proposes to add new Pod-level resource requirements fields to the PodSpec. This information will be be added to the PodResourceConfig message, similar to the container resource information.

 message PodResourceConfig {
     repeated ContainerResourceConfig containers = 1;
+
+    // Kubernetes resource spec of the pod-level resource requirements.
+    KubernetesResources kubernetes_resources = 2;
 }

The implementation if adding the KubernetesResources field to the PodResourceConfig is synced with KEP-2837.

CreateContainer

The ContainerConfig message (used in CreateContainer request) is extended to contain unmodified resource requests from the PodSpec.

+import "k8s.io/apimachinery/pkg/api/resource/generated.proto";

 message ContainerConfig {
 
 ...
 
     // Configuration specific to Windows containers.
     WindowsContainerConfig windows = 16;
 
     // CDI devices for the container.
     repeated CDIDevice CDI_devices = 17;
+
+    // Kubernetes resource spec of the container
+    KubernetesResources kubernetes_resources = 18;
 }
 
+// KubernetesResources contains the resource requests and limits as specified
+// in the Kubernetes core API ResourceRequirements.
+message KubernetesResources {
+    // Requests and limits from the Kubernetes container config.
+    map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 1;
+    map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 2;
+}

Note that mounts, devices, CDI devices are part of the ContainerConfig message but are left out of the diff snippet above.

Including the KubernetesResources in the ContainerConfig message serves multiple purposes:

Catch changes that happen between pod sandbox creation and container creation. For example, in-place pod updates might change the container before it was created.
Catch changes that happen over container restarts in in-place pod update scenarios
Consistency/completeness. Have enough information to make consistent action based only on information present in this rpc caal.

The resources (mounts, devices, CDI devices, Kubernetes resources) in the CreateContainer request should be identical to what was (pre-)informed in the RunPodSandbox request. If they are different, the CRI runtime may fail the container creation, for example because changes cannot be applied after a VM-based Pod has been created.

UpdateContainerResourcesRequest

The UpdateContainerResourcesRequest message is extended to pass down unmodified resource requests from the PodSpec.

 message UpdateContainerResourcesRequest {
     // ID of the container to update.
     string container_id = 1;
     // Resource configuration specific to Linux containers.
     LinuxContainerResources linux = 2;
     // Resource configuration specific to Windows containers.
     WindowsContainerResources windows = 3;
     // Unstructured key-value map holding arbitrary additional information for
     // container resources updating. This can be used for specifying experimental
     // resources to update or other options to use when updating the container.
     map<string, string> annotations = 4;
+
+    // Kubernetes resource spec of the container
+    KubernetesResources kubernetes_resources = 5;
 }

Note that mounts, devices, CDI devices are not part of the UpdateContainerResourcesRequest message and this proposal does not suggest adding them.

UpdatePodSandboxResources

The In-Place Update of Pod Resources (KEP-1287) Beta (PR) proposes to add new UpdatePodSandboxResources rpc to inform the CRI runtime about the changes in the pod resources.

The UpdatePodSandboxResourcesRequest message is extended similarly to the PodSandboxConfig message to contain information about resources of all its containers. In UpdatePodSandboxResourcesRequest this will reflect the updated resource requirements of the containers.

 message UpdatePodSandboxResourcesRequest {
     // ID of the PodSandbox to update.
     string pod_sandbox_id = 1;
 
     // Optional overhead represents the overheads associated with this sandbox
     LinuxContainerResources overhead = 2;
     // Optional resources represents the sum of container resources for this sandbox
     LinuxContainerResources resources = 3;
 
     // Unstructured key-value map holding arbitrary additional information for
     // sandbox resources updating. This can be used for specifying experimental
     // resources to update or other options to use when updating the sandbox.
     map<string, string> annotations = 4;
+
+    // Kubernetes resource spec of the containers in the pod.
+    PodResourceConfig pod_resources = 5;
 }

The implementation will be synced with KEP-1287.

kubelet

Kubelet code is refactored/modified so that all container resources are known before sandbox creation. This mainly consists of preparing all mounts (of all containers) early.

Kubelet will be extended to pass down all resources of containers in all related CRI requests (as described in the CRI API section). That is:

adding mounts, devices, CDI devices and the unmodified resource requests and limits of all containers into RunPodSandbox request
adding unmodified resource requests and limits into CreateContainer and UpdateContainerResources requests

For example, take a PodSpec:

apiVersion: v1
kind: Pod
...
spec:
  containers:
  - name: cnt-1
    image: k8s.gcr.io/pause
    resources:
      requests:
        cpu: 1
        memory: 1G
        example.com/resource: 1
      limits:
        cpu: 2
        memory: 2G
        example.com/resource: 1
    volumeMounts:
    - mountPath: /my-volume
      name: my-volume
    - mountPath: /image-volume
      name: image-volume
  volumes:
  - name: my-volume
    emptyDir:
  - name: image-volume
    image:
      reference: example.com/registry/artifact:tag

Then kubelet will send the following RunPodSandboxRequest when creating the Pod (represented here in yaml format):

RunPodSandboxRequest:
  config:
  ...
    podResources:
      containers:
      - name: cnt-1
        kubernetes_resources:
          requests:
            cpu: "1"
            memory: 1G
            example.com/resource: "1"
          limits:
            cpu: "2"
            memory: 2G
            example.com/resource: "1"
        CDI_devices:
        - name: example.com/resource=CDI-Dev-1
        mounts:
        - container_path: /my-volume
          host_path: /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~empty-dir/my-volume
        - container_path: /image-volume
          image:
            image: example.com/registry/artifact:tag
          ...
        - container_path: /var/run/secrets/kubernetes.io/serviceaccount
          host_path: /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~projected/kube-api-access-4srqm
          readonly: true
        - container_path: /dev/termination-log
          host_path: /var/lib/kubelet/pods/<pod-uid>/containers/cnt-1/<uuid>

Note that all device plugin resources are passed down in the kubernetes_resources field but this does not contain any properties of the device that was actually allocated for the container. However, these properties are exposed through the CDI_devices, mounts and devices fields.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

No prerequisite testing updates have been identified.

Unit tests

k8s.io/kubernetes/pkg/kubelet/kuberuntime: 2024-02-02 - 68.3%

The fake_runtime will be used in unit tests to verify that the Kubelet correctly passes down the resource information to the CRI runtime.

Integration tests

For alpha, no new integration tests are planned.

e2e tests

For alpha, no new e2e tests are planned.

For Beta: a suite of NRI tests will be added to verify that the runtime receives the resource information correctly and passes it down to the NRI plugins.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial unit tests completed and enabled

Beta

Gather feedback from developers and surveys
Feature gate enabled by default
containerd and CRI-O runtimes have released versions that have adopted the new CRI API changes

GA

No bugs reported in the previous cycle

Upgrade / Downgrade Strategy

The feature gate (in kubelet) controls the feature enablement. Existing runtime implementations will continue to work as previously, even if the feature is enabled.

Version Skew Strategy

The feature is node-local (kubelet-only) so there is no dependencies or effects to other Kubernetes components.

The behavior is unchanged if either kubelet or the CRI runtime running on a node does not support the feature. If kubelet has the feature enabled but the CRI runtime does not support it, the CRI runtime will ignore the new fields in the CRI API and function as previously. Similarly, if the CRI runtime supports the feature but the kubelet does not, the runtime will resort to the previous behavior.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate
- Feature gate name: KubeletContainerResourcesInPodSandbox
- Components depending on the feature gate:
  - kubelet

Does enabling the feature change any default behavior?

Yes. The kubelet will start passing the extra information to the CRI runtime for every container it creates. Whether this has any effect depends on if the underlying CRI runtime supports this feature. For example, an NRI plugin relying on the feature may cause the application to behave differently.

Long running pods that persist (without restart) over kubelet and CRI runtime update which enables the feature may experience version skew of the metadata. After enabling the feature, the CRI runtime does not have the aggregated information of all resources of the pod, provided with this feature, as the kubelet didn't restart these pods (didn't send the CreatePodSandbox CRI request). This may affect some scenarios e.g. NRI plugins. This "metadata skew" can be avoided by draining the node before updating the kubelet and the CRI runtime.

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

Yes, disabling the KubeletContainerResourcesInPodSandbox feature gate will disable the feature. Restarting pods may be needed to reset the information that was passed down to the CRI.

What happens if we reenable the feature if it was previously rolled back?

New pods will have the feature enabled. Existing pods will continue to operate as before until restarted.

Are there any tests for feature enablement/disablement?

Unit tests for the feature gate will be added.

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

Rollback or rollout in the kubelet should not fail - it only enables/disabled the information (fields in the CRI message) passed down to the CRI runtime.

However, if the CRI runtime depends on the feature, a rollout or rollback may cause failures of applications on pod restarts. Running pods are not affected.

What specific metrics should inform a rollback?

Alpha: No new metrics are planned. Increase in the existing kubelet_started_pods_errors_total metric can indicate a problem caused by this feature.

Generally, non-ready pods with CreatePodSandboxError status (reflected by the kubelet_started_pods_errors_total metric) is a possible indicator. The error message will contain details if the CRI failure is related to the feature.

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Alpha: Manual testing of the feature gate is performed.

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

No.

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

By examing the kubelet feature gate and the version of the CRI runtime. The enablement of the kubelet feature gate can be determined from the kubernetes_feature_enabled metric.

How can someone using this feature know that it is working for their instance?

The end users do not see the status of the feature directly.

The cluster operator can verify that the feature is working by examining the kubelet and CRI runtime logs.

The CRI runtime or NRI plugin developers depending on the feature can ensure that it is working by verifying that all the required information is available at pod sandbox creation time.

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

No increase in the kubelet_started_pods_errors_total rate.

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name: kubelet_started_pods_errors_total
- Components exposing the metric: kubelet

NOTE: The kubelet_started_pods_errors_total metric is a general metric for any errors that occur when starting pods. The error message (Pod events, kubelet logs) will contain details if the CRI failure is related to the feature.

Are there any missing metrics that would be useful to have to improve observability of this feature?

N/A.

Dependencies

Does this feature depend on any specific services running in the cluster?

No.

However, the practical usability of this feature requires that also the CRI runtime supports it. The feature is effectively a no-op if the CRI runtime does not support it.

Scalability

Will enabling / using this feature result in any new API calls?

No.

Will enabling / using this feature result in introducing new API types?

No.

Will enabling / using this feature result in any new calls to the cloud provider?

No.

Will enabling / using this feature result in increasing size or count of the existing API objects?

No.

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Not noticeably.

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

No. The new data fields in the CRI API would not count as significant increase.

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

No.

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

N/A. The feature is node-local.

What are other known failure modes?

The feature in Kubernetes is relatively straightforward - passing extra information to the CRI runtime. The failure scenarios arise in the CRI runtime level, e.g.:

misbehaving CRI runtime or NRI plugin
CRI runtime or NRI plugin is depending on the feature but it is not enabled in the kubelet
configuration skew in the cluster where some nodes have the feature enabled and some do not

Pod events and CRI runtime logs are the primary sources of information for these failure scenarios.

What steps should be taken if SLOs are not being met to determine the problem?