KEP-4112: Pass down resources to CRI

Release Signoff Checklist
Summary
Motivation
- Goals
- Non-Goals
Proposal
Design Details
Production Readiness Review Questionnaire
Implementation History
Drawbacks
Alternatives
- Container annotations
Infrastructure Needed (Optional)

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

(R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
(R) KEP approvers have approved the KEP status as implementable
(R) Design details are appropriately documented
(R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors)
- e2e Tests for all Beta API Operations (endpoints)
- (R) Ensure GA e2e tests meet requirements for Conformance Tests
- (R) Minimum Two Week Window for GA e2e tests to prove flake free
(R) Graduation criteria is in place
- (R) all GA Endpoints must be hit by Conformance Tests
(R) Production readiness review completed
(R) Production readiness review approved
"Implementation History" section is up-to-date for milestone
User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

The CRI runtime lacks visibility to the application resource requirements.

First, the resources required by the containers of a pod are not visible at the pod sandbox creation time. This can be problematic for example in the case of VM-based runtimes where all resources need to be reserved/prepared when the VM (i.e. sandbox) is being created.

Second, the kubelet does not provide complete information about the container resources specification of native and extended resources (requests and limits) to CRI. However, various use cases have been identified where detailed knowledge of all the resources can be utilized in container runtimes for more optimal resource allocation to improve application performance and reduce cross-application interference.

This KEP proposes CRI API extensions for providing complete view of pods resources at sandbox creation, and, providing unobfuscated information about the resource requests and limits to container runtimes.

Motivation

When the pod sandbox is created, the kubelet does not provide the CRI runtime any information about the resources (such as native resources, host devices, mounts, CDI devices etc) that will be required by the application. The CRI runtime only becomes aware of the resources piece by piece when containers of the pod are created (one-by-one).

This can cause issues with VM-based runtimes (e.g. Kata containers and Confidential Containers) that need to prepare the VM before containers are created.

For Kata to handle PCIe devices properly the CRI needs to tell the kata-runtime how many PCIe root-ports or PCIe switch-ports the hypervisor needs to create at sandbox creation depending on the number of devices allocated by the containers. The PCIe root-port is a static configuration and the hypervisor cannot adjust it once the sandbox is created. During container creation the PCIe devices are hot-plugged to the PCIe root-port or switch-port. If the number of pre-allocated pluggable ports is too low, the attachment will fail (container devices > pre-allocated hot-pluggable ports).

In the case of Confidential Containers (uses Kata unter the hood with additional software components for attestation) the CRI needs to consider the cold-plug aka direct attachment use-case. At sandbox creation time the hypervisor needs to know the exact number of pass-through devices and its properties (VFIO IOMMU group, the actual VFIO device - there can be several devices in a IOMMU group, attach to PCIe root-port or PCIe switch-port (PCI-Bridge)). In a confidential setting a user does not want to reconfigure the VM (creates an attack-vector) on every create container request. The hypervisor needs a fully static view of resources needed for VM sizing.

Independent of hot or cold-plug the hypervisor needs to know how the PCI(e) topology needs to look like at sandbox creation time.

Updating resources of a container means also resizing the VM, hence the hypervisors needs the complete list of resources available at a update container request.

Another visibility issue is related to the native and extended resources. Kubelet manages the native resources (CPU and memory) and communicates resource parameters over the CRI API to the runtime. The following snippet shows the currently supported CRI annotations that are provided by the Kubelet to e.g. containerd:

pkg/cri/annotations/annotations.go

  // SandboxCPU annotations are based on the initial CPU configuration for the sandbox. This is calculated as the
  // sum of container CPU resources, optionally provided by Kubelet (introduced  in 1.23) as part of the PodSandboxConfig
  SandboxCPUPeriod = "io.kubernetes.cri.sandbox-cpu-period"
  SandboxCPUQuota  = "io.kubernetes.cri.sandbox-cpu-quota"
  SandboxCPUShares = "io.kubernetes.cri.sandbox-cpu-shares"

  // SandboxMemory is the initial amount of memory associated with this sandbox. This is calculated as the sum
  // of container memory, optionally provided by Kubelet (introduced in 1.23) as part of the PodSandboxConfig.
  SandboxMem = "io.kubernetes.cri.sandbox-memory"

However, the original details of the resource spec are lost as they get translated (within kubelet) to platform-specific (i.e. Linux or Windows) resource controller parameters like cpu shares, memory limits etc. Non-native resources such as extended resources and the device plugin resources completely invisible to the CRI runtime. However, OCI hooks, runC wrappers, NRI plugins or in some cases even applications themselves would benefit on seeing the original resource requests and limits e.g. for doing customized resource optimization.

Extending the CRI API to communicate all resources already at sandbox creation and pass down resource requests and limits (of native and extended resources) would provide a comprehensive and early-enough view of the resource usage of all containers of the pod, allowing improved resource allocation without breaking any existing use cases.

Goals

make the information about all required resources (e.g. native and extended resources, devices, mounts, CDI devices) of a Pod available to the CRI at sandbox creation time
make container resource spec transparently visible to CRI (the container runtime)

Non-Goals

change kubelet resource management
change existing behavior of CRI

Proposal

User Stories

Story 1

As a VM-based container runtime developer, I want to allocate/expose enough RAM, hugepages, hot- or cold-pluggable PCI(e) ports, protected memory sections and other resources for the VM to ensure that all containers in the pod are guaranteed to get the resources they require.

Story 2

As a developer of non-runc / non-Linux CRI runtime, I want to know detailed container resource requests to be able to make correct resource allocation for the applications. I cannot rely on cgroup parameters on this but need to know what the user requested to fairly allocate resources between applications.

Story 3

As a cluster administrator, I want to install an NRI plugin that does customized resource handling. I run kubelet with CPU manager and memory manager disabled. Instead I use my NRI plugin to do customized resource allocation (e.g. cpu and memory pinning). To do that properly I need the actual resource requests and limits requested by the user.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

The proposal only adds new informantional data to the CRI API between kubelet and the container runtime with no user-visible changes which mitigates possible risks considerably.

Data duplication/inconsistency with native resources could be considered a risk as those are passed down to CRI both as "raw" requests and limits and as "translated" resource control parameters (like cpu shares oom scoring etc). But this should be largely mitigated by code reviews and unit tests.

Design Details

The proposal is that kubelet discloses full resources information from the PodSpec to the container runtime. This is accomplished by extending the ContainerConfig, UpdateContainerResourcesRequest and PodSandboxConfig messages of the CRI API.

With this information, the runtime can for example do detailed resource allocation so that CPU, memory and other resources for each container are optimally aligned.

The resource information is included in PodSandboxConfig so that the runtime can see the full picture of Pod's resource usage at Pod creation time, for example enabling more holistic resource allocation and thus better interoperability between containers inside the Pod.

CRI API

PodSandboxConfig

The PodSandboxConfig message (part of the RunPodSandbox request) will be extended to contain information about resources of all its containers known at the pod creation time. The container runtime may use this information to make preparations for all upcoming containers of the pod. E.g. setup all needed resources for a VM-based pod or prepare for optimal allocation of resources of all the containers of the Pod. However, the container runtime may continue to operate as they did (before this enhancement). That is, it can safely ignore the per-container resource information and allocate resources for each container separately, one at a time, with the CreateContainer.

 message PodSandboxConfig {
 
 ...
 
     // Optional configurations specific to Linux hosts.
     LinuxPodSandboxConfig linux = 8;
     // Optional configurations specific to Windows hosts.
     WindowsPodSandboxConfig windows = 9;
+
+    // Kubernetes resource spec of the containers in the pod.
+    PodResourceConfig pod_resources = 10;
 }
 
+// PodResourceConfig contains information of all resources requirements of
+// the containers of a pod.
+message PodResourceConfig {
+    repeated ContainerResourceConfig init_containers = 1;
+    repeated ContainerResourceConfig containers = 2;
+}
 
+// ContainerResourceConfig contains information of all resource requirements of
+// one container.
+message ContainerResourceConfig {
+    // Name of the container
+    string name= 1;
+
+    // Kubernetes resource spec of the container
+    KubernetesResources kubernetes_resources = 2;
+
+    // Mounts for the container.
+    repeated Mount mounts = 3;
+
+    // Devices for the container.
+    repeated Device devices = 4;
+
+    // CDI devices for the container.
+    repeated CDIDevice CDI_devices = 5;
+}

CreateContainer

The ContainerConfig message (used in CreateContainer request) is extended to contain unmodified resource requests from the PodSpec.

+import "k8s.io/apimachinery/pkg/api/resource/generated.proto";

 message ContainerConfig {
 
 ...
 
     // Configuration specific to Windows containers.
     WindowsContainerConfig windows = 16;
 
     // CDI devices for the container.
     repeated CDIDevice CDI_devices = 17;
+
+    // Kubernetes resource spec of the container
+    KubernetesResources kubernetes_resources = 18;
 }
 
+// KubernetesResources contains the resource requests and limits as specified
+// in the Kubernetes core API ResourceRequirements.
+message KubernetesResources {
+    // Requests and limits from the Kubernetes container config.
+    map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> requests = 1;
+    map<string, k8s.io.apimachinery.pkg.api.resource.Quantity> limits = 2;
+}

The resources (mounts, devices, CDI devices, Kubernetes resources) in the CreateContainer request should be identical to what was (pre-)informed in the RunPodSandbox request. If they are different, the CRI runtime may fail the container creation, for example because changes cannot be applied after a VM-based Pod has been created.

UpdateContainerResourcesRequest

The UpdateContainerResourcesRequest message is extended to pass down unmodified resource requests from the PodSpec.

 message UpdateContainerResourcesRequest {
     // ID of the container to update.
     string container_id = 1;
     // Resource configuration specific to Linux containers.
     LinuxContainerResources linux = 2;
     // Resource configuration specific to Windows containers.
     WindowsContainerResources windows = 3;
     // Unstructured key-value map holding arbitrary additional information for
     // container resources updating. This can be used for specifying experimental
     // resources to update or other options to use when updating the container.
     map<string, string> annotations = 4;
+
+    // Kubernetes resource spec of the container
+    KubernetesResources kubernetes_resources = 5;
 }

kubelet

Kubelet code is refactored/modified so that all container resources are known before sandbox creation. This mainly consists of preparing all mounts (of all containers) early.

Kubelet will be be extended to pass down all mounts, devices, CDI devices and the unmodified resource requests and limits to the container runtime in all related CRI requests, i.e. RunPodSandbox, CreateContainer and UpdateContainerResources.

For example, take a PodSpec:

apiVersion: v1
kind: Pod
...
spec:
  containers:
  - name: cnt-1
    image: k8s.gcr.io/pause
    resources:
      requests:
        cpu: 1
        memory: 1G
        example.com/resource: 1
      limits:
        cpu: 2
        memory: 2G
        example.com/resource: 1
    volumeMounts:
    - mountPath: /my-volume
      name: my-volume
  volumes:
  - name: my-volume
    emptyDir:

Then kubelet will send the following RunPodSandboxRequest when creating the Pod (represented here in yaml format):

RunPodSandboxRequest:
  config:
  ...
    podResources:
      containers:
      - name: cnt-1
        kubernetes_resources:
          requests:
            cpu: "1"
            memory: 1G
            example.com/resource: "1"
          limits:
            cpu: "2"
            memory: 2G
            example.com/resource: "1"
        CDI_devices:
        - name: example.com/resource=CDI-Dev-1
        mounts:
        - container_path: /my-volume
          host_path: /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~empty-dir/my-volume
        - container_path: /var/run/secrets/kubernetes.io/serviceaccount
          host_path: /var/lib/kubelet/pods/<pod-uid>/volumes/kubernetes.io~projected/kube-api-access-4srqm
          readonly: true
        - container_path: /dev/termination-log
          host_path: /var/lib/kubelet/pods/<pod-uid>/containers/cnt-1/<uuid>

Note that all device plugin resources are passed down in the kubernetes_resources field but this does not contain any properties of the device that was actually allocated for the container. However, these properties are exposed through the CDI_devices, mounts and devices fields.

Test Plan

[x] I/we understand the owners of the involved components may require updates to existing tests to make this code solid enough prior to committing the changes necessary to implement this enhancement.

Prerequisite testing updates

No prerequisite testing updates have been identified.

Unit tests

k8s.io/kubernetes/pkg/kubelet/kuberuntime: 2024-02-02 - 68.3%

The fake_runtime will be used in unit tests to verify that the Kubelet correctly passes down the resource information to the CRI runtime.

Integration tests

For alpha, no new integration tests are planned.

e2e tests

For alpha, no new e2e tests are planned.

Graduation Criteria

Alpha

Feature implemented behind a feature flag
Initial unit tests completed and enabled

Beta

Gather feedback from developers and surveys
Feature gate enabled by default
containerd and CRI-O runtimes have released versions that have adopted the new CRI API changes
The NRI API has adopted the feature

GA

No bugs reported in the previous cycle
N examples of real-world usage
N installs

Upgrade / Downgrade Strategy

The feature gate (in kubelet) controls the feature enablement. Existing runtime implementations will continue to work as previously, even if the feature is enabled.

Version Skew Strategy

The feature is node-local (kubelet-only) so there is no dependencies or effects to other Kubernetes components.

The behavior is unchanged if either kubelet or the CRI runtime running on a node does not support the feature. If kubelet has the feature enabled but the CRI runtime does not support it, the CRI runtime will ignore the new fields in the CRI API and function as previously. Similarly, if the CRI runtime supports the feature but the kubelet does not, the runtime will resort to the previous behavior.

Production Readiness Review Questionnaire

Feature Enablement and Rollback

How can this feature be enabled / disabled in a live cluster?

Feature gate (also fill in values in kep.yaml)
- Feature gate name:
- Components depending on the feature gate:
Other
- Describe the mechanism:
- Will enabling / disabling the feature require downtime of the control plane?
- Will enabling / disabling the feature require downtime or reprovisioning of a node?

Does enabling the feature change any default behavior?

Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)?

What happens if we reenable the feature if it was previously rolled back?

Are there any tests for feature enablement/disablement?

Rollout, Upgrade and Rollback Planning

How can a rollout or rollback fail? Can it impact already running workloads?

What specific metrics should inform a rollback?

Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested?

Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.?

Monitoring Requirements

How can an operator determine if the feature is in use by workloads?

How can someone using this feature know that it is working for their instance?

Events
- Event Reason:
API .status
- Condition name:
- Other field:
Other (treat as last resort)
- Details:

What are the reasonable SLOs (Service Level Objectives) for the enhancement?

What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?

Metrics
- Metric name:
- [Optional] Aggregation method:
- Components exposing the metric:
Other (treat as last resort)
- Details:

Are there any missing metrics that would be useful to have to improve observability of this feature?

Dependencies

Does this feature depend on any specific services running in the cluster?

Scalability

Will enabling / using this feature result in any new API calls?

Will enabling / using this feature result in introducing new API types?

Will enabling / using this feature result in any new calls to the cloud provider?

Will enabling / using this feature result in increasing size or count of the existing API objects?

Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs?

Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components?

Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)?

Troubleshooting

How does this feature react if the API server and/or etcd is unavailable?

What are other known failure modes?

What steps should be taken if SLOs are not being met to determine the problem?

Implementation History

Drawbacks

Alternatives

Container annotations

Container annotations could be used as an alternative way to pass down the resource requests and limits to the container runtime.

Files

README.md

Latest commit

History