kubernetes · k8s-ci-robot · Jan 28, 2020 · Oct 28, 2019 · Oct 28, 2019 · Oct 28, 2019
diff --git a/...81106-in-place-update-of-pod-resources.md → ...81106-in-place-update-of-pod-resources.md b/...81106-in-place-update-of-pod-resources.md → ...81106-in-place-update-of-pod-resources.md
@@ -5,9 +5,9 @@ authors:
   - "@bskiba"
   - "@schylek"
   - "@vinaykul"
-owning-sig: sig-autoscaling
+owning-sig: sig-node
 participating-sigs:
-  - sig-node
+  - sig-autoscaling
   - sig-scheduling
 reviewers:
   - "@bsalamat"
@@ -23,9 +23,10 @@ approvers:
   - "@mwielgus"
 editor: TBD
 creation-date: 2018-11-06
-last-updated: 2018-11-06
-status: provisional
+last-updated: 2020-01-14
+status: implementable
 see-also:
+  - "/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md"
 replaces:
 superseded-by:
 ---
@@ -48,11 +49,21 @@ superseded-by:
   - [Scheduler and API Server Interaction](#scheduler-and-api-server-interaction)
   - [Flow Control](#flow-control)
     - [Container resource limit update ordering](#container-resource-limit-update-ordering)
+    - [Container resource limit update failure handling](#container-resource-limit-update-failure-handling)
     - [Notes](#notes)
   - [Affected Components](#affected-components)
   - [Future Enhancements](#future-enhancements)
   - [Risks and Mitigations](#risks-and-mitigations)
+- [Test Plan](#test-plan)
+  - [Unit Tests](#unit-tests)
+  - [Pod Resize E2E Tests](#pod-resize-e2e-tests)
+  - [Resource Quota and Limit Ranges](#resource-quota-and-limit-ranges)
+  - [Resize Policy Tests](#resize-policy-tests)
+  - [Backward Compatibility and Negative Tests](#backward-compatibility-and-negative-tests)
 - [Graduation Criteria](#graduation-criteria)
+  - [Alpha](#alpha)
+  - [Beta](#beta)
+  - [Stable](#stable)
 - [Implementation History](#implementation-history)
 <!-- /toc -->
 
@@ -134,15 +145,19 @@ Thanks to the above:
   v1.ResourceRequirements) shows the **actual** resources held by the Pod and
   its Containers.
 
-A new Pod subresource named 'resourceallocation' is introduced to allow
-fine-grained access control that enables Kubelet to set or update resources
-allocated to a Pod, and prevents the user or any other component from changing
-the allocated resources.
+A new admission controller named 'PodResourceAllocation' is introduced in order
+to limit access to ResourcesAllocated field such that only Kubelet can update
+this field.
+
+Additionally, Kubelet is authorized to update PodSpec, and NodeRestriction
+admission plugin is extended to limit Kubelet's update access only to Pod's
+ResourcesAllocated field for CPU and memory resources.
 
 #### Container Resize Policy
 
 To provide fine-grained user control, PodSpec.Containers is extended with
-ResizePolicy map (new object) for each resource type (CPU, memory):
+ResizePolicy - a list of named subobjects (new object) that supports 'cpu'
+and 'memory' as names. It supports the following policy values:
 * NoRestart - the default value; resize Container without restarting it,
 * RestartContainer - restart the Container in-place to apply new resource
   values. (e.g. Java process needs to change its Xmx flag)
@@ -167,6 +182,13 @@ Kubelet calls UpdateContainerResources CRI API which currently takes
 but not for Windows. This parameter changes to *runtimeapi.ContainerResources*,
 that is runtime agnostic, and will contain platform-specific information.
 
+Additionally, ContainerStatus CRI API is extended to hold
+*runtimeapi.ContainerResources* so that it allows Kubelet to query Container's
+CPU and memory limit configurations from runtime.
+
+These CRI changes are a separate effort that does not affect the design
+proposed in this KEP.
+
 ### Kubelet and API Server Interaction
 
 When a new Pod is created, Scheduler is responsible for selecting a suitable
@@ -185,11 +207,10 @@ resources allocated (Pod.Spec.Containers[i].ResourcesAllocated) for all Pods in
 the Node, except the Pod being resized. For the Pod being resized, it adds the
 new desired resources (i.e Spec.Containers[i].Resources.Requests) to the sum.
 * If new desired resources fit, Kubelet accepts the resize by updating
-  Pod.Spec.Containers[i].ResourcesAllocated via pods/resourceallocation
-  subresource, and then proceeds to invoke UpdateContainerResources CRI API
-  to update the Container resource limits. Once all Containers are successfully
-  updated, it updates Pod.Status.ContainerStatuses[i].Resources to reflect the
-  new resource values.
+  Pod.Spec.Containers[i].ResourcesAllocated, and then proceeds to invoke
+  UpdateContainerResources CRI API to update Container resource limits. Once
+  all Containers are successfully updated, it updates
+  Pod.Status.ContainerStatuses[i].Resources to reflect new resource values.
 * If new desired resources don't fit, Kubelet rejects the resize, and no
   further action is taken.
   - Kubelet retries the Pod resize at a later time.
@@ -234,10 +255,9 @@ Pod with ResizePolicy set to NoRestart for all its Containers.
    resources to determine if the new desired Resources fit the Node.
    * _Case 1_: Kubelet finds new desired Resources fit. It accepts the resize
      and sets Spec.Containers[i].ResourcesAllocated equal to the values of
-     Spec.Containers[i].Resources.Requests by invoking resourceallocation
-     subresource. It then applies the new cgroup limits to the Pod and its
-     Containers, and once successfully done, sets Pod's
-     Status.ContainerStatuses[i].Resources to reflect the desired resources.
+     Spec.Containers[i].Resources.Requests. It then applies the new cgroup
+     limits to the Pod and its Containers, and once successfully done, sets
+     Pod's Status.ContainerStatuses[i].Resources to reflect desired resources.
      - If at the same time, a new Pod was assigned to this Node against the
        capacity taken up by this resource resize, that new Pod is rejected by
        Kubelet during admission if Node has no more room.
@@ -283,6 +303,16 @@ updates resource limit for the Pod and its Containers in the following manner:
 In all the above cases, Kubelet applies Container resource limit decreases
 before applying limit increases.
 
+#### Container resource limit update failure handling
+
+If multiple Containers in a Pod are being updated, and UpdateContainerResources
+CRI API fails for any of the containers, Kubelet will backoff and retry at a
+later time. Kubelet does not attempt to update limits for containers that are
+lined up for update after the failing container. This ensures that sum of the
+container limits does not exceed Pod-level cgroup limit at any point. Once all
+the container limits have been successfully updated, Kubelet updates the Pod's
+Status.ContainerStatuses[i].Resources to match the desired limit values.
+
 #### Notes
 
 * If CPU Manager policy for a Node is set to 'static', then only integral
@@ -309,13 +339,20 @@ before applying limit increases.
 
 Pod v1 core API:
 * extended model,
-* new subresource,
-* added validation.
+* modify RBAC bootstrap policy authorizing Node to update PodSpec,
+* extend NodeRestriction plugin limiting Node's update access to PodSpec only
+  to the ResourcesAllocated field,
+* new admission controller to limit update access to ResourcesAllocated field
+  only to Node, and mutates any updates to ResourcesAllocated & ResizePolicy
+  fields to maintain compatibility with older versions of clients,
+* added validation allowing only CPU and memory resource changes,
+* setting defaults for ResourcesAllocated and ResizePolicy fields.
 
 Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates:
 * for ResourceQuota, podEvaluator.Handler implementation is modified to allow
   Pod updates, and verify that sum of Pod.Spec.Containers[i].Resources for all
   Pods in the Namespace don't exceed quota,
+* PodResourceAllocation admission plugin is ordered before ResourceQuota.
 * for LimitRanger we check that a resize request does not violate the min and
   max limits specified in LimitRange for the Pod's namespace.
 
@@ -328,9 +365,6 @@ Kubelet:
 Scheduler:
 * compute resource allocations using Pod.Spec.Containers[i].ResourcesAllocated.
 
-Controllers:
-* propagate Template resources update to running Pod instances.
-
 Other components:
 * check how the change of meaning of resource requests influence other
   Kubernetes components.
@@ -347,6 +381,8 @@ Other components:
 1. Extend Node Information API to report the CPU Manager policy for the Node,
    and enable validation of integral CPU resize for nodes with 'static' CPU
    Manager policy.
+1. Extend controllers (Job, Deployment, etc) to propagate Template resources
+   update to running Pods.
 1. Allow resizing local ephemeral storage.
 1. Allow resource limits to be updated (VPA feature).
 
@@ -362,14 +398,148 @@ Other components:
 1. Resizing memory lower: Lowering cgroup memory limits may not work as pages
    could be in use, and approaches such as setting limit near current usage may
    be required. This issue needs further investigation.
+1. Older client versions: Previous versions of clients that are unaware of the
+   new ResourcesAllocated and ResizePolicy fields would set them to nil. To
+   keep compatibility, PodResourceAllocation admission controller mutates such
+   an update by copying non-nil values from the old Pod to current Pod.
+
+## Test Plan
+
+### Unit Tests
+
+Unit tests will cover the sanity of code changes that implements the feature,
+and the policy controls that are introduced as part of this feature.
+
+### Pod Resize E2E Tests
+
+End-to-End tests resize a Pod via PATCH to Pod's Spec.Containers[i].Resources.
+The e2e tests use docker as container runtime.
+  - Resizing of Requests are verified by querying the values in Pod's
+    Spec.Containers[i].ResourcesAllocated field.
+  - Resizing of Limits are verified by querying the cgroup limits of the Pod's
+    containers.
+
+E2E test cases for Guaranteed class Pod with one container:
+1. Increase, decrease Requests & Limits for CPU only.
+1. Increase, decrease Requests & Limits for memory only.
+1. Increase, decrease Requests & Limits for CPU and memory.
+1. Increase CPU and decrease memory.
+1. Decrease CPU and increase memory.
+
+E2E test cases for Burstable class single container Pod that specifies
+both CPU & memory:
+1. Increase, decrease Requests - CPU only.
+1. Increase, decrease Requests - memory only.
+1. Increase, decrease Requests - both CPU & memory.
+1. Increase, decrease Limits - CPU only.
+1. Increase, decrease Limits - memory only.
+1. Increase, decrease Limits - both CPU & memory.
+1. Increase, decrease Requests & Limits - CPU only.
+1. Increase, decrease Requests & Limits - memory only.
+1. Increase, decrease Requests & Limits - both CPU and memory.
+1. Increase CPU (Requests+Limits) & decrease memory(Requests+Limits).
+1. Decrease CPU (Requests+Limits) & increase memory(Requests+Limits).
+1. Increase CPU Requests while decreasing CPU Limits.
+1. Decrease CPU Requests while increasing CPU Limits.
+1. Increase memory Requests while decreasing memory Limits.
+1. Decrease memory Requests while increasing memory Limits.
+1. CPU: increase Requests, decrease Limits, Memory: increase Requests, decrease Limits.
+1. CPU: decrease Requests, increase Limits, Memory: decrease Requests, increase Limits.
+
+E2E tests for Burstable class single container Pod that specifies CPU only:
+1. Increase, decrease CPU - Requests only.
+1. Increase, decrease CPU - Limits only.
+1. Increase, decrease CPU - both Requests & Limits.
+
+E2E tests for Burstable class single container Pod that specifies memory only:
+1. Increase, decrease memory - Requests only.
+1. Increase, decrease memory - Limits only.
+1. Increase, decrease memory - both Requests & Limits.
+
+E2E tests for Guaranteed class Pod with three containers (c1, c2, c3):
+1. Increase CPU & memory for all three containers.
+1. Decrease CPU & memory for all three containers.
+1. Increase CPU, decrease memory for all three containers.
+1. Decrease CPU, increase memory for all three containers.
+1. Increase CPU for c1, decrease c2, c3 unchanged - no net CPU change.
+1. Increase memory for c1, decrease c2, c3 unchanged - no net memory change.
+1. Increase CPU for c1, decrease c2 & c3 - net CPU decrease for Pod.
+1. Increase memory for c1, decrease c2 & c3 - net memory decrease for Pod.
+1. Increase CPU for c1 & c3, decrease c2 - net CPU increase for Pod.
+1. Increase memory for c1 & c3, decrease c2 - net memory increase for Pod.
+
+### Resource Quota and Limit Ranges
+
+Setup a namespace with ResourceQuota and a single, valid Pod.
+1. Resize the Pod within resource quota - CPU only.
+1. Resize the Pod within resource quota - memory only.
+1. Resize the Pod within resource quota - both CPU and memory.
+1. Resize the Pod to exceed resource quota - CPU only.
+1. Resize the Pod to exceed resource quota - memory only.
+1. Resize the Pod to exceed resource quota - both CPU and memory.
+
+Setup a namespace with min and max LimitRange and create a single, valid Pod.
+1. Increase, decrease CPU within min/max bounds.
+1. Increase CPU to exceed max value.
+1. Decrease CPU to go below min value.
+1. Increase memory to exceed max value.
+1. Decrease memory to go below min value.
+
+### Resize Policy Tests
+
+Setup a guaranteed class Pod with two containers (c1 & c2).
+1. No resize policy specified, defaults to NoRestart. Verify that CPU and
+   memory are resized without restarting containers.
+1. NoRestart (cpu, memory) policy for c1, RestartContainer (cpu, memory) for c2.
+   Verify that c1 is resized without restart, c2 is restarted on resize.
+1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU only,
+   verify container is resized without restart.
+1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 memory only,
+   verify container is resized with restart.
+1. NoRestart cpu, RestartContainer memory policy for c1. Resize c1 CPU & memory,
+   verify container is resized with restart.
+
+### Backward Compatibility and Negative Tests
+
+1. Verify that Node is allowed to update only a Pod's ResourcesAllocated field.
+1. Verify that only Node account is allowed to udate ResourcesAllocated field.
+1. Verify that updating Pod Resources in workload template spec retains current
+   behavior:
+   - Updating Pod Resources in Job template is not allowed.
+   - Updating Pod Resources in Deployment template continues to result in Pod
+     being restarted with updated resources.
+1. Verify Pod updates by older version of client-go doesn't result in current
+   values of ResourcesAllocated and ResizePolicy fields being dropped.
+1. Verify that only CPU and memory resources are mutable by user.
+
+TODO: Identify more cases
 
 ## Graduation Criteria
 
-TODO
+### Alpha
+- In-Place Pod Resouces Update functionality is implemented for running Pods,
+- LimitRanger and ResourceQuota handling are added,
+- Resize Policies functionality is implemented,
+- Unit tests and E2E tests covering basic functionality are added,
+- E2E tests covering multiple containers are added.
+
+### Beta
+- VPA alpha integration of feature completed and any bugs addressed,
+- E2E tests covering Resize Policy, LimitRanger, and ResourceQuota are added,
+- Negative tests are identified and added.
+
+### Stable
+- VPA integration of feature moved to beta,
+- User feedback (ideally from atleast two distinct users) is green,
+- No major bugs reported for three months.
 
 ## Implementation History
 
 - 2018-11-06 - initial KEP draft created
 - 2019-01-18 - implementation proposal extended
 - 2019-03-07 - changes to flow control, updates per review feedback
 - 2019-08-29 - updated design proposal
+- 2019-10-25 - update key open items and move KEP to implementable
+- 2020-01-06 - API review suggested changes incorporated
+- 2020-01-13 - Test plan and graduation criteria added
+- 2020-01-21 - Graduation criteria updated per review feedback