Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: in-place update of pod resources #686

Merged
merged 22 commits into from
Oct 3, 2019
Merged
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
cd94808
Move Karol Golab's draft KEP for In-place update of pod resources fro…
Jan 12, 2019
7fb66f1
Update owning-sig to sig-autoscaling, add initial set of reviewers.
Jan 12, 2019
b8c1f4e
Flow Control and few other sections added
kgolab Jan 18, 2019
5d00f9f
Merge pull request #1 from kgolab/master
vinaykul Jan 18, 2019
9580642
Update KEP filename per latest template guidelines, add non-goal item.
Jan 22, 2019
b8d814e
Merge remote-tracking branch 'upstream/master'
Mar 7, 2019
df1c8f8
Update flow control, clarify items per review, identify risks.
Mar 7, 2019
17923eb
Update policy name, clarify scheduler actions and policy precedence
Mar 11, 2019
e5052fc
Add RetryPolicy API change, clarify transition of PodCondition fields…
Mar 12, 2019
1194243
Update control flow per review, add notes on Pod Overhead, emptyDir
Mar 26, 2019
bfab6a3
Update API and flow control to avoid storing state in PodCondition
May 7, 2019
69f9190
Rename PodSpec scheduler resource allocations & PodCondition, and cla…
May 14, 2019
199a008
Key changes:
vinaykul Jun 18, 2019
574737c
Update design so that Kubelet, instead of Scheduler, evicts lower pri…
vinaykul Jun 19, 2019
5bdcd57
1. Remove PreEmpting PodCondition.
vinaykul Jul 9, 2019
bc9dc2b
Extend PodSpec to hold accepted resource resize values, add resourcea…
vinaykul Aug 26, 2019
533c3c6
Update ResourceAllocated as ResourceList, clarify details of Kubelet …
vinaykul Sep 3, 2019
29a22b6
Restate Kubelet fault handling to minimum guarantees, clarify Schedul…
vinaykul Sep 8, 2019
20cbea6
Details of LimitRanger, ResourceQuota enforcement during Pod resize.
vinaykul Sep 14, 2019
0ed9505
ResourceQuota with resize uses Containers[i].Resources
vinaykul Sep 17, 2019
c745563
Add note on VPA+HPA limitation for CPU, memory
vinaykul Sep 17, 2019
55c8e56
Add KEP approvers, minor clarifications
vinaykul Sep 24, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
375 changes: 375 additions & 0 deletions keps/sig-autoscaling/20181106-in-place-update-of-pod-resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,375 @@
---
title: In-place Update of Pod Resources
authors:
- "@kgolab"
- "@bskiba"
- "@schylek"
- "@vinaykul"
owning-sig: sig-autoscaling
participating-sigs:
- sig-node
- sig-scheduling
reviewers:
- "@bsalamat"
- "@dashpole"
- "@derekwaynecarr"
- "@dchen1107"
- "@ahg-g"
- "@k82cn"
approvers:
- "@dchen1107"
- "@derekwaynecarr"
- "@ahg-g"
- "@mwielgus"
editor: TBD
creation-date: 2018-11-06
last-updated: 2018-11-06
status: provisional
see-also:
replaces:
superseded-by:
---

# In-place Update of Pod Resources

## Table of Contents

* [In-place Update of Pod Resources](#in-place-update-of-pod-resources)
* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [API Changes](#api-changes)
* [Container Resize Policy](#container-resize-policy)
* [CRI Changes](#cri-changes)
* [Kubelet and API Server interaction](#kubelet-and-api-server-interaction)
* [Kubelet Restart Tolerance](#kubelet-restart-tolerance)
* [Scheduler and API Server interaction](#scheduler-and-api-server-interaction)
* [Flow Control](#flow-control)
* [Container resource limit update ordering](#container-resource-limit-update-ordering)
* [Notes](#notes)
* [Affected Components](#affected-components)
* [Future Enhancements](#future-enhancements)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)

## Summary

This proposal aims at allowing Pod resource requests & limits to be updated
in-place, without a need to restart the Pod or its Containers.

The **core idea** behind the proposal is to make PodSpec mutable with regards to
Resources, denoting **desired** resources. Additionally, PodSpec is extended to
reflect resources **allocated** to a Pod, and PodStatus is extended to provide
information about **actual** resources applied to the Pod and its Containers.

This document builds upon [proposal for live and in-place vertical scaling][]
and [Vertical Resources Scaling in Kubernetes][].

[proposal for live and in-place vertical scaling]:
https://github.com/kubernetes/community/pull/1719
[Vertical Resources Scaling in Kubernetes]:
https://docs.google.com/document/d/18K-bl1EVsmJ04xeRq9o_vfY2GDgek6B6wmLjXw-kos4

## Motivation

Resources allocated to a Pod's Container(s) can require a change for various
reasons:
* load handled by the Pod has increased significantly, and current resources
are not sufficient,
* load has decreased significantly, and allocated resources are unused,
* resources have simply been set improperly.

Currently, changing resource allocation requires the Pod to be recreated since
the PodSpec's Container Resources is immutable.

While many stateless workloads are designed to withstand such a disruption,
some are more sensitive, especially when using low number of Pod replicas.

Moreover, for stateful or batch workloads, Pod restart is a serious disruption,
resulting in lower availability or higher cost of running.

Allowing Resources to be changed without recreating the Pod or restarting the
Containers addresses this issue directly.

### Goals

* Primary: allow to change Pod resource requests & limits without restarting
its Containers.
* Secondary: allow actors (users, VPA, StatefulSet, JobController) to decide
how to proceed if in-place resource resize is not possible.
* Secondary: allow users to specify which Pods and Containers can be resized
without a restart.
dchen1107 marked this conversation as resolved.
Show resolved Hide resolved

### Non-Goals

The explicit non-goal of this KEP is to avoid controlling full lifecycle of a
Pod which failed in-place resource resizing. This should be handled by actors
which initiated the resizing.

Other identified non-goals are:
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* allow to change Pod QoS class without a restart,
* to change resources of Init Containers without a restart,
* eviction of lower priority Pods to facilitate Pod resize,
* updating extended resources or any other resource types besides CPU, memory.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

## Proposal

### API Changes

PodSpec becomes mutable with regards to Container resources requests and
limits. PodSpec is extended with information of resources allocated on the
Node for the Pod. PodStatus is extended to show the actual resources applied
to the Pod and its Containers.

Thanks to the above:
* Pod.Spec.Containers[i].Resources becomes purely a declaration, denoting the
**desired** state of Pod resources,
* Pod.Spec.Containers[i].ResourcesAllocated (new object, type v1.ResourceList)
denotes the Node resources **allocated** to the Pod and its Containers,
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* Pod.Status.ContainerStatuses[i].Resources (new object, type
v1.ResourceRequirements) shows the **actual** resources held by the Pod and
its Containers.

A new Pod subresource named 'resourceallocation' is introduced to allow
fine-grained access control that enables Kubelet to set or update resources
allocated to a Pod, and prevents the user or any other component from changing
the allocated resources.

#### Container Resize Policy

To provide fine-grained user control, PodSpec.Containers is extended with
ResizePolicy map (new object) for each resource type (CPU, memory):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to use a map here. See the Lists of named subobjects preferred over maps API convention.

This also still feels like quite a large API change just to support legacy Xmx applications. But i'll defer to approvers on whether this is justified.

Copy link
Member Author

@vinaykul vinaykul Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to use a map here. See the Lists of named subobjects preferred over maps API convention.

This also still feels like quite a large API change just to support legacy Xmx applications. But i'll defer to approvers on whether this is justified.

Using map here is consistent with ResourceAllocations (v1.ResourceList map) or Resources - map of Requests and Limits.

I respect the API conventions, but if you visualize the two (see below) - map looks simpler considering LHS strings are fixed/known system-defined keys (and not user data, not something that helps the user if they could name/alias it). Is that convention applicable to this case? If you look at the intent of @jbeda in those discussions, he is looking to keep things simple.

...
      containers:
      - name: foo
        resources:
          limits:
            cpu: "1"
            memory: "1Gi"
          requests:
            cpu: "1"
            memory: "1Gi"
        resizePolicy:
          cpu: NoRestart
          memory: RestartContainer

vs.

      containers:
      - name: foo
        resources:
          limits:
            cpu: "1"
            memory: "1Gi"
          requests:
            cpu: "1"
            memory: "1Gi"
        resizePolicy:
        - name: cpu
          policy: NoRestart
        - name: memory
          policy: RestartContainer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. It is fine to leave it as is for now. This is worth pointing out to API reviewers later to get their opinion on.

* NoRestart - the default value; resize Container without restarting it,
Copy link

@tedyu tedyu Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a better name than ResizePolicy.
The two options here are about restart.

Would adding another value for RestartPolicy (RestartPolicyContainer) be better ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have this policy per container.
I had thought about naming it RestartPolicy at first, but felt it would cause confusion with PodSpec.RestartPolicy. Besides, ResizePolicy sounded somewhat better as it relates to resize.

* RestartContainer - restart the Container in-place to apply new resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the user forbid in-place updates for a given Container, e.g. because any resource change would require to re-run Init Containers?
I think there used to be an option which said "restart the whole Pod".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a concrete use case or scenario that requires RestartPod policy? I removed it after the above discussion - I could not trace a use-case for it and couldn't justify its need with sig-node.

Copy link
Member Author

@vinaykul vinaykul Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm wondering if this can be folded into RetryPolicy in VPA where it lets the user specify this - similar to updateMode 'Recreate', if user needs to re-run init for resize we evict the Pod and resize the replacement during admission . This is less ideal than in-place resize with restart, but the push has been to keep things as simple as possible for the Kubelet, and this is something that can be added later if there is a strong use case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see a case for a third policy here, which as a strawman I will call SignalContainerWINCH. This would allow the container to attempt to adjust its language runtime to conform to the new limits - e.g. a programmer determines that calling runtime.GOMAXPROCS(math.Ceil(numCPUs) + 1) results in less scheduler thrashing.

However, such a signal would only be useful if pods are able to interrogate the system for their own resource limits. This is perhaps best left to future enhancements to in-place update and should not block 1.17 implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On linux you can always read /sys/fs/cgroup can't you ?

values. (e.g. Java process needs to change its Xmx flag)

By using ResizePolicy, user can mark Containers as safe (or unsafe) for
in-place resource update. Kubelet uses it to determine the required action.

Setting the flag to separately control CPU & memory is due to an observation
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
that usually CPU can be added/removed without much problem whereas changes to
available memory are more probable to require restarts.

If more than one resource type with different policies are updated, then
RestartContainer policy takes precedence over NoRestart policy.

Additionally, if RestartPolicy is 'Never', ResizePolicy should be set to
NoRestart in order to pass validation.

#### CRI Changes

Kubelet calls UpdateContainerResources CRI API which currently takes
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
*runtimeapi.LinuxContainerResources* parameter that works for Docker and Kata,
but not for Windows. This parameter changes to *runtimeapi.ContainerResources*,
that is runtime agnostic, and will contain platform-specific information.

### Kubelet and API Server Interaction
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

When a new Pod is created, Scheduler is responsible for selecting a suitable
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
Node that accommodates the Pod.

For a newly created Pod, Spec.Containers[i].ResourcesAllocated must match
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
Spec.Containers[i].Resources.Requests. When Kubelet admits a new Pod, values in
Spec.Containers[i].ResourcesAllocated are used to determine if there is enough
room to admit the Pod. Kubelet does not set Pod's ResourcesAllocated after
admitting a new Pod.

When a Pod resize is requested, Kubelet attempts to update the resources
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
allocated to the Pod and its Containers. Kubelet first checks if the new
desired resources can fit the Node allocable resources by computing the sum of
resources allocated (Pod.Spec.Containers[i].ResourcesAllocated) for all Pods in
the Node, except the Pod being resized. For the Pod being resized, it adds the
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
new desired resources (i.e Spec.Containers[i].Resources.Requests) to the sum.
* If new desired resources fit, Kubelet accepts the resize by updating
Pod.Spec.Containers[i].ResourcesAllocated via pods/resourceallocation
subresource, and then proceeds to invoke UpdateContainerResources CRI API
to update the Container resource limits. Once all Containers are successfully
updated, it updates Pod.Status.ContainerStatuses[i].Resources to reflect the
new resource values.
* If new desired resources don't fit, Kubelet rejects the resize, and no
further action is taken.
- Kubelet retries the Pod resize at a later time.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

If multiple Pods need resizing, they are handled sequentially in the order in
which Pod additions and updates arrive at Kubelet.

Scheduler may, in parallel, assign a new Pod to the Node because it uses cached
Pods to compute Node allocable values. If this race condition occurs, Kubelet
resolves it by rejecting that new Pod if the Node has no room after Pod resize.

#### Kubelet Restart Tolerance

If Kubelet were to restart amidst handling a Pod resize, then upon restart, all
Pods are admitted at their current Pod.Spec.Containers[i].ResourcesAllocated
values, and resizes are handled after all existing Pods have been added. This
ensures that resizes don't affect previously admitted existing Pods.

### Scheduler and API Server Interaction

Scheduler continues to use Pod's Spec.Containers[i].Resources.Requests for
scheduling new Pods, and continues to watch Pod updates, and updates its cache.
It uses the cached Pod's Spec.Containers[i].ResourcesAllocated values to
compute the Node resources allocated to Pods. This ensures that it always uses
the most recently available resource allocations in making new Pod scheduling
decisions.

### Flow Control
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

The following steps denote a typical flow of an in-place resize operation for a
Pod with ResizePolicy set to NoRestart for all its Containers.

1. Initiating actor updates Pod's Spec.Containers[i].Resources via PATCH verb.
1. API Server validates the new Resources. (e.g. Limits are not below
Requests, QoS class doesn't change, ResourceQuota not exceeded...)
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
1. API Server calls all Admission Controllers to verify the Pod Update.
* If any of the Controllers reject the update, API Server responds with an
appropriate error message.
1. API Server updates PodSpec object with the new desired Resources.
1. Kubelet observes that Pod's Spec.Containers[i].Resources.Requests and
Spec.Containers[i].ResourcesAllocated differ. It checks its Node allocable
resources to determine if the new desired Resources fit the Node.
* _Case 1_: Kubelet finds new desired Resources fit. It accepts the resize
and sets Spec.Containers[i].ResourcesAllocated equal to the values of
Spec.Containers[i].Resources.Requests by invoking resourceallocation
subresource. It then applies the new cgroup limits to the Pod and its
Containers, and once successfully done, sets Pod's
Status.ContainerStatuses[i].Resources to reflect the desired resources.
- If at the same time, a new Pod was assigned to this Node against the
capacity taken up by this resource resize, that new Pod is rejected by
Kubelet during admission if Node has no more room.
* _Case 2_: Kubelet finds that the new desired Resources does not fit.
- If Kubelet determines there isn't enough room, it simply retries the Pod
resize at a later time.
1. Scheduler uses cached Pod's Spec.Containers[i].ResourcesAllocated to compute
resources available on the Node while a Pod resize may be in progress.
* If a new Pod is assigned to that Node in parallel, it can temporarily
result in actual sum of Pod resources for the Node exceeding Node's
allocable resources. This is resolved when Kubelet rejects that new Pod
during admission due to lack of room.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* Once Kubelet that accepted a parallel Pod resize updates that Pod's
Spec.Containers[i].ResourcesAllocated, and subsequently the Scheduler
updates its cache, accounting will reflect updated Pod resources for
future computations and scheduling decisions.
1. The initiating actor (e.g. VPA) observes the following:
* _Case 1_: Pod's Spec.Containers[i].ResourcesAllocated values have changed
and matches Spec.Containers[i].Resources.Requests, signifying that desired
resize has been accepted, and Pod is being resized. The resize operation
is complete when Pod's Status.ContainerStatuses[i].Resources and
Spec.Containers[i].Resources match.
* _Case 2_: Pod's Spec.Containers[i].ResourcesAllocated remains unchanged,
and continues to differ from desired Spec.Containers[i].Resources.Requests.
After a certain (user defined) timeout, initiating actor may take alternate
action. For example, based on Retry policy, initiating actor may:
- Evict the Pod to trigger a replacement Pod with new desired resources,
- Do nothing and let Kubelet back off and later retry the in-place resize.

#### Container resource limit update ordering
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

When in-place resize is requested for multiple Containers in a Pod, Kubelet
updates resource limit for the Pod and its Containers in the following manner:
1. If resource resizing results in net-increase of a resource type (CPU or
Memory), Kubelet first updates Pod-level cgroup limit for the resource
type, and then updates the Container resource limit.
1. If resource resizing results in net-decrease of a resource type, Kubelet
first updates the Container resource limit, and then updates Pod-level
cgroup limit.
1. If resource update results in no net change of a resource type, only the
Container resource limits are updated.

In all the above cases, Kubelet applies Container resource limit decreases
before applying limit increases.

#### Notes

* If CPU Manager policy for a Node is set to 'static', then only integral
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
values of CPU resize are allowed. If non-integral CPU resize is requested
for a Node with 'static' CPU Manager policy, that resize is rejected, and
an error message is logged to the event stream.
* To avoid races and possible gamification, all components will use Pod's
Spec.Containers[i].ResourcesAllocated when computing resources used by Pods.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* If additional resize requests arrive when a Pod is being resized, those
requests are handled after completion of the resize that is in progress. And
resize is driven towards the latest desired state.
* Lowering memory limits may not always take effect quickly if the application
is holding on to pages. Kubelet will use a control loop to set the memory
limits near usage in order to force a reclaim, and update the Pod's
Status.ContainerStatuses[i].Resources only when limit is at desired value.
* Impact of Pod Overhead: Kubelet adds Pod Overhead to the resize request to
determine if in-place resize is possible.
* Impact of memory-backed emptyDir volumes: If memory-backed emptyDir is in
use, Kubelet will clear out any files in emptyDir upon Container restart.
* At this time, Vertical Pod Autoscaler should not be used with Horizontal Pod
Autoscaler on CPU, memory. This enhancement does not change that limitation.

### Affected Components

Pod v1 core API:
* extended model,
* new subresource,
* added validation.

Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates:
* for ResourceQuota, podEvaluator.Handler implementation is modified to allow
Pod updates, and verify that sum of Pod.Spec.Containers[i].Resources for all
Pods in the Namespace don't exceed quota,
* for LimitRanger we check that a resize request does not violate the min and
max limits specified in LimitRange for the Pod's namespace.

Kubelet:
* set Pod's Status.ContainerStatuses[i].Resources for Containers upon placing
a new Pod on the Node,
* update Pod's Spec.Containers[i].ResourcesAllocated upon resize,
* change UpdateContainerResources CRI API to work for both Linux & Windows.

Scheduler:
* compute resource allocations using Pod.Spec.Containers[i].ResourcesAllocated.

Controllers:
* propagate Template resources update to running Pod instances.

Other components:
* check how the change of meaning of resource requests influence other
Kubernetes components.

### Future Enhancements

1. Kubelet (or Scheduler) evicts lower priority Pods from Node to make room for
resize. Pre-emption by Kubelet may be simpler and offer lower latencies.
1. Allow ResizePolicy to be set on Pod level, acting as default if (some of)
the Containers do not have it set on their own.
1. Extend ResizePolicy to separately control resource increase and decrease
(e.g. a Container can be given more memory in-place but decreasing memory
requires Container restart).
1. Extend Node Information API to report the CPU Manager policy for the Node,
and enable validation of integral CPU resize for nodes with 'static' CPU
Manager policy.
1. Allow resizing local ephemeral storage.
1. Allow resource limits to be updated (VPA feature).

### Risks and Mitigations

1. Backward compatibility: When Pod.Spec.Containers[i].Resources becomes
representative of desired state, and Pod's true resource allocations are
tracked in Pod.Spec.Containers[i].ResourcesAllocated, applications that
query PodSpec and rely on Resources in PodSpec to determine resource
allocations will see values that may not represent actual allocations. As
a mitigation, this change needs to be documented and highlighted in the
release notes, and in top-level Kubernetes documents.
1. Resizing memory lower: Lowering cgroup memory limits may not work as pages
could be in use, and approaches such as setting limit near current usage may
be required. This issue needs further investigation.

## Graduation Criteria

TODO

## Implementation History

- 2018-11-06 - initial KEP draft created
- 2019-01-18 - implementation proposal extended
- 2019-03-07 - changes to flow control, updates per review feedback
- 2019-08-29 - updated design proposal