-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: in-place update of pod resources #686
Changes from 8 commits
cd94808
7fb66f1
b8c1f4e
5d00f9f
9580642
b8d814e
df1c8f8
17923eb
e5052fc
1194243
bfab6a3
69f9190
199a008
574737c
5bdcd57
bc9dc2b
533c3c6
29a22b6
20cbea6
0ed9505
c745563
55c8e56
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,350 @@ | ||
--- | ||
title: In-place Update of Pod Resources | ||
authors: | ||
- "@kgolab" | ||
- "@bskiba" | ||
- "@schylek" | ||
- "@vinaykul" | ||
owning-sig: sig-autoscaling | ||
participating-sigs: | ||
- sig-node | ||
- sig-scheduling | ||
reviewers: | ||
- "@bsalamat" | ||
- "@derekwaynecarr" | ||
- "@dchen1107" | ||
approvers: | ||
- TBD | ||
editor: TBD | ||
creation-date: 2018-11-06 | ||
last-updated: 2018-11-06 | ||
status: provisional | ||
see-also: | ||
replaces: | ||
superseded-by: | ||
--- | ||
|
||
# In-place Update of Pod Resources | ||
|
||
## Table of Contents | ||
|
||
* [In-place Update of Pod Resources](#in-place-update-of-pod-resources) | ||
* [Table of Contents](#table-of-contents) | ||
* [Summary](#summary) | ||
* [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Non-Goals](#non-goals) | ||
* [Proposal](#proposal) | ||
* [API Changes](#api-changes) | ||
* [CRI Changes](#cri-changes) | ||
* [Flow Control](#flow-control) | ||
* [Transitions of ResourceResizeRequired condition](#transitions-of-resourceresizerequired-condition) | ||
* [Container resource limit update ordering](#container-resource-limit-update-ordering) | ||
* [Container resource limit update failure handling](#container-resource-limit-update-failure-handling) | ||
* [Notes](#notes) | ||
* [Affected Components](#affected-components) | ||
* [Possible Extensions](#possible-extensions) | ||
* [Risks and Mitigations](#risks-and-mitigations) | ||
* [Graduation Criteria](#graduation-criteria) | ||
* [Implementation History](#implementation-history) | ||
* [Alternatives](#alternatives) | ||
|
||
## Summary | ||
|
||
This proposal aims at allowing Pod resource requests & limits to be updated | ||
in-place, without a need to restart the Pod or its Containers. | ||
|
||
The **core idea** behind the proposal is to make PodSpec mutable with regards to | ||
Resources, denoting **desired** resources. | ||
Additionally, PodStatus is extended to provide information about **actual** | ||
resource allocation. | ||
|
||
This document builds upon [proposal for live and in-place vertical scaling][] and | ||
[Vertical Resources Scaling in Kubernetes][]. | ||
|
||
[proposal for live and in-place vertical scaling]: https://github.com/kubernetes/community/pull/1719 | ||
[Vertical Resources Scaling in Kubernetes]: https://docs.google.com/document/d/18K-bl1EVsmJ04xeRq9o_vfY2GDgek6B6wmLjXw-kos4/edit?ts=5b96bf40 | ||
|
||
## Motivation | ||
|
||
Resources allocated to a Pod's Container can require a change for various reasons: | ||
* load handled by the Pod has increased significantly and current resources are | ||
not sufficient, | ||
* load has decreased significantly and allocated resources are unused and wasted, | ||
* resources have simply been set improperly. | ||
|
||
Currently, changing resource allocation requires the Pod to be recreated since | ||
the PodSpec's Container Resources is immutable. | ||
|
||
While many stateless workloads are designed to withstand such a disruption, some | ||
are more sensitive, especially when using low number of Pod replicas. | ||
|
||
Moreover, for stateful or batch workloads, a Pod restart is a serious | ||
disruption, resulting in lower availability or higher cost of running. | ||
|
||
Allowing Resources to be changed without recreating the Pod or restarting the | ||
Containers addresses this issue directly. | ||
|
||
### Goals | ||
|
||
* Primary: allow to change Pod resource requests & limits without restarting its | ||
Containers. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* Secondary: allow actors (users, VPA, StatefulSet, JobController) to decide | ||
how to proceed if in-place resource resize is not possible. | ||
* Secondary: allow users to specify which Pods and Containers can be resized | ||
without a restart. | ||
dchen1107 marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
### Non-Goals | ||
|
||
The explicit non-goal of this KEP is to avoid controlling full life-cycle of a | ||
Pod which failed in-place resource resizing. This should be handled by actors | ||
which initiated the resizing. | ||
|
||
Other identified non-goals are: | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* allow to change Pod QoS class without a restart, | ||
* to change resources of Init Containers without a restart, | ||
* updating extended resources or any other resource types besides CPU, memory. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
## Proposal | ||
|
||
### API Changes | ||
|
||
PodSpec becomes mutable with regards to Container resources requests and limits. | ||
Additionally, PodSpec becomes a Pod subresource to allow fine-grained access control. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
PodStatus is extended with information about actually allocated Container resources. | ||
|
||
Thanks to the above: | ||
* PodSpec.Container.ResourceRequirements becomes purely a declaration, | ||
denoting **desired** state of the Pod, | ||
* PodStatus.ContainerStatus.ResourceAllocated (new object) denotes **actual** | ||
state of the Pod resources. | ||
|
||
To distinguish between possible states of a Pod resource update, | ||
a new PodCondition named ResourceResizeRequired is added, with the following states: | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* (empty) - the default value; resource update awaits reconciliation | ||
if ResourceRequirements differs from ResourceAllocated, | ||
* Requested - Scheduler determined in-place resource resizing is possible, and | ||
requested Kubelet to update Pod's resource allocations and limits, | ||
* Awaiting - awaiting resources to be freed (e.g. via pre-emption), | ||
* Failed - resource resizing could not have been performed in-place | ||
but might be possible if some conditions change, | ||
* Rejected - resource update was rejected by any of the components involved. | ||
|
||
To provide some fine-grained control to the user, | ||
PodSpec.Container.ResourceRequirements is extended with ResizePolicy flag | ||
for each resource type (CPU, memory) : | ||
* NoRestart - the default value; resize the Container without restarting it, | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
* RestartContainer - restart the Container in-place to apply new resource | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. How can the user forbid in-place updates for a given Container, e.g. because any resource change would require to re-run Init Containers? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Is there a concrete use case or scenario that requires RestartPod policy? I removed it after the above discussion - I could not trace a use-case for it and couldn't justify its need with sig-node. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Also, I'm wondering if this can be folded into RetryPolicy in VPA where it lets the user specify this - similar to updateMode 'Recreate', if user needs to re-run init for resize we evict the Pod and resize the replacement during admission . This is less ideal than in-place resize with restart, but the push has been to keep things as simple as possible for the Kubelet, and this is something that can be added later if there is a strong use case. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can see a case for a third policy here, which as a strawman I will call However, such a signal would only be useful if pods are able to interrogate the system for their own resource limits. This is perhaps best left to future enhancements to in-place update and should not block 1.17 implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. On linux you can always read /sys/fs/cgroup can't you ? |
||
values (e.g. Java process needs to change its Xmx flag), | ||
* RestartPod - restart the whole Pod in-place to apply new resource values | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
(e.g. Pod requires its Init Containers to re-run). | ||
|
||
By using the ResizePolicy flag, user can mark Containers or Pods as safe | ||
(or unsafe) for in-place resource update. | ||
|
||
This flag is used by Kubelet to determine the actions needed. This flag **may** be | ||
used by the actors starting the update to decide if the process should be started | ||
at all (for example VPA might decide to evict Pod with RestartPod policy). | ||
|
||
Setting the flag to separately control CPU & memory is due to an observation | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
that usually CPU can be added/removed without much problems whereas | ||
changes to available memory are more probable to require restarts. | ||
|
||
If more than one resource type with different policies are updated, then | ||
RestartPod policy takes precedence over RestartContainer, which in turn takes | ||
precedence over NoRestart policy. | ||
|
||
#### CRI Changes | ||
|
||
Kubelet calls UpdateContainerResources CRI API which currently takes | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
*runtimeapi.LinuxContainerResources* parameter that works for Docker and Kata, | ||
but not for Windows. This parameter is changed to *runtimeapi.ContainerResources*, | ||
that is runtime agnostic. | ||
|
||
### Flow Control | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
The following steps denote a typical flow of an in-place resize process for a Pod | ||
with ResizePolicy set to Update for all its Containers. | ||
|
||
1. The initiating actor updates ResourceRequirements using PATCH verb. | ||
1. API Server validates the new ResourceRequirements | ||
(e.g. limits are not below requested resources, QoS class does not change). | ||
1. API Server calls all Admission Controllers to verify the Pod Update. | ||
* If any of the Controllers rejects the update, the | ||
ResourceResizeRequired PodCondition is set to Rejected. | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. API Server updates the PodSpec object and clears ResourceResizeRequired condition. | ||
1. Scheduler observes that ResourceRequirements and ResourceAllocated differ. | ||
It checks its cache to determine if in-place resource resizing is possible. | ||
* If Node has capacity to accommodate new resource values, it updates | ||
its resource cache to use max(ResourceRequirements, ResourceAllocated), | ||
and sets ResourceResizeRequired PodCondition to Requested. | ||
* If required it pre-empts lower-priority Pods, setting the | ||
ResourceResizeRequired PodCondition to Awaiting. Once the | ||
lower-priority Pods are evicted, Scheduler clears the | ||
ResourceResizeRequired PodCondition and the flow continues. | ||
* If Node does not have capacity to accommodate new resource values, it | ||
sets ResourceResizeRequired PodCondition to Failed. | ||
1. Kubelet observes that ResourceResizeRequired PodCondition has been set to | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Requested, and checks its Node allocatable resources against the new | ||
ResourceRequirements for fit. | ||
* Kubelet sees that new ResourceRequirements fits, updates the PodStatus | ||
ResourceAllocated to match ResourceRequirements, clears the | ||
ResourceResizeRequired PodCondition, and then applies the new | ||
cgroup limits to the Pod and its running Containers. | ||
* Kubelet sees that new ResourceRequirements does not fit Node’s allocatable | ||
resources and sets the ResourceResizeRequired PodCondition to Failed. This | ||
can happen due to race-condition with multiple schedulers. | ||
1. Scheduler observes that PodCondition has changed. | ||
* Case 1: ResourceResizeRequired PodCondition is clear, ResourceRequirements | ||
matches ResourceAllocated. Scheduler updates cache to use the updated | ||
ResourceAllocated values. | ||
* Case 2: ResourceResizeRequired PodCondition is Failed. Scheduler updates | ||
its cache to use the unchanged ResourceAllocated values for accounting. | ||
1. The initiating actor observes that ResourceAllocated has changed. | ||
* Case 1: ResourceRequirements and ResourceAllocated match again, signifying | ||
a successful completion of Pod resources in-place resizing. | ||
* Case 2: ResourceResizeRequired PodCondition shows Failed, and initiating | ||
actor may take action. | ||
A few possible examples (perhaps controlled by a Retry policy): | ||
* Initiating actor (user/VPA) handles it maybe by deleting the Pod to | ||
trigger a replacement Pod with new resources for scheduling. | ||
* Initiating actor is a Controller (Job,Deployment,..), and it clears the | ||
ResourceResizeRequired PodCondition (based on other Pods departing, thus | ||
freeing resources), and Scheduler retries in-place resource resizing. | ||
|
||
#### Transitions of ResourceResizeRequired condition | ||
|
||
The following diagram shows possible transitions of ResourceResizeRequired condition. | ||
|
||
```text | ||
|
||
+----------+ | ||
| | | ||
| Rejected | | ||
| | | ||
+----^-----+ | ||
| | ||
| | ||
5| | ||
+----+----+ | ||
| <-----------+ | ||
+-----------+ (empty) | | | ||
| | +---------+ | | ||
| +--+---^--+ | | | ||
1| 2| | 4| |6 | ||
+-----v----+ | | +---v-+--+ | ||
| | | | | | | ||
| Awaiting | | | | Failed | | ||
| | | | | | | ||
+-------+--+ | | +---^----+ | ||
2| | |3 |4 | ||
| +---v---+---+ | | ||
| | | | | ||
+--------> Requested +--------+ | ||
| | | ||
+-----------+ | ||
|
||
``` | ||
|
||
1. Scheduler, on starting pre-emption. | ||
1. Scheduler, after pre-emption or no pre-emption needed. | ||
1. Kubelet, on successful resizing. | ||
1. Scheduler or Kubelet, if not enough space on Node. | ||
1. Any Controller, on permanent issue. | ||
1. Initiating actor, on retry. | ||
|
||
#### Container resource limit update ordering | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
|
||
When in-place resize is desired for multiple Containers in a Pod, Kubelet updates | ||
resource limit for the Containers as detailed below: | ||
1. If resource resizing results in net-increase of a resource type (CPU or Memory), | ||
Kubelet first updates Pod-level cgroup limit for the resource type, and then | ||
updates the Container resource limit. | ||
1. If resource resizing results in net-decrease of a resource type, Kubelet first | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. as noted earlier, for cpu this is immediate. for memory, the kubelet will need to induce pressure on the cgroup by setting a value based on its current usage. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This would be handled by the runtime, no? What's the expected behavior for the runtime and what's the timeout duration for the CRI call? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Yes this should be handled by run-time. Main intent of this section is to clarify how the limit update (UpdateContainerResources CRI call) should be ordered when multiple containers are being resized in a request. For e.g if Pod sum(memory) = 5G with containers c1 (2G), c2 (3G), and a resize requests c1(4G), c2(2G), we should set pod limit to 6G first, then update c2 limit before c1. |
||
updates the Container resource limit, and then updates Pod-level cgroup limit. | ||
1. If resource update results in no net change of a resource type, only the Container | ||
resource limits are updated. | ||
In all the above cases, Kubelet applies Container resource limit decreases before | ||
applying limit increases. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. if updating the container resource limit (through CRI I assume) fails, then is the cgroup reverted? From a Windows standpoint there's no cgroup managed by the kubelet. 100% of this will need to be done in CRI. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I feel the best course here is to restart the container - imho this is last resort action. The rationale being if a limit update fails, there is possibility of a revert action failing as well if we were to try to rollback things. Is there a better alternative? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @PatrickLang what do you mean no cgroup managed by kubelet ? All cgroup (or whatever windows equivalent)settings should be modified by CRI, is that not the case ? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. its probably worth evaluating how the cri implementor should handle this for non-burstable resources like memory. is the kubelet going to be responsible for inducing reclaim by setting limits closer to its usage but not actually at the resource allocated value? if we just pass the resource allocated value to the cri, it will not be sufficient for us to use this api to induce reclaim, so i think i prefer a kubelet keeps a heuristic for how it handles scale down of burstable resources. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Restarting the container" can violate the There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I'm also concerned about hard-to-define behavior when scaling down resources for CRI implementations. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Restarting container policy seems to have a use-case where Java apps using Xmx flag wont be able to use the increased mem unless the app is restarted. Could you please elaborate on the what might turn out unexpected on scale-down? |
||
|
||
#### Container resource limit update failure handling | ||
|
||
For simplicity, if Container resource limits update fails, Kubelet restarts the | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
Container in-place to allow new limits to take effect, and the action is logged. | ||
|
||
#### Notes | ||
|
||
* To avoid races and possible gamification, all components should use | ||
max(ResourceRequirements, ResourceAllocated) when computing resources | ||
used by a Pod. TBD if this can be weakened when ResourceResizeRequired | ||
condition is set to Rejected, or should the initiating actor update | ||
ResourceRequirements back to reclaim resources. | ||
* If another resource update arrives when a previous update is being handled, | ||
that and all subsequent updates should be buffered at the Controller, and | ||
applied upon success/failed completion of the update that is in progress. | ||
* Impact of memory backed emptyDir volumes: TBD - investigation needed. | ||
|
||
### Affected Components | ||
|
||
Pod v1 core API: | ||
* extended model, | ||
* added validation. | ||
|
||
Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates: | ||
* for ResourceQuota it should be enough to change podEvaluator.Handler | ||
implementation to allow Pod updates; max(ResourceRequirements, ResourceAllocated) | ||
should be used to be in line with current ResourceQuota behavior | ||
which blocks resources before they are used (e.g. for Pending Pods), | ||
* for LimitRanger TBD. | ||
|
||
Kubelet: | ||
* support in-place resource management, | ||
* set PodStatus ResourceAllocated for Containers on placing the Pod on Node. | ||
* change UpdateContainerResources CRI API so that it works for both Linux and Windows. | ||
|
||
Scheduler: | ||
* determine if in-place resize is possible, updates its cache depending on resizing outcome. | ||
|
||
Controllers: | ||
* propagate Template resources update to running Pod instances. | ||
* initiate resource update retries (controlled by retry policy) for Pods that failed resizing. | ||
|
||
Other components: | ||
* check how the change of meaning of resource requests influence other kubernetes components. | ||
|
||
### Possible Extensions | ||
|
||
1. Allow resource limits to be updated too (VPA feature). | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
1. Allow ResizePolicy to be set on Pod level, acting as default if | ||
(some of) the Containers do not have it set on their own. | ||
1. Extend ResizePolicy flag to separately control resource increase and decrease | ||
(e.g. a container can be given more memory in-place but | ||
decreasing memory requires container restart). | ||
|
||
### Risks and Mitigations | ||
|
||
1. Backward compatibility: When Resources in PodSpec becomes representative of | ||
vinaykul marked this conversation as resolved.
Show resolved
Hide resolved
|
||
desired state, and Pod's true resource allocations tracked in PodStatus, | ||
applications that query PodSpec and rely on Resources in PodSpec to determine | ||
resource usage will see values that may not represent actual allocations at | ||
the time of query. To mitigate, this change needs to be documented and | ||
highlighted in the release notes, and in top-level kubernetes documents. | ||
1. Resizing memory lower: Lowering cgroup memory limits may not work as pages | ||
could be in use, and approaches such as setting limit near current usage may | ||
be required. This issue needs further investigation. | ||
|
||
## Graduation Criteria | ||
|
||
TODO | ||
|
||
## Implementation History | ||
|
||
- 2018-11-06 - initial KEP draft created | ||
- 2019-01-18 - implementation proposal extended | ||
- 2019-03-07 - changes to flow control, updates per review feedback | ||
|
||
## Alternatives | ||
|
||
TODO | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please identify the component owners (for the autoscaling/node/scheduling areas) that will approve this KEP (and get approvals from them). That helps ensure there's agreement on the goals and overall approach before entering the API review process.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Thanks for pointing this out. I've identified the approvers for the stakeholder SIGs, and SIG-node, SIG-scheduling have approved the KEP.
@mwielgus is going to follow-up with @kgolab to see if there are any concerns, and if not we should get lgtm and approval from SIG-autoscaling.
Please let us know what our next-steps are for API review.
Thanks,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks. I'd suggest:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@liggitt Thanks for the guidance. I've resolved many of the issues and comments that were either addressed or have become stale.
I'm tracking the remaining outstanding questions in #1287
I'll give folks a few days to re-open any that they may feel is not resolved or resolved in error.
And then I and @dashpole will ping @thockin to setup a time for API review.