Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: in-place update of pod resources #686

Merged
merged 22 commits into from
Oct 3, 2019
Merged
Changes from 8 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
cd94808
Move Karol Golab's draft KEP for In-place update of pod resources fro…
Jan 12, 2019
7fb66f1
Update owning-sig to sig-autoscaling, add initial set of reviewers.
Jan 12, 2019
b8c1f4e
Flow Control and few other sections added
kgolab Jan 18, 2019
5d00f9f
Merge pull request #1 from kgolab/master
vinaykul Jan 18, 2019
9580642
Update KEP filename per latest template guidelines, add non-goal item.
Jan 22, 2019
b8d814e
Merge remote-tracking branch 'upstream/master'
Mar 7, 2019
df1c8f8
Update flow control, clarify items per review, identify risks.
Mar 7, 2019
17923eb
Update policy name, clarify scheduler actions and policy precedence
Mar 11, 2019
e5052fc
Add RetryPolicy API change, clarify transition of PodCondition fields…
Mar 12, 2019
1194243
Update control flow per review, add notes on Pod Overhead, emptyDir
Mar 26, 2019
bfab6a3
Update API and flow control to avoid storing state in PodCondition
May 7, 2019
69f9190
Rename PodSpec scheduler resource allocations & PodCondition, and cla…
May 14, 2019
199a008
Key changes:
vinaykul Jun 18, 2019
574737c
Update design so that Kubelet, instead of Scheduler, evicts lower pri…
vinaykul Jun 19, 2019
5bdcd57
1. Remove PreEmpting PodCondition.
vinaykul Jul 9, 2019
bc9dc2b
Extend PodSpec to hold accepted resource resize values, add resourcea…
vinaykul Aug 26, 2019
533c3c6
Update ResourceAllocated as ResourceList, clarify details of Kubelet …
vinaykul Sep 3, 2019
29a22b6
Restate Kubelet fault handling to minimum guarantees, clarify Schedul…
vinaykul Sep 8, 2019
20cbea6
Details of LimitRanger, ResourceQuota enforcement during Pod resize.
vinaykul Sep 14, 2019
0ed9505
ResourceQuota with resize uses Containers[i].Resources
vinaykul Sep 17, 2019
c745563
Add note on VPA+HPA limitation for CPU, memory
vinaykul Sep 17, 2019
55c8e56
Add KEP approvers, minor clarifications
vinaykul Sep 24, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
350 changes: 350 additions & 0 deletions keps/sig-autoscaling/20181106-in-place-update-of-pod-resources.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,350 @@
---
title: In-place Update of Pod Resources
authors:
- "@kgolab"
- "@bskiba"
- "@schylek"
- "@vinaykul"
owning-sig: sig-autoscaling
participating-sigs:
- sig-node
- sig-scheduling
reviewers:
- "@bsalamat"
- "@derekwaynecarr"
- "@dchen1107"
approvers:
- TBD
Copy link
Member

@liggitt liggitt Sep 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please identify the component owners (for the autoscaling/node/scheduling areas) that will approve this KEP (and get approvals from them). That helps ensure there's agreement on the goals and overall approach before entering the API review process.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt Thanks for pointing this out. I've identified the approvers for the stakeholder SIGs, and SIG-node, SIG-scheduling have approved the KEP.
@mwielgus is going to follow-up with @kgolab to see if there are any concerns, and if not we should get lgtm and approval from SIG-autoscaling.

Please let us know what our next-steps are for API review.
Thanks,

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt Thanks for pointing this out. I've identified the approvers for the stakeholder SIGs, and SIG-node, SIG-scheduling have approved the KEP.
@mwielgus is going to follow-up with @kgolab to see if there are any concerns, and if not we should get lgtm and approval from SIG-autoscaling.

Please let us know what our next-steps are for API review.

Thanks. I'd suggest:

  1. merging this in provisional state
  2. capturing links to outstanding comments/threads in an issue for resolution in a follow-up PR
  3. reaching out to the API approver (it looks like @thockin self assigned this one) to schedule a time for review

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@liggitt Thanks for the guidance. I've resolved many of the issues and comments that were either addressed or have become stale.

I'm tracking the remaining outstanding questions in #1287

I'll give folks a few days to re-open any that they may feel is not resolved or resolved in error.
And then I and @dashpole will ping @thockin to setup a time for API review.

editor: TBD
creation-date: 2018-11-06
last-updated: 2018-11-06
status: provisional
see-also:
replaces:
superseded-by:
---

# In-place Update of Pod Resources

## Table of Contents

* [In-place Update of Pod Resources](#in-place-update-of-pod-resources)
* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [API Changes](#api-changes)
* [CRI Changes](#cri-changes)
* [Flow Control](#flow-control)
* [Transitions of ResourceResizeRequired condition](#transitions-of-resourceresizerequired-condition)
* [Container resource limit update ordering](#container-resource-limit-update-ordering)
* [Container resource limit update failure handling](#container-resource-limit-update-failure-handling)
* [Notes](#notes)
* [Affected Components](#affected-components)
* [Possible Extensions](#possible-extensions)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)
* [Alternatives](#alternatives)

## Summary

This proposal aims at allowing Pod resource requests & limits to be updated
in-place, without a need to restart the Pod or its Containers.

The **core idea** behind the proposal is to make PodSpec mutable with regards to
Resources, denoting **desired** resources.
Additionally, PodStatus is extended to provide information about **actual**
resource allocation.

This document builds upon [proposal for live and in-place vertical scaling][] and
[Vertical Resources Scaling in Kubernetes][].

[proposal for live and in-place vertical scaling]: https://github.com/kubernetes/community/pull/1719
[Vertical Resources Scaling in Kubernetes]: https://docs.google.com/document/d/18K-bl1EVsmJ04xeRq9o_vfY2GDgek6B6wmLjXw-kos4/edit?ts=5b96bf40

## Motivation

Resources allocated to a Pod's Container can require a change for various reasons:
* load handled by the Pod has increased significantly and current resources are
not sufficient,
* load has decreased significantly and allocated resources are unused and wasted,
* resources have simply been set improperly.

Currently, changing resource allocation requires the Pod to be recreated since
the PodSpec's Container Resources is immutable.

While many stateless workloads are designed to withstand such a disruption, some
are more sensitive, especially when using low number of Pod replicas.

Moreover, for stateful or batch workloads, a Pod restart is a serious
disruption, resulting in lower availability or higher cost of running.

Allowing Resources to be changed without recreating the Pod or restarting the
Containers addresses this issue directly.

### Goals

* Primary: allow to change Pod resource requests & limits without restarting its
Containers.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* Secondary: allow actors (users, VPA, StatefulSet, JobController) to decide
how to proceed if in-place resource resize is not possible.
* Secondary: allow users to specify which Pods and Containers can be resized
without a restart.
dchen1107 marked this conversation as resolved.
Show resolved Hide resolved

### Non-Goals

The explicit non-goal of this KEP is to avoid controlling full life-cycle of a
Pod which failed in-place resource resizing. This should be handled by actors
which initiated the resizing.

Other identified non-goals are:
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* allow to change Pod QoS class without a restart,
* to change resources of Init Containers without a restart,
* updating extended resources or any other resource types besides CPU, memory.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

## Proposal

### API Changes

PodSpec becomes mutable with regards to Container resources requests and limits.
Additionally, PodSpec becomes a Pod subresource to allow fine-grained access control.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

PodStatus is extended with information about actually allocated Container resources.

Thanks to the above:
* PodSpec.Container.ResourceRequirements becomes purely a declaration,
denoting **desired** state of the Pod,
* PodStatus.ContainerStatus.ResourceAllocated (new object) denotes **actual**
state of the Pod resources.

To distinguish between possible states of a Pod resource update,
a new PodCondition named ResourceResizeRequired is added, with the following states:
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* (empty) - the default value; resource update awaits reconciliation
if ResourceRequirements differs from ResourceAllocated,
* Requested - Scheduler determined in-place resource resizing is possible, and
requested Kubelet to update Pod's resource allocations and limits,
* Awaiting - awaiting resources to be freed (e.g. via pre-emption),
* Failed - resource resizing could not have been performed in-place
but might be possible if some conditions change,
* Rejected - resource update was rejected by any of the components involved.

To provide some fine-grained control to the user,
PodSpec.Container.ResourceRequirements is extended with ResizePolicy flag
for each resource type (CPU, memory) :
* NoRestart - the default value; resize the Container without restarting it,
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* RestartContainer - restart the Container in-place to apply new resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the user forbid in-place updates for a given Container, e.g. because any resource change would require to re-run Init Containers?
I think there used to be an option which said "restart the whole Pod".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a concrete use case or scenario that requires RestartPod policy? I removed it after the above discussion - I could not trace a use-case for it and couldn't justify its need with sig-node.

Copy link
Member Author

@vinaykul vinaykul Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm wondering if this can be folded into RetryPolicy in VPA where it lets the user specify this - similar to updateMode 'Recreate', if user needs to re-run init for resize we evict the Pod and resize the replacement during admission . This is less ideal than in-place resize with restart, but the push has been to keep things as simple as possible for the Kubelet, and this is something that can be added later if there is a strong use case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see a case for a third policy here, which as a strawman I will call SignalContainerWINCH. This would allow the container to attempt to adjust its language runtime to conform to the new limits - e.g. a programmer determines that calling runtime.GOMAXPROCS(math.Ceil(numCPUs) + 1) results in less scheduler thrashing.

However, such a signal would only be useful if pods are able to interrogate the system for their own resource limits. This is perhaps best left to future enhancements to in-place update and should not block 1.17 implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On linux you can always read /sys/fs/cgroup can't you ?

values (e.g. Java process needs to change its Xmx flag),
* RestartPod - restart the whole Pod in-place to apply new resource values
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
(e.g. Pod requires its Init Containers to re-run).

By using the ResizePolicy flag, user can mark Containers or Pods as safe
(or unsafe) for in-place resource update.

This flag is used by Kubelet to determine the actions needed. This flag **may** be
used by the actors starting the update to decide if the process should be started
at all (for example VPA might decide to evict Pod with RestartPod policy).

Setting the flag to separately control CPU & memory is due to an observation
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
that usually CPU can be added/removed without much problems whereas
changes to available memory are more probable to require restarts.

If more than one resource type with different policies are updated, then
RestartPod policy takes precedence over RestartContainer, which in turn takes
precedence over NoRestart policy.

#### CRI Changes

Kubelet calls UpdateContainerResources CRI API which currently takes
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
*runtimeapi.LinuxContainerResources* parameter that works for Docker and Kata,
but not for Windows. This parameter is changed to *runtimeapi.ContainerResources*,
that is runtime agnostic.

### Flow Control
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

The following steps denote a typical flow of an in-place resize process for a Pod
with ResizePolicy set to Update for all its Containers.

1. The initiating actor updates ResourceRequirements using PATCH verb.
1. API Server validates the new ResourceRequirements
(e.g. limits are not below requested resources, QoS class does not change).
1. API Server calls all Admission Controllers to verify the Pod Update.
* If any of the Controllers rejects the update, the
ResourceResizeRequired PodCondition is set to Rejected.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
1. API Server updates the PodSpec object and clears ResourceResizeRequired condition.
1. Scheduler observes that ResourceRequirements and ResourceAllocated differ.
It checks its cache to determine if in-place resource resizing is possible.
* If Node has capacity to accommodate new resource values, it updates
its resource cache to use max(ResourceRequirements, ResourceAllocated),
and sets ResourceResizeRequired PodCondition to Requested.
* If required it pre-empts lower-priority Pods, setting the
ResourceResizeRequired PodCondition to Awaiting. Once the
lower-priority Pods are evicted, Scheduler clears the
ResourceResizeRequired PodCondition and the flow continues.
* If Node does not have capacity to accommodate new resource values, it
sets ResourceResizeRequired PodCondition to Failed.
1. Kubelet observes that ResourceResizeRequired PodCondition has been set to
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
Requested, and checks its Node allocatable resources against the new
ResourceRequirements for fit.
* Kubelet sees that new ResourceRequirements fits, updates the PodStatus
ResourceAllocated to match ResourceRequirements, clears the
ResourceResizeRequired PodCondition, and then applies the new
cgroup limits to the Pod and its running Containers.
* Kubelet sees that new ResourceRequirements does not fit Node’s allocatable
resources and sets the ResourceResizeRequired PodCondition to Failed. This
can happen due to race-condition with multiple schedulers.
1. Scheduler observes that PodCondition has changed.
* Case 1: ResourceResizeRequired PodCondition is clear, ResourceRequirements
matches ResourceAllocated. Scheduler updates cache to use the updated
ResourceAllocated values.
* Case 2: ResourceResizeRequired PodCondition is Failed. Scheduler updates
its cache to use the unchanged ResourceAllocated values for accounting.
1. The initiating actor observes that ResourceAllocated has changed.
* Case 1: ResourceRequirements and ResourceAllocated match again, signifying
a successful completion of Pod resources in-place resizing.
* Case 2: ResourceResizeRequired PodCondition shows Failed, and initiating
actor may take action.
A few possible examples (perhaps controlled by a Retry policy):
* Initiating actor (user/VPA) handles it maybe by deleting the Pod to
trigger a replacement Pod with new resources for scheduling.
* Initiating actor is a Controller (Job,Deployment,..), and it clears the
ResourceResizeRequired PodCondition (based on other Pods departing, thus
freeing resources), and Scheduler retries in-place resource resizing.

#### Transitions of ResourceResizeRequired condition

The following diagram shows possible transitions of ResourceResizeRequired condition.

```text

+----------+
| |
| Rejected |
| |
+----^-----+
|
|
5|
+----+----+
| <-----------+
+-----------+ (empty) | |
| | +---------+ |
| +--+---^--+ | |
1| 2| | 4| |6
+-----v----+ | | +---v-+--+
| | | | | |
| Awaiting | | | | Failed |
| | | | | |
+-------+--+ | | +---^----+
2| | |3 |4
| +---v---+---+ |
| | | |
+--------> Requested +--------+
| |
+-----------+

```

1. Scheduler, on starting pre-emption.
1. Scheduler, after pre-emption or no pre-emption needed.
1. Kubelet, on successful resizing.
1. Scheduler or Kubelet, if not enough space on Node.
1. Any Controller, on permanent issue.
1. Initiating actor, on retry.

#### Container resource limit update ordering
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

When in-place resize is desired for multiple Containers in a Pod, Kubelet updates
resource limit for the Containers as detailed below:
1. If resource resizing results in net-increase of a resource type (CPU or Memory),
Kubelet first updates Pod-level cgroup limit for the resource type, and then
updates the Container resource limit.
1. If resource resizing results in net-decrease of a resource type, Kubelet first
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as noted earlier, for cpu this is immediate. for memory, the kubelet will need to induce pressure on the cgroup by setting a value based on its current usage.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would be handled by the runtime, no? What's the expected behavior for the runtime and what's the timeout duration for the CRI call?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes this should be handled by run-time.

Main intent of this section is to clarify how the limit update (UpdateContainerResources CRI call) should be ordered when multiple containers are being resized in a request. For e.g if Pod sum(memory) = 5G with containers c1 (2G), c2 (3G), and a resize requests c1(4G), c2(2G), we should set pod limit to 6G first, then update c2 limit before c1.

updates the Container resource limit, and then updates Pod-level cgroup limit.
1. If resource update results in no net change of a resource type, only the Container
resource limits are updated.
In all the above cases, Kubelet applies Container resource limit decreases before
applying limit increases.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if updating the container resource limit (through CRI I assume) fails, then is the cgroup reverted?

From a Windows standpoint there's no cgroup managed by the kubelet. 100% of this will need to be done in CRI.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I feel the best course here is to restart the container - imho this is last resort action. The rationale being if a limit update fails, there is possibility of a revert action failing as well if we were to try to rollback things. Is there a better alternative?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PatrickLang what do you mean no cgroup managed by kubelet ? All cgroup (or whatever windows equivalent)settings should be modified by CRI, is that not the case ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

its probably worth evaluating how the cri implementor should handle this for non-burstable resources like memory. is the kubelet going to be responsible for inducing reclaim by setting limits closer to its usage but not actually at the resource allocated value? if we just pass the resource allocated value to the cri, it will not be sufficient for us to use this api to induce reclaim, so i think i prefer a kubelet keeps a heuristic for how it handles scale down of burstable resources.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Restarting the container" can violate the ResizePolicy you set for the container. I'd prefer simply failing the resizing

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm also concerned about hard-to-define behavior when scaling down resources for CRI implementations.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Restarting container policy seems to have a use-case where Java apps using Xmx flag wont be able to use the increased mem unless the app is restarted.

Could you please elaborate on the what might turn out unexpected on scale-down?


#### Container resource limit update failure handling

For simplicity, if Container resource limits update fails, Kubelet restarts the
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
Container in-place to allow new limits to take effect, and the action is logged.

#### Notes

* To avoid races and possible gamification, all components should use
max(ResourceRequirements, ResourceAllocated) when computing resources
used by a Pod. TBD if this can be weakened when ResourceResizeRequired
condition is set to Rejected, or should the initiating actor update
ResourceRequirements back to reclaim resources.
* If another resource update arrives when a previous update is being handled,
that and all subsequent updates should be buffered at the Controller, and
applied upon success/failed completion of the update that is in progress.
* Impact of memory backed emptyDir volumes: TBD - investigation needed.

### Affected Components

Pod v1 core API:
* extended model,
* added validation.

Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates:
* for ResourceQuota it should be enough to change podEvaluator.Handler
implementation to allow Pod updates; max(ResourceRequirements, ResourceAllocated)
should be used to be in line with current ResourceQuota behavior
which blocks resources before they are used (e.g. for Pending Pods),
* for LimitRanger TBD.

Kubelet:
* support in-place resource management,
* set PodStatus ResourceAllocated for Containers on placing the Pod on Node.
* change UpdateContainerResources CRI API so that it works for both Linux and Windows.

Scheduler:
* determine if in-place resize is possible, updates its cache depending on resizing outcome.

Controllers:
* propagate Template resources update to running Pod instances.
* initiate resource update retries (controlled by retry policy) for Pods that failed resizing.

Other components:
* check how the change of meaning of resource requests influence other kubernetes components.

### Possible Extensions

1. Allow resource limits to be updated too (VPA feature).
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
1. Allow ResizePolicy to be set on Pod level, acting as default if
(some of) the Containers do not have it set on their own.
1. Extend ResizePolicy flag to separately control resource increase and decrease
(e.g. a container can be given more memory in-place but
decreasing memory requires container restart).

### Risks and Mitigations

1. Backward compatibility: When Resources in PodSpec becomes representative of
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
desired state, and Pod's true resource allocations tracked in PodStatus,
applications that query PodSpec and rely on Resources in PodSpec to determine
resource usage will see values that may not represent actual allocations at
the time of query. To mitigate, this change needs to be documented and
highlighted in the release notes, and in top-level kubernetes documents.
1. Resizing memory lower: Lowering cgroup memory limits may not work as pages
could be in use, and approaches such as setting limit near current usage may
be required. This issue needs further investigation.

## Graduation Criteria

TODO

## Implementation History

- 2018-11-06 - initial KEP draft created
- 2019-01-18 - implementation proposal extended
- 2019-03-07 - changes to flow control, updates per review feedback

## Alternatives

TODO