Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: in-place update of pod resources #686

Merged
merged 22 commits into from
Oct 3, 2019
Merged
Changes from 4 commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
cd94808
Move Karol Golab's draft KEP for In-place update of pod resources fro…
Jan 12, 2019
7fb66f1
Update owning-sig to sig-autoscaling, add initial set of reviewers.
Jan 12, 2019
b8c1f4e
Flow Control and few other sections added
kgolab Jan 18, 2019
5d00f9f
Merge pull request #1 from kgolab/master
vinaykul Jan 18, 2019
9580642
Update KEP filename per latest template guidelines, add non-goal item.
Jan 22, 2019
b8d814e
Merge remote-tracking branch 'upstream/master'
Mar 7, 2019
df1c8f8
Update flow control, clarify items per review, identify risks.
Mar 7, 2019
17923eb
Update policy name, clarify scheduler actions and policy precedence
Mar 11, 2019
e5052fc
Add RetryPolicy API change, clarify transition of PodCondition fields…
Mar 12, 2019
1194243
Update control flow per review, add notes on Pod Overhead, emptyDir
Mar 26, 2019
bfab6a3
Update API and flow control to avoid storing state in PodCondition
May 7, 2019
69f9190
Rename PodSpec scheduler resource allocations & PodCondition, and cla…
May 14, 2019
199a008
Key changes:
vinaykul Jun 18, 2019
574737c
Update design so that Kubelet, instead of Scheduler, evicts lower pri…
vinaykul Jun 19, 2019
5bdcd57
1. Remove PreEmpting PodCondition.
vinaykul Jul 9, 2019
bc9dc2b
Extend PodSpec to hold accepted resource resize values, add resourcea…
vinaykul Aug 26, 2019
533c3c6
Update ResourceAllocated as ResourceList, clarify details of Kubelet …
vinaykul Sep 3, 2019
29a22b6
Restate Kubelet fault handling to minimum guarantees, clarify Schedul…
vinaykul Sep 8, 2019
20cbea6
Details of LimitRanger, ResourceQuota enforcement during Pod resize.
vinaykul Sep 14, 2019
0ed9505
ResourceQuota with resize uses Containers[i].Resources
vinaykul Sep 17, 2019
c745563
Add note on VPA+HPA limitation for CPU, memory
vinaykul Sep 17, 2019
55c8e56
Add KEP approvers, minor clarifications
vinaykul Sep 24, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,277 @@
---
kep-number: draft-20181106
title: In-place Update of Pod Resources
authors:
- "@kgolab"
- "@bskiba"
- "@schylek"
owning-sig: sig-autoscaling
participating-sigs:
- sig-node
- sig-scheduling
reviewers:
- "@bsalamat"
- "@derekwaynecarr"
- "@dchen1107"
approvers:
- TBD
editor: TBD
creation-date: 2018-11-06
last-updated: 2018-11-06
status: provisional
see-also:
replaces:
superseded-by:
---

# In-place Update of Pod Resources

## Table of Contents

* [In-place Update of Pod Resources](#in-place-update-of-pod-resources)
* [Table of Contents](#table-of-contents)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [API Changes](#api-changes)
* [Flow Control](#flow-control)
* [Transitions of InPlaceResize condition](#transitions-of-inplaceresize-condition)
* [Notes](#notes)
* [Affected Components](#affected-components)
* [Risks and Mitigations](#risks-and-mitigations)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)
* [Alternatives](#alternatives)

## Summary

This proposal aims at allowing Pod resource requests & limits to be updated
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
in-place, without a need to restart the Pod or its Containers.

The **core idea** behind the proposal is to make PodSpec mutable with regards to
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
Resources, denoting **desired** resources.
Additionally PodStatus is extended to provide information about **actual**
resource allocation.

This document builds upon [proposal for live and in-place vertical scaling][] and
[Vertical Resources Scaling in Kubernetes][].

[proposal for live and in-place vertical scaling]: https://github.com/kubernetes/community/pull/1719
[Vertical Resources Scaling in Kubernetes]: https://docs.google.com/document/d/18K-bl1EVsmJ04xeRq9o_vfY2GDgek6B6wmLjXw-kos4/edit?ts=5b96bf40

## Motivation

Resources allocated to a Pod's Container can require a change for various reasons:
* load handled by the Pod has increased significantly and current resources are
not enough to handle it,
* load has decreased significantly and currently allocated resources are unused
and thus wasted,
* Resources have simply been set improperly.

Currently changing Resources allocation requires the Pod to be recreated since
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
the PodSpec is immutable.

While many stateless workloads are designed to withstand such a disruption, some
are more sensitive, especially when using low number of Pod replicas.

Moreover, for stateful or batch workloads, a Pod restart is a serious
disruption, resulting in lower availability or higher cost of running.

Allowing Resources to be changed without recreating a Pod nor restarting a
Container addresses this issue directly.

### Goals

* Primary: allow to change Pod resource requests & limits without restarting its
Containers.
* Secondary: allow actors (users, VPA, StatefulSet, JobController) to decide
how to proceed if in-place resource update is not available.
* Secondary: allow users to specify which Pods and Containers can be updated
without a restart.

### Non-Goals

The explicit non-goal of this KEP is to avoid controlling full life-cycle of a
Pod which failed an in-place resource update. These cases should be handled by
actors which initiated the update.

Other identified non-goals are:
* allow to change Pod QoS class without a restart,
* to change resources of Init Containers without a restart.

## Proposal

### API Changes

PodSpec becomes mutable with regards to resources and limits.
Additionally, PodSpec becomes a Pod subresource to allow fine-grained access control.

PodStatus is extended with information about actually allocated resources.

Thanks to the above:
* PodSpec.Container.ResourceRequirements becomes purely a declaration,
denoting **desired** state of the Pod,
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* PodStatus.ContainerStatus.ResourceAllocated (new object) denotes **actual**
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a corresponding CRI change for review? That shouldn't block merging this KEP draft but it is going to be important for implementers as it would need to be reviewed for Linux+Windows compatibility and runtime compatibility (dockershim/kata/hyper-v)

cc @feiskyer

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not very confident we have reached agreement that it's the direction we will go. If yes, a CRI change should be included here.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@PatrickLang In our implementation, is was sufficient to make changes in kubelet to detect a resources-only container spec update, and call UpdateContainerResources CRI API without any changes to the CRI itself. We have tested it with docker, we are yet to try kata.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinaykul Kata does not update "container" resource for now :-) Also, it's related to how CRI shim is implemented, in containerd shimv2 work, I remembered we didn't handle this update at least month ago.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While if we decide to go with the current narrative in this KEP, CRI do need to be updated (new filed: ResourceAllocated) and CRI shim & crictl maintainers should be notified about the incompatible change of meaning of LinuxContainerResources .

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vinaykul Kata does not update "container" resource for now :-) Also, it's related to how CRI shim is implemented, in containerd shimv2 work, I remembered we didn't handle this update at least month ago.

Ah that's good to know. I last toyed with Kata at GA, and they were working on getting cpu/mem update working.

I tried out krt1.4.1 earlier today, and found OCI mostly works. CPU / memory increase & decrease reflects in the cgroup inside kata container and enforced, but VSZ/RSS isn't lowered when memory is lowered, and Get doesn't reflect the actual usage.

I'll try k8s-crio-kata tomorrow or friday to see how well crio-oci translation works and identify gaps. It probably won't work if containerd shim doesnt handle it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While if we decide to go with the current narrative in this KEP, CRI do need to be updated (new filed: ResourceAllocated) and CRI shim & crictl maintainers should be notified about the incompatible change of meaning of LinuxContainerResources .

@resouer kata-runtime 1.4.1 seems to handle updating cpu/memory via CRI-O (below example)
image

Regarding CRI, kubelet would merely switch from using PodSpec.Container.Container.ResourceRequirements to PodStatus.Container.ResourceAllocated to get the limits when invoking the CRI API in this function for e.g: https://github.com/Huawei-PaaS/kubernetes/blob/vert-scaling-cp-review/pkg/kubelet/kuberuntime/kuberuntime_container.go#L619

Did I miss something in thinking that CRI update is not necessary?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, I just check the kata-runtime's api and now every CRI shim could support container level resource adjustment. In that case, no CRI change is required, we can simply just use ResourceAllocated to generate containerResources.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about "every CRI"? I'll try to look on that deeper during next days, but almost for sure that will break https://github.com/Mirantis/virtlet/

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you sure about "every CRI"? I'll try to look on that deeper during next days, but almost for sure that will break https://github.com/Mirantis/virtlet/

@PatrickLang Over the past week, I experimented with WinServer 2019 (for another project planning that I'm working on), and got the chance to try a windows cross-compile kubelet of my implementation and take a closer look at how to set the updated limits. Windows does create the container with specified limits (perhaps using information from ContainerConfig.WindowsContainerConfig.WindowsContainerResources struct).

For cleanliness, I do see that we should update the CRI API to specify ContainerResources instead of LinuxContainerResources (which would have pointers to LinuxContainerResources or WindowsContainerResources similar to ContainerConfig). Do you think containerID + WindowsContainerResources is sufficient for Windows to successfully update the limits?

@jellonek I've not looked at virtlet. If you have had the chance, can you please check if its CRI shim is able to use LinuxContainerResources?

state of the Pod resources.

To distinguish between possible states of the Pod resources,
a new PodCondition InPlaceResize is added, with the following states:
* (empty) - the default value; resource update awaits reconciliation
(if ResourceRequirements differs from ResourceAllocated),
* Awaiting - awaiting resources to be freed (e.g. via pre-emption),
* Failed - resource update could not have been performed in-place
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
but might be possible if some conditions change,
* Rejected - resource update was rejected by any of the components involved.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

To provide some fine-grained control to the user,
PodSpec.Container.ResourceRequirements is extended with ResizingPolicy flag,
available per each resource request (CPU, memory) :
* InPlace - the default value; allow in-place resize of the Container,
* RestartContainer - restart the Container to apply new resource values
(e.g. Java process needs to change its Xmx flag),
* RestartPod - restart whole Pod to apply new resource values
(e.g. Pod requires its Init Containers to re-run).

By using the ResizingPolicy flag the user can mark Containers or Pods as safe
(or unsafe) for in-place resources update.

This flag **may** be used by the actors starting the process to decide if
the process should be started at all (for example VPA might decide to
evict Pod with RestartPod policy).
This flag **must** be used by Kubelet to verify the actions needed.

Setting the flag to separately control CPU & memory is due to an observation
that usually CPU can be added/removed without much problems whereas
changes to available memory are more probable to require restarts.

### Flow Control

The following steps denote a positive flow of an in-place update,
for a Pod having ResizingPolicy set to InPlace for all its Containers.
Some alternative flows are given in indented steps,
unless noted otherwise they abort the flow.

1. The initiating actor updates ResourceRequirements using PATCH verb.
1. API Server validates the new ResourceRequirements
(e.g. limits are not below requested resources, QoS class does not change).
1. API Server calls all Admission Controllers to verify the Pod Update.
derekwaynecarr marked this conversation as resolved.
Show resolved Hide resolved
1. If any of the controllers rejects the update,
the InPlaceResize PodCondition is set to Rejected.
1. API Server updates the PodSpec object and clears InPlaceResize condition.
1. Scheduler observes that ResourceRequirements and ResourceAllocated differ.
It updates its resource cache to use max(ResourceRequirements, ResourceAllocated).
1. If required it pre-empts lower-priority Pods, setting
the InPlaceResize PodCondition to Awaiting.
Once the lower-priority Pods are evicted, Scheduler clears
the InPlaceResize PodCondition and the flow continues.
1. Kubelet observes that ResourceRequirements and ResourceAllocated differ
and the InPlaceResize condition is clear.
This is done potentially prior to Scheduler pre-empting lower-priority Pods.
1. Kubelet checks that new ResourceRequirements do not fit Node’s
allocatable resources and sets the InPlaceResize condition to Failed.
1. Kubelet applies new resource values to cgroups, updates values
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
in ResourceAllocated to match ResourceRequirements
and clears InPlaceResize condition.
1. Scheduler observes that ResourceAllocated has changed.
It updates its resource cache to use new value of ResourceAllocated
for the given Pod.
1. The initating actor observes that ResourceRequirements and
ResourceAllocated match again which signifies the completion of an update.

#### Transitions of InPlaceResize condition

The following diagram shows possible transitions of InPlaceResize condition.

```text
+---------+
+-----------+ +-----------+
| | (empty) | |
| +---------> <---------+ |
| | +----+----+ | |
1| |2 3| 4| |5
+-----v-+--+ | +---+-v--+
| | | | |
| Awaiting | | | Failed |
| | | | |
+-------+--+ | +---+----+
3| | |3
| +----v-----+ |
| | | |
+---------> Rejected <--------+
| |
+----------+
```

1. Scheduler, on pre-emption.
1. Scheduler, after pre-emption finishes.
1. Any Controller, on permanent issue.
1. Kubelet, on successful retry.
1. Kubelet, if not enough space on Node.

#### Notes

* In case when there is no pre-emption required, Kubelet and Scheduler
will pick up the ResourceRequirements change in parallel.
* In case when there is pre-emption required Kubelet and Scheduler might
pick up the ResourceRequirements change in parallel,
Kubelet will then set the InPlaceResize condition to Failed
and Scheduler will clear it once pre-emption is done.
* Kubelet might try to apply new resources also if InPlaceResize
condition is set to Failed, as a normal retry mechanism.
* To avoid races and possible gamification, all components should use
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
max(ResourceRequirements, ResourceAllocated) when computing resources
used by a Pod. TBD if this can be weakened when InPlaceResize condition
is set to Rejected, or should the initiating actor update
ResourceRequirements back to reclaim resources.

### Affected Components

Pod v1 core API:
* extended model,
* added validation.

Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates:
* for ResourceQuota it should be enough to change podEvaluator.Handler
implementation to allow Pod updates; max(ResourceRequirements, ResourceAllocated)
should be used to be in line with current ResourceQuota behaviour
which blocks resources before they are used (e.g. for Pending Pods),
* for LimitRanger TBD.

Kubelet
* support in-place resource management,
* set ResourceRequirements on placing the Pod on Node.

Scheduler:
* update its caches with proper resources, depending on InPlaceResize condition.

Other components:
* check how the change of meaning of resource requests influence other kubernetes components.

### Possible Extensions

1. Allow resource limits to be updated too.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
1. Allow ResizingPolicy to be set on Pod level, acting as default if
(some of) the Containers do not have it set on their own.
1. Extend ResizingPolicy flag to separately control resource increase and decrease
(e.g. a container can be given more memory in-place but
decreasing memory requires container restart).

### Risks and Mitigations

TODO

## Graduation Criteria

TODO

## Implementation History

- 2018-11-06 - initial KEP draft created
- 2019-01-18 - implementation proposal extended

## Alternatives

TODO