Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP: in-place update of pod resources #686

Merged
merged 22 commits into from
Oct 3, 2019
Merged
Changes from 1 commit
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
cd94808
Move Karol Golab's draft KEP for In-place update of pod resources fro…
Jan 12, 2019
7fb66f1
Update owning-sig to sig-autoscaling, add initial set of reviewers.
Jan 12, 2019
b8c1f4e
Flow Control and few other sections added
kgolab Jan 18, 2019
5d00f9f
Merge pull request #1 from kgolab/master
vinaykul Jan 18, 2019
9580642
Update KEP filename per latest template guidelines, add non-goal item.
Jan 22, 2019
b8d814e
Merge remote-tracking branch 'upstream/master'
Mar 7, 2019
df1c8f8
Update flow control, clarify items per review, identify risks.
Mar 7, 2019
17923eb
Update policy name, clarify scheduler actions and policy precedence
Mar 11, 2019
e5052fc
Add RetryPolicy API change, clarify transition of PodCondition fields…
Mar 12, 2019
1194243
Update control flow per review, add notes on Pod Overhead, emptyDir
Mar 26, 2019
bfab6a3
Update API and flow control to avoid storing state in PodCondition
May 7, 2019
69f9190
Rename PodSpec scheduler resource allocations & PodCondition, and cla…
May 14, 2019
199a008
Key changes:
vinaykul Jun 18, 2019
574737c
Update design so that Kubelet, instead of Scheduler, evicts lower pri…
vinaykul Jun 19, 2019
5bdcd57
1. Remove PreEmpting PodCondition.
vinaykul Jul 9, 2019
bc9dc2b
Extend PodSpec to hold accepted resource resize values, add resourcea…
vinaykul Aug 26, 2019
533c3c6
Update ResourceAllocated as ResourceList, clarify details of Kubelet …
vinaykul Sep 3, 2019
29a22b6
Restate Kubelet fault handling to minimum guarantees, clarify Schedul…
vinaykul Sep 8, 2019
20cbea6
Details of LimitRanger, ResourceQuota enforcement during Pod resize.
vinaykul Sep 14, 2019
0ed9505
ResourceQuota with resize uses Containers[i].Resources
vinaykul Sep 17, 2019
c745563
Add note on VPA+HPA limitation for CPU, memory
vinaykul Sep 17, 2019
55c8e56
Add KEP approvers, minor clarifications
vinaykul Sep 24, 2019
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Prev Previous commit
Next Next commit
Extend PodSpec to hold accepted resource resize values, add resourcea…
…llocation subresource

* Extend PodSpec to hold accepted resource resize values

* Extend PodSpec to hold accepted resource resize values
  • Loading branch information
vinaykul authored Aug 26, 2019
commit bc9dc2beb57d6d3a3c39fd2987f7714658562dde
239 changes: 107 additions & 132 deletions keps/sig-autoscaling/20181106-in-place-update-of-pod-resources.md
Original file line number Diff line number Diff line change
Expand Up @@ -36,13 +36,13 @@ superseded-by:
* [Non-Goals](#non-goals)
* [Proposal](#proposal)
* [API Changes](#api-changes)
* [Container Restart Policy](#container-restart-policy)
* [Container Resize Policy](#container-resize-policy)
* [CRI Changes](#cri-changes)
* [Kubelet and API Server interaction](#kubelet-and-api-server-interaction)
* [Scheduler and API Server interaction](#scheduler-and-api-server-interaction)
* [Flow Control](#flow-control)
* [Transitions of the ResizingPod PodCondition](#transitions-of-the-resizingpod-podcondition)
* [Container resource limit update ordering](#container-resource-limit-update-ordering)
* [Kubelet Restart Fault Tolerance](#kubelet-restart-fault-tolerance)
* [Notes](#notes)
* [Affected Components](#affected-components)
* [Possible Extensions](#possible-extensions)
Expand All @@ -57,9 +57,9 @@ This proposal aims at allowing Pod resource requests & limits to be updated
in-place, without a need to restart the Pod or its Containers.

The **core idea** behind the proposal is to make PodSpec mutable with regards to
Resources, denoting **desired** resources.
Additionally, PodStatus is extended to provide information about **actual**
resource allocation.
Resources, denoting **desired** resources. Additionally, PodSpec is extended to
reflect resources **allocated** to a Pod, and PodStatus is extended to provide
information about **actual** resources applied to the Pod and its Containers.

This document builds upon [proposal for live and in-place vertical scaling][]
and [Vertical Resources Scaling in Kubernetes][].
Expand Down Expand Up @@ -115,31 +115,26 @@ Other identified non-goals are:
### API Changes

PodSpec becomes mutable with regards to Container resources requests and
limits. PodStatus is extended with information about actually allocated
Container resources.
limits. PodSpec is extended with information of resources allocated on the
Node for the Pod. PodStatus is extended to show the actual resources applied
to the Pod and its Containers.

Thanks to the above:
* PodSpec.Container.ResourceRequirements becomes purely a declaration, denoting
**desired** state of the Pod resources,
* PodStatus.ContainerStatus.ResourcesAllocated (new object) shows the resources
held by the Pod and its Containers.

In order to determine the state of a Pod resource resize, we add a new
PodCondition named ResizingPod, which describes the status of the last resize
request.

This PodCondition can have the following values:
* Status: false - Pod resize operation completed,
- Reason: (empty) - Initial state, no resize requested since Pod creation,
- Reason: Success - Pod and its Containers were successfully resized,
- Reason: FailedNodeCapacity - Node does not have room to resize the Pod.
* Status: true - Pod is in the process of being resized,
- Reason: InProgress - Kubelet is performing Pod resize.

#### Container Restart Policy

To provide fine-grained user control, PodSpec.Container is extended with
ResizePolicy map for each resource type (CPU, memory):
* Pod.Spec.Containers[i].Resources becomes purely a declaration, denoting the
**desired** state of Pod resources,
* Pod.Spec.Containers[i].ResourcesAllocated (new object) denotes the Node
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
resources **allocated** to the Pod and its Containers,
* Pod.Status.ContainerStatuses[i].Resources (new object) shows the **actual**
resources held by the Pod and its Containers.

A new Pod subresource named 'resourceallocation' is introduced to allow
fine-grained access control that enables Kubelet to set or update resources
allocated to a Pod.

#### Container Resize Policy

To provide fine-grained user control, PodSpec.Containers is extended with
ResizePolicy map (new object) for each resource type (CPU, memory):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to use a map here. See the Lists of named subobjects preferred over maps API convention.

This also still feels like quite a large API change just to support legacy Xmx applications. But i'll defer to approvers on whether this is justified.

Copy link
Member Author

@vinaykul vinaykul Sep 2, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to use a map here. See the Lists of named subobjects preferred over maps API convention.

This also still feels like quite a large API change just to support legacy Xmx applications. But i'll defer to approvers on whether this is justified.

Using map here is consistent with ResourceAllocations (v1.ResourceList map) or Resources - map of Requests and Limits.

I respect the API conventions, but if you visualize the two (see below) - map looks simpler considering LHS strings are fixed/known system-defined keys (and not user data, not something that helps the user if they could name/alias it). Is that convention applicable to this case? If you look at the intent of @jbeda in those discussions, he is looking to keep things simple.

...
      containers:
      - name: foo
        resources:
          limits:
            cpu: "1"
            memory: "1Gi"
          requests:
            cpu: "1"
            memory: "1Gi"
        resizePolicy:
          cpu: NoRestart
          memory: RestartContainer

vs.

      containers:
      - name: foo
        resources:
          limits:
            cpu: "1"
            memory: "1Gi"
          requests:
            cpu: "1"
            memory: "1Gi"
        resizePolicy:
        - name: cpu
          policy: NoRestart
        - name: memory
          policy: RestartContainer

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ack. It is fine to leave it as is for now. This is worth pointing out to API reviewers later to get their opinion on.

* NoRestart - the default value; resize Container without restarting it,
Copy link

@tedyu tedyu Sep 18, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if there is a better name than ResizePolicy.
The two options here are about restart.

Would adding another value for RestartPolicy (RestartPolicyContainer) be better ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need to have this policy per container.
I had thought about naming it RestartPolicy at first, but felt it would cause confusion with PodSpec.RestartPolicy. Besides, ResizePolicy sounded somewhat better as it relates to resize.

* RestartContainer - restart the Container in-place to apply new resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can the user forbid in-place updates for a given Container, e.g. because any resource change would require to re-run Init Containers?
I think there used to be an option which said "restart the whole Pod".

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a concrete use case or scenario that requires RestartPod policy? I removed it after the above discussion - I could not trace a use-case for it and couldn't justify its need with sig-node.

Copy link
Member Author

@vinaykul vinaykul Sep 23, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I'm wondering if this can be folded into RetryPolicy in VPA where it lets the user specify this - similar to updateMode 'Recreate', if user needs to re-run init for resize we evict the Pod and resize the replacement during admission . This is less ideal than in-place resize with restart, but the push has been to keep things as simple as possible for the Kubelet, and this is something that can be added later if there is a strong use case.

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can see a case for a third policy here, which as a strawman I will call SignalContainerWINCH. This would allow the container to attempt to adjust its language runtime to conform to the new limits - e.g. a programmer determines that calling runtime.GOMAXPROCS(math.Ceil(numCPUs) + 1) results in less scheduler thrashing.

However, such a signal would only be useful if pods are able to interrogate the system for their own resource limits. This is perhaps best left to future enhancements to in-place update and should not block 1.17 implementation.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On linux you can always read /sys/fs/cgroup can't you ?

values. (e.g. Java process needs to change its Xmx flag)
Expand All @@ -162,139 +157,108 @@ NoRestart in order to pass validation.
Kubelet calls UpdateContainerResources CRI API which currently takes
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
*runtimeapi.LinuxContainerResources* parameter that works for Docker and Kata,
but not for Windows. This parameter changes to *runtimeapi.ContainerResources*,
that is runtime agnostic.
that is runtime agnostic, and will contain platform-specific information.

### Kubelet and API Server interaction
### Kubelet and API Server Interaction
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

When a new Pod is created, Scheduler is responsible for selecting a suitable
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
Node that accommodates the Pod.

When a Pod resize is requested, Kubelet attempts to update the resources
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
allocated for the Pod and its Containers. Kubelet first checks if the new
desired resources can fit the Node allocatable resources. It computes the sum
desired resources can fit the Node allocatable resources by computing the sum
of resources requested by all Pods on the Node with the new desried resources
for the Pod being resized.
* If new desired resources fit, Kubelet accepts the resize by marking
ResizingPod.Reason to 'InProgress', and proceeds to invoke
UpdateContainerResources CRI API to update the Container resource limits.
Once all Containers are successfully updated, it sets ResizingPod.Reason
to 'Success', and updates the Pod's ContainerStatus.ResourcesAllocated to
reflect the new resource values.
* If new desired resources doesn't fit, Kubelet will fail the resize and mark
the ResizingPod.Reason to 'FailedNodeCapacity'.
* If new desired resources fit, Kubelet accepts the resize by updating
Pod.Spec.Containers[i].ResourcesAllocated via pods/resourceallocation
subresource, and then proceeds to invoke UpdateContainerResources CRI API
to update the Container resource limits. Once all Containers are successfully
updated, it updates Pod.Status.ContainerStatuses[i].Resources to reflect the
new resource values.
* If new desired resources doesn't fit, Kubelet will reject the resize, and no
further changes are made.
- Kubelet retries Pod resize at a later time, or when other Pods depart and
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
free up resources.

Kubelet uses max(ResourceRequirements, ResourcesAllocated) for computing Node
resource usage to avoid race between competing Pod resize requests.
Kubelet uses max(Pod.Spec.Containers[i].Resources,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We discussed this at sig-node, and i'll post the rationale here in-case it is useful for others who are watching this proposal:

We can use Pod.Spec.Containers[i].ResourcesAllocated for computing the pod's requested resources, rather than max(Pod.Spec.Containers[i].Resources, Pod.Status.ContainerStatuses[i].Resources). Kubernetes pod placement is Request driven, rather than Usage driven. We accept that pods can be placed on nodes which have few additional resources available at the moment, but have implemented mechanisms (e.g. eviction, cgroup CPU throttling) to ensure pods are able to access resources they have requested even when few are available. With respect to this KEP, we only need to guarantee that the ResourcesAllocated of pods is less than Node Allocatable. This is true even if decreasing container memory limits lags behind the desired resource limits.

Pod.Status.ContainerStatuses[i].Resources) for computing Node resource usage
to avoid race between competing Pod resize requests.

While the Scheduler may assign a new Pod to the Node in parallel because it
uses cached Node resource values, by using max(ResourceRequirements,
ResourcesAllocated) Kubelet also prevents new Pods competing with Pod resize,
and rejects a new Pod if Node does not have enough room.
Scheduler may, in parallel, assign a new Pod to the Node because it uses
cached Node resources values. By using max(Pod.Spec.Containers[i].Resources,
Pod.Status.ContainerStatuses[i].Resources) Kubelet also prevents new Pods
from competing with Pod resize, and rejects a new Pod if Node does not have
enough room.

Additionally, Kubelet may evict lower priority Pods from the Node in order to
make room for the resize. Eviction of lower priority Pods can be done in
second phase of the implementation of this feature.
second phase of the implementation of this feature. (not scoped for this KEP)
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

### Scheduler and API Server interaction
### Scheduler and API Server Interaction

Scheduler observes the resize request posted to API Server, and updates the
Node available resources accounting in its cache by using
max(ResourceRequirements, ResourcesAllocated) when computing Node
resources used by the Pods. This ensures that new Pods are not assigned to a
Node against resources that are being allocated by Kubelet to resize an
existing Pod.
Node available resources accounting in its cache by using Pod's
max(Spec.Containers[i].Resources, Status.ContainerStatuses[i].Resources) when
computing Node resources used by the Pods. This ensures that, in the case of
resource decrease for existing Pod, new Pod is not prematurely assigned to the
Node that is still in the process of deallocating the resized Pod's resources.

### Flow Control
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

The following steps denote a typical flow of an in-place resize operation for a
Pod with ResizePolicy set to NoRestart for all its Containers.

1. Initiating actor updates Pod's Container.ResourceRequirements using PATCH
verb.
1. API Server validates the new ResourceRequirements (e.g. Limits are not below
1. Initiating actor updates Pod's Spec.Containers[i].Resources via PATCH verb.
1. API Server validates the new Resources (e.g. Limits are not below
Requests, QoS class doesn't change, ResourceQuota not exceeded..).
1. API Server calls all Admission Controllers to verify the Pod Update.
* If any of the Controllers reject the update, API Server responds with an
appropriate error message.
1. API Server updates PodSpec object with the new desired ResourceRequirements.
1. Kubelet observes that Pod's Container.ResourceRequirements and
ContainerStatus.ResourcesAllocated differ. It checks its Node allocatable
resources to determine if the new desired ResourceRequirements fit the Node.
* _Case 1_: Kubelet finds new ResourceRequirements fit. It sets
ResizingPod.Status to true, and ResizingPod.Reason to InProgress, then
applies resized cgroup limits to the Pod and its Containers, and once
successfully done, updates Pod's ContainerStatus.ResourcesAllocated to
reflect the desired ResourceRequirements. It then sets ResizingPod.Status
to false, and ResizingPod.Reason to Success.
1. API Server updates PodSpec object with the new desired Resources.
1. Kubelet observes that Pod's Spec.Containers[i].Resources and
Spec.Containers[i].ResourcesAllocated differ. It checks its Node allocatable
resources to determine if the new desired Resources fit the Node.
* _Case 1_: Kubelet finds new desired Resources fit. It accepts the resize
and sets Spec.Containers[i].ResourcesAllocated equal to the values of
Containers[i].Resources by invoking resourceallocation subresource. It
then applies the new cgroup limits to the Pod and its Containers, and
once successfully done, sets Pod's Status.ContainerStatuses[i].Resources
to reflect the new ResourcesAllocated values.
- If at the same time, a new Pod was assigned to this Node against the
capacity taken up by this resource resize, that new Pod is rejected by
Kubelet during admission if Node has no more room.
* _Case 2_: Kubelet finds that desired ResourceRequirements does not fit.
- Kubelet checks to see if evicting lower priority Pods on the Node can
successfully resize the Pod. If yes, it sets ResizingPod.Status to true,
ResizingPod.Reason to InProgress, and initiates pre-emption of lower
priority Pods via Eviction API. Once lower priority Pods have been
evicted, the flow continues as above.
- If Kubelet is unable to create enough room by evicting lower priority
Pods, it sets ResizingPod.Reason to FailedNodeCapacity, and
ResizingPod.Status to false.
1. Scheduler, in parallel, observes that Container.ResourceRequirements and
ContainerStatus.ResourcesAllocated differ, updates its cache, and uses
max(ResourceRequirements, ResourcesAllocated) when computing resources
available on the Node.
* _Case 2_: Kubelet finds that the new desired Resources does not fit.
- Kubelet checks to see if evicting lower priority Pods can successfully
resize the Pod. If yes, it sets Containers[i].ResourcesAllocated equal
to Containers[i].Resources by invoking resourceallocation subresource,
and initiates pre-emption of lower priority Pods via Eviction API.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
Once lower priority Pods have been evicted, the flow continues as above.
- If Kubelet determines that it is unable to make enough room by evicting
lower priority Pods, it simply retries the resize at a later time.
1. Scheduler, in parallel, observes that Pod's Spec.Containers[i].Resources and
Status.ContainerStatuses[i].Resources differ, updates its cache, and uses
max(Spec.Containers[i].Resources, Status.ContainerStatuses[i].Resources)
when computing resources available on the Node.
* This can temporarily result in sum of Pod resources for the Node
exceeding Node's allocatable resources if a new Pod was assigned to that
Node in parallel, exceeding Node capacity. This is resolved when Kubelet
rejects that new Pod during admission due to lack of room.
* After Kubelet has successfully resized the Pod, and updated
ContainerStatus.ResourcesAllocated, Scheduler updates its cache, and
accounting reflects the updated Pod resources.
1. The initiating actor observes that ResizingPod.Reason and/or
ContainerStatus.ResourcesAllocated fields have changed.
* _Case 1_: ResizingPod.Status is false, ResizingPod.Reason is Success,
and Pod's Container.ResourceRequirements matches
ContainerStatus.ResourcesAllocated, signifying a successful completion of
in-place Pod resources resizing.
* _Case 2_: ResizingPod.Status is false, and ResizingPod.Reason is
FailedNodeCapacity. The initiating actor may take alternative action.
For example, based on Retry policy, initiating actor (e.g VPA) may:
- Evict the Pod to trigger a replacement Pod with updated resources,
* After Kubelet has successfully resized the Pod and updated Pod's
Status.ContainerStatuses[i].Resources, Scheduler updates its cache, and
the accounting reflects updated Pod resources.
1. The initiating actor (e.g. VPA) observes the following:
* _Case 1_: Pod's Spec.Containers[i].ResourcesAllocated values have changed
and matches Spec.Containers[i].Resources, signifying that desired resize
has been accepted, and Pod's resources are being resized. The resize
operation is complete when Pod's Spec.Containers[i].Resources and
Status.ContainerStatuses[i].Resources match.
* _Case 2_: Pod's Spec.Containers[i].ResourcesAllocated values remain
unchanged, and continues to differ from Spec.Containers[i].Resources.
After a certain (user defined) timeout, initiating actor may take alternate
action. For example, based on Retry policy, initiating actor may:
- Evict the Pod to trigger a replacement Pod with new desired resources,
- Do nothing, and let Kubelet backoff and retry in-place resize.

#### Transitions of the ResizingPod PodCondition

The following diagram shows possible transitions of ResizingPod.Status and
ResizingPod.Reason fields respectively.

```text

+-------------------------------------+
| |
| 2|
| +----v----+
| | |
| +---------------------+ false |
| | | Success |
| | | |
| | +----+----+
| | |
| 1| 3|
| +------v-----+ +----------v---------+
| | | | |
| | true |4 | false |
+---+ InProgress <---------+ FailedNodeCapacity |
| | | |
+------------+ +--------------------+

```

1. Kubelet, on initiating in-place resize.
1. Kubelet, on successful completion of in-place resize.
1. Kubelet, on Node not having capacity to resize Pod.
1. Kubelet, on initiating in-place resize retry.

#### Container resource limit update ordering
vinaykul marked this conversation as resolved.
Show resolved Hide resolved

When in-place resize is requested for multiple Containers in a Pod, Kubelet
Expand All @@ -311,21 +275,30 @@ updates resource limit for the Pod and its Containers in the following manner:
In all the above cases, Kubelet applies Container resource limit decreases
before applying limit increases.

#### Kubelet Restart Fault Tolerance

If Kubelet were to restart amidst handling Pod resize, then upon start up, all
existing (and new Pods, if any) are handled by Kubelet as new Pod additions. If
a Pod resize was being handled at time of restart, or other Pod resize requests
arrive during the time Kubelet is offline, then the Pods needing resize (i.e
Spec.Containers[i].Resources and Spec.Containers[i].ResourcesAllocated differ)
are ordered by the Pod's ResourceVersion to ensure first-come-first-serve.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessary. We already sort the initial batch of pods we get after restart by creation time.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fine for PodAdditions if there's no resize requested. We complete pod admission sorted by creation time at ResourcesAllocated values for existing Pods.

Once that is done, if two or more Pods requested resize while KL was restarting, then we want to order them FCFS, don't we? If yes, then for Pods needing resize (and any brand new Pods), ascending ResourceVersion seems to be the way to do it per my experiment.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Resizes must be serialized with admissions. This means the first HandlePodAdmission (with all pods that previously existed) will complete before any resizes are done.

Copy link

@riking riking Aug 29, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are you actually allowed to compare ResourceVersion values across resources like that? I thought they were reserving the possibility of changing the implementation in the future and required that clients treat them as opaque numbers.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, are you actually allowed to compare ResourceVersion values across resources like that? I thought they were reserving the possibility of changing the implementation in the future and required that clients treat them as opaque numbers.

ResourceVersion, as currently documented led me to believe this is possible - see example 2. The documentation for the CompareResourceVersion code does not call out if this applicable to individual objects and cannot cross objects. If this is not possible, the other alternative is to add LastUpdateTimeStamp to meta.


#### Notes

* If CPU Manager policy for a Node is set to 'static', then only integral
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
values of CPU resize are allowed.
* To avoid races and possible gamification, all components will use
max(ResourceRequirements, ResourcesAllocated) when computing resources used
by a Pod.
max(Spec.Containers[i].Resources, Status.ContainerStatuses[i].Resources)
when computing resources used by a Pod.
* If additional resize requests arrive when a Pod is being resized, those
requests are handled after completion of the resize that is in progress. And
resize is driven towards the latest desired state.
* We explored the option of Scheduler, instead of Kubelet, pre-empting lower
priority Pods. Pre-emption by Kubelet is simpler, and has lower latencies.
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
* Lowering memory limits may not always work if the application is holding on
to pages. Kubelet will use a control loop to set the memory limits near usage
in order to force a reclaim, and update ContainerStatus.ResourcesAllocated
in order to force a reclaim, and update Status.ContainerStatuses[i].Resources
vinaykul marked this conversation as resolved.
Show resolved Hide resolved
only when limit is at desired value.
* Impact of Pod Overhead: Kubelet adds Pod Overhead to the resize request to
determine if in-place resize is possible.
Expand All @@ -340,20 +313,22 @@ Pod v1 core API:

Admission Controllers: LimitRanger, ResourceQuota need to support Pod Updates:
* for ResourceQuota it should be enough to change podEvaluator.Handler
implementation to allow Pod updates; max(ResourceRequirements,
ResourcesAllocated) should be used to be in line with current ResourceQuota
behavior which blocks resources before they are used (e.g. for Pending Pods),
implementation to allow Pod updates; max(Spec.Containers[i].Resources,
Status.ContainerStatuses[i].Resources) should be used to be in line with
current ResourceQuota behavior which blocks resources before they are used
(e.g. for Pending Pods),
* for LimitRanger TBD.

Kubelet:
* support in-place resource resize,
* set Pod's ContainerStatus.ResourcesAllocated for Containers on placing the
Pod on Node,
* set Pod's Status.ContainerStatuses[i].Resources for Containers on placing
the Pod on Node,
* change UpdateContainerResources CRI API to work for both Linux & Windows,
* invoke eviction API for lower priorty Pods. (Implemented in phase 2)

Scheduler:
* update cache using max(ResourceRequirements, ResourcesAllocated).
* update cache using Pod's max(Spec.Containers[i].Resources,
Status.ContainerStatuses[i].Resources).

Controllers:
* propagate Template resources update to running Pod instances.
Expand Down