In-Place Pod Vertical Scaling feature #1

vinaykul · 2020-01-24T14:05:30Z

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change

/kind bug
/kind cleanup
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
This PR implements the API change for In-Place Pod Vertical Scaling KEP , the Kubelet CRI KEP to support in-place pod vertical scaling feature , and the core implementation of in-place pod vertical scaling feature.

It adds new fields named ResourcesAllocated and ResizePolicy to Pod's Container type, and Resources field to ContainerStatus type. It also adds a new admission controller that limits the ability to change ResourcesAllocated field to system-node account. This API change enables users to scale a Pod's CPU and memory resources up or down.

The CRI change enables runtimes to report currently applied resource limits for a container, and adds support for Windows containers.

The core implementation commit implements the In-Place Pod Vertical Scaling KEP

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

The commit named 'InPlace Pod Vertical Scaling feature - API change' implements the API change for In-Place Pod Vertical Scaling.
The commit named 'InPlace Pod Vertical Scaling feature - Kubelet CRI change' implements the CRI changes to allow runtimes to report resource limits.
The commit named 'InPlace Pod Vertical Scaling feature - core implementation' is the basic implementation of In-Place Pod Vertical Scaling feature that performs the core resizing and resize policy handling.

Switching scheduler, kubelet, and apiserver code (limit range, resource quota) is coming in a future commit.

Does this PR introduce a user-facing change?: Yes

1. A new field named ResourcesAllocated (type v1.ResourceList) is added to Container struct.
2. A new field named ResizePolicy is added to Container struct.
3. A new field named Resources (type v1.ResourceRequirements) is added to ContainerStatus struct.
4. A new admission controller named 'PodResourceAllocation' is introduced to limit access to modification of ResourcesAllocated field.

e.g: 
root@skibum:~/vInPlacePodVerticalScaling# cat ~/YML/1pod.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: 1pod
spec:
  containers:
  - name: stress
    image: skiibum/ubuntu-stress:18.10
    resources:
      limits:
        cpu: "1"
        memory: "1Gi"
      requests:
        cpu: "1"
        memory: "1Gi"
    resizePolicy:
    - resourceName: cpu
      policy: NoRestart
    - resourceName: memory
      policy: RestartContainer

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

In-Place Update of Pod Resources KEP:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20181106-in-place-update-of-pod-resources.md

Container Resources CRI API Changes for Pod Vertical Scaling KEP:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md

thockin · 2020-01-27T22:12:37Z

pkg/apis/core/types.go

+	RestartContainer ContainerResizePolicy = "RestartContainer"
+)
+
+// ResizePolicy represents the resource resize policy for a single container.


Please be more precise about the meaning of this - is it a requirement that this container not be restarted or is it saying that a restart is not required (but may still happen if the kubelet needs to)? Or something else. I think this distinction may influence the constant strings above.

E.g. I assume this is a statement of capability. "I can be resized in-place without restart". That would lead me to look for something like:

resizePolicy: - resource: "cpu" restart: "NotRequired" - resource: "memory" restart: Required

What other policies do you think might one day be needed here?

NoRestart tells Kubelet to call UpdateContainerResources CRI API to resize the resource - doing so does not restart the container. RestartContaienr policy on the other hand tell Kubelet to stop and start the container with new resources, and the use-case for this was some legacy java apps using -xmxN flag which are unable to use the increased memory without restarting.

We had RestartPod as a third option at one point, but removed it because we couldn't find a real-world use case for it other than the notion that someone might want to restart the entire pod in-place (to rerun the init containers).

I'll add more details to the comments.

Are you thinking this should just be a binary/bool value considering we have just two values for the foreseeable future?

Hi, I found this half-completed, I don't know if it is still alive.

My point was to clarify whether "no restart" is contract or guidance. If the kubelet, for some reason, needed to restart the pod, would that be a violation of NoRestart or just an unfortunate, but legal, edge case?

I've updated documentation in the code to clarify that container resize policy is a guidance by stating that kubelet should resize the container without restarting it if possible . I hope this is sufficient language.

pkg/apis/core/v1/defaults.go

pkg/apis/core/validation/validation.go

dashpole · 2020-02-25T18:47:30Z

/cc

dchen1107 · 2020-02-25T18:48:49Z

/cc

dashpole · 2020-02-26T00:00:18Z

lgtm, this matches what I expect from the KEP.

vinaykul · 2020-02-26T04:36:49Z

@thockin @liggitt

Apologies for the delay, I was pulled off to work on another priority item due to a couple of our delayed coming back from China after they went there for the new year holidays.

Please review update a419856 , it should address your comments.

thockin · 2020-05-06T01:48:23Z

pkg/apis/core/types.go

+	RestartContainer ContainerResizePolicy = "RestartContainer"
+)
+
+// ResizePolicy represents the resource resize policy for a single container.


Hi, I found this half-completed, I don't know if it is still alive.

My point was to clarify whether "no restart" is contract or guidance. If the kubelet, for some reason, needed to restart the pod, would that be a violation of NoRestart or just an unfortunate, but legal, edge case?

thockin · 2020-05-06T01:50:22Z

pkg/apis/core/types.go

@@ -2053,6 +2080,12 @@ type Container struct {
 	// Compute resource requirements.
 	// +optional
 	Resources ResourceRequirements
+	// Node compute resources allocated to the container.
+	// +optional
+	ResourcesAllocated ResourceList


I am still confused why this is spec?

This is very much alive - I'm a couple of weeks away from having a mostly complete implementation out for review, and also updated API code to move some of the validation earlier into admission stage.

NoRestart/RestartContainer above is a guidance, and lets us decide whether to stop/start container or just call UpdateContainerResources API. The runtime may still choose to restart - we don't have control over it, and no explicit documentation telling runtime not to restart.

ResourcesAllocated was added to Spec to track resources that node has agreed reserve for the pod, and make this info visible to the user.

We had discussed the alternative of storing this information locally on the node, but I was told that Borg had encountered issues with such approach, and they want to avoid checkpointing on the node, and thought it best to have Spec as the source of truth. We had also looked at relying on PodStatus, but the Status, per API conventions, isn't allowed store any state that cant be reconstructed through observations.

pkg/kubelet/kubelet_pods.go

pkg/kubelet/cm/cgroup_manager_linux.go

dashpole · 2020-06-05T16:34:43Z

pkg/kubelet/kubelet.go

+	}
+	containersPatchData = strings.TrimRight(containersPatchData, ",")
+	patchData := fmt.Sprintf(`{"spec":{"containers":[%s]}}`, containersPatchData)
+	return true, patchData


We have a race condition here, but i don't remember how we agreed to solve it. For example, if we called canResizePod on pod A and B, such that the ordering was:

canResizePod(A) => true

canResizePod(B) => true

Patch(A)

Patch(B)

This could pass even if there was only room for 1 of the pods, since A is not included in kl.GetActivePods() during canResizePod(B).

At minimum, I think we need to update GetActivePods here. Or am I misremembering our conclusion?

I've tested this for add vs resize race after coding up the switch-over from Requests to ResourcesAllocated (next change - hopefully by eod Monday). I think the mutex should serialize race between two resizes as well.

So in the above case, the next GetActivePods (for pod B( should use updated ResourcesAllocated values of Pod A if the resize was accepted. But let me specifically test that case and verify..

I tested the case of concurrent updates, and the mutex mechanism holds up well.

Where do we update the active pods?

The other thing I was trying to reason through at sig-node is whether the update needs to be written back to the API Server while holding the lock. Take the following example:

canResizePod(A) decrease by 100m => true

updateActivePods(A) decrease by 100m

canResizePod(B) increase by 100m => true, since A's requests are lowered

updateActivePods(B) increase by 100m

Patch(B)

Update (B) in the CRI

... Patch(A) hangs for a while ... Update(A) in the CRI doesn't happen.

In this case, B would have a higher allocated even though A has not been patched to be lower. An observer that did kubectl describe no would see resource allocated > allocatable. We could decide that that is OK, since we really just care about changes sent to the CRI, and the Patch() doesn't change that.

Does that make sense?

If we can end up with allocated > allocatable, that means a kubelet restart will result in one of the pods being rejected, as it reads pods from the source to admit them.

Also, WRT container.ResourcesAllocated[rName] = rQuantity updating active pods, I'd prefer explicitly calling podManager.UpdatePod() (or whatever the function name is) to make this explicit. The podManager GetActivePods() function is meant to be read-only, but we don't do deep copies to enforce that.

If we can end up with allocated > allocatable, that means a kubelet restart will result in one of the pods being rejected, as it reads pods from the source to admit them.

good point. we have to patch holding the lock, ugh. i'll look into it after i'm done with resource quota and limit range, and will also fix the explicit update.

Sounds good. We have tests that test the throughput of pod startup. If it has a significant negative impact, we can consider the approach you currently have, and document the potential issues.

Commit 92ffa91 - Modified resource quota and limitranger plugins to handle updates to Requests and Limits, and modified kubelet to hold the lock across patch - it addresses the kubelet restart issue that you identified. Thanks!

pkg/kubelet/kubelet.go

pkg/kubelet/kuberuntime/kuberuntime_manager.go

pkg/kubelet/eviction/helpers.go

1. Add ResourcesAllocated and ResizePolicy fields to Container struct. 2. Add Resources field to ContainerStatus struct. 3. Add a new admission controller named PodResourceAllocation for setting defaults and validation. 4. Address code-review items: a. Clarify meaning of container resize policy that it is a guidance from user, not a guarantee by k8s. b. Handle updates from clients unaware of the new fields in pod spec. c. Use DropDisabledPodFields to avoid feature-gate checks in validation where possible. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md

1. Add ContainerResources message that supports both Linux and Windows resources. 2. Add ContainerResources to ContainerStatus CRI API. 3. Modify UpdateContainerResources CRI API to use ContainerResources. 4. Implement handling ContainerStatus response for runtimes that don't support resources query. KEP: /enhancements/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md

1. Handle patching of ResourcesAllocated fields of a resized pod in Kubelet's syncPod routine. 2. Handle container limits updation in runtime manager's SyncPod routine. 3. Add new pods after testing for admissibility using ResourcesAllocated field values. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md

…ation 1. Generate CPU Request API status information from runtime's cpu.shares report. 2. Rename pod cgrpoup config get/set functions for functional clarity, and simplify update container resources function. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md

1. Switch to using container.ResourcesAllocated instead of container.Resources.Requests for resource allocation/usage calculations. 2. Add support for ResourcesAllocated to kubectl describe with pod and node resources. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md

1. Enable resource quota and limit ranger for pod resource changes. 2. In kubelet, do apiserver ResourcesAllocated patch under mutex. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md

1. Fix issues found by verify, typecheck, dependencies tests. 2. Fix error message during validation when InPlacePodVerticalScaling feature-gate is off. 3. Backout feature-gate change in kubeadm InitFeatureGates list. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md

1. Use strategicpatch to generate patch data for ResourcesAllocated when handling pod resize. 2. Use ContainersToStart instead of ContainersToRestart in SyncPod. 3. Simplify use of ContainersToUpdate in SyncPod. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md

1. Add framework code for e2e testing, and example e2e tests. 2. Do container resource update based on difference between requests as well as limits. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md

vinaykul · 2021-06-14T01:23:35Z

This is stale. Working on implementation of updated design. Will be sending PRs to k/k in a couple of weeks once well tested.

* Add APF concurrency utilization test

These were found with a modified klog that enables "go vet" to check klog call parameters: cmd/kubeadm/app/features/features.go:149:4: printf: k8s.io/klog/v2.Warningf format %t has arg v of wrong type string (govet) klog.Warningf("Setting deprecated feature gate %s=%t. It will be removed in a future release.", k, v) test/images/sample-device-plugin/sampledeviceplugin.go:147:5: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet) klog.Errorf("error: %w", err) test/images/sample-device-plugin/sampledeviceplugin.go:155:3: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet) klog.Errorf("Failed to add watch to %q: %w", triggerPath, err) staging/src/k8s.io/code-generator/cmd/prerelease-lifecycle-gen/prerelease-lifecycle-generators/status.go:207:5: printf: k8s.io/klog/v2.Fatalf does not support error-wrapping directive %w (govet) klog.Fatalf("Package %v: unsupported %s value: %q :%w", i, tagEnabledName, ptag.value, err) staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:286:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet) klog.V(4).Infof("Node %s missing in vSphere cloud provider cache, trying node informer") staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:302:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet) klog.V(4).Infof("Node %s missing in vSphere cloud provider caches, trying the API server")

vinaykul self-assigned this Jan 24, 2020

vinaykul mentioned this pull request Jan 24, 2020

In-Place Vertical Pod Scaling KEP to implementable, and mini-KEP for CRI extensions kubernetes/enhancements#1342

Merged

thockin reviewed Jan 27, 2020

View reviewed changes

thockin reviewed May 6, 2020

View reviewed changes

vinaykul force-pushed the inplace-pod-vertical-scaling branch from a419856 to 3c93b83 Compare May 15, 2020 02:35

vinaykul force-pushed the inplace-pod-vertical-scaling branch from 0dd6c02 to 26b6385 Compare June 2, 2020 16:44

vinaykul requested a review from dashpole June 3, 2020 17:29

vinaykul force-pushed the inplace-pod-vertical-scaling branch from 54102dd to 6f65aff Compare June 5, 2020 10:05

vinaykul changed the title ~~In-Place Pod Vertical Scaling feature - API code change~~ In-Place Pod Vertical Scaling feature Jun 5, 2020

dashpole reviewed Jun 5, 2020

View reviewed changes

dashpole reviewed Jun 9, 2020

View reviewed changes

pkg/kubelet/eviction/helpers.go Show resolved Hide resolved

vinaykul mentioned this pull request Jun 11, 2020

In-Place Update of Pod Resources kubernetes/enhancements#1287

Open

31 tasks

vinaykul force-pushed the inplace-pod-vertical-scaling branch 2 times, most recently from 92ffa91 to 3f40b72 Compare June 14, 2020 21:58

vinaykul mentioned this pull request Jun 14, 2020

In-Place Pod Vertical Scaling feature kubernetes/kubernetes#92127

Closed

vinaykul force-pushed the inplace-pod-vertical-scaling branch from 113cf1c to f51a9a0 Compare June 23, 2020 08:26

vinaykul added 9 commits June 29, 2020 17:10

InPlace Pod Vertical Scaling feature - generated code from API change

efae4df

InPlace Pod Vertical Scaling feature - core implementation

4df3cb9

1. Enable resource quota and limit ranger for pod resource changes. 2. In kubelet, do apiserver ResourcesAllocated patch under mutex. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md

vinaykul force-pushed the inplace-pod-vertical-scaling branch from 942465d to fef6f86 Compare June 30, 2020 00:11

vinaykul force-pushed the inplace-pod-vertical-scaling branch from fef6f86 to 426d0bb Compare June 30, 2020 01:28

vinaykul closed this Jun 14, 2021

vinaykul pushed a commit that referenced this pull request Aug 24, 2022

APF concurrency isolation integration test (#1)

f298d50

* Add APF concurrency utilization test

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

In-Place Pod Vertical Scaling feature #1

In-Place Pod Vertical Scaling feature #1

vinaykul commented Jan 24, 2020 •

edited

Loading

thockin Jan 27, 2020

vinaykul Jan 28, 2020

vinaykul Jan 28, 2020 •

edited

Loading

thockin May 6, 2020

vinaykul May 15, 2020

dashpole commented Feb 25, 2020

dchen1107 commented Feb 25, 2020

dashpole commented Feb 26, 2020

vinaykul commented Feb 26, 2020

thockin May 6, 2020

thockin May 6, 2020

vinaykul May 7, 2020

dashpole Jun 5, 2020

vinaykul Jun 6, 2020

vinaykul Jun 9, 2020

dashpole Jun 9, 2020

dashpole Jun 9, 2020

dashpole Jun 10, 2020

dashpole Jun 10, 2020

vinaykul Jun 11, 2020

dashpole Jun 11, 2020

vinaykul Jun 12, 2020 •

edited

Loading

vinaykul commented Jun 14, 2021

In-Place Pod Vertical Scaling feature #1

In-Place Pod Vertical Scaling feature #1

Conversation

vinaykul commented Jan 24, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinaykul Jan 28, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dashpole commented Feb 25, 2020

dchen1107 commented Feb 25, 2020

dashpole commented Feb 26, 2020

vinaykul commented Feb 26, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vinaykul Jun 12, 2020 • edited Loading

Choose a reason for hiding this comment

vinaykul commented Jun 14, 2021

vinaykul commented Jan 24, 2020 •

edited

Loading

vinaykul Jan 28, 2020 •

edited

Loading

vinaykul Jun 12, 2020 •

edited

Loading