Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

In-Place Pod Vertical Scaling feature #1

Closed
wants to merge 10 commits into from

Conversation

vinaykul
Copy link
Owner

@vinaykul vinaykul commented Jan 24, 2020

What type of PR is this?

Uncomment only one /kind <> line, hit enter to put that in a new line, and remove leading whitespace from that line:

/kind api-change

/kind bug
/kind cleanup
/kind deprecation
/kind design
/kind documentation
/kind failing-test
/kind feature
/kind flake

What this PR does / why we need it:
This PR implements the API change for In-Place Pod Vertical Scaling KEP , the Kubelet CRI KEP to support in-place pod vertical scaling feature , and the core implementation of in-place pod vertical scaling feature.

It adds new fields named ResourcesAllocated and ResizePolicy to Pod's Container type, and Resources field to ContainerStatus type. It also adds a new admission controller that limits the ability to change ResourcesAllocated field to system-node account. This API change enables users to scale a Pod's CPU and memory resources up or down.

The CRI change enables runtimes to report currently applied resource limits for a container, and adds support for Windows containers.

The core implementation commit implements the In-Place Pod Vertical Scaling KEP

Which issue(s) this PR fixes:

Fixes #

Special notes for your reviewer:

  1. The commit named 'InPlace Pod Vertical Scaling feature - API change' implements the API change for In-Place Pod Vertical Scaling.
  2. The commit named 'InPlace Pod Vertical Scaling feature - Kubelet CRI change' implements the CRI changes to allow runtimes to report resource limits.
  3. The commit named 'InPlace Pod Vertical Scaling feature - core implementation' is the basic implementation of In-Place Pod Vertical Scaling feature that performs the core resizing and resize policy handling.

Switching scheduler, kubelet, and apiserver code (limit range, resource quota) is coming in a future commit.

Does this PR introduce a user-facing change?: Yes

1. A new field named ResourcesAllocated (type v1.ResourceList) is added to Container struct.
2. A new field named ResizePolicy is added to Container struct.
3. A new field named Resources (type v1.ResourceRequirements) is added to ContainerStatus struct.
4. A new admission controller named 'PodResourceAllocation' is introduced to limit access to modification of ResourcesAllocated field.

e.g: 
root@skibum:~/vInPlacePodVerticalScaling# cat ~/YML/1pod.yaml 
apiVersion: v1
kind: Pod
metadata:
  name: 1pod
spec:
  containers:
  - name: stress
    image: skiibum/ubuntu-stress:18.10
    resources:
      limits:
        cpu: "1"
        memory: "1Gi"
      requests:
        cpu: "1"
        memory: "1Gi"
    resizePolicy:
    - resourceName: cpu
      policy: NoRestart
    - resourceName: memory
      policy: RestartContainer


Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

In-Place Update of Pod Resources KEP:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20181106-in-place-update-of-pod-resources.md

Container Resources CRI API Changes for Pod Vertical Scaling KEP:
https://github.com/kubernetes/enhancements/blob/master/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md

RestartContainer ContainerResizePolicy = "RestartContainer"
)

// ResizePolicy represents the resource resize policy for a single container.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please be more precise about the meaning of this - is it a requirement that this container not be restarted or is it saying that a restart is not required (but may still happen if the kubelet needs to)? Or something else. I think this distinction may influence the constant strings above.

E.g. I assume this is a statement of capability. "I can be resized in-place without restart". That would lead me to look for something like:

resizePolicy:
  - resource: "cpu"
    restart: "NotRequired"
  - resource: "memory"
    restart: Required

What other policies do you think might one day be needed here?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

NoRestart tells Kubelet to call UpdateContainerResources CRI API to resize the resource - doing so does not restart the container. RestartContaienr policy on the other hand tell Kubelet to stop and start the container with new resources, and the use-case for this was some legacy java apps using -xmxN flag which are unable to use the increased memory without restarting.

We had RestartPod as a third option at one point, but removed it because we couldn't find a real-world use case for it other than the notion that someone might want to restart the entire pod in-place (to rerun the init containers).

I'll add more details to the comments.

Copy link
Owner Author

@vinaykul vinaykul Jan 28, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are you thinking this should just be a binary/bool value considering we have just two values for the foreseeable future?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I found this half-completed, I don't know if it is still alive.

My point was to clarify whether "no restart" is contract or guidance. If the kubelet, for some reason, needed to restart the pod, would that be a violation of NoRestart or just an unfortunate, but legal, edge case?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've updated documentation in the code to clarify that container resize policy is a guidance by stating that kubelet should resize the container without restarting it if possible . I hope this is sufficient language.

pkg/apis/core/v1/defaults.go Outdated Show resolved Hide resolved
pkg/apis/core/v1/defaults.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/validation.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/validation.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/validation.go Outdated Show resolved Hide resolved
pkg/apis/core/validation/validation.go Outdated Show resolved Hide resolved
@dashpole
Copy link
Collaborator

/cc

1 similar comment
@dchen1107
Copy link

/cc

@dashpole
Copy link
Collaborator

lgtm, this matches what I expect from the KEP.

@vinaykul
Copy link
Owner Author

@thockin @liggitt

Apologies for the delay, I was pulled off to work on another priority item due to a couple of our delayed coming back from China after they went there for the new year holidays.

Please review update a419856 , it should address your comments.

RestartContainer ContainerResizePolicy = "RestartContainer"
)

// ResizePolicy represents the resource resize policy for a single container.
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi, I found this half-completed, I don't know if it is still alive.

My point was to clarify whether "no restart" is contract or guidance. If the kubelet, for some reason, needed to restart the pod, would that be a violation of NoRestart or just an unfortunate, but legal, edge case?

@@ -2053,6 +2080,12 @@ type Container struct {
// Compute resource requirements.
// +optional
Resources ResourceRequirements
// Node compute resources allocated to the container.
// +optional
ResourcesAllocated ResourceList
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am still confused why this is spec?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is very much alive - I'm a couple of weeks away from having a mostly complete implementation out for review, and also updated API code to move some of the validation earlier into admission stage.

NoRestart/RestartContainer above is a guidance, and lets us decide whether to stop/start container or just call UpdateContainerResources API. The runtime may still choose to restart - we don't have control over it, and no explicit documentation telling runtime not to restart.

ResourcesAllocated was added to Spec to track resources that node has agreed reserve for the pod, and make this info visible to the user.

We had discussed the alternative of storing this information locally on the node, but I was told that Borg had encountered issues with such approach, and they want to avoid checkpointing on the node, and thought it best to have Spec as the source of truth. We had also looked at relying on PodStatus, but the Status, per API conventions, isn't allowed store any state that cant be reconstructed through observations.

@vinaykul vinaykul force-pushed the inplace-pod-vertical-scaling branch from a419856 to 3c93b83 Compare May 15, 2020 02:35
@vinaykul vinaykul force-pushed the inplace-pod-vertical-scaling branch from 0dd6c02 to 26b6385 Compare June 2, 2020 16:44
@vinaykul vinaykul requested a review from dashpole June 3, 2020 17:29
@vinaykul vinaykul force-pushed the inplace-pod-vertical-scaling branch from 54102dd to 6f65aff Compare June 5, 2020 10:05
@vinaykul vinaykul changed the title In-Place Pod Vertical Scaling feature - API code change In-Place Pod Vertical Scaling feature Jun 5, 2020
pkg/kubelet/kubelet_pods.go Outdated Show resolved Hide resolved
pkg/kubelet/cm/cgroup_manager_linux.go Outdated Show resolved Hide resolved
}
containersPatchData = strings.TrimRight(containersPatchData, ",")
patchData := fmt.Sprintf(`{"spec":{"containers":[%s]}}`, containersPatchData)
return true, patchData
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We have a race condition here, but i don't remember how we agreed to solve it. For example, if we called canResizePod on pod A and B, such that the ordering was:

  1. canResizePod(A) => true
  2. canResizePod(B) => true
  3. Patch(A)
  4. Patch(B)

This could pass even if there was only room for 1 of the pods, since A is not included in kl.GetActivePods() during canResizePod(B).

At minimum, I think we need to update GetActivePods here. Or am I misremembering our conclusion?

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've tested this for add vs resize race after coding up the switch-over from Requests to ResourcesAllocated (next change - hopefully by eod Monday). I think the mutex should serialize race between two resizes as well.

So in the above case, the next GetActivePods (for pod B( should use updated ResourcesAllocated values of Pod A if the resize was accepted. But let me specifically test that case and verify..

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested the case of concurrent updates, and the mutex mechanism holds up well.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do we update the active pods?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other thing I was trying to reason through at sig-node is whether the update needs to be written back to the API Server while holding the lock. Take the following example:

  1. canResizePod(A) decrease by 100m => true
  2. updateActivePods(A) decrease by 100m
  3. canResizePod(B) increase by 100m => true, since A's requests are lowered
  4. updateActivePods(B) increase by 100m
  5. Patch(B)
  6. Update (B) in the CRI
  7. ... Patch(A) hangs for a while ... Update(A) in the CRI doesn't happen.

In this case, B would have a higher allocated even though A has not been patched to be lower. An observer that did kubectl describe no would see resource allocated > allocatable. We could decide that that is OK, since we really just care about changes sent to the CRI, and the Patch() doesn't change that.

Does that make sense?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can end up with allocated > allocatable, that means a kubelet restart will result in one of the pods being rejected, as it reads pods from the source to admit them.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, WRT container.ResourcesAllocated[rName] = rQuantity updating active pods, I'd prefer explicitly calling podManager.UpdatePod() (or whatever the function name is) to make this explicit. The podManager GetActivePods() function is meant to be read-only, but we don't do deep copies to enforce that.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we can end up with allocated > allocatable, that means a kubelet restart will result in one of the pods being rejected, as it reads pods from the source to admit them.

good point. we have to patch holding the lock, ugh. i'll look into it after i'm done with resource quota and limit range, and will also fix the explicit update.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sounds good. We have tests that test the throughput of pod startup. If it has a significant negative impact, we can consider the approach you currently have, and document the potential issues.

Copy link
Owner Author

@vinaykul vinaykul Jun 12, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Commit 92ffa91 - Modified resource quota and limitranger plugins to handle updates to Requests and Limits, and modified kubelet to hold the lock across patch - it addresses the kubelet restart issue that you identified. Thanks!

pkg/kubelet/kubelet.go Outdated Show resolved Hide resolved
pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated Show resolved Hide resolved
pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated Show resolved Hide resolved
pkg/kubelet/kuberuntime/kuberuntime_manager.go Outdated Show resolved Hide resolved
@vinaykul vinaykul force-pushed the inplace-pod-vertical-scaling branch 2 times, most recently from 92ffa91 to 3f40b72 Compare June 14, 2020 21:58
@vinaykul vinaykul force-pushed the inplace-pod-vertical-scaling branch from 113cf1c to f51a9a0 Compare June 23, 2020 08:26
vinaykul added 9 commits June 29, 2020 17:10
1. Add ResourcesAllocated and ResizePolicy fields to Container struct.
2. Add Resources field to ContainerStatus struct.
3. Add a new admission controller named PodResourceAllocation for setting defaults and validation.
4. Address code-review items:
   a. Clarify meaning of container resize policy that it is a guidance from user, not a guarantee by k8s.
   b. Handle updates from clients unaware of the new fields in pod spec.
   c. Use DropDisabledPodFields to avoid feature-gate checks in validation where possible.
KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Add ContainerResources message that supports both Linux and Windows resources.
2. Add ContainerResources to ContainerStatus CRI API.
3. Modify UpdateContainerResources CRI API to use ContainerResources.
4. Implement handling ContainerStatus response for runtimes that don't support resources query.
KEP: /enhancements/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md
1. Handle patching of ResourcesAllocated fields of a resized pod in Kubelet's syncPod routine.
2. Handle container limits updation in runtime manager's SyncPod routine.
3. Add new pods after testing for admissibility using ResourcesAllocated field values.
KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
…ation

1. Generate CPU Request API status information from runtime's cpu.shares report.
2. Rename pod cgrpoup config get/set functions for functional clarity, and simplify update container resources function.
KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Switch to using container.ResourcesAllocated instead of container.Resources.Requests for resource allocation/usage calculations.
2. Add support for ResourcesAllocated to kubectl describe with pod and node resources.
KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Enable resource quota and limit ranger for pod resource changes.
2. In kubelet, do apiserver ResourcesAllocated patch under mutex.
KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Fix issues found by verify, typecheck, dependencies tests.
2. Fix error message during validation when InPlacePodVerticalScaling feature-gate is off.
3. Backout feature-gate change in kubeadm InitFeatureGates list.
KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Use strategicpatch to generate patch data for ResourcesAllocated when handling pod resize.
2. Use ContainersToStart instead of ContainersToRestart in SyncPod.
3. Simplify use of ContainersToUpdate in SyncPod.
KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
@vinaykul vinaykul force-pushed the inplace-pod-vertical-scaling branch from 942465d to fef6f86 Compare June 30, 2020 00:11
1. Add framework code for e2e testing, and example e2e tests.
2. Do container resource update based on difference between requests as well as limits.
KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
@vinaykul vinaykul force-pushed the inplace-pod-vertical-scaling branch from fef6f86 to 426d0bb Compare June 30, 2020 01:28
@vinaykul
Copy link
Owner Author

This is stale. Working on implementation of updated design. Will be sending PRs to k/k in a couple of weeks once well tested.

@vinaykul vinaykul closed this Jun 14, 2021
vinaykul pushed a commit that referenced this pull request Aug 24, 2022
* Add APF concurrency utilization test
vinaykul pushed a commit that referenced this pull request Nov 1, 2023
These were found with a modified klog that enables "go vet" to check klog call
parameters:

    cmd/kubeadm/app/features/features.go:149:4: printf: k8s.io/klog/v2.Warningf format %t has arg v of wrong type string (govet)
    			klog.Warningf("Setting deprecated feature gate %s=%t. It will be removed in a future release.", k, v)
    test/images/sample-device-plugin/sampledeviceplugin.go:147:5: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet)
    				klog.Errorf("error: %w", err)
    test/images/sample-device-plugin/sampledeviceplugin.go:155:3: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet)
    		klog.Errorf("Failed to add watch to %q: %w", triggerPath, err)
    staging/src/k8s.io/code-generator/cmd/prerelease-lifecycle-gen/prerelease-lifecycle-generators/status.go:207:5: printf: k8s.io/klog/v2.Fatalf does not support error-wrapping directive %w (govet)
    				klog.Fatalf("Package %v: unsupported %s value: %q :%w", i, tagEnabledName, ptag.value, err)
    staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:286:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet)
    		klog.V(4).Infof("Node %s missing in vSphere cloud provider cache, trying node informer")
    staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:302:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet)
    		klog.V(4).Infof("Node %s missing in vSphere cloud provider caches, trying the API server")
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants