-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
In-Place Pod Vertical Scaling feature #1
Conversation
RestartContainer ContainerResizePolicy = "RestartContainer" | ||
) | ||
|
||
// ResizePolicy represents the resource resize policy for a single container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please be more precise about the meaning of this - is it a requirement that this container not be restarted or is it saying that a restart is not required (but may still happen if the kubelet needs to)? Or something else. I think this distinction may influence the constant strings above.
E.g. I assume this is a statement of capability. "I can be resized in-place without restart". That would lead me to look for something like:
resizePolicy:
- resource: "cpu"
restart: "NotRequired"
- resource: "memory"
restart: Required
What other policies do you think might one day be needed here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NoRestart tells Kubelet to call UpdateContainerResources CRI API to resize the resource - doing so does not restart the container. RestartContaienr policy on the other hand tell Kubelet to stop and start the container with new resources, and the use-case for this was some legacy java apps using -xmxN flag which are unable to use the increased memory without restarting.
We had RestartPod as a third option at one point, but removed it because we couldn't find a real-world use case for it other than the notion that someone might want to restart the entire pod in-place (to rerun the init containers).
I'll add more details to the comments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are you thinking this should just be a binary/bool value considering we have just two values for the foreseeable future?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I found this half-completed, I don't know if it is still alive.
My point was to clarify whether "no restart" is contract or guidance. If the kubelet, for some reason, needed to restart the pod, would that be a violation of NoRestart or just an unfortunate, but legal, edge case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've updated documentation in the code to clarify that container resize policy is a guidance by stating that kubelet should resize the container without restarting it if possible . I hope this is sufficient language.
/cc |
1 similar comment
/cc |
lgtm, this matches what I expect from the KEP. |
RestartContainer ContainerResizePolicy = "RestartContainer" | ||
) | ||
|
||
// ResizePolicy represents the resource resize policy for a single container. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi, I found this half-completed, I don't know if it is still alive.
My point was to clarify whether "no restart" is contract or guidance. If the kubelet, for some reason, needed to restart the pod, would that be a violation of NoRestart or just an unfortunate, but legal, edge case?
@@ -2053,6 +2080,12 @@ type Container struct { | |||
// Compute resource requirements. | |||
// +optional | |||
Resources ResourceRequirements | |||
// Node compute resources allocated to the container. | |||
// +optional | |||
ResourcesAllocated ResourceList |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am still confused why this is spec?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is very much alive - I'm a couple of weeks away from having a mostly complete implementation out for review, and also updated API code to move some of the validation earlier into admission stage.
NoRestart/RestartContainer above is a guidance, and lets us decide whether to stop/start container or just call UpdateContainerResources API. The runtime may still choose to restart - we don't have control over it, and no explicit documentation telling runtime not to restart.
ResourcesAllocated was added to Spec to track resources that node has agreed reserve for the pod, and make this info visible to the user.
We had discussed the alternative of storing this information locally on the node, but I was told that Borg had encountered issues with such approach, and they want to avoid checkpointing on the node, and thought it best to have Spec as the source of truth. We had also looked at relying on PodStatus, but the Status, per API conventions, isn't allowed store any state that cant be reconstructed through observations.
a419856
to
3c93b83
Compare
0dd6c02
to
26b6385
Compare
54102dd
to
6f65aff
Compare
pkg/kubelet/kubelet.go
Outdated
} | ||
containersPatchData = strings.TrimRight(containersPatchData, ",") | ||
patchData := fmt.Sprintf(`{"spec":{"containers":[%s]}}`, containersPatchData) | ||
return true, patchData |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We have a race condition here, but i don't remember how we agreed to solve it. For example, if we called canResizePod on pod A and B, such that the ordering was:
- canResizePod(A) => true
- canResizePod(B) => true
- Patch(A)
- Patch(B)
This could pass even if there was only room for 1 of the pods, since A is not included in kl.GetActivePods() during canResizePod(B).
At minimum, I think we need to update GetActivePods here. Or am I misremembering our conclusion?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've tested this for add vs resize race after coding up the switch-over from Requests to ResourcesAllocated (next change - hopefully by eod Monday). I think the mutex should serialize race between two resizes as well.
So in the above case, the next GetActivePods (for pod B( should use updated ResourcesAllocated values of Pod A if the resize was accepted. But let me specifically test that case and verify..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tested the case of concurrent updates, and the mutex mechanism holds up well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where do we update the active pods?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The other thing I was trying to reason through at sig-node is whether the update needs to be written back to the API Server while holding the lock. Take the following example:
- canResizePod(A) decrease by 100m => true
- updateActivePods(A) decrease by 100m
- canResizePod(B) increase by 100m => true, since A's requests are lowered
- updateActivePods(B) increase by 100m
- Patch(B)
- Update (B) in the CRI
- ... Patch(A) hangs for a while ... Update(A) in the CRI doesn't happen.
In this case, B would have a higher allocated even though A has not been patched to be lower. An observer that did kubectl describe no
would see resource allocated > allocatable. We could decide that that is OK, since we really just care about changes sent to the CRI, and the Patch() doesn't change that.
Does that make sense?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can end up with allocated > allocatable, that means a kubelet restart will result in one of the pods being rejected, as it reads pods from the source to admit them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, WRT container.ResourcesAllocated[rName] = rQuantity
updating active pods, I'd prefer explicitly calling podManager.UpdatePod() (or whatever the function name is) to make this explicit. The podManager GetActivePods() function is meant to be read-only, but we don't do deep copies to enforce that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we can end up with allocated > allocatable, that means a kubelet restart will result in one of the pods being rejected, as it reads pods from the source to admit them.
good point. we have to patch holding the lock, ugh. i'll look into it after i'm done with resource quota and limit range, and will also fix the explicit update.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sounds good. We have tests that test the throughput of pod startup. If it has a significant negative impact, we can consider the approach you currently have, and document the potential issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Commit 92ffa91 - Modified resource quota and limitranger plugins to handle updates to Requests and Limits, and modified kubelet to hold the lock across patch - it addresses the kubelet restart issue that you identified. Thanks!
92ffa91
to
3f40b72
Compare
113cf1c
to
f51a9a0
Compare
1. Add ResourcesAllocated and ResizePolicy fields to Container struct. 2. Add Resources field to ContainerStatus struct. 3. Add a new admission controller named PodResourceAllocation for setting defaults and validation. 4. Address code-review items: a. Clarify meaning of container resize policy that it is a guidance from user, not a guarantee by k8s. b. Handle updates from clients unaware of the new fields in pod spec. c. Use DropDisabledPodFields to avoid feature-gate checks in validation where possible. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Add ContainerResources message that supports both Linux and Windows resources. 2. Add ContainerResources to ContainerStatus CRI API. 3. Modify UpdateContainerResources CRI API to use ContainerResources. 4. Implement handling ContainerStatus response for runtimes that don't support resources query. KEP: /enhancements/keps/sig-node/20191025-kubelet-container-resources-cri-api-changes.md
1. Handle patching of ResourcesAllocated fields of a resized pod in Kubelet's syncPod routine. 2. Handle container limits updation in runtime manager's SyncPod routine. 3. Add new pods after testing for admissibility using ResourcesAllocated field values. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
…ation 1. Generate CPU Request API status information from runtime's cpu.shares report. 2. Rename pod cgrpoup config get/set functions for functional clarity, and simplify update container resources function. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Switch to using container.ResourcesAllocated instead of container.Resources.Requests for resource allocation/usage calculations. 2. Add support for ResourcesAllocated to kubectl describe with pod and node resources. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Enable resource quota and limit ranger for pod resource changes. 2. In kubelet, do apiserver ResourcesAllocated patch under mutex. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Fix issues found by verify, typecheck, dependencies tests. 2. Fix error message during validation when InPlacePodVerticalScaling feature-gate is off. 3. Backout feature-gate change in kubeadm InitFeatureGates list. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
1. Use strategicpatch to generate patch data for ResourcesAllocated when handling pod resize. 2. Use ContainersToStart instead of ContainersToRestart in SyncPod. 3. Simplify use of ContainersToUpdate in SyncPod. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
942465d
to
fef6f86
Compare
1. Add framework code for e2e testing, and example e2e tests. 2. Do container resource update based on difference between requests as well as limits. KEP: /enhancements/keps/sig-node/20181106-in-place-update-of-pod-resources.md
fef6f86
to
426d0bb
Compare
This is stale. Working on implementation of updated design. Will be sending PRs to k/k in a couple of weeks once well tested. |
* Add APF concurrency utilization test
These were found with a modified klog that enables "go vet" to check klog call parameters: cmd/kubeadm/app/features/features.go:149:4: printf: k8s.io/klog/v2.Warningf format %t has arg v of wrong type string (govet) klog.Warningf("Setting deprecated feature gate %s=%t. It will be removed in a future release.", k, v) test/images/sample-device-plugin/sampledeviceplugin.go:147:5: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet) klog.Errorf("error: %w", err) test/images/sample-device-plugin/sampledeviceplugin.go:155:3: printf: k8s.io/klog/v2.Errorf does not support error-wrapping directive %w (govet) klog.Errorf("Failed to add watch to %q: %w", triggerPath, err) staging/src/k8s.io/code-generator/cmd/prerelease-lifecycle-gen/prerelease-lifecycle-generators/status.go:207:5: printf: k8s.io/klog/v2.Fatalf does not support error-wrapping directive %w (govet) klog.Fatalf("Package %v: unsupported %s value: %q :%w", i, tagEnabledName, ptag.value, err) staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:286:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet) klog.V(4).Infof("Node %s missing in vSphere cloud provider cache, trying node informer") staging/src/k8s.io/legacy-cloud-providers/vsphere/nodemanager.go:302:3: printf: (k8s.io/klog/v2.Verbose).Infof format %s reads arg #1, but call has 0 args (govet) klog.V(4).Infof("Node %s missing in vSphere cloud provider caches, trying the API server")
What type of PR is this?
/kind api-change
What this PR does / why we need it:
This PR implements the API change for In-Place Pod Vertical Scaling KEP , the Kubelet CRI KEP to support in-place pod vertical scaling feature , and the core implementation of in-place pod vertical scaling feature.
It adds new fields named ResourcesAllocated and ResizePolicy to Pod's Container type, and Resources field to ContainerStatus type. It also adds a new admission controller that limits the ability to change ResourcesAllocated field to system-node account. This API change enables users to scale a Pod's CPU and memory resources up or down.
The CRI change enables runtimes to report currently applied resource limits for a container, and adds support for Windows containers.
The core implementation commit implements the In-Place Pod Vertical Scaling KEP
Which issue(s) this PR fixes:
Fixes #
Special notes for your reviewer:
Switching scheduler, kubelet, and apiserver code (limit range, resource quota) is coming in a future commit.
Does this PR introduce a user-facing change?: Yes
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: