Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
KEP: in-place update of pod resources #686
KEP: in-place update of pod resources #686
Changes from 1 commit
cd94808
7fb66f1
b8c1f4e
5d00f9f
9580642
b8d814e
df1c8f8
17923eb
e5052fc
1194243
bfab6a3
69f9190
199a008
574737c
5bdcd57
bc9dc2b
533c3c6
29a22b6
20cbea6
0ed9505
c745563
55c8e56
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to use a map here. See the Lists of named subobjects preferred over maps API convention.
This also still feels like quite a large API change just to support legacy Xmx applications. But i'll defer to approvers on whether this is justified.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using map here is consistent with ResourceAllocations (v1.ResourceList map) or Resources - map of Requests and Limits.
I respect the API conventions, but if you visualize the two (see below) - map looks simpler considering LHS strings are fixed/known system-defined keys (and not user data, not something that helps the user if they could name/alias it). Is that convention applicable to this case? If you look at the intent of @jbeda in those discussions, he is looking to keep things simple.
vs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ack. It is fine to leave it as is for now. This is worth pointing out to API reviewers later to get their opinion on.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if there is a better name than ResizePolicy.
The two options here are about restart.
Would adding another value for RestartPolicy (RestartPolicyContainer) be better ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need to have this policy per container.
I had thought about naming it RestartPolicy at first, but felt it would cause confusion with PodSpec.RestartPolicy. Besides, ResizePolicy sounded somewhat better as it relates to resize.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How can the user forbid in-place updates for a given Container, e.g. because any resource change would require to re-run Init Containers?
I think there used to be an option which said "restart the whole Pod".
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a concrete use case or scenario that requires RestartPod policy? I removed it after the above discussion - I could not trace a use-case for it and couldn't justify its need with sig-node.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I'm wondering if this can be folded into RetryPolicy in VPA where it lets the user specify this - similar to updateMode 'Recreate', if user needs to re-run init for resize we evict the Pod and resize the replacement during admission . This is less ideal than in-place resize with restart, but the push has been to keep things as simple as possible for the Kubelet, and this is something that can be added later if there is a strong use case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can see a case for a third policy here, which as a strawman I will call
SignalContainerWINCH
. This would allow the container to attempt to adjust its language runtime to conform to the new limits - e.g. a programmer determines that callingruntime.GOMAXPROCS(math.Ceil(numCPUs) + 1)
results in less scheduler thrashing.However, such a signal would only be useful if pods are able to interrogate the system for their own resource limits. This is perhaps best left to future enhancements to in-place update and should not block 1.17 implementation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On linux you can always read /sys/fs/cgroup can't you ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We discussed this at sig-node, and i'll post the rationale here in-case it is useful for others who are watching this proposal:
We can use
Pod.Spec.Containers[i].ResourcesAllocated
for computing the pod's requested resources, rather thanmax(Pod.Spec.Containers[i].Resources, Pod.Status.ContainerStatuses[i].Resources)
. Kubernetes pod placement is Request driven, rather than Usage driven. We accept that pods can be placed on nodes which have few additional resources available at the moment, but have implemented mechanisms (e.g. eviction, cgroup CPU throttling) to ensure pods are able to access resources they have requested even when few are available. With respect to this KEP, we only need to guarantee that theResourcesAllocated
of pods is less than Node Allocatable. This is true even if decreasing container memory limits lags behind the desired resource limits.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessary. We already sort the initial batch of pods we get after restart by creation time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fine for PodAdditions if there's no resize requested. We complete pod admission sorted by creation time at ResourcesAllocated values for existing Pods.
Once that is done, if two or more Pods requested resize while KL was restarting, then we want to order them FCFS, don't we? If yes, then for Pods needing resize (and any brand new Pods), ascending ResourceVersion seems to be the way to do it per my experiment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Resizes must be serialized with admissions. This means the first HandlePodAdmission (with all pods that previously existed) will complete before any resizes are done.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, are you actually allowed to compare ResourceVersion values across resources like that? I thought they were reserving the possibility of changing the implementation in the future and required that clients treat them as opaque numbers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ResourceVersion, as currently documented led me to believe this is possible - see example 2. The documentation for the CompareResourceVersion code does not call out if this applicable to individual objects and cannot cross objects. If this is not possible, the other alternative is to add LastUpdateTimeStamp to meta.