-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update pod resource management design and rollout plan #314
Conversation
c13e7b0
to
6d3a603
Compare
High level requirements for the design are as follows: | ||
- Do not break existing users. Ideally, there should be no changes to the Kubernetes API semantics. | ||
- Support multiple cgroup managers - systemd, cgroupfs, etc. | ||
### Memory overcommitment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@sjenning -- this section captures the work you are implementing.
* A value of 0 will instruct the `kubelet` to adjust the Burstable and | ||
BestEffort cgroup to restrict memory overcommit by inducing memory | ||
pressure and reclaim if needed at the QoS level cgroup tiers before | ||
inducing pressure at the `ROOT` cgroup. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't realize this flag would take a value. I was thinking, at least at first, this would do a hard reserve of G and Bu limits against lower tiers. Basically hardcoding this value to 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think a range is a valuable heuristic, any objections?
|
||
``` | ||
ROOT/Burstable/memory.limit_in_bytes = | ||
Node.Allocatable - {(summation of memory requests of `Guaranteed` pods)*(1-qmo/100)} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this formula is correct. I think it should be
Node.Allocatable - {(summation of memory requests of Guaranteed
pods)*(100-qmo/100)}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's possible I have a math error, will double check.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be
ROOT/Burstable/memory.limit_in_bytes = Node.Allocatable - { (sum (
Guaranteed) + sum (
Burstable) ) * (1-(qmo/100)) }
:sigm
Node.Allocatable - {(summation of memory requests of all `Guaranteed` and `Burstable` pods)*(1-qmo/100)} | ||
``` | ||
|
||
Each time a pod is admitted to the `kubelet`, the `kubelet` will increment |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/increment/adjust. Increment makes me think "increase" when we are actually decreasing the limit on lower QoS tier with each added pod.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
agreed on the word change.
cc: @ethernetdan, change will require users to drain nodes. You likely want to track the progress of this |
@calebamiles looking like 1.6 will require a Node drains due to this + CRI enablement, we should talk about how to minimize impact |
|
||
__Note__: The cgroup-root flag would allow the user to configure the root of the QoS cgroup hierarchy. Hence cgroup-root would be redefined as the root of QoS cgroup hierarchy and not containers. | ||
* 01/31/2017 - Discuss the rollout plan in sig-node meeting | ||
* 02/14/2017 - Flip the switch to enable pod level cgroups by default |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made this comment in SIG-node but if we can get this in by the 13th (I'd try to get it in the Friday before to be safe) we can target v1.6.0-alpha.2
container and a BestEffort container is classified as a Burstable pod. The | ||
BestEffort container is not able to consume slack resources from the sibling | ||
Burstable container. It must instead compete for scarce resources at the node | ||
level across all containers in all QoS classes which violates the design. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Even with pod level cgroups, the BestEffort container in this example will compete with other Burstable pods. Remember there are no pod cgroup limits for Burstable pods unless all their containers specify limits, in which case there would be no BestEffort containers in the first place.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
that is correct. i had taken this verbiage from the original document. will update that it will allow containers in a pod to share slack resources within its qos tier.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the best effort container gets cpu share time relative to the burstable pods request is the scenario i was most targeting here.
1. Ability to charge any memory usage of memory-backed volumes to the pod when | ||
an individual container exits instead of the node. | ||
|
||
## Enabling the unified cgroup hierarchy |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unified cgroup hierarchy
could mean multiple things. What are you referring to here? If it's just about QoS and Pod cgroups, I'd rather refer to it as QoS cgroups
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sure
## Enabling the unified cgroup hierarchy | ||
|
||
To enable the unified cgroup hierarchy, the operator must enable the | ||
`--cgroups-per-qos` flag. Once enabled, the `kubelet` will start managing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kindly note that we require --cgroup-root
to not be /
. This may change once Node Allocatable changes are made.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would prefer we do not require the cgroup-root to previously exist, so ideally as part of node allocatable, the kubelet can create that step in the taxonomy. maybe we can just state that we anticipate that the root will not be /.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
per node allocatable design, i am keeping the cgroup-root flag defaulted to / since we will create the kubepods cgroup relative to that.
operators may have to choose a particular cgroup driver to ensure | ||
proper system behavior. For example, if operators use the `systemd` | ||
cgroup driver provided by the `docker` runtime, the `kubelet` must | ||
be configured to use the `systemd` cgroup driver. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is worth mentioning that with systemd, ownership of a slice must be delegated to the kubelet in-order to use the cgroupfs
driver.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
actually, i think Delegate=true relative to the cgroup root for the hierarchy should go in node allocatable follow-on or that PR that has that design.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that we introduce cgroup drivers here, I'd appreciate if you can cross reference this doc from node-allocatable.md
or move this section to that doc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i had hoped my doc could merge first, but ok ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
linked to node_allocatable design in "## Enabling QoS and Pod level cgroups" section (so even earlier in the doc).
## Integration with container runtimes | ||
|
||
The `kubelet` when integrating with container runtimes always provides the | ||
concrete cgroup filesystem name for the pod sandbox. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Concrete or abstract? IIRC, we intended to keep the wire format for CRI to be Abstract
and have the runtime shims choose between Abstract
and Concrete
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we settled on concrete. the shims are always provided the cgroupfs syntax as found on the host.
|
||
For the initial implementation we will only support limits for cpu and memory resources. | ||
The `cgroups-per-qos` flag will be enabled by default, but user's |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if we can use FeatureGate instead?
|
||
#### Rkt runtime | ||
|
||
We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does rkt
support pod level cgroups?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think rkt
needs to support it. I think individual container runtime support should be separate from this proposal generally. Looking in the current code in head, I do not see it being used.
|
||
We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself. | ||
|
||
#### Add Pod level metrics to Kubelet's metrics provider |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you deleting this section? We do intend to get pod level metrics to help with evictions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought that was covered in the core metrics proposal, i can add back some of these items.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i added this to a future enhancements section.
- [ ] Check if parent cgroup exist and error out if they don't. | ||
- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root. | ||
- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager. | ||
- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is still necessary. We cannot admit a pod until (or unless) the QoS cgroups can accommodate it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i wanted to avoid enumerating must have work items as part of the design doc. i will update the verbiage in the Memory allocation section to denote that it will do a check on Admit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i describe the heuristic we will apply in the memory allocation section.
- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root. | ||
- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager. | ||
- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node. | ||
- [ ] Log basic cgroup management ie. creation/deletion metrics |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We need this feature. Internally, we have observed high tail latencies for cgroupfs operations. I'd prefer having metrics to help us narrow down such issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok, i will add a section for this in the document.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i added a section for logging.
* opt-in behavior surrounding the feature (`qos-memory-overcommit` support) completed. | ||
* 03/01/2017 - Send an announcement to kubernetes-dev@ about the rollout and potential impact | ||
* 03/22/2017 - Kubernetes 1.6 release | ||
* TBD (1.7?) - Deprecate the old implementation |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there an old implementation?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i meant this to read look at deprecating the ability to have pod level cgroups off. will clarify text.
68973b7
to
5e9be36
Compare
accounting on the node, and introduces a number of code complexities when | ||
trying to build features around QoS. | ||
|
||
This design introduces a unified cgroup hierarchy to enable the following: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/unified/new/ since unified means something else in the context of cgroups
|
||
This design introduces a unified cgroup hierarchy to enable the following: | ||
|
||
1. Improve enforcement of QoS class on the node. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: s/Improve/Enable s/class/classes/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: also do we want to say "QoS cgroup" instead of "QoS class" throughout since it is "Pod cgoups" and "Pod Class" doesn't make sense?
|
||
1. Improve enforcement of QoS class on the node. | ||
1. Simplify resource accounting at the pod level. | ||
1. Allow containers in a pod to share slack resources within its QoS class. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if this is talking about pod level cgroups wouldn't it be "share slack resources within its pod cgroup"?
|
||
The `--cgroup-root` flag must have a value specified to use this feature. | ||
The `kubelet` will parent any cgroups it creates below that specified value. | ||
The `--cgroup-root` flag will default to `/` if not specified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This conflicts with L57. Does the flag have a default or must it be specified?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the flag must have a value, and it defaults to /
|
||
A pod can belong to one of the following 3 QoS classes: Guaranteed, Burstable, and BestEffort, in decreasing order of priority. | ||
The `kubelet` will support a flag `--qos-reserve-limits` that takes a | ||
set of percentages per compressible resource that controls how the QoS |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/compressible/incompressible
in a range from 0-100%, where a value of 0 instructs the `kubelet` to | ||
attempt no reservation, and a value of 100 will instruct the `kubelet` | ||
to attempt to reserve the sum of requested resource across all pods | ||
on the node. How the `kubelet` achieves this desired state is resource |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: maybe something along the lines of "exclude pods from lower QoS classes from using resources requested by higher QoS classes" would be clearer.
attempt no reservation, and a value of 100 will instruct the `kubelet` | ||
to attempt to reserve the sum of requested resource across all pods | ||
on the node. How the `kubelet` achieves this desired state is resource | ||
specific. The default value per compressible resource if not specified |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/compressible/incompressible
to attempt to reserve the sum of requested resource across all pods | ||
on the node. How the `kubelet` achieves this desired state is resource | ||
specific. The default value per compressible resource if not specified | ||
is for no reservation to occur. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should to default be 100% reservation? I think this is what Vish was thinking. Since 0% is the effective value if you don't use the flag. The flag without a value should do something different than not having flag at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i would assume that if a resource is not enumerated in the flag value, we would do no reservation.
|
||
By default, no memory limits are applied to the BestEffort | ||
and Burstable QoS level cgroups unless a `--qos-reserve-limits` value | ||
is specified for memory. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Needs changing if you agree with my previous comment about default reservations
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i dont think i agree w/ previous comment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
my opinion is that if no value is provided, we should do no harm, and therefore do nothing. it will have the same literal effect of qos-reserve-limits=memory=0%
5e9be36
to
8c71cc4
Compare
@vishh - please take a look since node allocatable basically requires this as well. |
67e32e5
to
69f95c0
Compare
@vishh @sjenning @dchen1107 -- all updates have been made. i believe this accurately represents the plan we have PRs in flight to pursue for 1.6. i updated text around how qos level cgroup sandboxes are updated to try to reduce confusion, and minimize risk. ptal. |
Automatic merge from submit-queue (batch tested with PRs 41349, 41532, 41256, 41587, 41657) Enable pod level cgroups by default **What this PR does / why we need it**: It enables pod level cgroups by default. **Special notes for your reviewer**: This is intended to be enabled by default on 2/14/2017 per the plan outlined here: kubernetes/community#314 **Release note**: ```release-note Each pod has its own associated cgroup by default. ```
We use the following denotations in the sections below: | ||
Internally, the `kubelet` maintains both an abstract and a concrete name | ||
for its associated cgroup sandboxes. The abstract name follows the traditional | ||
`cgroupfs` style syntax. The concrete name is the name for how the cgroup |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the case of systemd
the concrete
name is also abstract
:)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i am not following, unless you mean the concrete name is abstract by virtue that it encodes the hierarchy at each step. if so, i agree, that is an abstraction in and of itself. that said, it is current reality ;-)
pod<UID>/memory.limit_in_bytes = sum(pod.spec.containers.resources.limits[memory]) | ||
``` | ||
|
||
Note: This design enables containers in a pod to optionally share slack compute resources. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, unless all containers specify limits, there will no pod level limits. Once they specify limits, there is no sharing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed
``` | ||
pod<UID>/cpu.shares = 2 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Once we set cpu.shares
to 2
at the QoS level cgroup, is this setting at the pod level still relevant?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is less relevant, but it is consistent with what we end up setting for the container cgroup sandbox that it contains.
|
||
## QoS level cgroups | ||
|
||
The `kubelet` defines a `--cgroup-root` flag that is used to specify the `ROOT` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd mention that the recommended configuration is to set --cgroup-root
to /
to avoid having a deep cgroup hierarchy.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ack
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
text was added re: deep hierarchies.
allows BestEffort and Burstable pods to potentially consume as many | ||
resources that are presently available on the node. | ||
|
||
For compressible resources, this prioritization scheme has little impact. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not entirely true since power is limited across CPU sockets and so Burstable or BestEffort can cause performance issues for Guaranteed pods.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @vishh --
You're talking about impact of TDP/HWP and turbo-boost frequency scaling interfering with G pods? Isn't the millicores concept intended to gloss over that level of detail, or is it that we are now concerned with it again in scenarios such as you described?
Now you have got me curious ... Any chance you could provide the content of /proc/cmdline and the output of "turbostat sleep 10" from a hypervisor node along with "numastat -p kvm"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i rephrased this, but avoided discussing why a pod cpu request should ultimately being measured in watts ;-)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jeremyeder The discussion here is about providing consistent performance. I agree with @derekwaynecarr that we should be discussing watts per core in a different issue.
pods memory request may not be satisfied if there are active BestEffort | ||
pods consuming all available memory. | ||
|
||
The `kubelet` will support a flag `experimental-qos-reserve-requests` that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
reserve-requests
is a bit confusing. Are we reserving for Guaranteed pods? One could also interpret it as node level reservation.
That's why I preferred overcommit-percentage
where the default ""
would be 100%
of Allocatable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought we had agreed on this.
You can't "overcommit" a node in the way I think of overcommitting. The scheduler looks at pod resources requests
and packs them in up to node Allocatable
. In my mind, overcommitment is a scheduler thing i.e. pack more pods on the node assuming not all of them will use their requests
all at once.
What we are doing is reserving the requests
made by G and Bu pods against use in the lower QoS tiers. Hence qos-reserve-requests
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
by definition, we are reserving for G pods in the same manner we are reserving for node allocatable top level cgroup. both have a memory limit set equal to their request. i agree with @sjenning . i think overcommit-percentage means something like a physical cpu/memory to virtual cpu/memory scalar that would skew what the node reports as allocatable back to the scheduler. this is precisely not that. this is attempting to reserve for the qos tier in the same manner we are reserving for the kubepods cgroup sandbox one level up in the hierarchy. it's also experimental in 1.6, so i am not sure how much we want to bikeshed on the name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can't "overcommit" a node in the way I think of overcommitting.
Our current QoS doc talks about oversubscribing
. May be that's a better term?
What we are doing is reserving the requests made by G and Bu pods against use in the lower QoS tiers. Hence qos-reserve-requests
This is not obvious from the flag name. We use the keyword reservation
in other places to indicate System
and Kube
reservations for example where its a static reservation.
Whereas here, we are referring to a dynamic quantity.
Because the kubernetes scheduler only looks at scheduler
, it inherently oversubscribes (trying a different term) the nodes. This flag is merely controlling the amount of oversubscription.
it's also experimental in 1.6, so i am not sure how much we want to bikeshed on the name.
Having spent so much time on it, we might as well resolve it now among the three of us. I feel this conversation will be helpful in agreeing on the semantic meaning of the some of the terms we use all over the code too - reservation
, commitment
, limits
, etc.
converges to a desired state. Failure to set `cpu.shares` at the QoS level | ||
cgroup would result in `500m` of cpu for a Guaranteed pod to have different | ||
meaning than `500m` of cpu for a Burstable pod in the current hierarchy. | ||
For this reason, we will always set `cpu.shares` for the QoS level sandboxes |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wasn't aware of this change. I was under the assumption that all QoS level settings would be opt-in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, given that we have pod level limits, I don't see QoS level shares limits to have that much of an impact unless a user explicitly opts in to restricting overcommit.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not following, absent setting a value:
$ cat /Burstable/cpu.shares
1024
If you have a allocatable=4cpu, and you run the following:
This means if you run stress to do the following:
kubectl run burstable --image=sjenning/stress --requests=cpu=500m,memory=100Mi --replicas=1 --command -- /usr/bin/stress -c 4 -t 3600
by default, this bursts to consume all 4 cpus, which is fine, that is the point.
but if you do this:
kubectl run guaranteed --image=sjenning/stress --requests=cpu=500m,memory=100Mi --limits=cpu=500m,memory=100Mi --replicas=7 --command -- /usr/bin/stress -c 4 -t 3600
the result is that the burstable pod skews closer to 1 core of usage since the 1024 share evaluation is made relative to the guaranteed pods, and we really want the Burstable cpu share to be 500 so it was given equal time relative to the other pods.
i tested this locally and confirmed this was the case, and manually assigning
echo 500 > /Burstable/cpu.shares
fixes usage to closer to the actual value we requested as expected...
do you disagree that we need to set shares on burstable tier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
do you disagree that we need to set shares on burstable tier?
Nope. All I'm saying is that I'd prefer not setting QoS limits by default in v1.6.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As per offline discussion, given that this will lead to a regression, I agree that its necessary to have it turned on by default. Apologies for the confusion.
A value of `--experimental-qos-reserve-requests=memory=100%` will cause the | ||
`kubelet` to adjust the Burstable and BestEffort cgroups from consuming memory | ||
that was requested by a higher QoS class. This increases the risk | ||
of inducing OOM on BestEffort workloads in favor of increasing memory |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and on Burstable too since they cannot consume memory that is otherwise free
that was requested by a higher QoS class. This increases the risk | ||
of inducing OOM on BestEffort workloads in favor of increasing memory | ||
resource guarantees for Burstable and Guaranteed workloads. A value of | ||
`--experimental-qos-reserve-requests=memory=0%` will allow a Burstable |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I mentioned earlier, I find reservation misleading since it is not clear what the reservation is meant for.
|
||
Since memory is an incompressible resource, it is possible that a QoS | ||
level cgroup sandbox may not be able to reduce memory usage below the | ||
value specified in the heuristic during pod admission and pod termination. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We do use any heuristic
do we? Limits are based on the policy mentioned above right?
As a result, the `kubelet` runs a periodic thread to attempt to converge | ||
to this desired state from the above heuristic. If unreclaimable memory | ||
usage has exceeded the desired limit for the sandbox, the `kubelet` will | ||
attempt to set the effective limit near the current usage to put pressure |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not safe to create a pod level cgroup unless updates on QoS level memory cgroup has been successful.
It's not obvious what the general design principle is from this text.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vishh -- does something like the following help? I disagree on the safety argument. Operators most concerned about safety may deploy G workloads first to a node before opening up other workloads for scheduling. For nodes with fluid set of pods being scheduled, I want to attempt to provide safety, but do no worse than if the reservation feature is disabled.
Use cases:
- I want to prioritize access to compressible resources for my system/cluster daemons over end-user pods.
- I want to prioritize access to compressible resources for my G workloads over my Bu workloads.
- I want to prioritize access to compressible resources for my Bu workloads over my Be workloads.
An operator may choose to prioritize access to a compressible resource over utilization for each step in the hierarchy of workloads described above dependent upon their intended workload. Almost all cluster operators will use node allocatable to enforce the first use case in order to provide reliable operation of the node. It is understood that not all operators may feel the same for G/Bu workloads across the user community. For the users I represent, I know they may opt to deploy important cluster services as G workloads via a DaemonSet
and would like a similar resource reservation model as is provided via node allocatable for system/cluster daemons and end-user pods. In addition, I know users I represent would like to run Be workloads and minimize their ability to impact Bu pods. Reliance on eviction, oom_killer, and critical pods is not always sufficient. Eviction still has latency, oom_killer is disruptive, and critical pods as I view them are intended to provide G style guarantees but run with Bu resource requirements.
Design Principle
kubelet with opt-in configuration will attempt to limit the ability for a pod in a lower QoS tier to burst utilization of a compressible resource that was requested by a pod in a higher QoS tier.
Mechanism
Prior to starting a G pod, we will always attempt to update the Bu and Be tiers to have their memory limits reduced based on the incoming G pod memory request. It is possible that we are unable to reduce the Bu and Be tier to their new desired limit if usage already has exceeded that value prior to execution of the G pod. If they have exceeded their usage, we will set the limit at or near their usage to put pressure on the cgroup to prevent further growth. The kubelet will not wait for the QoS cgroup memory limit to converge to the desired state prior to execution of the G pod. This does mean that the G pod could induce an OOM for the kubepods cgroup, but per our QoS design, we would prefer the oom_killer targets the Bu/Be pods first, or ideally, we get the eviction code path to kick in response to kernel notification, and evict the Bu/Be pod. The periodic task that attempts to converge the QoS tier memory limits would then have a better opportunity to converge on the desired state so any future Bu/Be pods that land do not impact the G pod since they would be working under a reduced memory limit. The same logic applies for the Bu pods that get scheduled, but reduces the pertinent QoS cgroup sandbox update to just Be tier.
Best Practices
Operators that want to provide a similar resource reservation model for G pods as we offer via enforcement of node allocatable are encouraged to schedule their G pods via a DameonSet prior to opening up the node for scheduling of Bu/Be workloads. This ensures that the Bu/Be tiers have had their QoS memory limits appropriately adjusted before taking unbounded workloads on the node.
Future considerations
It's possible some users may want to stall execution of a pod in a higher QoS tier if resource reservations for compressible resources have not converged on the desired state. It can be evaluated based on more experience in the community.
We have discussed potentially allow end-user pods to schedule in the kube/system cgroup spaces, but the same hierarchy of needs may still be extended. For example, hosted multi-tenant shared-node offerings may want to allow the operators to deploy pods to kube/system cgroup spaces, but not the tenants on a shared node. For those users, offering the same opportunity to reserve requests at each level in the hierarchy is important.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Firstly, I'd prefer all this content to exist in the doc. This is kind of what I have been asking for in @sjenning's PRs. Thanks for posting this.
Your design choice is not geared towards predictable tail latencies. May be the customers you represent aren't expecting that yet. I do know some customers who care about that. Especially ones who run user facing web services that cannot have unpredictable latencies.
Relying on OOM scores and user space evictions isn't reliable enough yet. What this proposal is attempting to provide is definitely better than the current state, but we should clarify that it is besteffort
Quality of Service. Users should know that we do not optimize for predicatble tail latencies (yet).
Given that the kernel does not prevent over commitment from a memcg perspective this is not a correctness issue though.
All memory backed volumes are removed when a pod reaches a terminal state. | ||
|
||
The `kubelet` verifies that a pod's cgroup is deleted from the | ||
host before deleting a pod from the API server as part of the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
part of
what?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
graceful deletion ;-)
|
||
## Log basic cgroup management | ||
|
||
The `kubelet` will log and collect metrics associated with cgroup manipulation. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are the metrics already in place or is this a TODO?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this is a TODO, i am hoping between @sjenning and me, we can bang a PR out by monday.
we basically will need an instrumented CgroupManager
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
13ee005
to
648809e
Compare
@vishh -- ready for another round.... |
Just a couple more comment threads still open. |
@dchen1107 @davidopp @erictune @thockin There has been a debate on naming the QoS over subscription policy introduced by this proposal. I will try to capture the discussion thus far.
@derekwaynecarr @sjenning did I miss anything? EDIT: |
will attempt to limit the abilty for a pod in a lower QoS tier to burst utilization | ||
of a compressible resource that was requested by a pod in a higher QoS tier. | ||
|
||
The `kubelet` will support a flag `experimental-qos-reserve-requests` that |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to capture what was discussed via slack:
i want this flag to be called --experimental-qos-reserved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
flag name updated.
@vishh -- what is missed is how i view this is doing the exact same thing for the qos tiers as we are doing with the system and kube tiers via system-reserved and kube-reserved, both of which statically reserve resources available away from workloads with a lower precedent in the hierarchy. the distinction here is that the reservation is not static (because its based on pods scheduled). the action is still the same. see use case discussion here for context: |
I see the word "reserved" has multiple meanings, but I am OK with it. At least, I don't immediately see a better word - reservation feels right here. This doesn't allow me to express "under-provision Burstable but fully fund Guaranteed", which seems like it might be a useful policy (an sort of maps to Borg tiers)? This representation (qos-reserved) is hard to wrap one's head around, though. It doesn't mean to overcommit, if I understand. It simply modulates the availability of resources for lower tiers. Do we need a way for the scheduler to understand over-commit, too? |
648809e
to
3b88cc4
Compare
|
||
## Design | ||
For example, the cgroup name `/Burstable/pod_123-456` is translated to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Didn't you want to make all cgroup names lower case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yes, per our discussion, i think the existing code should following cgroup v2 naming conventions now rather than later.
see: https://www.kernel.org/doc/Documentation/cgroup-v2.txt
section: 2-6-2. Avoid Name Collisions
i will update the text, and open a pr.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
opened kubernetes/kubernetes#42497 to update
node in the cgroup hierarchy below which the `kubelet` should manange individual | ||
cgroup sandboxes. It is strongly recommended that users keep the default | ||
value for `--cgroup-root` as `/` in order to avoid deep cgroup hierarchies. If | ||
`--enforce-node-alloctable` is enabled per [node allocatable](node-allocatable.md), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not correct. kubepods
is controlled by --cgroups-per-qos
flag
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed.
cluster services as Guaranteed workloads via a `DaemonSet` and would like a similar | ||
resource reservation model as is provided via [node allocatable](node-allocatable) | ||
for system and kubernetes daemons. In addition, depending how operators choose | ||
to leverage BestEffort workloads, they may choose to make a similar decision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This rationale calls for having separate configuration for G
and Bu
QoS levels. Something consider before going to GA with this feature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the rationale was not intended to be read in that manner. as you know, many may choose to deploy cluster daemons in burstable tier as well (GKE does with kube-proxy).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Got it. When I read your description, you try to point out that each use case can exist independent of the other. That's why I felt it is useful to have per QoS configuration.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
clarified text to avoid confusion.
to the desired state prior to execution of the pod, but it will always | ||
attempt to cap the existing usage of QoS cgroup sandboxes in lower tiers. | ||
This does mean that the new pod could induce an OOM event at the `ROOT` | ||
cgroup, but ideally per our QoS design, the oom_killer targets a pod |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a next step, we need to enable evictions at the QoS levels too to facilitate QoS enforcement.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added a note to Future enhancements.
reservation model for Guaranteed pods as we offer via enforcement of | ||
node allocatable are encouraged to schedule their Guaranteed pods first | ||
as it will ensure the Burstable and BestEffort tiers have had their QoS | ||
memory limits appropriately ajdusted before taking unbounded workload on |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: typo: adjusted
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed
host before deleting a pod from the API server as part of the graceful | ||
deletion process. | ||
|
||
This ensures resource consumption associated with those volumes are not |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is not totally true because we do not wait for the pod level cgroup usage to drop to 0
before deleting it.
I feel this is a TODO.
This feature does prevent a crash looping containers from taking up all the memory on the node if node and cause system OOMs if node allocatable and evictions are enabled.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
removed confusing text.
3b88cc4
to
5f8f613
Compare
style syntax into transient slices, and as a result, it must follow `systemd` | ||
conventions for path encoding. | ||
|
||
For example, the cgroup name `/Burstable/pod_123-456` is translated to a |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I'd appreciate a future PR to rename all names to lowercase.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i thought i caught them all, i will take another pass.
LGTM. Merging this PR because it's the design doc for a feature that is already merged into v1.6. |
Automatic merge from submit-queue (batch tested with PRs 41919, 41149, 42350, 42351, 42285) kubelet: enable qos-level memory limits ```release-note Experimental support to reserve a pod's memory request from being utilized by pods in lower QoS tiers. ``` Enables the QoS-level memory cgroup limits described in kubernetes/community#314 **Note: QoS level cgroups have to be enabled for any of this to take effect.** Adds a new `--experimental-qos-reserved` flag that can be used to set the percentage of a resource to be reserved at the QoS level for pod resource requests. For example, `--experimental-qos-reserved="memory=50%`, means that if a Guaranteed pod sets a memory request of 2Gi, the Burstable and BestEffort QoS memory cgroups will have their `memory.limit_in_bytes` set to `NodeAllocatable - (2Gi*50%)` to reserve 50% of the guaranteed pod's request from being used by the lower QoS tiers. If a Burstable pod sets a request, its reserve will be deducted from the BestEffort memory limit. The result is that: - Guaranteed limit matches root cgroup at is not set by this code - Burstable limit is `NodeAllocatable - Guaranteed reserve` - BestEffort limit is `NodeAllocatable - Guaranteed reserve - Burstable reserve` The only resource currently supported is `memory`; however, the code is generic enough that other resources can be added in the future. @derekwaynecarr @vishh
Update pod resource management design and rollout plan
Fix frakti release notes
Update pod resource management design and rollout plan
Update the documentation for pod resource management based on current plan.
Document a rollout plan and tentative timeline for enabling the feature.