Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update pod resource management design and rollout plan #314

Merged
merged 1 commit into from
Mar 3, 2017

Conversation

derekwaynecarr
Copy link
Member

Update the documentation for pod resource management based on current plan.

Document a rollout plan and tentative timeline for enabling the feature.

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jan 31, 2017
High level requirements for the design are as follows:
- Do not break existing users. Ideally, there should be no changes to the Kubernetes API semantics.
- Support multiple cgroup managers - systemd, cgroupfs, etc.
### Memory overcommitment
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@sjenning -- this section captures the work you are implementing.

* A value of 0 will instruct the `kubelet` to adjust the Burstable and
BestEffort cgroup to restrict memory overcommit by inducing memory
pressure and reclaim if needed at the QoS level cgroup tiers before
inducing pressure at the `ROOT` cgroup.
Copy link
Contributor

@sjenning sjenning Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't realize this flag would take a value. I was thinking, at least at first, this would do a hard reserve of G and Bu limits against lower tiers. Basically hardcoding this value to 0.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a range is a valuable heuristic, any objections?


```
ROOT/Burstable/memory.limit_in_bytes =
Node.Allocatable - {(summation of memory requests of `Guaranteed` pods)*(1-qmo/100)}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this formula is correct. I think it should be

Node.Allocatable - {(summation of memory requests of Guaranteed pods)*(100-qmo/100)}

Copy link
Member Author

@derekwaynecarr derekwaynecarr Jan 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible I have a math error, will double check.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Shouldn't this be
ROOT/Burstable/memory.limit_in_bytes = Node.Allocatable - { (sum (Guaranteed) + sum (Burstable) ) * (1-(qmo/100)) }

:sigm

Node.Allocatable - {(summation of memory requests of all `Guaranteed` and `Burstable` pods)*(1-qmo/100)}
```

Each time a pod is admitted to the `kubelet`, the `kubelet` will increment
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/increment/adjust. Increment makes me think "increase" when we are actually decreasing the limit on lower QoS tier with each added pod.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

agreed on the word change.

@vishh vishh self-assigned this Jan 31, 2017
@calebamiles
Copy link
Contributor

cc: @ethernetdan, change will require users to drain nodes. You likely want to track the progress of this

@ethernetdan
Copy link

@calebamiles looking like 1.6 will require a Node drains due to this + CRI enablement, we should talk about how to minimize impact


__Note__: The cgroup-root flag would allow the user to configure the root of the QoS cgroup hierarchy. Hence cgroup-root would be redefined as the root of QoS cgroup hierarchy and not containers.
* 01/31/2017 - Discuss the rollout plan in sig-node meeting
* 02/14/2017 - Flip the switch to enable pod level cgroups by default

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made this comment in SIG-node but if we can get this in by the 13th (I'd try to get it in the Friday before to be safe) we can target v1.6.0-alpha.2

container and a BestEffort container is classified as a Burstable pod. The
BestEffort container is not able to consume slack resources from the sibling
Burstable container. It must instead compete for scarce resources at the node
level across all containers in all QoS classes which violates the design.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even with pod level cgroups, the BestEffort container in this example will compete with other Burstable pods. Remember there are no pod cgroup limits for Burstable pods unless all their containers specify limits, in which case there would be no BestEffort containers in the first place.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that is correct. i had taken this verbiage from the original document. will update that it will allow containers in a pod to share slack resources within its qos tier.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the best effort container gets cpu share time relative to the burstable pods request is the scenario i was most targeting here.

1. Ability to charge any memory usage of memory-backed volumes to the pod when
an individual container exits instead of the node.

## Enabling the unified cgroup hierarchy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unified cgroup hierarchy could mean multiple things. What are you referring to here? If it's just about QoS and Pod cgroups, I'd rather refer to it as QoS cgroups.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sure

## Enabling the unified cgroup hierarchy

To enable the unified cgroup hierarchy, the operator must enable the
`--cgroups-per-qos` flag. Once enabled, the `kubelet` will start managing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kindly note that we require --cgroup-root to not be /. This may change once Node Allocatable changes are made.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would prefer we do not require the cgroup-root to previously exist, so ideally as part of node allocatable, the kubelet can create that step in the taxonomy. maybe we can just state that we anticipate that the root will not be /.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

per node allocatable design, i am keeping the cgroup-root flag defaulted to / since we will create the kubepods cgroup relative to that.

operators may have to choose a particular cgroup driver to ensure
proper system behavior. For example, if operators use the `systemd`
cgroup driver provided by the `docker` runtime, the `kubelet` must
be configured to use the `systemd` cgroup driver.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is worth mentioning that with systemd, ownership of a slice must be delegated to the kubelet in-order to use the cgroupfs driver.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, i think Delegate=true relative to the cgroup root for the hierarchy should go in node allocatable follow-on or that PR that has that design.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that we introduce cgroup drivers here, I'd appreciate if you can cross reference this doc from node-allocatable.md or move this section to that doc.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i had hoped my doc could merge first, but ok ;-)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

linked to node_allocatable design in "## Enabling QoS and Pod level cgroups" section (so even earlier in the doc).

## Integration with container runtimes

The `kubelet` when integrating with container runtimes always provides the
concrete cgroup filesystem name for the pod sandbox.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Concrete or abstract? IIRC, we intended to keep the wire format for CRI to be Abstract and have the runtime shims choose between Abstract and Concrete?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we settled on concrete. the shims are always provided the cgroupfs syntax as found on the host.


For the initial implementation we will only support limits for cpu and memory resources.
The `cgroups-per-qos` flag will be enabled by default, but user's
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if we can use FeatureGate instead?


#### Rkt runtime

We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does rkt support pod level cgroups?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think rkt needs to support it. I think individual container runtime support should be separate from this proposal generally. Looking in the current code in head, I do not see it being used.


We want to have rkt create pods under a root QoS class that kubelet specifies, and set pod level cgroup parameters mentioned in this proposal by itself.

#### Add Pod level metrics to Kubelet's metrics provider
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are you deleting this section? We do intend to get pod level metrics to help with evictions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought that was covered in the core metrics proposal, i can add back some of these items.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added this to a future enhancements section.

- [ ] Check if parent cgroup exist and error out if they don't.
- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root.
- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager.
- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is still necessary. We cannot admit a pod until (or unless) the QoS cgroups can accommodate it.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i wanted to avoid enumerating must have work items as part of the design doc. i will update the verbiage in the Memory allocation section to denote that it will do a check on Admit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i describe the heuristic we will apply in the memory allocation section.

- [ ] Set top level cgroup limit to resource allocatable until we support QoS level cgroup updates. If cgroup root is not `/` then set node resource allocatable as the cgroup resource limits on cgroup root.
- [ ] Add a NodeResourceAllocatableProvider which returns the amount of allocatable resources on the nodes. This interface would be used both by the Kubelet and ContainerManager.
- [ ] Add top level feasibility check to ensure that pod can be admitted on the node by estimating left over resources on the node.
- [ ] Log basic cgroup management ie. creation/deletion metrics
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We need this feature. Internally, we have observed high tail latencies for cgroupfs operations. I'd prefer having metrics to help us narrow down such issues.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will add a section for this in the document.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i added a section for logging.

* opt-in behavior surrounding the feature (`qos-memory-overcommit` support) completed.
* 03/01/2017 - Send an announcement to kubernetes-dev@ about the rollout and potential impact
* 03/22/2017 - Kubernetes 1.6 release
* TBD (1.7?) - Deprecate the old implementation
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there an old implementation?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i meant this to read look at deprecating the ability to have pod level cgroups off. will clarify text.

accounting on the node, and introduces a number of code complexities when
trying to build features around QoS.

This design introduces a unified cgroup hierarchy to enable the following:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/unified/new/ since unified means something else in the context of cgroups


This design introduces a unified cgroup hierarchy to enable the following:

1. Improve enforcement of QoS class on the node.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: s/Improve/Enable s/class/classes/

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: also do we want to say "QoS cgroup" instead of "QoS class" throughout since it is "Pod cgoups" and "Pod Class" doesn't make sense?


1. Improve enforcement of QoS class on the node.
1. Simplify resource accounting at the pod level.
1. Allow containers in a pod to share slack resources within its QoS class.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if this is talking about pod level cgroups wouldn't it be "share slack resources within its pod cgroup"?


The `--cgroup-root` flag must have a value specified to use this feature.
The `kubelet` will parent any cgroups it creates below that specified value.
The `--cgroup-root` flag will default to `/` if not specified.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This conflicts with L57. Does the flag have a default or must it be specified?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the flag must have a value, and it defaults to /


A pod can belong to one of the following 3 QoS classes: Guaranteed, Burstable, and BestEffort, in decreasing order of priority.
The `kubelet` will support a flag `--qos-reserve-limits` that takes a
set of percentages per compressible resource that controls how the QoS
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/compressible/incompressible

in a range from 0-100%, where a value of 0 instructs the `kubelet` to
attempt no reservation, and a value of 100 will instruct the `kubelet`
to attempt to reserve the sum of requested resource across all pods
on the node. How the `kubelet` achieves this desired state is resource
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: maybe something along the lines of "exclude pods from lower QoS classes from using resources requested by higher QoS classes" would be clearer.

attempt no reservation, and a value of 100 will instruct the `kubelet`
to attempt to reserve the sum of requested resource across all pods
on the node. How the `kubelet` achieves this desired state is resource
specific. The default value per compressible resource if not specified
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/compressible/incompressible

to attempt to reserve the sum of requested resource across all pods
on the node. How the `kubelet` achieves this desired state is resource
specific. The default value per compressible resource if not specified
is for no reservation to occur.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should to default be 100% reservation? I think this is what Vish was thinking. Since 0% is the effective value if you don't use the flag. The flag without a value should do something different than not having flag at all.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i would assume that if a resource is not enumerated in the flag value, we would do no reservation.


By default, no memory limits are applied to the BestEffort
and Burstable QoS level cgroups unless a `--qos-reserve-limits` value
is specified for memory.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Needs changing if you agree with my previous comment about default reservations

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i dont think i agree w/ previous comment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my opinion is that if no value is provided, we should do no harm, and therefore do nothing. it will have the same literal effect of qos-reserve-limits=memory=0%

@derekwaynecarr
Copy link
Member Author

@vishh - please take a look since node allocatable basically requires this as well.

@derekwaynecarr derekwaynecarr force-pushed the cgroup_rollout branch 3 times, most recently from 67e32e5 to 69f95c0 Compare February 21, 2017 21:59
@derekwaynecarr
Copy link
Member Author

@vishh @sjenning @dchen1107 -- all updates have been made. i believe this accurately represents the plan we have PRs in flight to pursue for 1.6. i updated text around how qos level cgroup sandboxes are updated to try to reduce confusion, and minimize risk. ptal.

k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Feb 22, 2017
Automatic merge from submit-queue (batch tested with PRs 41349, 41532, 41256, 41587, 41657)

Enable pod level cgroups by default

**What this PR does / why we need it**:
It enables pod level cgroups by default.

**Special notes for your reviewer**:
This is intended to be enabled by default on 2/14/2017 per the plan outlined here:
kubernetes/community#314

**Release note**:
```release-note
Each pod has its own associated cgroup by default.
```
We use the following denotations in the sections below:
Internally, the `kubelet` maintains both an abstract and a concrete name
for its associated cgroup sandboxes. The abstract name follows the traditional
`cgroupfs` style syntax. The concrete name is the name for how the cgroup
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the case of systemd the concrete name is also abstract :)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i am not following, unless you mean the concrete name is abstract by virtue that it encodes the hierarchy at each step. if so, i agree, that is an abstraction in and of itself. that said, it is current reality ;-)

pod<UID>/memory.limit_in_bytes = sum(pod.spec.containers.resources.limits[memory])
```

Note: This design enables containers in a pod to optionally share slack compute resources.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Well, unless all containers specify limits, there will no pod level limits. Once they specify limits, there is no sharing.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed

```
pod<UID>/cpu.shares = 2
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Once we set cpu.shares to 2 at the QoS level cgroup, is this setting at the pod level still relevant?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is less relevant, but it is consistent with what we end up setting for the container cgroup sandbox that it contains.


## QoS level cgroups

The `kubelet` defines a `--cgroup-root` flag that is used to specify the `ROOT`
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd mention that the recommended configuration is to set --cgroup-root to / to avoid having a deep cgroup hierarchy.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ack

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

text was added re: deep hierarchies.

allows BestEffort and Burstable pods to potentially consume as many
resources that are presently available on the node.

For compressible resources, this prioritization scheme has little impact.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not entirely true since power is limited across CPU sockets and so Burstable or BestEffort can cause performance issues for Guaranteed pods.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @vishh --

You're talking about impact of TDP/HWP and turbo-boost frequency scaling interfering with G pods? Isn't the millicores concept intended to gloss over that level of detail, or is it that we are now concerned with it again in scenarios such as you described?

Now you have got me curious ... Any chance you could provide the content of /proc/cmdline and the output of "turbostat sleep 10" from a hypervisor node along with "numastat -p kvm"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i rephrased this, but avoided discussing why a pod cpu request should ultimately being measured in watts ;-)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeremyeder The discussion here is about providing consistent performance. I agree with @derekwaynecarr that we should be discussing watts per core in a different issue.

pods memory request may not be satisfied if there are active BestEffort
pods consuming all available memory.

The `kubelet` will support a flag `experimental-qos-reserve-requests` that
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

reserve-requests is a bit confusing. Are we reserving for Guaranteed pods? One could also interpret it as node level reservation.

That's why I preferred overcommit-percentage where the default "" would be 100% of Allocatable.

Copy link
Contributor

@sjenning sjenning Feb 22, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought we had agreed on this.

You can't "overcommit" a node in the way I think of overcommitting. The scheduler looks at pod resources requests and packs them in up to node Allocatable. In my mind, overcommitment is a scheduler thing i.e. pack more pods on the node assuming not all of them will use their requests all at once.

What we are doing is reserving the requests made by G and Bu pods against use in the lower QoS tiers. Hence qos-reserve-requests.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

by definition, we are reserving for G pods in the same manner we are reserving for node allocatable top level cgroup. both have a memory limit set equal to their request. i agree with @sjenning . i think overcommit-percentage means something like a physical cpu/memory to virtual cpu/memory scalar that would skew what the node reports as allocatable back to the scheduler. this is precisely not that. this is attempting to reserve for the qos tier in the same manner we are reserving for the kubepods cgroup sandbox one level up in the hierarchy. it's also experimental in 1.6, so i am not sure how much we want to bikeshed on the name.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can't "overcommit" a node in the way I think of overcommitting.

Our current QoS doc talks about oversubscribing. May be that's a better term?

What we are doing is reserving the requests made by G and Bu pods against use in the lower QoS tiers. Hence qos-reserve-requests

This is not obvious from the flag name. We use the keyword reservation in other places to indicate System and Kube reservations for example where its a static reservation.
Whereas here, we are referring to a dynamic quantity.

Because the kubernetes scheduler only looks at scheduler, it inherently oversubscribes (trying a different term) the nodes. This flag is merely controlling the amount of oversubscription.

it's also experimental in 1.6, so i am not sure how much we want to bikeshed on the name.

Having spent so much time on it, we might as well resolve it now among the three of us. I feel this conversation will be helpful in agreeing on the semantic meaning of the some of the terms we use all over the code too - reservation, commitment, limits, etc.

converges to a desired state. Failure to set `cpu.shares` at the QoS level
cgroup would result in `500m` of cpu for a Guaranteed pod to have different
meaning than `500m` of cpu for a Burstable pod in the current hierarchy.
For this reason, we will always set `cpu.shares` for the QoS level sandboxes
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wasn't aware of this change. I was under the assumption that all QoS level settings would be opt-in.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, given that we have pod level limits, I don't see QoS level shares limits to have that much of an impact unless a user explicitly opts in to restricting overcommit.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not following, absent setting a value:

$ cat /Burstable/cpu.shares
1024

If you have a allocatable=4cpu, and you run the following:

This means if you run stress to do the following:

kubectl run burstable --image=sjenning/stress --requests=cpu=500m,memory=100Mi --replicas=1 --command -- /usr/bin/stress -c 4 -t 3600

by default, this bursts to consume all 4 cpus, which is fine, that is the point.

but if you do this:

kubectl run guaranteed --image=sjenning/stress --requests=cpu=500m,memory=100Mi --limits=cpu=500m,memory=100Mi --replicas=7 --command -- /usr/bin/stress -c 4 -t 3600

the result is that the burstable pod skews closer to 1 core of usage since the 1024 share evaluation is made relative to the guaranteed pods, and we really want the Burstable cpu share to be 500 so it was given equal time relative to the other pods.

i tested this locally and confirmed this was the case, and manually assigning

echo 500 > /Burstable/cpu.shares

fixes usage to closer to the actual value we requested as expected...

do you disagree that we need to set shares on burstable tier?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

do you disagree that we need to set shares on burstable tier?

Nope. All I'm saying is that I'd prefer not setting QoS limits by default in v1.6.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As per offline discussion, given that this will lead to a regression, I agree that its necessary to have it turned on by default. Apologies for the confusion.

A value of `--experimental-qos-reserve-requests=memory=100%` will cause the
`kubelet` to adjust the Burstable and BestEffort cgroups from consuming memory
that was requested by a higher QoS class. This increases the risk
of inducing OOM on BestEffort workloads in favor of increasing memory
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and on Burstable too since they cannot consume memory that is otherwise free

that was requested by a higher QoS class. This increases the risk
of inducing OOM on BestEffort workloads in favor of increasing memory
resource guarantees for Burstable and Guaranteed workloads. A value of
`--experimental-qos-reserve-requests=memory=0%` will allow a Burstable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I mentioned earlier, I find reservation misleading since it is not clear what the reservation is meant for.


Since memory is an incompressible resource, it is possible that a QoS
level cgroup sandbox may not be able to reduce memory usage below the
value specified in the heuristic during pod admission and pod termination.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We do use any heuristic do we? Limits are based on the policy mentioned above right?

As a result, the `kubelet` runs a periodic thread to attempt to converge
to this desired state from the above heuristic. If unreclaimable memory
usage has exceeded the desired limit for the sandbox, the `kubelet` will
attempt to set the effective limit near the current usage to put pressure
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is not safe to create a pod level cgroup unless updates on QoS level memory cgroup has been successful.
It's not obvious what the general design principle is from this text.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vishh -- does something like the following help? I disagree on the safety argument. Operators most concerned about safety may deploy G workloads first to a node before opening up other workloads for scheduling. For nodes with fluid set of pods being scheduled, I want to attempt to provide safety, but do no worse than if the reservation feature is disabled.

Use cases:

  1. I want to prioritize access to compressible resources for my system/cluster daemons over end-user pods.
  2. I want to prioritize access to compressible resources for my G workloads over my Bu workloads.
  3. I want to prioritize access to compressible resources for my Bu workloads over my Be workloads.

An operator may choose to prioritize access to a compressible resource over utilization for each step in the hierarchy of workloads described above dependent upon their intended workload. Almost all cluster operators will use node allocatable to enforce the first use case in order to provide reliable operation of the node. It is understood that not all operators may feel the same for G/Bu workloads across the user community. For the users I represent, I know they may opt to deploy important cluster services as G workloads via a DaemonSet and would like a similar resource reservation model as is provided via node allocatable for system/cluster daemons and end-user pods. In addition, I know users I represent would like to run Be workloads and minimize their ability to impact Bu pods. Reliance on eviction, oom_killer, and critical pods is not always sufficient. Eviction still has latency, oom_killer is disruptive, and critical pods as I view them are intended to provide G style guarantees but run with Bu resource requirements.

Design Principle
kubelet with opt-in configuration will attempt to limit the ability for a pod in a lower QoS tier to burst utilization of a compressible resource that was requested by a pod in a higher QoS tier.

Mechanism
Prior to starting a G pod, we will always attempt to update the Bu and Be tiers to have their memory limits reduced based on the incoming G pod memory request. It is possible that we are unable to reduce the Bu and Be tier to their new desired limit if usage already has exceeded that value prior to execution of the G pod. If they have exceeded their usage, we will set the limit at or near their usage to put pressure on the cgroup to prevent further growth. The kubelet will not wait for the QoS cgroup memory limit to converge to the desired state prior to execution of the G pod. This does mean that the G pod could induce an OOM for the kubepods cgroup, but per our QoS design, we would prefer the oom_killer targets the Bu/Be pods first, or ideally, we get the eviction code path to kick in response to kernel notification, and evict the Bu/Be pod. The periodic task that attempts to converge the QoS tier memory limits would then have a better opportunity to converge on the desired state so any future Bu/Be pods that land do not impact the G pod since they would be working under a reduced memory limit. The same logic applies for the Bu pods that get scheduled, but reduces the pertinent QoS cgroup sandbox update to just Be tier.

Best Practices
Operators that want to provide a similar resource reservation model for G pods as we offer via enforcement of node allocatable are encouraged to schedule their G pods via a DameonSet prior to opening up the node for scheduling of Bu/Be workloads. This ensures that the Bu/Be tiers have had their QoS memory limits appropriately adjusted before taking unbounded workloads on the node.

Future considerations
It's possible some users may want to stall execution of a pod in a higher QoS tier if resource reservations for compressible resources have not converged on the desired state. It can be evaluated based on more experience in the community.

We have discussed potentially allow end-user pods to schedule in the kube/system cgroup spaces, but the same hierarchy of needs may still be extended. For example, hosted multi-tenant shared-node offerings may want to allow the operators to deploy pods to kube/system cgroup spaces, but not the tenants on a shared node. For those users, offering the same opportunity to reserve requests at each level in the hierarchy is important.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Firstly, I'd prefer all this content to exist in the doc. This is kind of what I have been asking for in @sjenning's PRs. Thanks for posting this.

Your design choice is not geared towards predictable tail latencies. May be the customers you represent aren't expecting that yet. I do know some customers who care about that. Especially ones who run user facing web services that cannot have unpredictable latencies.

Relying on OOM scores and user space evictions isn't reliable enough yet. What this proposal is attempting to provide is definitely better than the current state, but we should clarify that it is besteffort Quality of Service. Users should know that we do not optimize for predicatble tail latencies (yet).

Given that the kernel does not prevent over commitment from a memcg perspective this is not a correctness issue though.

All memory backed volumes are removed when a pod reaches a terminal state.

The `kubelet` verifies that a pod's cgroup is deleted from the
host before deleting a pod from the API server as part of the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

part of what?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graceful deletion ;-)


## Log basic cgroup management

The `kubelet` will log and collect metrics associated with cgroup manipulation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are the metrics already in place or is this a TODO?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a TODO, i am hoping between @sjenning and me, we can bang a PR out by monday.

we basically will need an instrumented CgroupManager

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@derekwaynecarr derekwaynecarr force-pushed the cgroup_rollout branch 2 times, most recently from 13ee005 to 648809e Compare February 23, 2017 21:24
@derekwaynecarr
Copy link
Member Author

@vishh -- ready for another round....

@vishh
Copy link
Contributor

vishh commented Feb 24, 2017

Just a couple more comment threads still open.

@vishh
Copy link
Contributor

vishh commented Feb 27, 2017

@dchen1107 @davidopp @erictune @thockin

There has been a debate on naming the QoS over subscription policy introduced by this proposal. I will try to capture the discussion thus far.

  1. Requests are guaranteed. QoS policies let's k8s oversubscribe requests in a safe manner.
  2. This proposal is introducing new QoS level cgroup limits policy that will provide a means to control over subscription of requests / Guaranteed resources at the node level
  3. The proposed flag name and hence user facing name (via docs) for this feature is --experimental-qos-reserve-requests.
  4. I find the proposed name to not be capturing the over subscription and policy aspects of this feature.
  5. I prefer to name this feature something in the lines of --experimental-qos-over-subscription-policy
  6. There is a thought that the scheduler does not over subscribe in kubernetes by default because it guaranteed requests. I feel by not scheduling based against limits the scheduler is always over subscribing. It is not over subscribing requests, but it is over subscribing capacity.
  7. In the case of QoS policy feature in this proposal though, we are attempting to control over subscription of requests which leads to some confusion.
    May be an alternative name would be --experimental-qos-request-over-subscription-policy ?

@derekwaynecarr @sjenning did I miss anything?

EDIT:
It will be helpful to clarify for once what certain keywords mean to be consistent across the project. The ones specific to this proposal are reservations, requests, limits, over-subscription, over commitment, QoS

will attempt to limit the abilty for a pod in a lower QoS tier to burst utilization
of a compressible resource that was requested by a pod in a higher QoS tier.

The `kubelet` will support a flag `experimental-qos-reserve-requests` that
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to capture what was discussed via slack:

i want this flag to be called --experimental-qos-reserved

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

flag name updated.

@derekwaynecarr
Copy link
Member Author

@vishh -- what is missed is how i view this is doing the exact same thing for the qos tiers as we are doing with the system and kube tiers via system-reserved and kube-reserved, both of which statically reserve resources available away from workloads with a lower precedent in the hierarchy. the distinction here is that the reservation is not static (because its based on pods scheduled). the action is still the same.

see use case discussion here for context:
https://github.com/kubernetes/community/pull/314/files#diff-adf1bc7c9f94f4ebefc6894c6ec50fd8R206

@thockin
Copy link
Member

thockin commented Feb 27, 2017

I see the word "reserved" has multiple meanings, but I am OK with it. At least, I don't immediately see a better word - reservation feels right here.

This doesn't allow me to express "under-provision Burstable but fully fund Guaranteed", which seems like it might be a useful policy (an sort of maps to Borg tiers)?

This representation (qos-reserved) is hard to wrap one's head around, though. It doesn't mean to overcommit, if I understand. It simply modulates the availability of resources for lower tiers. Do we need a way for the scheduler to understand over-commit, too?


## Design
For example, the cgroup name `/Burstable/pod_123-456` is translated to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Didn't you want to make all cgroup names lower case?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, per our discussion, i think the existing code should following cgroup v2 naming conventions now rather than later.

see: https://www.kernel.org/doc/Documentation/cgroup-v2.txt
section: 2-6-2. Avoid Name Collisions

i will update the text, and open a pr.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened kubernetes/kubernetes#42497 to update

node in the cgroup hierarchy below which the `kubelet` should manange individual
cgroup sandboxes. It is strongly recommended that users keep the default
value for `--cgroup-root` as `/` in order to avoid deep cgroup hierarchies. If
`--enforce-node-alloctable` is enabled per [node allocatable](node-allocatable.md),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not correct. kubepods is controlled by --cgroups-per-qos flag

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed.

cluster services as Guaranteed workloads via a `DaemonSet` and would like a similar
resource reservation model as is provided via [node allocatable](node-allocatable)
for system and kubernetes daemons. In addition, depending how operators choose
to leverage BestEffort workloads, they may choose to make a similar decision
Copy link
Contributor

@vishh vishh Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This rationale calls for having separate configuration for G and Bu QoS levels. Something consider before going to GA with this feature.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the rationale was not intended to be read in that manner. as you know, many may choose to deploy cluster daemons in burstable tier as well (GKE does with kube-proxy).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Got it. When I read your description, you try to point out that each use case can exist independent of the other. That's why I felt it is useful to have per QoS configuration.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

clarified text to avoid confusion.

to the desired state prior to execution of the pod, but it will always
attempt to cap the existing usage of QoS cgroup sandboxes in lower tiers.
This does mean that the new pod could induce an OOM event at the `ROOT`
cgroup, but ideally per our QoS design, the oom_killer targets a pod
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As a next step, we need to enable evictions at the QoS levels too to facilitate QoS enforcement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added a note to Future enhancements.

reservation model for Guaranteed pods as we offer via enforcement of
node allocatable are encouraged to schedule their Guaranteed pods first
as it will ensure the Burstable and BestEffort tiers have had their QoS
memory limits appropriately ajdusted before taking unbounded workload on
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: typo: adjusted

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed

host before deleting a pod from the API server as part of the graceful
deletion process.

This ensures resource consumption associated with those volumes are not
Copy link
Contributor

@vishh vishh Mar 3, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not totally true because we do not wait for the pod level cgroup usage to drop to 0 before deleting it.
I feel this is a TODO.

This feature does prevent a crash looping containers from taking up all the memory on the node if node and cause system OOMs if node allocatable and evictions are enabled.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

removed confusing text.

style syntax into transient slices, and as a result, it must follow `systemd`
conventions for path encoding.

For example, the cgroup name `/Burstable/pod_123-456` is translated to a
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd appreciate a future PR to rename all names to lowercase.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i thought i caught them all, i will take another pass.

@vishh vishh added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 3, 2017
@vishh
Copy link
Contributor

vishh commented Mar 3, 2017

LGTM.

Merging this PR because it's the design doc for a feature that is already merged into v1.6.

@vishh vishh merged commit 5390ec4 into kubernetes:master Mar 3, 2017
k8s-github-robot pushed a commit to kubernetes/kubernetes that referenced this pull request Mar 4, 2017
Automatic merge from submit-queue (batch tested with PRs 41919, 41149, 42350, 42351, 42285)

kubelet: enable qos-level memory limits

```release-note
Experimental support to reserve a pod's memory request from being utilized by pods in lower QoS tiers.
```

Enables the QoS-level memory cgroup limits described in kubernetes/community#314

**Note: QoS level cgroups have to be enabled for any of this to take effect.**

Adds a new `--experimental-qos-reserved` flag that can be used to set the percentage of a resource to be reserved at the QoS level for pod resource requests.

For example, `--experimental-qos-reserved="memory=50%`, means that if a Guaranteed pod sets a memory request of 2Gi, the Burstable and BestEffort QoS memory cgroups will have their `memory.limit_in_bytes` set to `NodeAllocatable - (2Gi*50%)` to reserve 50% of the guaranteed pod's request from being used by the lower QoS tiers.

If a Burstable pod sets a request, its reserve will be deducted from the BestEffort memory limit.

The result is that:
- Guaranteed limit matches root cgroup at is not set by this code
- Burstable limit is `NodeAllocatable - Guaranteed reserve`
- BestEffort limit is `NodeAllocatable - Guaranteed reserve - Burstable reserve`

The only resource currently supported is `memory`; however, the code is generic enough that other resources can be added in the future.

@derekwaynecarr @vishh
ruebenramirez pushed a commit to ruebenramirez/community that referenced this pull request Apr 22, 2017
Update pod resource management design and rollout plan
shyamjvs pushed a commit to shyamjvs/community that referenced this pull request Sep 22, 2017
MadhavJivrajani pushed a commit to MadhavJivrajani/community that referenced this pull request Nov 30, 2021
Update pod resource management design and rollout plan
danehans pushed a commit to danehans/community that referenced this pull request Jul 18, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants