Skip to content

Commit

Permalink
Merge pull request #2992 from itskingori/node_resource_handling
Browse files Browse the repository at this point in the history
Automatic merge from submit-queue

Add documentation on handling node resources

At a minimum, this is meant to give more context on why the feature in #2982 was added and attempts to give some recommendations of what to consider when evaluating node system resources.

I hope this spurs some discussion and that the recommendations I make maybe be assessed further. For example ... in one of the links I referenced, we're advised to set `system-reserved` **only if we know what we are doing** (which I can't say I do 💯% ... 🤷‍♂️) and we're even warned to only set it if you really need to.
  • Loading branch information
Kubernetes Submit Queue authored Aug 14, 2017
2 parents 0620cce + 1bd329a commit b7331ac
Showing 1 changed file with 131 additions and 0 deletions.
131 changes: 131 additions & 0 deletions docs/node_resource_handling.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,131 @@
## Node Resource Handling In Kuberenetes

An aspect of Kubernetes clusters that is often overlooked is the resources non-
pod components require to run, such as:

* Operating system components i.e. `sshd`, `udev` etc.
* Kubernetes system components i.e. `kubelet`, `container runtime` (e.g.
Docker), `node problem detector`, `journald` etc.

As you manage your cluster, it's important that you are cognisant of these
components because if your critical non-pod components don't have enough
resources, you might end up with a very unstable cluster.

### Understanding Node Resources

Each node in a cluster has resources available to it and pods scheduled to run
on the node may or may not have resource requests or limits set on them.
Kubernetes schedules pods on nodes that have resources that satisfy the pod's
specified requirements. Broadly, pods are [bin-packed][4] onto the nodes in a
best effort attempt to utilize as much of the resources available with as few
nodes as possible.

```
Node Capacity
---------------------------
| kube-reserved |
|-------------------------|
| system-reserved |
|-------------------------|
| eviction-threshold |
|-------------------------|
| |
| allocatable |
| (available for pods) |
| |
| |
---------------------------
```

Node resources can be be categorised into 4 (as shown above):

* `kube-reserved` – reserves resources for kubernetes system daemons.
* `system-reserved` – reserves resources for operating system components.
* `eviction-threshold` – specifies limits that trigger evictions when node
resources drop below the reserved value.
* `allocatable` – the remaining node resources available for scheduling of pods
when `kube-reserved`, `system-reserved` and `eviction-threshold` resources
have been accounted for.

For example, with a 30.5 GB, 4 vCPUs machine with only `eviction-thresholds` set
as `--eviction-hard=memory.available<100Mi` we'd get the following `Capacity`
and `Allocatable` resources:

```
$ kubectl describe node/ip-xx-xx-xx-xxx.internal
...
Capacity:
cpu: 4
memory: 31402412Ki
...
Allocatable:
cpu: 4
memory: 31300012Ki
...
```

### So, What Could Possibly Go Wrong?

The scheduler ensures that for each resource type, the sum of the resources
scheduled does not surpass the sum of allocatable resources. But suppose you
have a couple of applications deployed in your cluster that are constantly using
up way more resources set in their resource requests (burst above requests but
below limits during workload). You end up with a node with pods that are each
attempting to take take up more resources than there are available on the node!

This is particularly an issue with non-compressible resources like memory. For
example, in the aforementioned case, with an eviction threshold of only
`memory.available<100Mi` and no `kube-reserved` nor `system-reserved`
reservations set, it is possible for a node to OOM prior to when `kubelet` is
able to reclaim memory (because it may not observe memory pressure right away,
since it polls `cAdvisor` to collect memory usage stats at a regular interval).

All the while, keep in mind that without `kube-reserved` nor `system-reserved`
reservations set (which is most clusters i.e. [GKE][5], [Kops][6]), the
scheduler doesn't account for resources that non-pod components would require to
function properly because `Capacity` and `Allocatable` resources are more or
less equal.

### Where Do We Go From Here?

It's difficult to give a one size fits all answer to node resource allocation.
The behaviour of your cluster depends on the resource requirements of the apps
running on the cluster, the pod density and the cluster size. But there's a
[node performance dashboard][7] that exposes `cpu` and `memory` usage profiles
of `kubelet` and `docker` engine at multiple levels of pod density which may
serve as a guide for what values would be appropriate for your cluster.

But, it seems fitting to recommend the following:

1. Always set requests with some breathing room – do not set requests to match
your application's resource profile during idle time too closely.
2. Always set limits – so that your application doesn't hog all the memory on a
node during a spike.
3. Don't set your limits for imcompressible resources too high - at the end of
the day, the Kubernetes scheduler schedules based on resource requests which
match what's available on the node. During a spike, your pod technically will
try to access resources outside what it's guaranteed to have access to. As
explained before, this can be an issue if a bunch of your pods are all
bursting at the same time.
4. Increase eviction thresholds if they are too low - while extreme utilization
is ideal, it might be too close to the edge such that the system doesn't have
enought time to reclaim resources via evictions if the resource increases
within that window rapidly.
5. Reserve resources for system components once you've been able to profile your
nodes i.e. `kube-reserved` and `system-reserved`.

**Further Reading:**

* [Configure Out Of Resource Handling][2]
* [Reserve Compute Resources for System Daemons][1]
* [Managing Compute Resources for Containers][3]
* [Visualize Kubelet Performance with Node Dashboard][8]

[1]: https://kubernetes.io/docs/tasks/administer-cluster/reserve-compute-resources/
[2]: https://kubernetes.io/docs/tasks/administer-cluster/out-of-resource/
[3]: https://kubernetes.io/docs/concepts/configuration/manage-compute-resources-container/
[4]: https://en.wikipedia.org/wiki/Bin_packing_problem
[5]: https://cloud.google.com/container-engine/
[6]: https://github.com/kubernetes/kops
[7]: http://node-perf-dash.k8s.io/#/builds
[8]: http://blog.kubernetes.io/2016/11/visualize-kubelet-performance-with-node-dashboard.html

0 comments on commit b7331ac

Please sign in to comment.