diff --git a/content/en/docs/concepts/scheduling-eviction/_index.md b/content/en/docs/concepts/scheduling-eviction/_index.md index 79fca8e5975f4..21e9371f0378b 100644 --- a/content/en/docs/concepts/scheduling-eviction/_index.md +++ b/content/en/docs/concepts/scheduling-eviction/_index.md @@ -1,11 +1,37 @@ --- title: "Scheduling, Preemption and Eviction" weight: 90 +content_type: concept description: > In Kubernetes, scheduling refers to making sure that Pods are matched to Nodes so that the kubelet can run them. Preemption is the process of terminating Pods with lower Priority so that Pods with higher Priority can schedule on Nodes. Eviction is the process of proactively terminating one or more Pods on resource-starved Nodes. +no_list: true --- +In Kubernetes, scheduling refers to making sure that {{}} +are matched to {{}} so that the +{{}} can run them. Preemption +is the process of terminating Pods with lower {{}} +so that Pods with higher Priority can schedule on Nodes. Eviction is the process +of terminating one or more Pods on Nodes. + +## Scheduling + +* [Kubernetes Scheduler](/docs/concepts/scheduling-eviction/kube-scheduler/) +* [Assigning Pods to Nodes](/docs/concepts/scheduling-eviction/assign-pod-node/) +* [Pod Overhead](/docs/concepts/scheduling-eviction/pod-overhead/) +* [Taints and Tolerations](/docs/concepts/scheduling-eviction/taint-and-toleration/) +* [Scheduling Framework](/docs/concepts/scheduling-eviction/scheduling-framework) +* [Scheduler Performance Tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/) +* [Resource Bin Packing for Extended Resources](/docs/concepts/scheduling-eviction/resource-bin-packing/) + +## Pod Disruption + +{{}} + +* [Pod Priority and Preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/) +* [Node-pressure Eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/) +* [API-initiated Eviction](/docs/concepts/scheduling-eviction/api-eviction/) diff --git a/content/en/docs/concepts/scheduling-eviction/api-eviction.md b/content/en/docs/concepts/scheduling-eviction/api-eviction.md new file mode 100644 index 0000000000000..e7f1942df2cfd --- /dev/null +++ b/content/en/docs/concepts/scheduling-eviction/api-eviction.md @@ -0,0 +1,19 @@ +--- +title: API-initiated Eviction +content_type: concept +weight: 70 +--- + +{{< glossary_definition term_id="api-eviction" length="short" >}}
+ +You can request eviction by directly calling the Eviction API +using a client of the kube-apiserver, like the `kubectl drain` command. +This creates an `Eviction` object, which causes the API server to terminate the Pod. + +API-initiated evictions respect your configured [`PodDisruptionBudgets`](/docs/tasks/run-application/configure-pdb/) +and [`terminationGracePeriodSeconds`](/docs/concepts/workloads/pods/pod-lifecycle#pod-termination). + +## {{% heading "whatsnext" %}} + +* Learn about [Node-pressure Eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/) +* Learn about [Pod Priority and Preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/) diff --git a/content/en/docs/concepts/scheduling-eviction/eviction-policy.md b/content/en/docs/concepts/scheduling-eviction/eviction-policy.md deleted file mode 100644 index b63c729696e9c..0000000000000 --- a/content/en/docs/concepts/scheduling-eviction/eviction-policy.md +++ /dev/null @@ -1,24 +0,0 @@ ---- -title: Eviction Policy -content_type: concept -weight: 60 ---- - - - -This page is an overview of Kubernetes' policy for eviction. - - - -## Eviction Policy - -The {{< glossary_tooltip text="kubelet" term_id="kubelet" >}} proactively monitors for -and prevents total starvation of a compute resource. In those cases, the `kubelet` can reclaim -the starved resource by failing one or more Pods. When the `kubelet` fails -a Pod, it terminates all of its containers and transitions its `PodPhase` to `Failed`. -If the evicted Pod is managed by a Deployment, the Deployment creates another Pod -to be scheduled by Kubernetes. - -## {{% heading "whatsnext" %}} - -- Learn how to [configure out of resource handling](/docs/tasks/administer-cluster/out-of-resource/) with eviction signals and thresholds. diff --git a/content/en/docs/concepts/scheduling-eviction/kube-scheduler.md b/content/en/docs/concepts/scheduling-eviction/kube-scheduler.md index 0944ecc768c5a..52c8fd417e93e 100644 --- a/content/en/docs/concepts/scheduling-eviction/kube-scheduler.md +++ b/content/en/docs/concepts/scheduling-eviction/kube-scheduler.md @@ -77,11 +77,9 @@ one of these at random. There are two supported ways to configure the filtering and scoring behavior of the scheduler: - 1. [Scheduling Policies](/docs/reference/scheduling/policies) allow you to configure _Predicates_ for filtering and _Priorities_ for scoring. 1. [Scheduling Profiles](/docs/reference/scheduling/config/#profiles) allow you to configure Plugins that implement different scheduling stages, including: `QueueSort`, `Filter`, `Score`, `Bind`, `Reserve`, `Permit`, and others. You can also configure the kube-scheduler to run different profiles. - ## {{% heading "whatsnext" %}} * Read about [scheduler performance tuning](/docs/concepts/scheduling-eviction/scheduler-perf-tuning/) diff --git a/content/en/docs/concepts/scheduling-eviction/node-pressure-eviction.md b/content/en/docs/concepts/scheduling-eviction/node-pressure-eviction.md new file mode 100644 index 0000000000000..1f1fcd9991dab --- /dev/null +++ b/content/en/docs/concepts/scheduling-eviction/node-pressure-eviction.md @@ -0,0 +1,411 @@ +--- +title: Node-pressure Eviction +content_type: concept +weight: 60 +--- + +{{}}
+ +The {{}} monitors resources +like CPU, memory, disk space, and filesystem inodes on your cluster's nodes. +When one or more of these resources reach specific consumption levels, the +kubelet can proactively fail one or more pods on the node to reclaim resources +and prevent starvation. + +During a node-pressure eviction, the kubelet sets the `PodPhase` for the +selected pods to `Failed`. This terminates the pods. + +Node-pressure eviction is not the same as +[API-initiated eviction](/docs/concepts/scheduling-eviction/eviction/#api-eviction). + +The kubelet does not respect your configured `PodDisruptionBudget` or the pod's +`terminationGracePeriodSeconds`. If you use [soft eviction thresholds](#soft-eviction-thresholds), +the kubelet respects your configured `eviction-max-pod-grace-period`. If you use +[hard eviction thresholds](#hard-eviction-thresholds), it uses a `0s` grace period for termination. + +If the pods are managed by a {{< glossary_tooltip text="workload" term_id="workload" >}} +resource (such as {{< glossary_tooltip text="StatefulSet" term_id="statefulset" >}} +or {{< glossary_tooltip text="Deployment" term_id="deployment" >}}) that +replaces failed pods, the control plane or `kube-controller-manager` creates new +pods in place of the evicted pods. + +{{}} +The kubelet attempts to [reclaim node-level resources](#reclaim-node-resources) +before it terminates end-user pods. For example, it removes unused container +images when disk resources are starved. +{{}} + +The kubelet uses various parameters to make eviction decisions, like the following: + + * Eviction signals + * Eviction thresholds + * Monitoring intervals + +### Eviction signals {#eviction-signals} + +Eviction signals are the current state of a particular resource at a specific +point in time. Kubelet uses eviction signals to make eviction decisions by +comparing the signals to eviction thresholds, which are the minimum amount of +the resource that should be available on the node. + +Kubelet uses the following eviction signals: + +| Eviction Signal | Description | +|----------------------|---------------------------------------------------------------------------------------| +| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` | +| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` | +| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` | +| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` | +| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` | +| `pid.available` | `pid.available` := `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc` | + +In this table, the `Description` column shows how kubelet gets the value of the +signal. Each signal supports either a percentage or a literal value. Kubelet +calculates the percentage value relative to the total capacity associated with +the signal. + +The value for `memory.available` is derived from the cgroupfs instead of tools +like `free -m`. This is important because `free -m` does not work in a +container, and if users use the [node +allocatable](/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable) feature, out of resource decisions +are made local to the end user Pod part of the cgroup hierarchy as well as the +root node. This [script](/examples/admin/resource/memory-available.sh) +reproduces the same set of steps that the kubelet performs to calculate +`memory.available`. The kubelet excludes inactive_file (i.e. # of bytes of +file-backed memory on inactive LRU list) from its calculation as it assumes that +memory is reclaimable under pressure. + +The kubelet supports the following filesystem partitions: + +1. `nodefs`: The node's main filesystem, used for local disk volumes, emptyDir, + log storage, and more. For example, `nodefs` contains `/var/lib/kubelet/`. +1. `imagefs`: An optional filesystem that container runtimes use to store container + images and container writable layers. + +Kubelet auto-discovers these filesystems and ignores other filesystems. Kubelet +does not support other configurations. + +{{}} +Some kubelet garbage collection features are deprecated in favor of eviction. +For a list of the deprecated features, see [kubelet garbage collection deprecation](/docs/concepts/cluster-administration/kubelet-garbage-collection/#deprecation). +{{}} + +### Eviction thresholds + +You can specify custom eviction thresholds for the kubelet to use when it makes +eviction decisions. + +Eviction thresholds have the form `[eviction-signal][operator][quantity]`, where: + +* `eviction-signal` is the [eviction signal](#eviction-signals) to use. +* `operator` is the [relational operator](https://en.wikipedia.org/wiki/Relational_operator#Standard_relational_operators) + you want, such as `<` (less than). +* `quantity` is the eviction threshold amount, such as `1Gi`. The value of `quantity` + must match the quantity representation used by Kubernetes. You can use either + literal values or percentages (`%`). + +For example, if a node has `10Gi` of total memory and you want trigger eviction if +the available memory falls below `1Gi`, you can define the eviction threshold as +either `memory.available<10%` or `memory.available<1Gi`. You cannot use both. + +You can configure soft and hard eviction thresholds. + +#### Soft eviction thresholds {#soft-eviction-thresholds} + +A soft eviction threshold pairs an eviction threshold with a required +administrator-specified grace period. The kubelet does not evict pods until the +grace period is exceeded. The kubelet returns an error on startup if there is no +specified grace period. + +You can specify both a soft eviction threshold grace period and a maximum +allowed pod termination grace period for kubelet to use during evictions. If you +specify a maximum allowed grace period and the soft eviction threshold is met, +the kubelet uses the lesser of the two grace periods. If you do not specify a +maximum allowed grace period, the kubelet kills evicted pods immediately without +graceful termination. + +You can use the following flags to configure soft eviction thresholds: + +* `eviction-soft`: A set of eviction thresholds like `memory.available<1.5Gi` + that can trigger pod eviction if held over the specified grace period. +* `eviction-soft-grace-period`: A set of eviction grace periods like `memory.available=1m30s` + that define how long a soft eviction threshold must hold before triggering a Pod eviction. +* `eviction-max-pod-grace-period`: The maximum allowed grace period (in seconds) + to use when terminating pods in response to a soft eviction threshold being met. + +#### Hard eviction thresholds {#hard-eviction-thresholds} + +A hard eviction threshold has no grace period. When a hard eviction threshold is +met, the kubelet kills pods immediately without graceful termination to reclaim +the starved resource. + +You can use the `eviction-hard` flag to configure a set of hard eviction +thresholds like `memory.available<1Gi`. + +The kubelet has the following default hard eviction thresholds: + +* `memory.available<100Mi` +* `nodefs.available<10%` +* `imagefs.available<15%` +* `nodefs.inodesFree<5%` (Linux nodes) + +### Eviction monitoring interval + +The kubelet evaluates eviction thresholds based on its configured `housekeeping-interval` +which defaults to `10s`. + +### Node conditions {#node-conditions} + +The kubelet reports node conditions to reflect that the node is under pressure +because hard or soft eviction threshold is met, independent of configured grace +periods. + +The kubelet maps eviction signals to node conditions as follows: + +| Node Condition | Eviction Signal | Description | +|-------------------|---------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------| +| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold | +| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold | +| `PIDPressure` | `pid.available` | Available processes identifiers on the (Linux) node has fallen below an eviction threshold | + +The kubelet updates the node conditions based on the configured +`--node-status-update-frequency`, which defaults to `10s`. + +#### Node condition oscillation + +In some cases, nodes oscillate above and below soft eviction thresholds without +holding for the defined grace periods. This causes the reported node condition +to constantly switch between `true` and `false`, leading to bad eviction decisions. + +To protect against oscillation, you can use the `eviction-pressure-transition-period` +flag, which controls how long the kubelet must wait before transitioning a node +condition to a different state. The transition period has a default value of `5m`. + +### Reclaiming node level resources {#reclaim-node-resources} + +The kubelet tries to reclaim node-level resources before it evicts end-user pods. + +When a `DiskPressure` node condition is reported, the kubelet reclaims node-level +resources based on the filesystems on the node. + +#### With `imagefs` + +If the node has a dedicated `imagefs` filesystem for container runtimes to use, +the kubelet does the following: + + * If the `nodefs` filesystem meets the eviction threshlds, the kubelet garbage collects + dead pods and containers. + * If the `imagefs` filesystem meets the eviction thresholds, the kubelet + deletes all unused images. + +#### Without `imagefs` + +If the node only has a `nodefs` filesystem that meets eviction thresholds, +the kubelet frees up disk space in the following order: + +1. Garbage collect dead pods and containers +1. Delete unused images + +### Pod selection for kubelet eviction + +If the kubelet's attempts to reclaim node-level resources don't bring the eviction +signal below the threshold, the kubelet begins to evict end-user pods. + +The kubelet uses the following parameters to determine pod eviction order: + +1. Whether the pod's resource usage exceeds requests +1. [Pod Priority](/docs/concepts/configuration/pod-priority-preemption/) +1. The pod's resource usage relative to requests + +As a result, kubelet ranks and evicts pods in the following order: + +1. `BestEffort` or `Burstable` pods where the usage exceeds requests. These pods + are evicted based on their Priority and then by how much their usage level + exceeds the request. +1. `Guaranteed` pods and `Burstable` pods where the usage is less than requests + are evicted last, based on their Priority. + +{{}} +The kubelet does not use the pod's QoS class to determine the eviction order. +You can use the QoS class to estimate the most likely pod eviction order when +reclaiming resources like memory. QoS does not apply to EphemeralStorage requests, +so the above scenario will not apply if the node is, for example, under `DiskPressure`. +{{}} + +`Guaranteed` pods are guaranteed only when requests and limits are specified for +all the containers and they are equal. These pods will never be evicted because +of another pod's resource consumption. If a system daemon (such as `kubelet`, +`docker`, and `journald`) is consuming more resources than were reserved via +`system-reserved` or `kube-reserved` allocations, and the node only has +`Guaranteed` or `Burstable` pods using less resources than requests left on it, +then the kubelet must choose to evict one of these pods to preserve node stability +and to limit the impact of resource starvation on other pods. In this case, it +will choose to evict pods of lowest Priority first. + +When the kubelet evicts pods in response to `inode` or `PID` starvation, it uses +the Priority to determine the eviction order, because `inodes` and `PIDs` have no +requests. + +The kubelet sorts pods differently based on whether the node has a dedicated +`imagefs` filesystem: + +#### With `imagefs` + +If `nodefs` is triggering evictions, the kubelet sorts pods based on `nodefs` +usage (`local volumes + logs of all containers`). + +If `imagefs` is triggering evictions, the kubelet sorts pods based on the +writable layer usage of all containers. + +#### Without `imagefs` + +If `nodefs` is triggering evictions, the kubelet sorts pods based on their total +disk usage (`local volumes + logs & writable layer of all containers`) + +### Minimum eviction reclaim + +In some cases, pod eviction only reclaims a small amount of the starved resource. +This can lead to the kubelet repeatedly hitting the configured eviction thresholds +and triggering multiple evictions. + +You can use the `--eviction-minimum-reclaim` flag or a [kubelet config file](/docs/tasks/administer-cluster/kubelet-config-file/) +to configure a minimum reclaim amount for each resource. When the kubelet notices +that a resource is starved, it continues to reclaim that resource until it +reclaims the quantity you specify. + +For example, the following configuration sets minimum reclaim amounts: + +```yaml +apiVersion: kubelet.config.k8s.io/v1beta1 +kind: KubeletConfiguration +evictionHard: + memory.available: "500Mi" + nodefs.available: "1Gi" + imagefs.available: "100Gi" +evictionMinimumReclaim: + memory.available: "0Mi" + nodefs.available: "500Mi" + imagefs.available: "2Gi" +``` + +In this example, if the `nodefs.available` signal meets the eviction threshold, +the kubelet reclaims the resource until the signal reaches the threshold of `1Gi`, +and then continues to reclaim the minimum amount of `500Mi` it until the signal +reaches `1.5Gi`. + +Similarly, the kubelet reclaims the `imagefs` resource until the `imagefs.available` +signal reaches `102Gi`. + +The default `eviction-minimum-reclaim` is `0` for all resources. + +### Node out of memory behavior + +If the node experiences an out of memory (OOM) event prior to the kubelet +being able to reclaim memory, the node depends on the [oom_killer](https://lwn.net/Articles/391222/) +to respond. + +The kubelet sets an `oom_score_adj` value for each container based on the QoS for the pod. + +| Quality of Service | oom_score_adj | +|--------------------|-----------------------------------------------------------------------------------| +| `Guaranteed` | -997 | +| `BestEffort` | 1000 | +| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) | + +{{}} +The kubelet also sets an `oom_score_adj` value of `-997` for containers in Pods that have +`system-node-critical` {{}} +{{}} + +If the kubelet can't reclaim memory before a node experiences OOM, the +`oom_killer` calculates an `oom_score` based on the percentage of memory it's +using on the node, and then adds the `oom_score_adj` to get an effective `oom_score` +for each container. It then kills the container with the highest score. + +This means that containers in low QoS pods that consume a large amount of memory +relative to their scheduling requests are killed first. + +Unlike pod eviction, if a container is OOM killed, the `kubelet` can restart it +based on its `RestartPolicy`. + +### Best practices {#node-pressure-eviction-good-practices} + +The following sections describe best practices for eviction configuration. + +#### Schedulable resources and eviction policies + +When you configure the kubelet with an eviction policy, you should make sure that +the scheduler will not schedule pods if they will trigger eviction because they +immediately induce memory pressure. + +Consider the following scenario: + +* Node memory capacity: `10Gi` +* Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.) +* Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM. + +For this to work, the kubelet is launched as follows: + +``` +--eviction-hard=memory.available<500Mi +--system-reserved=memory=1.5Gi +``` + +In this configuration, the `--system-reserved` flag reserves `1.5Gi` of memory +for the system, which is `10% of the total memory + the eviction threshold amount`. + +The node can reach the eviction threshold if a pod is using more than its request, +or if the system is using more than `1Gi` of memory, which makes the `memory.available` +signal fall below `500Mi` and triggers the threshold. + +#### DaemonSet + +Pod Priority is a major factor in making eviction decisions. If you do not want +the kubelet to evict pods that belong to a `DaemonSet`, give those pods a high +enough `priorityClass` in the pod spec. You can also use a lower `priorityClass` +or the default to only allow `DaemonSet` pods to run when there are enough +resources. + +### Known issues + +The following sections describe known issues related to out of resource handling. + +#### kubelet may not observe memory pressure right away + +By default, the kubelet polls `cAdvisor` to collect memory usage stats at a +regular interval. If memory usage increases within that window rapidly, the +kubelet may not observe `MemoryPressure` fast enough, and the `OOMKiller` +will still be invoked. + +You can use the `--kernel-memcg-notification` flag to enable the `memcg` +notification API on the kubelet to get notified immediately when a threshold +is crossed. + +If you are not trying to achieve extreme utilization, but a sensible measure of +overcommit, a viable workaround for this issue is to use the `--kube-reserved` +and `--system-reserved` flags to allocate memory for the system. + +#### active_file memory is not considered as available memory + +On Linux, the kernel tracks the number of bytes of file-backed memory on active +LRU list as the `active_file` statistic. The kubelet treats `active_file` memory +areas as not reclaimable. For workloads that make intensive use of block-backed +local storage, including ephemeral local storage, kernel-level caches of file +and block data means that many recently accessed cache pages are likely to be +counted as `active_file`. If enough of these kernel block buffers are on the +active LRU list, the kubelet is liable to observe this as high resource use and +taint the node as experiencing memory pressure - triggering pod eviction. + +For more more details, see [https://github.com/kubernetes/kubernetes/issues/43916](https://github.com/kubernetes/kubernetes/issues/43916) + +You can work around that behavior by setting the memory limit and memory request +the same for containers likely to perform intensive I/O activity. You will need +to estimate or measure an optimal memory limit value for that container. + +## {{% heading "whatsnext" %}} + +* Learn about [API-initiated Eviction](/docs/concepts/scheduling-eviction/api-eviction/) +* Learn about [Pod Priority and Preemption](/docs/concepts/scheduling-eviction/pod-priority-preemption/) +* Learn about [PodDisruptionBudgets](/docs/tasks/run-application/configure-pdb/) +* Learn about [Quality of Service](/docs/tasks/configure-pod-container/quality-service-pod/) (QoS) +* Check out the [Eviction API](/docs/reference/generated/kubernetes-api/{{}}/#create-eviction-pod-v1-core) \ No newline at end of file diff --git a/content/en/docs/concepts/scheduling-eviction/pod-overhead.md b/content/en/docs/concepts/scheduling-eviction/pod-overhead.md index 15992126f9bb4..eebc235084ec4 100644 --- a/content/en/docs/concepts/scheduling-eviction/pod-overhead.md +++ b/content/en/docs/concepts/scheduling-eviction/pod-overhead.md @@ -5,7 +5,7 @@ reviewers: - tallclair title: Pod Overhead content_type: concept -weight: 50 +weight: 30 --- diff --git a/content/en/docs/concepts/scheduling-eviction/pod-priority-preemption.md b/content/en/docs/concepts/scheduling-eviction/pod-priority-preemption.md index 5e75674d73a3c..112e244f467a4 100644 --- a/content/en/docs/concepts/scheduling-eviction/pod-priority-preemption.md +++ b/content/en/docs/concepts/scheduling-eviction/pod-priority-preemption.md @@ -4,7 +4,7 @@ reviewers: - wojtek-t title: Pod Priority and Preemption content_type: concept -weight: 70 +weight: 50 --- @@ -372,4 +372,6 @@ that exceeds its requests may be evicted. ## {{% heading "whatsnext" %}} * Read about using ResourceQuotas in connection with PriorityClasses: [limit Priority Class consumption by default](/docs/concepts/policy/resource-quotas/#limit-priority-class-consumption-by-default) - +* Learn about [Pod Disruption](/docs/concepts/workloads/pods/disruptions/) +* Learn about [API-initiated Eviction](/docs/concepts/scheduling-eviction/api-eviction/) +* Learn about [Node-pressure Eviction](/docs/concepts/scheduling-eviction/node-pressure-eviction/) diff --git a/content/en/docs/concepts/scheduling-eviction/resource-bin-packing.md b/content/en/docs/concepts/scheduling-eviction/resource-bin-packing.md index 94bfaa1280625..a7b36393669b2 100644 --- a/content/en/docs/concepts/scheduling-eviction/resource-bin-packing.md +++ b/content/en/docs/concepts/scheduling-eviction/resource-bin-packing.md @@ -5,7 +5,7 @@ reviewers: - ahg-g title: Resource Bin Packing for Extended Resources content_type: concept -weight: 30 +weight: 80 --- diff --git a/content/en/docs/concepts/scheduling-eviction/scheduler-perf-tuning.md b/content/en/docs/concepts/scheduling-eviction/scheduler-perf-tuning.md index 24283f2efaf9e..b110dc63e54bf 100644 --- a/content/en/docs/concepts/scheduling-eviction/scheduler-perf-tuning.md +++ b/content/en/docs/concepts/scheduling-eviction/scheduler-perf-tuning.md @@ -3,7 +3,7 @@ reviewers: - bsalamat title: Scheduler Performance Tuning content_type: concept -weight: 80 +weight: 100 --- diff --git a/content/en/docs/concepts/scheduling-eviction/scheduling-framework.md b/content/en/docs/concepts/scheduling-eviction/scheduling-framework.md index 06ed901c2a8bb..3be7adf43067c 100644 --- a/content/en/docs/concepts/scheduling-eviction/scheduling-framework.md +++ b/content/en/docs/concepts/scheduling-eviction/scheduling-framework.md @@ -3,7 +3,7 @@ reviewers: - ahg-g title: Scheduling Framework content_type: concept -weight: 70 +weight: 90 --- diff --git a/content/en/docs/reference/glossary/api-eviction.md b/content/en/docs/reference/glossary/api-eviction.md new file mode 100644 index 0000000000000..b13238c955c5f --- /dev/null +++ b/content/en/docs/reference/glossary/api-eviction.md @@ -0,0 +1,22 @@ +--- +title: API-initiated eviction +id: api-eviction +date: 2021-04-27 +full_link: /docs/concepts/scheduling-eviction/pod-eviction/#api-eviction +short_description: > + API-initiated eviction is the process by which you use the Eviction API to create an + Eviction object that triggers graceful pod termination. +aka: +tags: +- operation +--- +API-initiated eviction is the process by which you use the [Eviction API](/docs/reference/generated/kubernetes-api/{{}}/#create-eviction-pod-v1-core) +to create an `Eviction` object that triggers graceful pod termination. + + + +You can request eviction either by directly calling the Eviction API +using a client of the kube-apiserver, like the `kubectl drain` command. +When an `Eviction` object is created, the API server terminates the Pod. + +API-initiated eviction is not the same as [node-pressure eviction](/docs/concepts/scheduling-eviction/eviction/#kubelet-eviction). diff --git a/content/en/docs/reference/glossary/node-pressure-eviction.md b/content/en/docs/reference/glossary/node-pressure-eviction.md new file mode 100644 index 0000000000000..742ee3fe0c7bf --- /dev/null +++ b/content/en/docs/reference/glossary/node-pressure-eviction.md @@ -0,0 +1,23 @@ +--- +title: Node-pressure eviction +id: node-pressure-eviction +date: 2021-05-13 +full_link: /docs/concepts/scheduling-eviction/node-pressure-eviction/ +short_description: > + Node-pressure eviction is the process by which the kubelet proactively fails + pods to reclaim resources on nodes. +aka: kubelet eviction +tags: +- operation +--- +Node-pressure eviction is the process by which the {{}} proactively terminates +pods to reclaim resources on nodes. + + + +The kubelet monitors resources like CPU, memory, disk space, and filesystem +inodes on your cluster's nodes. When one or more of these resources reach +specific consumption levels, the kubelet can proactively fail one or more pods +on the node to reclaim resources and prevent starvation. + +Node-pressure eviction is not the same as [API-initiated eviction](/docs/concepts/scheduling-eviction/api-eviction/). diff --git a/content/en/docs/reference/glossary/pod-disruption.md b/content/en/docs/reference/glossary/pod-disruption.md new file mode 100644 index 0000000000000..1efd69dd4cfb7 --- /dev/null +++ b/content/en/docs/reference/glossary/pod-disruption.md @@ -0,0 +1,19 @@ +--- +id: pod-disruption +title: Pod Disruption +full_link: /docs/concepts/workloads/pods/disruptions/ +date: 2021-05-12 +short_description: > + The process by which Pods on Nodes are terminated either voluntarily or involuntarily. + +aka: +related: + - pod + - container +tags: + - operation +--- + +[Pod disruption](/docs/concepts/workloads/pods/disruptions/) is the process by which Pods on Nodes are terminated either voluntarily or involuntarily. + +Voluntary disruptions are started intentionally by application owners or cluster administrators. Involuntary disruptions are unintentional and can be triggered by unavoidable issues like Nodes running out of resources, or by accidental deletions. diff --git a/content/en/docs/tasks/administer-cluster/out-of-resource.md b/content/en/docs/tasks/administer-cluster/out-of-resource.md deleted file mode 100644 index f750dd2585397..0000000000000 --- a/content/en/docs/tasks/administer-cluster/out-of-resource.md +++ /dev/null @@ -1,354 +0,0 @@ ---- -reviewers: -- derekwaynecarr -- vishh -- timstclair -title: Configure Out of Resource Handling -content_type: concept ---- - - - -This page explains how to configure out of resource handling with `kubelet`. - -The `kubelet` needs to preserve node stability when available compute resources -are low. This is especially important when dealing with incompressible -compute resources, such as memory or disk space. If such resources are exhausted, -nodes become unstable. - - - -### Eviction Signals - -The `kubelet` supports eviction decisions based on the signals described in the following -table. The value of each signal is described in the Description column, which is based on -the `kubelet` summary API. - -| Eviction Signal | Description | -|----------------------|---------------------------------------------------------------------------------------| -| `memory.available` | `memory.available` := `node.status.capacity[memory]` - `node.stats.memory.workingSet` | -| `nodefs.available` | `nodefs.available` := `node.stats.fs.available` | -| `nodefs.inodesFree` | `nodefs.inodesFree` := `node.stats.fs.inodesFree` | -| `imagefs.available` | `imagefs.available` := `node.stats.runtime.imagefs.available` | -| `imagefs.inodesFree` | `imagefs.inodesFree` := `node.stats.runtime.imagefs.inodesFree` | -| `pid.available` | `pid.available` := `node.stats.rlimit.maxpid` - `node.stats.rlimit.curproc` | - -Each of the above signals supports either a literal or percentage based value. -The percentage based value is calculated relative to the total capacity -associated with each signal. - -The value for `memory.available` is derived from the cgroupfs instead of tools -like `free -m`. This is important because `free -m` does not work in a -container, and if users use the [node -allocatable](/docs/tasks/administer-cluster/reserve-compute-resources/#node-allocatable) feature, out of resource decisions -are made local to the end user Pod part of the cgroup hierarchy as well as the -root node. This [script](/examples/admin/resource/memory-available.sh) -reproduces the same set of steps that the `kubelet` performs to calculate -`memory.available`. The `kubelet` excludes inactive_file (i.e. # of bytes of -file-backed memory on inactive LRU list) from its calculation as it assumes that -memory is reclaimable under pressure. - -`kubelet` supports only two filesystem partitions. - -1. The `nodefs` filesystem that kubelet uses for volumes, daemon logs, etc. -1. The `imagefs` filesystem that container runtimes uses for storing images and - container writable layers. - -`imagefs` is optional. `kubelet` auto-discovers these filesystems using -cAdvisor. `kubelet` does not care about any other filesystems. Any other types -of configurations are not currently supported by the kubelet. For example, it is -_not OK_ to store volumes and logs in a dedicated `filesystem`. - -In future releases, the `kubelet` will deprecate the existing [garbage -collection](/docs/concepts/cluster-administration/kubelet-garbage-collection/) -support in favor of eviction in response to disk pressure. - -### Eviction Thresholds - -The `kubelet` supports the ability to specify eviction thresholds that trigger the `kubelet` to reclaim resources. - -Each threshold has the following form: - -`[eviction-signal][operator][quantity]` - -where: - -* `eviction-signal` is an eviction signal token as defined in the previous table. -* `operator` is the desired relational operator, such as `<` (less than). -* `quantity` is the eviction threshold quantity, such as `1Gi`. These tokens must match the quantity representation used by Kubernetes. An eviction threshold can also be expressed as a percentage using the `%` token. - -For example, if a node has `10Gi` of total memory and you want trigger eviction if -the available memory falls below `1Gi`, you can define the eviction threshold as -either `memory.available<10%` or `memory.available<1Gi`. You cannot use both. - -#### Soft Eviction Thresholds - -A soft eviction threshold pairs an eviction threshold with a required -administrator-specified grace period. No action is taken by the `kubelet` -to reclaim resources associated with the eviction signal until that grace -period has been exceeded. If no grace period is provided, the `kubelet` -returns an error on startup. - -In addition, if a soft eviction threshold has been met, an operator can -specify a maximum allowed Pod termination grace period to use when evicting -pods from the node. If specified, the `kubelet` uses the lesser value among -the `pod.Spec.TerminationGracePeriodSeconds` and the max allowed grace period. -If not specified, the `kubelet` kills Pods immediately with no graceful -termination. - -To configure soft eviction thresholds, the following flags are supported: - -* `eviction-soft` describes a set of eviction thresholds (e.g. `memory.available<1.5Gi`) that if met over a corresponding grace period would trigger a Pod eviction. -* `eviction-soft-grace-period` describes a set of eviction grace periods (e.g. `memory.available=1m30s`) that correspond to how long a soft eviction threshold must hold before triggering a Pod eviction. -* `eviction-max-pod-grace-period` describes the maximum allowed grace period (in seconds) to use when terminating pods in response to a soft eviction threshold being met. - -#### Hard Eviction Thresholds - -A hard eviction threshold has no grace period, and if observed, the `kubelet` -will take immediate action to reclaim the associated starved resource. If a -hard eviction threshold is met, the `kubelet` kills the Pod immediately -with no graceful termination. - -To configure hard eviction thresholds, the following flag is supported: - -* `eviction-hard` describes a set of eviction thresholds (e.g. `memory.available<1Gi`) that if met would trigger a Pod eviction. - -The `kubelet` has the following default hard eviction threshold: - -* `memory.available<100Mi` -* `nodefs.available<10%` -* `imagefs.available<15%` - -On a Linux node, the default value also includes `nodefs.inodesFree<5%`. - -### Eviction Monitoring Interval - -The `kubelet` evaluates eviction thresholds per its configured housekeeping interval. - -* `housekeeping-interval` is the interval between container housekeepings which defaults to `10s`. - -### Node Conditions - -The `kubelet` maps one or more eviction signals to a corresponding node condition. - -If a hard eviction threshold has been met, or a soft eviction threshold has been met -independent of its associated grace period, the `kubelet` reports a condition that -reflects the node is under pressure. - -The following node conditions are defined that correspond to the specified eviction signal. - -| Node Condition | Eviction Signal | Description | -|-------------------|---------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------------------------------------| -| `MemoryPressure` | `memory.available` | Available memory on the node has satisfied an eviction threshold | -| `DiskPressure` | `nodefs.available`, `nodefs.inodesFree`, `imagefs.available`, or `imagefs.inodesFree` | Available disk space and inodes on either the node's root filesystem or image filesystem has satisfied an eviction threshold | -| `PIDPressure` | `pid.available` | Available processes identifiers on the (Linux) node has fallen below an eviction threshold | | - -The `kubelet` continues to report node status updates at the frequency specified by -`--node-status-update-frequency` which defaults to `10s`. - -### Oscillation of node conditions - -If a node is oscillating above and below a soft eviction threshold, but not exceeding -its associated grace period, it would cause the corresponding node condition to -constantly oscillate between true and false, and could cause poor scheduling decisions -as a consequence. - -To protect against this oscillation, the following flag is defined to control how -long the `kubelet` must wait before transitioning out of a pressure condition. - -* `eviction-pressure-transition-period` is the duration for which the `kubelet` has to wait before transitioning out of an eviction pressure condition. - -The `kubelet` would ensure that it has not observed an eviction threshold being met -for the specified pressure condition for the period specified before toggling the -condition back to `false`. - -### Reclaiming node level resources - -If an eviction threshold has been met and the grace period has passed, -the `kubelet` initiates the process of reclaiming the pressured resource -until it has observed the signal has gone below its defined threshold. - -The `kubelet` attempts to reclaim node level resources prior to evicting end-user Pods. If -disk pressure is observed, the `kubelet` reclaims node level resources differently if the -machine has a dedicated `imagefs` configured for the container runtime. - -#### With `imagefs` - -If `nodefs` filesystem has met eviction thresholds, `kubelet` frees up disk space by deleting the dead Pods and their containers. - -If `imagefs` filesystem has met eviction thresholds, `kubelet` frees up disk space by deleting all unused images. - -#### Without `imagefs` - -If `nodefs` filesystem has met eviction thresholds, `kubelet` frees up disk space in the following order: - -1. Delete dead Pods and their containers -1. Delete all unused images - -### Evicting end-user Pods - -If the `kubelet` is unable to reclaim sufficient resource on the node, `kubelet` begins evicting Pods. - -The `kubelet` ranks Pods for eviction first by whether or not their usage of the starved resource exceeds requests, -then by [Priority](/docs/concepts/configuration/pod-priority-preemption/), and then by the consumption of the starved compute resource relative to the Pods' scheduling requests. - -As a result, `kubelet` ranks and evicts Pods in the following order: - -* `BestEffort` or `Burstable` Pods whose usage of a starved resource exceeds its request. Such pods are ranked by Priority, and then usage above request. -* `Guaranteed` pods and `Burstable` pods whose usage is beneath requests are evicted last. `Guaranteed` Pods are guaranteed only when requests and limits are specified for all the containers and they are equal. Such pods are guaranteed to never be evicted because of another Pod's resource consumption. If a system daemon (such as `kubelet`, `docker`, and `journald`) is consuming more resources than were reserved via `system-reserved` or `kube-reserved` allocations, and the node only has `Guaranteed` or `Burstable` Pods using less than requests remaining, then the node must choose to evict such a Pod in order to preserve node stability and to limit the impact of the unexpected consumption to other Pods. In this case, it will choose to evict pods of Lowest Priority first. - -If necessary, `kubelet` evicts Pods one at a time to reclaim disk when `DiskPressure` -is encountered. If the `kubelet` is responding to `inode` starvation, it reclaims -`inodes` by evicting Pods with the lowest quality of service first. If the `kubelet` -is responding to lack of available disk, it ranks Pods within a quality of service -that consumes the largest amount of disk and kills those first. - -#### With `imagefs` - -If `nodefs` is triggering evictions, `kubelet` sorts Pods based on the usage on `nodefs` - -- local volumes + logs of all its containers. - -If `imagefs` is triggering evictions, `kubelet` sorts Pods based on the writable layer usage of all its containers. - -#### Without `imagefs` - -If `nodefs` is triggering evictions, `kubelet` sorts Pods based on their total disk usage - -- local volumes + logs & writable layer of all its containers. - -### Minimum eviction reclaim - -In certain scenarios, eviction of Pods could result in reclamation of small amount of resources. This can result in -`kubelet` hitting eviction thresholds in repeated successions. In addition to that, eviction of resources like `disk`, is time consuming. - -To mitigate these issues, `kubelet` can have a per-resource `minimum-reclaim`. Whenever `kubelet` observes -resource pressure, `kubelet` attempts to reclaim at least `minimum-reclaim` amount of resource below -the configured eviction threshold. - -For example, with the following configuration: - -``` ---eviction-hard=memory.available<500Mi,nodefs.available<1Gi,imagefs.available<100Gi ---eviction-minimum-reclaim="memory.available=0Mi,nodefs.available=500Mi,imagefs.available=2Gi"` -``` - -If an eviction threshold is triggered for `memory.available`, the `kubelet` works to ensure -that `memory.available` is at least `500Mi`. For `nodefs.available`, the `kubelet` works -to ensure that `nodefs.available` is at least `1.5Gi`, and for `imagefs.available` it -works to ensure that `imagefs.available` is at least `102Gi` before no longer reporting pressure -on their associated resources. - -The default `eviction-minimum-reclaim` is `0` for all resources. - -### Scheduler - -The node reports a condition when a compute resource is under pressure. The -scheduler views that condition as a signal to dissuade placing additional -pods on the node. - -| Node Condition | Scheduler Behavior | -| ------------------| ----------------------------------------------------| -| `MemoryPressure` | No new `BestEffort` Pods are scheduled to the node. | -| `DiskPressure` | No new Pods are scheduled to the node. | - -## Node OOM Behavior - -If the node experiences a system OOM (out of memory) event prior to the `kubelet` being able to reclaim memory, -the node depends on the [oom_killer](https://lwn.net/Articles/391222/) to respond. - -The `kubelet` sets a `oom_score_adj` value for each container based on the quality of service for the Pod. - -| Quality of Service | oom_score_adj | -|--------------------|-----------------------------------------------------------------------------------| -| `Guaranteed` | -998 | -| `BestEffort` | 1000 | -| `Burstable` | min(max(2, 1000 - (1000 * memoryRequestBytes) / machineMemoryCapacityBytes), 999) | - -If the `kubelet` is unable to reclaim memory prior to a node experiencing system OOM, the `oom_killer` calculates -an `oom_score` based on the percentage of memory it's using on the node, and then add the `oom_score_adj` to get an -effective `oom_score` for the container, and then kills the container with the highest score. - -The intended behavior should be that containers with the lowest quality of service that -are consuming the largest amount of memory relative to the scheduling request should be killed first in order -to reclaim memory. - -Unlike Pod eviction, if a Pod container is OOM killed, it may be restarted by the `kubelet` based on its `RestartPolicy`. - -## Best Practices - -The following sections describe best practices for out of resource handling. - -### Schedulable resources and eviction policies - -Consider the following scenario: - -* Node memory capacity: `10Gi` -* Operator wants to reserve 10% of memory capacity for system daemons (kernel, `kubelet`, etc.) -* Operator wants to evict Pods at 95% memory utilization to reduce incidence of system OOM. - -To facilitate this scenario, the `kubelet` would be launched as follows: - -``` ---eviction-hard=memory.available<500Mi ---system-reserved=memory=1.5Gi -``` - -Implicit in this configuration is the understanding that "System reserved" should include the amount of memory -covered by the eviction threshold. - -To reach that capacity, either some Pod is using more than its request, or the system is using more than `1.5Gi - 500Mi = 1Gi`. - -This configuration ensures that the scheduler does not place Pods on a node that immediately induce memory pressure -and trigger eviction assuming those Pods use less than their configured request. - -### DaemonSet - -As `Priority` is a key factor in the eviction strategy, if you do not want pods belonging to a `DaemonSet` to be evicted, specify a sufficiently high priorityClass in the pod spec template. If you want pods belonging to a `DaemonSet` to run only if there are sufficient resources, specify a lower or default priorityClass. - - -## Deprecation of existing feature flags to reclaim disk - -`kubelet` has been freeing up disk space on demand to keep the node stable. - -As disk based eviction matures, the following `kubelet` flags are marked for deprecation -in favor of the simpler configuration supported around eviction. - -| Existing Flag | New Flag | -| ------------------------------------------ | ----------------------------------------| -| `--image-gc-high-threshold` | `--eviction-hard` or `eviction-soft` | -| `--image-gc-low-threshold` | `--eviction-minimum-reclaim` | -| `--maximum-dead-containers` | deprecated | -| `--maximum-dead-containers-per-container` | deprecated | -| `--minimum-container-ttl-duration` | deprecated | -| `--low-diskspace-threshold-mb` | `--eviction-hard` or `eviction-soft` | -| `--outofdisk-transition-frequency` | `--eviction-pressure-transition-period` | - -## Known issues - -The following sections describe known issues related to out of resource handling. - -### kubelet may not observe memory pressure right away - -The `kubelet` currently polls `cAdvisor` to collect memory usage stats at a regular interval. If memory usage -increases within that window rapidly, the `kubelet` may not observe `MemoryPressure` fast enough, and the `OOMKiller` -will still be invoked. We intend to integrate with the `memcg` notification API in a future release to reduce this -latency, and instead have the kernel tell us when a threshold has been crossed immediately. - -If you are not trying to achieve extreme utilization, but a sensible measure of overcommit, a viable workaround for -this issue is to set eviction thresholds at approximately 75% capacity. This increases the ability of this feature -to prevent system OOMs, and promote eviction of workloads so cluster state can rebalance. - -### kubelet may evict more Pods than needed - -The Pod eviction may evict more Pods than needed due to stats collection timing gap. This can be mitigated by adding -the ability to get root container stats on an on-demand basis [(https://github.com/google/cadvisor/issues/1247)](https://github.com/google/cadvisor/issues/1247) in the future. - -### active_file memory is not considered as available memory - -On Linux, the kernel tracks the number of bytes of file-backed memory on active LRU list as the `active_file` statistic. The kubelet treats `active_file` memory areas as not reclaimable. For workloads that make intensive use of block-backed local storage, including ephemeral local storage, kernel-level caches of file and block data means that many recently accessed cache pages are likely to be counted as `active_file`. If enough of these kernel block buffers are on the active LRU list, the kubelet is liable to observe this as high resource use and taint the node as experiencing memory pressure - triggering Pod eviction. - -For more more details, see [https://github.com/kubernetes/kubernetes/issues/43916](https://github.com/kubernetes/kubernetes/issues/43916) - -You can work around that behavior by setting the memory limit and memory request the same for containers likely to perform intensive I/O activity. You will need to estimate or measure an optimal memory limit value for that container. - diff --git a/static/_redirects b/static/_redirects index 0dfa334de2d4d..3ebe2f337ede4 100644 --- a/static/_redirects +++ b/static/_redirects @@ -91,7 +91,7 @@ /docs/concepts/cluster-administration/guaranteed-scheduling-critical-addon-pods/ /docs/tasks/administer-cluster/guaranteed-scheduling-critical-addon-pods/ 301 /docs/concepts/cluster-administration/master-node-communication/ /docs/concepts/architecture/master-node-communication/ 301 /docs/concepts/cluster-administration/network-plugins/ /docs/concepts/extend-kubernetes/compute-storage-net/network-plugins/ 301 -/docs/concepts/cluster-administration/out-of-resource/ /docs/tasks/administer-cluster/out-of-resource/ 301 +/docs/concepts/cluster-administration/out-of-resource/ /docs/concepts/scheduling-eviction/node-pressure-eviction/ 301 /docs/concepts/cluster-administration/resource-usage-monitoring /docs/tasks/debug-application-cluster/resource-usage-monitoring/ 301 /docs/concepts/cluster-administration/monitoring/ /docs/concepts/cluster-administration/system-metrics/ 301 /docs/concepts/cluster-administration/controller-metrics/ /docs/concepts/cluster-administration/system-metrics/ 301 @@ -127,6 +127,7 @@ /id/docs/concepts/scheduling/scheduler-perf-tuning/ /id/docs/concepts/scheduling-eviction/scheduler-perf-tuning/ 301 /docs/concepts/scheduling/scheduling-framework/ /docs/concepts/scheduling-eviction/scheduling-framework/ 301 /id/docs/concepts/scheduling/scheduling-framework/ /id/docs/concepts/scheduling-eviction/scheduling-framework/ 301 +/docs/concepts/scheduling-eviction/eviction-policy/ /docs/concepts/scheduling-eviction/node-pressure-eviction/ 301 /docs/concepts/service-catalog/ /docs/concepts/extend-kubernetes/service-catalog/ 301 /docs/concepts/services-networking/networkpolicies/ /docs/concepts/services-networking/network-policies/ 301 /docs/concepts/storage/etcd-store-api-object/ /docs/tasks/administer-cluster/configure-upgrade-etcd/ 301 @@ -261,6 +262,7 @@ /docs/tasks/administer-cluster/quota-memory-cpu-namespace/ /docs/tasks/administer-cluster/manage-resources/quota-memory-cpu-namespace/ 301 /docs/tasks/administer-cluster/quota-pod-namespace/ /docs/tasks/administer-cluster/manage-resources/quota-pod-namespace/ 301 /docs/tasks/administer-cluster/reserve-compute-resources/out-of-resource.md /docs/tasks/administer-cluster/out-of-resource/ 301 +/docs/tasks/administer-cluster/out-of-resource/ /docs/concepts/scheduling-eviction/pod-eviction/ 301 /docs/tasks/administer-cluster/romana-network-policy/ /docs/tasks/administer-cluster/network-policy-provider/romana-network-policy/ 301 /docs/tasks/administer-cluster/running-cloud-controller.md /docs/tasks/administer-cluster/running-cloud-controller/ 301 /docs/tasks/administer-cluster/share-configuration/ /docs/tasks/access-application-cluster/configure-access-multiple-clusters/ 301