Skip to content

Latest commit

 

History

History
943 lines (743 loc) · 39.9 KB

File metadata and controls

943 lines (743 loc) · 39.9 KB

KEP-2000: Graceful Node Shutdown

Release Signoff Checklist

Items marked with (R) are required prior to targeting to a milestone / release.

  • (R) Enhancement issue in release milestone, which links to KEP dir in kubernetes/enhancements (not the initial KEP PR)
  • (R) KEP approvers have approved the KEP status as implementable
  • (R) Design details are appropriately documented
  • (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input
  • (R) Graduation criteria is in place
  • (R) Production readiness review completed
  • Production readiness review approved
  • "Implementation History" section is up-to-date for milestone
  • User-facing documentation has been created in kubernetes/website, for publication to kubernetes.io
  • Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes

Summary

Kubelet should be aware of node shutdown and trigger graceful shutdown of pods during a machine shutdown.

Motivation

Users and cluster administrators expect that pods will adhere to expected pod lifecycle including pod termination. Currently, when a node shuts down, pods do not follow the expected pod termination lifecycle and are not terminated gracefully which can cause issues for some workloads. This KEP aims to address this problem by making the kubelet aware of the underlying node shutdown. Kubelet will propagate this signal to pods ensuring they can shutdown as gracefully as possible.

Goals

  • Make kubelet aware of underlying node shutdown event and trigger pod termination with sufficient grace period to shutdown properly
  • Handle node shutdown in cloud-provider agnostic way
  • Introduce minimal shutdown delay in order to shutdown node soon as possible (but not sooner)
  • Focus on handling shutdown on systemd based machines

Non-Goals

  • Let users modify or change existing pod lifecycle or introduce new inner pod depencides / shutdown ordering
  • Support every linux init and ACPI event handling mechanism (focus on widely used logind from systemd)
  • Provide guarantee to handle all cases of graceful node shutdown, for example abrupt shutdown or sudden power cable pull can’t result in graceful shutdown

Proposal

User Stories (Optional)

Story 1

  • As a cluster administrator, I can configure the nodes in my cluster to allocate X seconds for my pods to terminate gracefully during a node shutdown

Story 2

  • As a developer I can expect that my pods will terminate gracefully during node shutdowns

Background on Linux Shutdown

In the context of this KEP, shutdown is referred to as shutdown of the underlying machine. On most linux distros shutdown can be initiated via a variety of methods for example:

  1. shutdown -h now
  2. shutdown -h +30 #schedule a delayed shutdown in 30mins
  3. systemctl poweroff
  4. Physically pressing the power button on the machine
  5. If a machine is a VM, the underlying hypervisor can press the “virtual” power button
  6. For a cloud instance, stopping the instance via Cloud API, e.g. via gcloud compute instances stop. Depending on the cloud provider, this may result in virtual power button press by the underlying hypervisor.

Note: The use of shutdown -h now is dependent on systemd version. This is explored in Github issue #124039

Some of these cases will involve the machine receiving an ACPI event to change the power state. The machine can go from G0 (working state) to G2 (Soft Off) and finally to G3 (Off) more info on ACPI states. On Linux, prior to shutdown usually a system daemon will listen to these events and perform some series of actions prior to userspace calling the reboot(2) systemcall with LINUX_REBOOT_CMD_POWER_OFF or LINUX_REBOOT_CMD_HALT to actually shutdown the machine.

Historically, ACPI events were often handled by the acpid daemon which uses a variety of mechanisms to watch ACPI events (i.e. reading /proc/acpi/event or /dev/input/eventX to react to power button presses). However, in most modern linux distros today, systemd-logind has taken over as the main component reacting to ACPI events and initiating shutdown of the machine. On a system with systemd-logind, for example, a trigger of the power button will result in the systemd target poweroff being run (see HandlePowerKey, which will terminate all the systemd services running on the machine and eventually shut it down. However, in the context of kubernetes, systemd is not aware of the pods and containers running on the machine and systemd will simply kill them as regular linux processes.

Background on Inhibitors

systemd-logind provides the ability for applications to delay shutdown and perform some series of actions before the shutdown completes through a mechanism called "Inhibitor Locks". Applications can request to delay shutdown by taking an inhibitor lock by sending messages to logind over dbus. Applications can request up to InhibitDelayMaxSec (a setting configured in logind.conf) for delay based locks, which allow applications to receive sleep and shutdown events, and block the shutdown from proceeding by InhibitDelayMaxSec period to execute some critical work prior to shutdown/sleep. Inhibitor Locks were introduced to systemd 183 (released in 2012).

We believe that making use of systemd is a reasonable approach considering almost all new popular linux distros are systemd based (RHEL, Google COS, Ubuntu, CentOS, Debian, Fedora, Flatcar Linux, see widespread adoption) and systemd 183 (released in 2012) features support for inhibitors.

Thanks to @giuseppe for helping with getting systemd inhibitors working!

Implementation

Introduce a new Kubelet Config setting, kubeletConfig.ShutdownGracePeriod, defaulting to 0 seconds. Upon kubelet startup,

  • if the setting is greater than 0 seconds
    • kubelet will check with dbus current InhibitDelayMaxSec to check if kubeletConfig.ShutdownGracePeriod <= InhibitDelayMaxSec.
  • if kubeletConfig.ShutdownGracePeriod > InhibitDelayMaxSec
    • Kubelet will attempt to update the InhibitDelayMaxSec setting, by writing a config file to /etc/systemd/logind.conf.d/kubelet.conf, and sending a SIGHUP to logind to update the config setting to ensure that the ShutdownGracePeriod from kubelet config is equal to InhibitDelayMaxSec.

After updating the InhibitDelayMaxSec on the node if needed, Kubelet will query the dbus for the final value of InhibitDelayMaxSec set on the node and treat min(InhibitDelayMaxSec, kubeletConfig.ShutdownGracePeriod) as the allocatable shutdown grace period, which will be referred to in this KEP as ShutdownGracePeriod.

Kubelet will register with dbus as a delay systemd inhibitor lock for the ShutdownGracePeriod for the shutdown event. Kubelet will also register a PrepareForShutdown signal which will be emitted prior to the shutdown. Upon receiving the signal, Kubelet will have additional ShutdownGracePeriod time before the actual node will initiate the shutdown.

Handling the shutdown

Upon a shutdown occurring, Kubelet will gracefully terminate all the pods running on the node and update the Ready condition of the node to false with a message Node Shutting Down, thereby ensuring new workloads will not get scheduled to the node.

Since some of the pods running on the node are often critical for the the workloads running on a node (e.g. logging pod daemonset, kubeproxy, kubedns) etc, we choose to split the pods running on the node into two categories, “critical system pods”, and regular pods. Critical system pods should be terminated last, because for example, if the logging pod is terminated first, logs from the other workloads will not be captured. Critical system pods are identified as those that are in the system-cluster-critical or system-node-critical priority classes.

Upon shutdown Kubelet will:

  1. Update the Node's Ready condition to false, with the reason Node is shutting down
  2. Gracefully terminate all non critical system pods with a gracePeriodOverride computed as min(podSpec.terminationGracePeriodSeconds, ShutdownGracePeriod-ShutdownGracePeriodCriticalPods)
  3. Gracefully terminate all critical system pods with gracePeriodOverride of ShutdownGracePeriodCriticalPods seconds

Kubelet will use the same existing killPod function to perform the termination of pods, using gracePeriodOverride to set the appropriate grace period. During the termination process, normal pod termination processes will apply, e.g. preStop Hooks will be called, SIGTERM to containers delivered, etc.

To ensure gracePeriodOverride is respected, Github issue #92432 should also be addressed to ensure that gracePeriod override will be respected for preStop hooks.

POC: I’ve prototyped an initial POC here of the proposed implementation on the shutdown branch.

Notes/Constraints/Caveats (Optional)

Risks and Mitigations

  • Kubelet does not receive shutdown event or is able to create inhibitor lock
    • Mitigation: Kubelet does not provide graceful shutdown to pods (same as today’s existing behavior). For alpha stage, to track shutdown behavior and if it was successful, we plan to add a debugging log statement just prior to kubelet's shutdown process being completed, so it's possible to verify if kubelet shutdown the node gracefully.
  • Kubelet is unable to update InhibitDelayMaxSec in logind to match that of kubeletConfig.ShutdownGracePeriod
    • If there are multiple logind configuration file overrides in /etc/systemd/logind.conf.d/, logind will use the config file with the lexicographically latest name. As a result in rare cases, the kubelet’s InhibitDelayMaxSec conf file override may be overwritten by another config file (possibly placed by another service on the machine).
    • Mitigation: Kubelet will use current value of InhibitDelayMaxSec from logind as the shutdown period which may be less than kubeletConfig.ShutdownGracePeriod.
  • OS / Distro does not use systemd or systemd version < 183
    • Mitigation: Kubelet will not provide graceful shutdown to pods (same as today’s existing behavior).

Design Details

The design proposes adding a new KubeletConfig field ShutdownGracePeriod used to specify total time period kubelet should delay shutdown by and thus total time allocated to the graceful termination process.

In addition to ShutdownGracePeriod, another KubeletConfig field will be added ShutdownGracePeriodCriticalPods. During the shutdown, the ShutdownGracePeriod-ShutdownGracePeriodCriticalPods duration will be grace period for non critical system pods like user workloads, while the remaining time of ShutdownGracePeriodCriticalPods will be the grace period for critical pods like node logging daemonsets.

type KubeletConfiguration struct {
    ...
    ShutdownGracePeriod metav1.Duration
    ShutdownGracePeriodCriticalPods metav1.Duration
}

Communication with systemd over dbus for (creating inhibitor lock, receiving PrepareForShutdown callback, etc), will make use of the github.com/godbus/dbus/v5 package which is already included in vendor/.

Termination of pods will make use of the existing killPod function from the kubelet package and specify the appropriate gracePeriodOverride as necessary.

Test Plan

  • Unit tests for kubelet of handling shutdown event
  • New E2E tests to validate node graceful shutdown (note limitation that K8S E2E tests currently only run on GCE).
    • Shutdown grace period unspecified, feature is not active
    • Pod’s ExecStop and SIGTERM handlers are given gracePeriodSeconds for case when gracePeriodSeconds <= kubeletConfig.ShutdownGracePeriod
    • Pod’s ExecStop and SIGTERM handlers are given kubeletConfig.ShutdownGracePeriod for case when gracePeriodSeconds > kubeletConfig.ShutdownGracePeriod

Graduation Criteria

Alpha Graduation

  • Implemented the feature for Linux (systemd) only
  • Unit tests
    • Unit tests will mock out system components (i.e. systemd, inhibitors) for alpha
  • Investigate how e2e tests can be implemented (e.g. may need to create fake shutdown event)

Alpha -> Beta Graduation

  • Addresses feedback from alpha testers
  • Sufficient E2E and unit testing

Beta -> GA Graduation

  • Addresses feedback from beta
  • Sufficient number of users using the feature
  • Confident that no further API / kubelet config configuration options changes are needed
  • Close on any remaining open issues & bugs

Upgrade / Downgrade Strategy

n/a

Version Skew Strategy

n/a

Production Readiness Review Questionnaire

Feature Enablement and Rollback

This section must be completed when targeting alpha to a release.

  • How can this feature be enabled / disabled in a live cluster?

    • Feature gate (also fill in values in kep.yaml)
      • Feature gate name: GracefulNodeShutdown
      • Components depending on the feature gate:
        • kubelet
    • Other
      • Describe the mechanism:
      • Will enabling / disabling the feature require downtime of the control plane?
        • no
      • Will enabling / disabling the feature require downtime or reprovisioning of a node? (Do not assume Dynamic Kubelet Config feature is enabled).
        • yes (will require restart of kubelet)
  • Does enabling the feature change any default behavior? Any change of default behavior may be surprising to users or break existing automations, so be extremely careful here.

    • The main behavior change is that during a node shutdown, pods running on the node will be terminated gracefully.
  • Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? Also set disable-supported to true or false in kep.yaml. Describe the consequences on existing workloads (e.g., if this is a runtime feature, can it break the existing applications?).

    • Yes, the feature can be disabled by either disabling the feature gate, or setting kubeletConfig.ShutdownGracePeriod to 0 seconds.
  • What happens if we reenable the feature if it was previously rolled back?

    • Kubelet will attempt to perform graceful termination of pods during a node shutdown.
  • Are there any tests for feature enablement/disablement? The e2e framework does not currently support enabling or disabling feature gates. However, unit tests in each component dealing with managing data, created with and without the feature, are necessary. At the very least, think about conversion tests if API types are being modified.

    • n/a

Rollout, Upgrade and Rollback Planning

This section must be completed when targeting beta graduation to a release.

  • How can a rollout fail? Can it impact already running workloads? Try to be as paranoid as possible - e.g., what if some components will restart mid-rollout?

This feature should not impact rollouts.

  • What specific metrics should inform a rollback?

N/A.

  • Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? Describe manual testing that was done and the outcomes. Longer term, we may want to require automated upgrade/rollback tests, but we are missing a bunch of machinery and tooling and can't do that now.

The feature is part of kubelet config so updating kubelet config should enable/disable the feature; upgrade/downgrade is N/A.

  • Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? Even if applying deprecation policies, they may still surprise some users.

No.

Monitoring Requirements

This section must be completed when targeting beta graduation to a release.

  • How can an operator determine if the feature is in use by workloads? Ideally, this should be a metric. Operations against the Kubernetes API (e.g., checking if there are objects with field X set) may be a last resort. Avoid logs or events for this purpose.

Check if the feature gate and kubelet config settings are enabled on a node.

  • What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service?
    • Metrics
      • Metric name:
      • [Optional] Aggregation method:
      • Components exposing the metric:
    • Other (treat as last resort)
      • Details:

N/A

  • What are the reasonable SLOs (Service Level Objectives) for the above SLIs? At a high level, this usually will be in the form of "high percentile of SLI per day <= X". It's impossible to provide comprehensive guidance, but at the very high level (needs more precise definitions) those may be things like:
    • per-day percentage of API calls finishing with 5XX errors <= 1%
    • 99% percentile over day of absolute value from (job creation time minus expected job creation time) for cron job <= 10%
    • 99,9% of /health requests per day finish with 200 code

N/A.

  • Are there any missing metrics that would be useful to have to improve observability of this feature? Describe the metrics themselves and the reasons why they weren't added (e.g., cost, implementation difficulties, etc.).

N/A.

Dependencies

This section must be completed when targeting beta graduation to a release.

  • Does this feature depend on any specific services running in the cluster? Think about both cluster-level services (e.g. metrics-server) as well as node-level agents (e.g. specific version of CRI). Focus on external or optional services that are needed. For example, if this feature depends on a cloud provider API, or upon an external software-defined storage or network control plane.

    For each of these, fill in the following—thinking about running existing user workloads and creating new ones, as well as about cluster-level services (e.g. DNS):

    • [Dependency name]
      • Usage description:
        • Impact of its outage on the feature:
        • Impact of its degraded performance or high-error rates on the feature:

No, this feature doesn't depend on any specific services running the cluster. It only depends on systemd running on the node itself.

Scalability

For alpha, this section is encouraged: reviewers should consider these questions and attempt to answer them.

For beta, this section is required: reviewers must answer these questions.

For GA, this section is required: approvers should be able to confirm the previous answers based on experience in the field.

  • Will enabling / using this feature result in any new API calls? Describe them, providing:
    • API call type (e.g. PATCH pods)
    • estimated throughput
    • originating component(s) (e.g. Kubelet, Feature-X-controller) focusing mostly on:
    • components listing and/or watching resources they didn't before
    • API calls that may be triggered by changes of some Kubernetes resources (e.g. update of object X triggers new updates of object Y)
    • periodic API calls to reconcile state (e.g. periodic fetching state, heartbeats, leader election, etc.)

No.

  • Will enabling / using this feature result in introducing new API types? Describe them, providing:
    • API type
    • Supported number of objects per cluster
    • Supported number of objects per namespace (for namespace-scoped objects)

No.

  • Will enabling / using this feature result in any new calls to the cloud provider?

No.

  • Will enabling / using this feature result in increasing size or count of the existing API objects? Describe them, providing:
    • API type(s):
    • Estimated increase in size: (e.g., new annotation of size 32B)
    • Estimated amount of new objects: (e.g., new Object X for every existing Pod)

No.

  • Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? Think about adding additional work or introducing new steps in between (e.g. need to do X to start a container), etc. Please describe the details.

No.

  • Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? Things to keep in mind include: additional in-memory state, additional non-trivial computations, excessive access to disks (including increased log volume), significant amount of data sent and/or received over network, etc. This through this both in small and large cases, again with respect to the supported limits.

No.

Troubleshooting

The Troubleshooting section currently serves the Playbook role. We may consider splitting it into a dedicated Playbook document (potentially with some monitoring details). For now, we leave it here.

This section must be completed when targeting beta graduation to a release.

  • How does this feature react if the API server and/or etcd is unavailable?

The feature does not depend on the API server / etcd.

  • What are other known failure modes? For each of them, fill in the following information by copying the below template:

    • [Failure mode brief description]
      • Detection: How can it be detected via metrics? Stated another way: how can an operator troubleshoot without logging into a master or worker node?
      • Mitigations: What can be done to stop the bleeding, especially for already running user workloads?
      • Diagnostics: What are the useful log messages and their required logging levels that could help debug the issue? Not required until feature graduated to beta.
      • Testing: Are there any tests for failure mode? If not, describe why.
  • What steps should be taken if SLOs are not being met to determine the problem?

N/A.

Implementation History

Drawbacks

Alternatives

  • Use systemd cgroup driver to set TimeoutStopSec= on scopes underlying containers
    • Set TimeStopSec= for the container scopes using the value set in the pod for termination grace period. The problem with this approach is that systemd doesn’t understand the prestop hooks.
  • Use systemd cgroup driver to set Before=kubelet.service on scopes underlying containers
    • Set Before=kubelet.service and container runtime service for the container scopes. Systemd would then stop the containers after the kubelet giving the kubelet a chance to stop the containers itself. This depends upon using the systemd cgroups driver and is coupled to systemd.
  • Use systemd cgroup driver to set controller property on scope to delegate control to kubelet
    • Set Controller dbus property for the container scopes and set After=kubelet.service for the containers. Systemd would then signal the kubelet over dbus to delegate the container scope termination. This requires more work in the kubelet and is also coupled to systemd and the systemd cgroup driver.
  • Don’t handle node shutdown events at all, and have users drain nodes before shutting them down.
    • This is not always possible, for example if the shutdown is controlled by some external system (e.g. Preemptible VMs).
  • Avoid relying on systemd and logind and directly hook into ACPI events on the node.
    • Unfortunately, this can create conflicts because only one systemd daemon should be monitoring ACPI events. Additionally, if the system is using systemd but kubelet did not integrate with it, systemd by default would terminate kubelet and other processes during a shutdown event.
  • Provide more configuration options on how to split time during shutdown (e.g. split between critical pods and user workloads). Need more feedback from the community here.

Infrastructure Needed (Optional)