Add KEP for etcdadm

kubernetes · Oct 22, 2018 · dee32b2 · dee32b2
1 parent 3005cd4
commit dee32b2
Show file tree

Hide file tree

Showing 2 changed files with 199 additions and 1 deletion.
diff --git a/keps/NEXT_KEP_NUMBER b/keps/NEXT_KEP_NUMBER
@@ -1 +1 @@
-31
+32
diff --git a/keps/sig-cluster-lifecycle/0031-20181022-etcdadm.md b/keps/sig-cluster-lifecycle/0031-20181022-etcdadm.md
@@ -0,0 +1,198 @@
+---
+kep-number: 31
+title: etcdadm
+authors:
+  - "@justinsb"
+owning-sig: sig-cluster-lifecycle
+#participating-sigs:
+#- sig-apimachinery
+reviewers:
+  - @roberthbailey
+  - @timothysc
+approvers:
+  - @roberthbailey
+  - @timothysc
+editor: TBD
+creation-date: 2018-10-22
+last-updated: 2018-10-22
+status: provisional
+#see-also:
+#  - KEP-1
+#  - KEP-2
+#replaces:
+#  - KEP-3
+#superseded-by:
+#  - KEP-100
+---
+
+# etcdadm - automation for etcd clusters
+
+1. **Fill out the "overview" sections.**
+  This includes the Summary and Motivation sections.
+  These should be easy if you've preflighted the idea of the KEP with the appropriate SIG.
+1. **Create a PR.**
+  Assign it to folks in the SIG that are sponsoring this process.
+1. **Merge early.**
+  Avoid getting hung up on specific details and instead aim to get the goal of the KEP merged quickly.
+  The best way to do this is to just start with the "Overview" sections and fill out details incrementally in follow on PRs.
+  View anything marked as a `provisional` as a working document and subject to change.
+  Aim for single topic PRs to keep discussions focused.
+  If you disagree with what is already in a document, open a new PR with suggested changes.
+
+The canonical place for the latest set of instructions (and the likely source of this file) is [here](/keps/0000-kep-template.md).
+
+The `Metadata` section above is intended to support the creation of tooling around the KEP process.
+This will be a YAML section that is fenced as a code block.
+See the KEP process for details on each of these items.
+
+## Table of Contents
+
+A table of contents is helpful for quickly jumping to sections of a KEP and for highlighting any additional information provided beyond the standard KEP template.
+[Tools for generating][] a table of contents from markdown are available.
+
+* [Table of Contents](#table-of-contents)
+* [Summary](#summary)
+* [Motivation](#motivation)
+    * [Goals](#goals)
+    * [Non-Goals](#non-goals)
+* [Proposal](#proposal)
+    * [User Stories](#user-stories)
+      * [Manual Cluster Creation](#manual-cluster-creation)
+      * [Automatic Cluster Creation](#automatic-cluster-creation)
+    * [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
+    * [Risks and Mitigations](#risks-and-mitigations)
+* [Graduation Criteria](#graduation-criteria)
+* [Implementation History](#implementation-history)
+* [Infrastructure Needed](#infrastructure-needed)
+
+## Summary
+
+etcdadm makes operation of etcd for the Kubernetes control plane easy, on clouds
+and on bare-metal, including both single-node and HA configurations.
+
+It is able to perform cluster reconfigurations, upgrades / downgrades, and
+backups / restores.
+
+## Motivation
+
+Today each installation tool must reimplement etcd operation, and this is
+difficult.  It also leads to ecosystem fragmentation - e.g. etcd backups from
+one tool are not necessarily compatible with the backups from other tools.  The
+failure modes are subtle and rare, and thus the kubernetes project benefits from
+having more collaboration.
+
+
+### Goals
+
+The following key tasks are in scope:
+
+* Cluster creation
+* Cluster teardown
+* Cluster resizing / membership changes
+* Cluster backups
+* Disaster recovery or restore from backup
+* Cluster upgrades
+* Cluster downgrades
+* PKI management
+
+We will implement this functionality both as a base layer of imperative (manual
+CLI) operation, and a self-management layer which should enable automated
+in "safe" scenarios (with fallback to manual operation).
+
+### Non-Goals
+
+* The project is not targeted at operation of an etcd cluster for use other than
+  by Kubernetes apiserver.  We are not building a general-purpose etcd operation
+  toolkit.  Likely it will work well for other use-cases, but other tools may be
+  more suitable.
+
+## Proposal
+
+We will combine the [etcdadm](https://github.com/platform9/etcdadm) from
+Platform9 with the [etcd-manager](https://github.com/kopeio/etcd-manager)
+project from kopeio / @justinsb.
+
+etcdadm gives us easy to use CLI commands, which will form the base layer of
+operation.  Automation should ideally describe what it is doing in terms of
+etcdadm commands, though we will also expose etcdadm as a go-library for easier
+consumption, following the kubectl pattern of a `cmd/` layer calling into a
+`pkg/` layer.  This means the end-user can understand the operation of the
+tooling, and advanced users can feel confident that they can use the CLI tooling
+for advanced operations.
+
+etcd-manager provides automation of the common scenarios, particularly when
+running on a cloud.  It will be rebased to work in terms of etcdadm CLI
+operations (which will likely require some functionality to be added to etcdadm
+itself).  Where automation is not known to be safe, etcd-manager can stop and
+allow for manual intervention using the CLI.
+
+kops is currently using etcd-manager, and we expect other tooling
+(e.g. cluster-api implementations) to adopt this project for etcd management
+going forwards.
+
+### User Stories
+
+#### Manual Cluster Creation
+
+A cluster operator setting up a cluster manually will be able to do so using etcdadm and kubeadm.
+
+The basic flow looks like:
+
+* On a master machine, run `etcdadm init`, making note of the `etcdadm join
+  <endpoint>` command
+* On each other master machine, copy the CA certificate and key from one of the
+  other masters, then run the `etcdadm join <endpoint>` command.
+* Run kubeadm following the [external etcd procedure](https://kubernetes.io/docs/setup/independent/high-availability/#external-etcd)
+
+This results in an multi-node ("HA") etcd cluster.
+
+#### Automatic Cluster Creation
+
+etcd-manager works by coordinating via a shared filesystem-like store (e.g. S3
+or GCS) and/or via cloud APIs (e.g. EC2 or GCE).  In doing so it is able to
+automate the manual commands, which is very handy for running in a cloud
+environment like AWS or GCE.
+
+The basic flow would look like:
+
+* The user writes a configuration file to GCS using `etcdadm seed
+  gs://mybucket/cluster1/etcd1 version=3.2.12 nodes=3`
+* On each master machine, run `etcdadm auto gs://mybucket/cluster1/etcd1`.
+  (Likely the user will have to run that persistently, either as a systemd
+  service or a static pod.)
+
+`etcdadm auto` downloads the target configuration from GCS, discovers other
+peers also running etcdadm, gossips with them to do basic leader election.  When
+sufficient nodes are available to form a quorum, it starts etcd.
+
+### Implementation Details/Notes/Constraints
+
+* There will be some changes needed to both platform9/etcdadm (e.g. etcd2
+  support) and kopeio/etcd-manager (to rebase on top of etcdadm).
+* It is unlikely that e.g. GKE / EKS will use etcdadm (at least initially),
+  which limits the pool of contributors.
+
+### Risks and Mitigations
+
+* Automatic mode may make incorrect decisions and break a cluster.  Mitigation:
+  automated backups, and a willingness to stop and wait for a fix / operator
+  intervention (CLI mode).
+* Automatic mode relies on peer-to-peer discovery and gossiping, which is less
+  reliable than Raft.  Mitigation: rely on Raft as much as possible, be very
+  conservative in automated operations (favor correctness over availability or
+  speed).  etcd non-voting members will make this much more reliable.
+
+## Graduation Criteria
+
+etcdadm will be considered successful when it is used by the majority of OSS
+cluster installations.
+
+## Implementation History
+
+* Much SIG discussion
+* Initial proposal to SIG 2018-10-09
+* Initial KEP draft 2018-10-22 
+
+## Infrastructure Needed
+
+* etcdadm will be a subproject under sig-cluster-lifecycle