-
Notifications
You must be signed in to change notification settings - Fork 5.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
design-proposal: add self-hosted kubernetes proposal #206
Changes from 2 commits
d5a8837
3d5eb3e
71d92d8
de711ac
916dfa8
a06518f
d086034
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,91 @@ | ||
# Proposal: Self-hosted Control Plane | ||
|
||
Author: Brandon Philips <[email protected]> | ||
Last Updated: 2016-12-20 | ||
|
||
## Motivations | ||
|
||
> Running in our components in pods would solve many problems, which we'll otherwise need to implement other, less portable, more brittle solutions to, and doesn't require much that we don't need to do for other reasons. Full self-hosting is the eventual goal. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. "Running |
||
|
||
- Brian Grant ([ref](https://github.com/kubernetes/kubernetes/issues/4090#issuecomment-74890508)) | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. add ">" to this and above line. Otherwise this won't show as quote format. |
||
|
||
### What is self-hosted? | ||
|
||
Self-hosted Kubernetes runs all required and optional components of a Kubernetes cluster on top of Kubernetes itself. | ||
|
||
The advantages of a self-hosted Kubernetes cluster are: | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. There are tradeoffs that should be highlighted too. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. what tradeoffs would you like highlighted? |
||
|
||
1. **Small Dependencies:** self-hosted should reduce the number of components required, on host, for a Kubernetes cluster to be deployed to a Kubelet (ideally running in a container). This should greatly simplify the perceived complexity of Kubernetes installation. | ||
2. **Deployment consistency:** self-hosted reduces the number of files that are written to disk or managed via configuration management or manual installation via SSH. Our hope is to reduce the number of moving parts relying on the host OS to make deployments consistent in all environments. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another aspect of deployment consistency is to help reduce the amount of "distribution" drift. If we have a single packaged thing that can be used everywhere we reduce the need/opportunity for distributions to carry patches. |
||
3. **Introspection:** internal components can be debugged and inspected by users using existing Kubernetes APIs like `kubectl logs` | ||
4. **Cluster Upgrades:** Related to introspection the components of a Kubernetes cluster are now subject to control via Kubernetes APIs. Upgrades of Kubelet's are possible via new daemon sets, API servers can be upgraded using daemon sets and potentially deployments in the future, and flags of add-ons can be changed by updating deployments, etc. (An example script is in progress.) | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Please, point to the script or remove the reference? |
||
However, there is a spectrum of ways that a cluster can be self-hosted. To do this we are going to divide the Kubernetes cluster into a variety of layers beginning with the Kubelet (level 0) and going up to the add-ons (Level 4). A cluster can self-host all of these levels 0-4 or only partially self-host. | ||
|
||
![](self-hosted-layers.png) | ||
|
||
For example, a 0-4 self-hosted cluster means that the kubelet is a daemon set, the API server runs as a pod and is exposed as a service, and so on. While a 1-4 self-hosted cluster would have a system installed Kubelet. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Another advantage is how easier it becomes to do master components HA. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I just realized you added a section for HA below but adding small reference here still makes sense to me. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. done There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Just want to make sure you really wanted to have There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, it could be 02-4 or 0,2-4. I just went with 02-4 There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 👍 |
||
## Practical Implementation Overview | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should we clarify that this implementation is a "2-4 self hosted cluster" as per the above definition? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Actually, from below it looks like there's a self hosted kubelet, but not etcd. Which means given this taxonomy you can't describe the proposed implementation. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. kubelet technically isn't self-hosted because it isn't running as a daemonset because of limitations of the docker engine. |
||
|
||
This document outlines the current implementation of "self-hosted Kubernetes" installation and upgrade of Kubernetes clusters based on the work that the teams at CoreOS and Google have been doing. The work is motivated by the upstream "Self-hosted Proposal". | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The google doc links to the upstream proposal. Why take that link out in the md version? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Fixed |
||
|
||
The entire system is working today and is used by Bootkube, a Kubernetes Incubator project, and all Tectonic clusters created since July 2016. This document outlines the implementation, not the experience. The experience goal is that users not know all of these details and instead get a working Kubernetes cluster out the other end that can be upgraded using the Kubernetes APIs. | ||
|
||
The target audience of this document are others, like [kubeadm](https://github.com/kubernetes/kubernetes/pull/38407), thinking about and building the way forward for install and upgrade of Kubernetes. If you want a higher level demonstration of "Self-Hosted" and the value see this [video and blog](https://coreos.com/blog/self-hosted-kubernetes.html). | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. target audience == cluster lifecycle sig? |
||
|
||
|
||
### Bootkube | ||
|
||
Today, the first component of the installation of a self-hosted cluster is [`bootkube`](https://github.com/kubernetes-incubator/bootkube). A kubelet connects to the temporary Kubernetes API server provided by bootkube and is told to deploy the required Kubernetes components, as pods. This diagram shows all of the moving parts: | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. For someone that doesn't know |
||
![](self-hosted-moving-parts.png) | ||
|
||
At the end of this process the bootkube can be shut down and the system kubelet will coordinate, through a POSIX lock, to let the self-hosted kubelet take over lifecycle and management of the control plane components. The final cluster state looks like this: | ||
|
||
![](self-hosted-final-cluster.png) | ||
|
||
There are a few things to note. First, generally, the control components like the API server, etc will be pinned to a set of dedicated control nodes. For security policy, service discovery, and scaling reasons it is easiest to assume that control nodes will always exist on N nodes. | ||
|
||
Another challenge is load balancing the API server. Bedrock for the API server will be DNS, TLS, and a load balancer that live off cluster and that load balancer will want to only healthcheck a handful of servers for the API server port liveness probe. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Should a reference to |
||
### Bootkube Challenges | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. One thing not mentioned here is the challenge of discovering the API server (and etcd). The way it is written now it is assumed that these components float across the cluster. However, most production components will want to pin them to nodes (or dedicated nodes) so that they are easily discovered. We can't use k8s services to find the api server as the only way to resolve them is via the api server! There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is not the assumption. We do pin the components to a set of control nodes. I will make that explicit. |
||
|
||
This process has a number of moving parts. Most notably the hand off of control from the "host system" to the Kubernetes self-hosted system. And things can go wrong: | ||
|
||
1) The self-hosted Kubelet is in a precarious position as there is no one around to restart the process if it crashes. The high level is that the system init system will watch for the Kubelet POSIX lock and start the system Kubelet if the lock is missing. Once the system Kubelet starts it will launch the self-hosted Kubelet. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Can one guarantee the lock doesn't exist when a self-hosted kubelet is not running, say when it has never ran or has crashed? If so, can't one have the lock watchdog run periodically and react accordingly? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
Thoughts? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pires Yes, you can guarantee. All of this exists in code today. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @pires added a note about There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Great! |
||
2) Recovering from reboots of single-master installations is a challenge as the Kubelet won't have an API server to talk to to restart the self-hosted components. We are solving this today with "[user space checkpointing](https://github.com/kubernetes-incubator/bootkube/tree/master/cmd/checkpoint#checkpoint)" container in the Kubelet pod that will periodically check the pod manifests and persist them to the static pod manifest directory. Longer term we would like for the kubelet to be able to checkpoint itself without external code. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. This is a challenge even for HA systems if there is ever a "cool" boot event. Fun analog: in the public power utility world there is an idea of a "Black Start" (https://en.wikipedia.org/wiki/Black_start). There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Kubelet checkpointing of API resources is definitely needed for availability reasons, as well as for reliable self-hosting. kubernetes/kubernetes#489 |
||
|
||
## Long Term Goals | ||
|
||
Ideally bootkube disappears over time and is replaced by a [Kubelet pod API](https://github.com/kubernetes/kubernetes/issues/28138). The write API would enable an external installation program to setup the control plane of a self-hosted Kubernetes cluster without requiring an existing API server. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I was pointed at this doc: https://docs.google.com/document/d/1tFTq37rSRNSacZeXVTIb4KGLGL8-tMw30fQ2Kkdm7JQ/edit#heading=h.fi1ijw76bqwg but it doesn't appear to be linked from that issue. I assume this will all end up in this repo eventually, yes? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. yes, the plan is to get into this repo. cc @ethernetdan There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. The document is still under debate but should be in as a PR shortly. |
||
|
||
[Checkpointing](https://github.com/kubernetes/kubernetes/issues/489) is also required to make for a reliable system that can survive a number of normal operations like full down scenarios of the control plane. Today, we can sufficiently do checkpointing external of the Kubelet process, but checkpointing inside of the Kubelet would be ideal. | ||
|
||
A simple updater can take care of helping users update from v1.3.0 to v1.3.1, etc over time. | ||
|
||
### Self-hosted Cluster Upgrades | ||
|
||
#### Kubelet upgrades | ||
|
||
The kubelet could be upgraded in a very similar process to that outlined in the self-hosted proposal. | ||
|
||
However, because of the challenges around the self-hosted Kubelet (see above) Tectonic currently has a 1-4 self-hosted cluster with an alternative Kubelet update scheme which side-steps the self-hosted Kubelet issues. First, a kubelet system service is launched that uses the [chrooted kubelet](https://github.com/kubernetes/community/pull/131) implemented by the [kubelet-wrapper](https://coreos.com/kubernetes/docs/latest/kubelet-wrapper.html) then when an update is required a node annotation is made which is read by a long-running daemonset that updates the kubelet-wrapper configuration. This makes Kubelet versions updateable from the cluster API. | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. nit: |
||
#### API Server, Scheduler, and Controller Manager | ||
|
||
Upgrading these components is fairly straightforward. They are stateless, easily run in containers, and can be modeled as pods and services. Upgrades are simply a matter of deploying new versions, health checking them, and changing the service label selectors. | ||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. What about upgrades in multi-master environment? Are we going to upgrade each master node separately or just upgrade all nodes at the same time? If we're going with the second option - how can we solve a situation when upgrading API/schedules/controller leads to failure of the whole cluster and no API service is available - will the kubelet Pod API be used to rollback the upgrade and recover the old services? There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Upgrade each node separately. If you have multi-master you still have a load balancing tier or something in front, presumably. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. With deployments, one can enforce that only one (or a percentage of total) pod is to be updated at a time. My concern here is with flags passed to each component and how that may affect the cluster, e.g. API server adds flag that requires kubelet to also specify a pairing flag or run a newly needed feature. But this could be resolved with a simple updater, as mentioned in Long Term Goals section. |
||
|
||
#### etcd self-hosted | ||
|
||
As the primary data store of Kubernetes etcd plays an important role. Today, etcd does not run on top of the self-hosted cluster. However, progress is being made with the introduction of the [etcd Operator](https://coreos.com/blog/introducing-the-etcd-operator.html) and integration into [bootkube](https://github.com/kubernetes-incubator/bootkube/blob/848cf581451425293031647b5754b528ec5bf2a0/cmd/bootkube/start.go#L37). | ||
|
||
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. Again, should a reference to There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I don't know about adding a reference to something without a concrete proposal. There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I can agree with that. |
||
### Conclusions | ||
|
||
Kubernetes self-hosted is working today. Bootkube is an implementation of the "temporary control plane" and this entire process has been used by [`bootkube`](https://github.com/kubernetes-incubator/bootkube) users and Tectonic since the Kubernetes v1.4 release. We are excited to give users a simpler installation flow and sustainable cluster lifecycle upgrade/management. | ||
|
||
## Known Issues | ||
|
||
- [Health check endpoints for components don't work correctly](https://github.com/kubernetes-incubator/bootkube/issues/64#issuecomment-228144345) | ||
- [kubeadm doesn't do self-hosted yet](https://github.com/kubernetes/kubernetes/pull/38407) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think date is needed as we have VCS in place, but in case you want to keep it, please update it before merge.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can remove it.