-
Notifications
You must be signed in to change notification settings - Fork 39.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Design for automated HA master deployment #29649
Conversation
##Components | ||
|
||
###etcd | ||
```Note: this paragraphs assumes we are using etcd v2.2; it will have to be revisited during |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please put the ``` on a separate line from the text; putting it on the same line screws up the syntax highlighting in Github. (same comment for next line)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
If you haven't seen it, I should show you how kops does it, because I think it would be great if we treated the masters as cattle not pets. If we treat the persistent disks as our pets, and use them as an election mechanism, we are able to move them dynamically around a pool of masters. This works particularly nicely because then you can think in terms of auto-scaling groups or managed-instance-groups instead of master instances, and you can just kill a master on a whim. (Although I haven't yet dealt with cluster resizing or having more masters than etcd nodes - perhaps we can tackle those together :-) ) |
Needs a |
|
||
This document describes technical design of this feature. It assumes that we are using aforementioned | ||
scripts for cluster deployment. It focuses on GCE use-case but all of the ideas described in | ||
the following sections should be easy to port to AWS, other cloud providers and bare-metal environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait.. you're dictating ease of use based on script wrapping?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand. Can you explain?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"the ideas" are portable, not necessarily the implementation. The implementation should be easily portable to aws (and maybe other salt-based deployment) but I doubt it'd be easily portable to bare metal (even using salt).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that this is a GCE-centric doc, I think we should remove the clause re: porting.
I would almost put a clause that this does not serve as the recommended pattern for all use cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rephrased.
IMHO this needs to be vetted by the @kubernetes/sig-cluster-lifecycle sig. There are far to many suppositions on deployment that don't hold for others. /cc @dgoodwin |
@justinsb What do you mean by cattle vs pets here? I don't follow this analogy here. Can you explain? @timothysc This design assumes some simplifications, such as colocation of apiservers and etcd or equal number of them. I understand that in some cases it might not be enough, but I don't expect our deployment scripts to work in all scenarios. I think this is a good step forward which doesn't limit options for advanced users, but allows us to actually start testing HA master setup, which we currently don't do. |
Kubernetes maintains a special service called `kubernetes`. Currently it keeps a | ||
list of IP addresses for all apiservers. As it uses a command line flag | ||
`--apiserver-count` it is not very dynamic and would require restarting all | ||
masters to change number of master replicas. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The current --apiserver-count
is broken anyways as it doesn't take into account failures of apiservers ( #22609). Maybe there's also some kind of liveness check/TTL for these IPs needed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's a very good point. This made me think that maybe there's a better solution. As you say we should be using a TTL for each IP. What we can do is:
- In the Endpoints object annotations we'd keep a TTL for each IP. Each annotation would keep a pair with an IP, that it corresponds to, and a TTL
- Each apiserver when updating service
kubernetes
will do two things:- Add it's own IP if it's not there and add/update TTL for it
- Remove all the IPs with too old TTL
I think it'd be much easier than a ConfigMap
and would solve the problem of unavailable apiservers. I'll update also the issue you mentioned.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We talked about this with @thockin and @jszczepkowski and we believe that a reasonable approach would be to:
- Add a ConfigMap that would keep the list of active apiservers, with their expiration times; those would be updated by each apiserver separately
- Change EndpointsReconsiler in apiserver to update Endpoints list to match active apiservers from the ConfigMap.
That way we will have a dynamic configuration and at the same time we will not be updating Endpoints too often, as expiration times will be stored in a dedicated ConfigMap.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Where will such a ConfigMap live? Which namespace?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would imagine the "lock server" configmap for the api servers would live in kube-system
. For communicating the set of IPs that clients should try to connect to, we can lean on the work I'm doing in #30707 along with enhancing kubeconfig to be multi-endpoint aware. That kubeconfig ConfigMap is proposed to live in a new kube-public
namespace.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This ConfigMap would live in kube-system
and would be used only by apiserver (here) to properly set list of endpoints in kubernetes
service which lives in default
namespace.
4f53f66
to
038418f
Compare
Please also keep up on the discussion about Endpoints churn being On Thu, Oct 20, 2016 at 8:20 AM, Filip Grzadkowski <[email protected]
|
038418f
to
f51b8c3
Compare
@roberthbailey I've updated the section with |
|
||
To allow dynamic changes to the number of apiservers in the cluster, we will | ||
introduce a `ConfigMap` that will keep an expiration time for each apiserver | ||
(keyed by IP). Each apiserver will do three things: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What IP do you key by? In the GCE case where you promote an external IP into a load balancer, won't the external IP of the apiserver change (which will break the mapping)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In GCE case we are using internal IP that will not change regardless of mode. This will work without any problems.
In GKE case we are using external IP by passing --advertise-address=...
. If keep it as it is, it will work because every apiserver will be updating the same IP that will be promoted to point to LB. If we want to keep it consistent with GCE (which is not required IMO), then we can use external IP of each VM but it'd require:
- restarting first master VM to update this
--advertise-address
flag (once we add LB in front if it and change it's external IP address) - fixing the certs issue that makes it impossible to talk directly to apiserver.
I think that for GKE it'd be ok if we just use external IP that points to LB.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In GCE case we are using internal IP that will not change regardless of mode.
I thought that the design was to change GCE to use the external IP (which can be promoted to a LB IP) instead of using the internal IP.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We planned to change it only for kubelet->master communication and not for this kubernetes
service. But I'm open for suggestions. If we decide to change it also here for consistency (are there other reasons?) then it will also work (it will be similar to GKE case).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I didn't catch that distinction. That sgtm (although it might be worth making it a bit clearer in the doc).
f51b8c3
to
3c81ae6
Compare
docs/design/ha_master.md, line 55 at r6 (raw file):
Did you consider putting etcd-proxies to listen each host's 127.0.0.1 and then pass requests further to alive etcd backends, which proxy seems capable to autodetect? In Kargo, we use exactly that layout, see https://github.com/kubespray/kargo/blob/master/docs/ha-mode.md for details Comments from Reviewable |
Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions. docs/design/ha_master.md, line 246 at r1 (raw file):
|
docs/design/ha_master.md, line 55 at r6 (raw file):
|
docs/design/ha_master.md, line 149 at r4 (raw file):
Let me cross post my comment here as well: apiservers could do only the step 1, this would act as the heartbeat. If some is dead, the entry expires and autoremoves, so there is no need to do the step 2. This simplifies design and makes "collisions" impossible. Makes sens? Would that work as well for the configmap?.. Comments from Reviewable |
docs/design/ha_master.md, line 55 at r6 (raw file):
|
3c81ae6
to
2e5f27d
Compare
Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions. docs/design/ha_master.md, line 94 at r4 (raw file):
|
Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions. docs/design/ha_master.md, line 55 at r6 (raw file):
|
Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions. docs/design/ha_master.md, line 55 at r6 (raw file):
|
@roberthbailey I think all the comments are addressed. PTAL. |
docs/design/ha_master.md, line 55 at r6 (raw file):
|
Reviewed 1 of 1 files at r7. Comments from Reviewable |
@bogdando I think we are pretty much in agreement on the proposal so I've marked it with lgtm. Please feel free to send a follow-up PR if you want to tweak/clarify any of the language in the doc. |
Automatic merge from submit-queue |
Part of kubernetes/enhancements#48 |
Automatic merge from submit-queue Design for automated HA master deployment kubernetes#21124 @jszczepkowski @davidopp @roberthbailey @xiang90 @mikedanese
#21124
@jszczepkowski @davidopp @roberthbailey @xiang90 @mikedanese
This change is