Design for automated HA master deployment #29649

fgrzadkowski · 2016-07-27T01:33:16Z

@jszczepkowski @davidopp @roberthbailey @xiang90 @mikedanese

This change is

fgrzadkowski · 2016-07-27T01:36:16Z

davidopp · 2016-07-27T04:21:48Z

docs/design/ha_master.md

+##Components
+
+###etcd
+```Note: this paragraphs assumes we are using etcd v2.2; it will have to be revisited during


Please put the ``` on a separate line from the text; putting it on the same line screws up the syntax highlighting in Github. (same comment for next line)

justinsb · 2016-07-27T04:50:36Z

If you haven't seen it, I should show you how kops does it, because I think it would be great if we treated the masters as cattle not pets. If we treat the persistent disks as our pets, and use them as an election mechanism, we are able to move them dynamically around a pool of masters. This works particularly nicely because then you can think in terms of auto-scaling groups or managed-instance-groups instead of master instances, and you can just kill a master on a whim.

(Although I haven't yet dealt with cluster resizing or having more masters than etcd nodes - perhaps we can tackle those together :-) )

luxas · 2016-07-27T06:47:59Z

Needs a hack/update-munge-docs.sh run

timothysc · 2016-07-27T13:30:39Z

docs/design/ha_master.md

+
+This document describes technical design of this feature. It assumes that we are using aforementioned
+scripts for cluster deployment. It focuses on GCE use-case but all of the ideas described in
+the following sections should be easy to port to AWS, other cloud providers and bare-metal environment.


Wait.. you're dictating ease of use based on script wrapping?

I don't understand. Can you explain?

"the ideas" are portable, not necessarily the implementation. The implementation should be easily portable to aws (and maybe other salt-based deployment) but I doubt it'd be easily portable to bare metal (even using salt).

Given that this is a GCE-centric doc, I think we should remove the clause re: porting.

I would almost put a clause that this does not serve as the recommended pattern for all use cases.

timothysc · 2016-07-27T13:55:18Z

IMHO this needs to be vetted by the @kubernetes/sig-cluster-lifecycle sig. There are far to many suppositions on deployment that don't hold for others.

/cc @dgoodwin

fgrzadkowski · 2016-07-27T14:17:51Z

@justinsb What do you mean by cattle vs pets here? I don't follow this analogy here. Can you explain?

@timothysc This design assumes some simplifications, such as colocation of apiservers and etcd or equal number of them. I understand that in some cases it might not be enough, but I don't expect our deployment scripts to work in all scenarios. I think this is a good step forward which doesn't limit options for advanced users, but allows us to actually start testing HA master setup, which we currently don't do.
If you believe there are parts of this design that will just not work I'm happy to fix it. However if your concern is that it doesn't cover all possible use-cases, then I think we should proceed, because I don't want to support all of them.

hanikesn · 2016-10-20T00:19:50Z

docs/design/ha_master.md

+Kubernetes maintains a special service called `kubernetes`. Currently it keeps a
+list of IP addresses for all apiservers. As it uses a command line flag
+`--apiserver-count` it is not very dynamic and would require restarting all
+masters to change number of master replicas.


The current --apiserver-count is broken anyways as it doesn't take into account failures of apiservers ( #22609). Maybe there's also some kind of liveness check/TTL for these IPs needed?

That's a very good point. This made me think that maybe there's a better solution. As you say we should be using a TTL for each IP. What we can do is:

In the Endpoints object annotations we'd keep a TTL for each IP. Each annotation would keep a pair with an IP, that it corresponds to, and a TTL

Each apiserver when updating service kubernetes will do two things:

Add it's own IP if it's not there and add/update TTL for it

Remove all the IPs with too old TTL

I think it'd be much easier than a ConfigMap and would solve the problem of unavailable apiservers. I'll update also the issue you mentioned.

@roberthbailey

We talked about this with @thockin and @jszczepkowski and we believe that a reasonable approach would be to:

Add a ConfigMap that would keep the list of active apiservers, with their expiration times; those would be updated by each apiserver separately

Change EndpointsReconsiler in apiserver to update Endpoints list to match active apiservers from the ConfigMap.

That way we will have a dynamic configuration and at the same time we will not be updating Endpoints too often, as expiration times will be stored in a dedicated ConfigMap.

Where will such a ConfigMap live? Which namespace?

I would imagine the "lock server" configmap for the api servers would live in kube-system. For communicating the set of IPs that clients should try to connect to, we can lean on the work I'm doing in #30707 along with enhancing kubeconfig to be multi-endpoint aware. That kubeconfig ConfigMap is proposed to live in a new kube-public namespace.

This ConfigMap would live in kube-system and would be used only by apiserver (here) to properly set list of endpoints in kubernetes service which lives in default namespace.

thockin · 2016-10-20T15:39:22Z

Please also keep up on the discussion about Endpoints churn being
expensive.

On Thu, Oct 20, 2016 at 8:20 AM, Filip Grzadkowski <[email protected]

wrote:

@fgrzadkowski commented on this pull request.

In docs/design/ha_master.md
#29649:

+2. Unmanaged DNS - this is very similar to Managed DNS, with the exception that DNS entries
+will be manually managed by the user. We will provide detailed documentation for the entries we
+expect.
+3. [GCP only] Promote master IP - in GCP, when we create the first master replica, we generate a static
+external IP address that is later assigned to the master VM. When creating additional replicas we
+will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer
+instead of a single master. When removing second to last replica we will reverse this operation (assign
+IP address to the remaining master VM and delete load balancer). That way user will not have to provide
+a domain name and all client configurations will keep working.
+
+#### kubernetes service
+
+Kubernetes maintains a special service called kubernetes. Currently it keeps a
+list of IP addresses for all apiservers. As it uses a command line flag
+--apiserver-count it is not very dynamic and would require restarting all
+masters to change number of master replicas.

That's a very good point. This made me think that maybe there's a better
solution. As you say we should be using a TTL for each IP. What we can do
is:

In the Endpoints object annotations we'd keep a TTL for each IP.
Each annotation would keep a pair with an IP, that it corresponds to, and a
TTL

Each apiserver when updating service kubernetes will do two things:
2.1. Add it's own IP if it's not there and add/update TTL for it 2.2.
Remove all the IPs with too old TTL

I think it'd be much easier than a ConfigMap and would solve the problem
of unavailable apiservers. I'll update also the issue you mentioned.

@roberthbailey https://github.com/roberthbailey

—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#29649, or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVJjbWwm2lhiIMdU_hzqb90kpPkX9ks5q14a2gaJpZM4JVwHV
.

fgrzadkowski · 2016-10-20T15:43:22Z

@thockin Let's continue the discussion about endpoints in #22609

mikedanese · 2016-10-20T16:05:01Z

Also #34627 and #26637

fgrzadkowski · 2016-10-24T14:22:27Z

@roberthbailey I've updated the section with kubernetes service. PTAL.

roberthbailey · 2016-10-24T17:10:08Z

docs/design/ha_master.md

+
+To allow dynamic changes to the number of apiservers in the cluster, we will
+introduce a `ConfigMap` that will keep an expiration time for each apiserver
+(keyed by IP). Each apiserver will do three things:


What IP do you key by? In the GCE case where you promote an external IP into a load balancer, won't the external IP of the apiserver change (which will break the mapping)?

In GCE case we are using internal IP that will not change regardless of mode. This will work without any problems.

In GKE case we are using external IP by passing --advertise-address=.... If keep it as it is, it will work because every apiserver will be updating the same IP that will be promoted to point to LB. If we want to keep it consistent with GCE (which is not required IMO), then we can use external IP of each VM but it'd require:

restarting first master VM to update this --advertise-address flag (once we add LB in front if it and change it's external IP address)

fixing the certs issue that makes it impossible to talk directly to apiserver.

I think that for GKE it'd be ok if we just use external IP that points to LB.

In GCE case we are using internal IP that will not change regardless of mode.

I thought that the design was to change GCE to use the external IP (which can be promoted to a LB IP) instead of using the internal IP.

We planned to change it only for kubelet->master communication and not for this kubernetes service. But I'm open for suggestions. If we decide to change it also here for consistency (are there other reasons?) then it will also work (it will be similar to GKE case).

Ah, I didn't catch that distinction. That sgtm (although it might be worth making it a bit clearer in the doc).

bogdando · 2016-10-27T15:14:40Z

docs/design/ha_master.md, line 55 at r6 (raw file):

  parts of etcd and we will only have to configure them properly.
* All apiserver replicas will be working independently talking to an etcd on
  127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master

Did you consider putting etcd-proxies to listen each host's 127.0.0.1 and then pass requests further to alive etcd backends, which proxy seems capable to autodetect? In Kargo, we use exactly that layout, see https://github.com/kubespray/kargo/blob/master/docs/ha-mode.md for details

Comments from Reviewable

fgrzadkowski · 2016-10-27T16:38:17Z

Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions.

docs/design/ha_master.md, line 246 at r1 (raw file):

Previously, fgrzadkowski (Filip Grzadkowski) wrote…

Correct. It's explained in the "add-on manager" section above together with the explanation why I suggest to ignore this problem for now.

Done.

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Did you consider putting etcd-proxies to listen each host's 127.0.0.1 and then pass requests further to alive etcd backends, which proxy seems capable to autodetect? In Kargo, we use exactly that layout, see https://github.com/kubespray/kargo/blob/master/docs/ha-mode.md for details

No we haven't. What would be the advantage of such setup?

Comments from Reviewable

bogdando · 2016-10-28T09:22:52Z

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, fgrzadkowski (Filip Grzadkowski) wrote…

No we haven't. What would be the advantage of such setup?

Native LB that comes from etcd-proxy instead of introducing "provider specific solutions to load balance traffic between master replicas"

Comments from Reviewable

bogdando · 2016-10-28T09:27:35Z

docs/design/ha_master.md, line 149 at r4 (raw file):

Each apiserver when updating service kubernetes will do two things:

Add it's own IP if it's not there and add/update TTL for it
Remove all the IPs with too old TTL

Let me cross post my comment here as well: apiservers could do only the step 1, this would act as the heartbeat. If some is dead, the entry expires and autoremoves, so there is no need to do the step 2. This simplifies design and makes "collisions" impossible. Makes sens? Would that work as well for the configmap?..

Comments from Reviewable

bogdando · 2016-10-28T09:44:43Z

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Native LB that comes from etcd-proxy instead of introducing "provider specific solutions to load balance traffic between master replicas"

Although, that would impact upgrades as "Apiserver talks only to a local etcd replica which will be in a compatible version" would not be the case with etcd-proxy, because it has no constraints for a localhost-to-local-only etcd connection termination.

Comments from Reviewable

fgrzadkowski · 2016-10-28T10:16:16Z

Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions.

docs/design/ha_master.md, line 94 at r4 (raw file):

Previously, mattymo (Matthew Mosesohn) wrote…

This solution excludes deployments using Calico where etcd is shared between Kubernetes and Calico. Localhost listening etcd for client connections on non-master nodes creates a huge security vulnerability. I don't see any reason why we wouldn't want SSL enabled for client connections in such a scenario.

In such scenario (etcd running on a different machine) we would probably use SSL. This document does not discuss such scenario.

docs/design/ha_master.md, line 149 at r4 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Each apiserver when updating service kubernetes will do two things:

Add it's own IP if it's not there and add/update TTL for it
Remove all the IPs with too old TTL

Let me cross post my comment here as well: apiservers could do only the step 1, this would act as the heartbeat. If some is dead, the entry expires and autoremoves, so there is no need to do the step 2. This simplifies design and makes "collisions" impossible. Makes sens? Would that work as well for the configmap?..

No. We are not using etcd TTL here, rather a ConfigMap entries to store expiration times. This requires custom logic in EndpointsReconsiler in apiserver.

docs/design/ha_master.md, line 153 at r5 (raw file):

Previously, roberthbailey (Robert Bailey) wrote…

Ah, I didn't catch that distinction. That sgtm (although it might be worth making it a bit clearer in the doc).

Done.

I added a paragraph in load-balancing section.

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Although, that would impact upgrades as "Apiserver talks only to a local etcd replica which will be in a compatible version" would not be the case with etcd-proxy, because it has no constraints for a localhost-to-local-only etcd connection termination.

We would still need a load balancer for apiserver, right? It would only change communication between apiserver and etcd, which does not use load balancer.

Comments from Reviewable

bogdando · 2016-10-28T11:56:00Z

Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions.

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, fgrzadkowski (Filip Grzadkowski) wrote…

We would still need a load balancer for apiserver, right? It would only change communication between apiserver and etcd, which does not use load balancer.

Yes, unless clients sided LB with multiple apiserver endpoints https://github.com//issues/18174 implemented.

Comments from Reviewable

fgrzadkowski · 2016-11-02T12:29:42Z

Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions.

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Yes, unless clients sided LB with multiple apiserver endpoints #18174 implemented.

I don't see how it's related. How clients talk to apiserver is orthogonal to how apiserver talks to etcd, right?

Comments from Reviewable

fgrzadkowski · 2016-11-02T12:30:07Z

@roberthbailey I think all the comments are addressed. PTAL.

bogdando · 2016-11-02T12:40:52Z

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, fgrzadkowski (Filip Grzadkowski) wrote…

I don't see how it's related. How clients talk to apiserver is orthogonal to how apiserver talks to etcd, right?

This is related to "We would still need a load balancer for apiserver, right?" in the meaning of external LB may be _not needed_ for apiservers, once https://github.com//issues/18174 implemented, thus we may want to do not add external LB for etcd as well and use its native proxy solution. As for clients, those are anything what contacts apiservers or etcd instances, including internal k8s components.

Comments from Reviewable

roberthbailey · 2016-11-04T07:56:14Z

Reviewed 1 of 1 files at r7.
Review status: all files reviewed at latest revision, 28 unresolved discussions.

Comments from Reviewable

roberthbailey · 2016-11-04T07:57:20Z

@bogdando I think we are pretty much in agreement on the proposal so I've marked it with lgtm. Please feel free to send a follow-up PR if you want to tweak/clarify any of the language in the doc.

k8s-github-robot · 2016-11-04T08:17:00Z

Automatic merge from submit-queue

jszczepkowski · 2016-11-08T10:49:37Z

Part of kubernetes/enhancements#48

@jszczepkowski

Automatic merge from submit-queue Design for automated HA master deployment kubernetes#21124 @jszczepkowski @davidopp @roberthbailey @xiang90 @mikedanese

fgrzadkowski added the sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. label Jul 27, 2016

fgrzadkowski assigned roberthbailey and davidopp Jul 27, 2016

googlebot added the cla: yes label Jul 27, 2016

fgrzadkowski added e2e-not-required [deprecated use retest-not-required] and removed e2e-not-required [deprecated use retest-not-required] labels Jul 27, 2016

fgrzadkowski mentioned this pull request Jul 27, 2016

Simplify HA Setup for Master kubernetes/enhancements#48

Closed

22 tasks

k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jul 27, 2016

davidopp reviewed Jul 27, 2016
View reviewed changes

luxas added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jul 27, 2016

timothysc reviewed Jul 27, 2016
View reviewed changes

hanikesn reviewed Oct 20, 2016

View reviewed changes

fgrzadkowski force-pushed the ha_design_doc branch from 4f53f66 to 038418f Compare October 20, 2016 14:57

fgrzadkowski force-pushed the ha_design_doc branch from 038418f to f51b8c3 Compare October 24, 2016 14:22

roberthbailey reviewed Oct 24, 2016

View reviewed changes

fgrzadkowski force-pushed the ha_design_doc branch from f51b8c3 to 3c81ae6 Compare October 25, 2016 13:14

bogdando mentioned this pull request Oct 28, 2016

Configure etcd cluster with ssl kubernetes-sigs/kubespray#31

Closed

Design for automated HA master deployment

2e5f27d

fgrzadkowski force-pushed the ha_design_doc branch from 3c81ae6 to 2e5f27d Compare October 28, 2016 10:16

roberthbailey added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2016

k8s-github-robot merged commit 4280eed into kubernetes:master Nov 4, 2016

foxish mentioned this pull request Jan 21, 2017

[mungegithub] Bot mis-labels PRs when Github fails kubernetes/test-infra#1637

Closed

Design for automated HA master deployment #29649

Design for automated HA master deployment #29649

Conversation

fgrzadkowski commented Jul 27, 2016 • edited by k8s-oncall Loading

fgrzadkowski commented Jul 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

justinsb commented Jul 27, 2016

luxas commented Jul 27, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

timothysc commented Jul 27, 2016 • edited Loading

fgrzadkowski commented Jul 27, 2016

Choose a reason for hiding this comment

fgrzadkowski Oct 20, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

thockin commented Oct 20, 2016

@fgrzadkowski commented on this pull request.

fgrzadkowski commented Oct 20, 2016

mikedanese commented Oct 20, 2016

fgrzadkowski commented Oct 24, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bogdando commented Oct 27, 2016

fgrzadkowski commented Oct 27, 2016

bogdando commented Oct 28, 2016

bogdando commented Oct 28, 2016

bogdando commented Oct 28, 2016

fgrzadkowski commented Oct 28, 2016

bogdando commented Oct 28, 2016

fgrzadkowski commented Nov 2, 2016

fgrzadkowski commented Nov 2, 2016

bogdando commented Nov 2, 2016

roberthbailey commented Nov 4, 2016

roberthbailey commented Nov 4, 2016

k8s-github-robot commented Nov 4, 2016

jszczepkowski commented Nov 8, 2016

fgrzadkowski commented Jul 27, 2016 •

edited by k8s-oncall

Loading

timothysc commented Jul 27, 2016 •

edited

Loading

fgrzadkowski Oct 20, 2016 •

edited

Loading

fgrzadkowski commented Oct 24, 2016 •

edited

Loading