Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Design for automated HA master deployment #29649

Merged
merged 1 commit into from
Nov 4, 2016

Conversation

fgrzadkowski
Copy link
Contributor

@fgrzadkowski fgrzadkowski commented Jul 27, 2016

@fgrzadkowski
Copy link
Contributor Author

@k8s-github-robot k8s-github-robot added kind/design Categorizes issue or PR as related to design. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. release-note-label-needed labels Jul 27, 2016
##Components

###etcd
```Note: this paragraphs assumes we are using etcd v2.2; it will have to be revisited during
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please put the ``` on a separate line from the text; putting it on the same line screws up the syntax highlighting in Github. (same comment for next line)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done

@justinsb
Copy link
Member

If you haven't seen it, I should show you how kops does it, because I think it would be great if we treated the masters as cattle not pets. If we treat the persistent disks as our pets, and use them as an election mechanism, we are able to move them dynamically around a pool of masters. This works particularly nicely because then you can think in terms of auto-scaling groups or managed-instance-groups instead of master instances, and you can just kill a master on a whim.

(Although I haven't yet dealt with cluster resizing or having more masters than etcd nodes - perhaps we can tackle those together :-) )

@luxas luxas added release-note-none Denotes a PR that doesn't merit a release note. and removed release-note-label-needed labels Jul 27, 2016
@luxas
Copy link
Member

luxas commented Jul 27, 2016

Needs a hack/update-munge-docs.sh run


This document describes technical design of this feature. It assumes that we are using aforementioned
scripts for cluster deployment. It focuses on GCE use-case but all of the ideas described in
the following sections should be easy to port to AWS, other cloud providers and bare-metal environment.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait.. you're dictating ease of use based on script wrapping?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. Can you explain?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"the ideas" are portable, not necessarily the implementation. The implementation should be easily portable to aws (and maybe other salt-based deployment) but I doubt it'd be easily portable to bare metal (even using salt).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that this is a GCE-centric doc, I think we should remove the clause re: porting.

I would almost put a clause that this does not serve as the recommended pattern for all use cases.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rephrased.

@timothysc
Copy link
Member

timothysc commented Jul 27, 2016

IMHO this needs to be vetted by the @kubernetes/sig-cluster-lifecycle sig. There are far to many suppositions on deployment that don't hold for others.

/cc @dgoodwin

@fgrzadkowski
Copy link
Contributor Author

@justinsb What do you mean by cattle vs pets here? I don't follow this analogy here. Can you explain?

@timothysc This design assumes some simplifications, such as colocation of apiservers and etcd or equal number of them. I understand that in some cases it might not be enough, but I don't expect our deployment scripts to work in all scenarios. I think this is a good step forward which doesn't limit options for advanced users, but allows us to actually start testing HA master setup, which we currently don't do.
If you believe there are parts of this design that will just not work I'm happy to fix it. However if your concern is that it doesn't cover all possible use-cases, then I think we should proceed, because I don't want to support all of them.

Kubernetes maintains a special service called `kubernetes`. Currently it keeps a
list of IP addresses for all apiservers. As it uses a command line flag
`--apiserver-count` it is not very dynamic and would require restarting all
masters to change number of master replicas.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The current --apiserver-count is broken anyways as it doesn't take into account failures of apiservers ( #22609). Maybe there's also some kind of liveness check/TTL for these IPs needed?

Copy link
Contributor Author

@fgrzadkowski fgrzadkowski Oct 20, 2016

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a very good point. This made me think that maybe there's a better solution. As you say we should be using a TTL for each IP. What we can do is:

  1. In the Endpoints object annotations we'd keep a TTL for each IP. Each annotation would keep a pair with an IP, that it corresponds to, and a TTL
  2. Each apiserver when updating service kubernetes will do two things:
    1. Add it's own IP if it's not there and add/update TTL for it
    2. Remove all the IPs with too old TTL

I think it'd be much easier than a ConfigMap and would solve the problem of unavailable apiservers. I'll update also the issue you mentioned.

@roberthbailey

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We talked about this with @thockin and @jszczepkowski and we believe that a reasonable approach would be to:

  1. Add a ConfigMap that would keep the list of active apiservers, with their expiration times; those would be updated by each apiserver separately
  2. Change EndpointsReconsiler in apiserver to update Endpoints list to match active apiservers from the ConfigMap.

That way we will have a dynamic configuration and at the same time we will not be updating Endpoints too often, as expiration times will be stored in a dedicated ConfigMap.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where will such a ConfigMap live? Which namespace?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would imagine the "lock server" configmap for the api servers would live in kube-system. For communicating the set of IPs that clients should try to connect to, we can lean on the work I'm doing in #30707 along with enhancing kubeconfig to be multi-endpoint aware. That kubeconfig ConfigMap is proposed to live in a new kube-public namespace.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This ConfigMap would live in kube-system and would be used only by apiserver (here) to properly set list of endpoints in kubernetes service which lives in default namespace.

@thockin
Copy link
Member

thockin commented Oct 20, 2016

Please also keep up on the discussion about Endpoints churn being
expensive.

On Thu, Oct 20, 2016 at 8:20 AM, Filip Grzadkowski <[email protected]

wrote:

@fgrzadkowski commented on this pull request.

In docs/design/ha_master.md
#29649:

+2. Unmanaged DNS - this is very similar to Managed DNS, with the exception that DNS entries
+will be manually managed by the user. We will provide detailed documentation for the entries we
+expect.
+3. [GCP only] Promote master IP - in GCP, when we create the first master replica, we generate a static
+external IP address that is later assigned to the master VM. When creating additional replicas we
+will create a loadbalancer infront of them and reassign aforementioned IP to point to the load balancer
+instead of a single master. When removing second to last replica we will reverse this operation (assign
+IP address to the remaining master VM and delete load balancer). That way user will not have to provide
+a domain name and all client configurations will keep working.
+
+#### kubernetes service
+
+Kubernetes maintains a special service called kubernetes. Currently it keeps a
+list of IP addresses for all apiservers. As it uses a command line flag
+--apiserver-count it is not very dynamic and would require restarting all
+masters to change number of master replicas.

That's a very good point. This made me think that maybe there's a better
solution. As you say we should be using a TTL for each IP. What we can do
is:

  1. In the Endpoints object annotations we'd keep a TTL for each IP.
    Each annotation would keep a pair with an IP, that it corresponds to, and a
    TTL
  2. Each apiserver when updating service kubernetes will do two things:
    2.1. Add it's own IP if it's not there and add/update TTL for it 2.2.
    Remove all the IPs with too old TTL

I think it'd be much easier than a ConfigMap and would solve the problem
of unavailable apiservers. I'll update also the issue you mentioned.

@roberthbailey https://github.com/roberthbailey


You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
#29649, or mute the thread
https://github.com/notifications/unsubscribe-auth/AFVgVJjbWwm2lhiIMdU_hzqb90kpPkX9ks5q14a2gaJpZM4JVwHV
.

@fgrzadkowski
Copy link
Contributor Author

@thockin Let's continue the discussion about endpoints in #22609

@mikedanese
Copy link
Member

Also #34627 and #26637

@fgrzadkowski
Copy link
Contributor Author

fgrzadkowski commented Oct 24, 2016

@roberthbailey I've updated the section with kubernetes service. PTAL.


To allow dynamic changes to the number of apiservers in the cluster, we will
introduce a `ConfigMap` that will keep an expiration time for each apiserver
(keyed by IP). Each apiserver will do three things:
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What IP do you key by? In the GCE case where you promote an external IP into a load balancer, won't the external IP of the apiserver change (which will break the mapping)?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GCE case we are using internal IP that will not change regardless of mode. This will work without any problems.

In GKE case we are using external IP by passing --advertise-address=.... If keep it as it is, it will work because every apiserver will be updating the same IP that will be promoted to point to LB. If we want to keep it consistent with GCE (which is not required IMO), then we can use external IP of each VM but it'd require:

  • restarting first master VM to update this --advertise-address flag (once we add LB in front if it and change it's external IP address)
  • fixing the certs issue that makes it impossible to talk directly to apiserver.

I think that for GKE it'd be ok if we just use external IP that points to LB.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In GCE case we are using internal IP that will not change regardless of mode.

I thought that the design was to change GCE to use the external IP (which can be promoted to a LB IP) instead of using the internal IP.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We planned to change it only for kubelet->master communication and not for this kubernetes service. But I'm open for suggestions. If we decide to change it also here for consistency (are there other reasons?) then it will also work (it will be similar to GKE case).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, I didn't catch that distinction. That sgtm (although it might be worth making it a bit clearer in the doc).

@bogdando
Copy link

docs/design/ha_master.md, line 55 at r6 (raw file):

  parts of etcd and we will only have to configure them properly.
* All apiserver replicas will be working independently talking to an etcd on
  127.0.0.1 (i.e. local etcd replica), which if needed will forward requests to the current etcd master

Did you consider putting etcd-proxies to listen each host's 127.0.0.1 and then pass requests further to alive etcd backends, which proxy seems capable to autodetect? In Kargo, we use exactly that layout, see https://github.com/kubespray/kargo/blob/master/docs/ha-mode.md for details


Comments from Reviewable

@fgrzadkowski
Copy link
Contributor Author

Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions.


docs/design/ha_master.md, line 246 at r1 (raw file):

Previously, fgrzadkowski (Filip Grzadkowski) wrote…

Correct. It's explained in the "add-on manager" section above together with the explanation why I suggest to ignore this problem for now.

Done.

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Did you consider putting etcd-proxies to listen each host's 127.0.0.1 and then pass requests further to alive etcd backends, which proxy seems capable to autodetect? In Kargo, we use exactly that layout, see https://github.com/kubespray/kargo/blob/master/docs/ha-mode.md for details

No we haven't. What would be the advantage of such setup?

Comments from Reviewable

@bogdando
Copy link

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, fgrzadkowski (Filip Grzadkowski) wrote…

No we haven't. What would be the advantage of such setup?

Native LB that comes from etcd-proxy instead of introducing "provider specific solutions to load balance traffic between master replicas"

Comments from Reviewable

@bogdando
Copy link

docs/design/ha_master.md, line 149 at r4 (raw file):

Each apiserver when updating service kubernetes will do two things:

Add it's own IP if it's not there and add/update TTL for it
Remove all the IPs with too old TTL

Let me cross post my comment here as well: apiservers could do only the step 1, this would act as the heartbeat. If some is dead, the entry expires and autoremoves, so there is no need to do the step 2. This simplifies design and makes "collisions" impossible. Makes sens? Would that work as well for the configmap?..


Comments from Reviewable

@bogdando
Copy link

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Native LB that comes from etcd-proxy instead of introducing "provider specific solutions to load balance traffic between master replicas"

Although, that would impact upgrades as "Apiserver talks only to a local etcd replica which will be in a compatible version" would not be the case with etcd-proxy, because it has no constraints for a localhost-to-local-only etcd connection termination.

Comments from Reviewable

@fgrzadkowski
Copy link
Contributor Author

Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions.


docs/design/ha_master.md, line 94 at r4 (raw file):

Previously, mattymo (Matthew Mosesohn) wrote…

This solution excludes deployments using Calico where etcd is shared between Kubernetes and Calico. Localhost listening etcd for client connections on non-master nodes creates a huge security vulnerability. I don't see any reason why we wouldn't want SSL enabled for client connections in such a scenario.

In such scenario (etcd running on a different machine) we would probably use SSL. This document does not discuss such scenario.

docs/design/ha_master.md, line 149 at r4 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Each apiserver when updating service kubernetes will do two things:

Add it's own IP if it's not there and add/update TTL for it
Remove all the IPs with too old TTL

Let me cross post my comment here as well: apiservers could do only the step 1, this would act as the heartbeat. If some is dead, the entry expires and autoremoves, so there is no need to do the step 2. This simplifies design and makes "collisions" impossible. Makes sens? Would that work as well for the configmap?..

No. We are not using etcd TTL here, rather a ConfigMap entries to store expiration times. This requires custom logic in EndpointsReconsiler in apiserver.

docs/design/ha_master.md, line 153 at r5 (raw file):

Previously, roberthbailey (Robert Bailey) wrote…

Ah, I didn't catch that distinction. That sgtm (although it might be worth making it a bit clearer in the doc).

Done.

I added a paragraph in load-balancing section.


docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Although, that would impact upgrades as "Apiserver talks only to a local etcd replica which will be in a compatible version" would not be the case with etcd-proxy, because it has no constraints for a localhost-to-local-only etcd connection termination.

We would still need a load balancer for apiserver, right? It would only change communication between apiserver and etcd, which does not use load balancer.

Comments from Reviewable

@bogdando
Copy link

Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions.


docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, fgrzadkowski (Filip Grzadkowski) wrote…

We would still need a load balancer for apiserver, right? It would only change communication between apiserver and etcd, which does not use load balancer.

Yes, unless clients sided LB with multiple apiserver endpoints https://github.com//issues/18174 implemented.

Comments from Reviewable

@fgrzadkowski
Copy link
Contributor Author

Review status: 0 of 1 files reviewed at latest revision, 28 unresolved discussions.


docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, bogdando (Bogdan Dobrelya) wrote…

Yes, unless clients sided LB with multiple apiserver endpoints #18174 implemented.

I don't see how it's related. How clients talk to apiserver is orthogonal to how apiserver talks to etcd, right?

Comments from Reviewable

@fgrzadkowski
Copy link
Contributor Author

@roberthbailey I think all the comments are addressed. PTAL.

@bogdando
Copy link

bogdando commented Nov 2, 2016

docs/design/ha_master.md, line 55 at r6 (raw file):

Previously, fgrzadkowski (Filip Grzadkowski) wrote…

I don't see how it's related. How clients talk to apiserver is orthogonal to how apiserver talks to etcd, right?

This is related to "We would still need a load balancer for apiserver, right?" in the meaning of external LB may be _not needed_ for apiservers, once https://github.com//issues/18174 implemented, thus we may want to do not add external LB for etcd as well and use its native proxy solution. As for clients, those are anything what contacts apiservers or etcd instances, including internal k8s components.

Comments from Reviewable

@roberthbailey roberthbailey added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Nov 4, 2016
@roberthbailey
Copy link
Contributor

Reviewed 1 of 1 files at r7.
Review status: all files reviewed at latest revision, 28 unresolved discussions.


Comments from Reviewable

@roberthbailey
Copy link
Contributor

@bogdando I think we are pretty much in agreement on the proposal so I've marked it with lgtm. Please feel free to send a follow-up PR if you want to tweak/clarify any of the language in the doc.

@k8s-github-robot
Copy link

Automatic merge from submit-queue

@k8s-github-robot k8s-github-robot merged commit 4280eed into kubernetes:master Nov 4, 2016
@jszczepkowski
Copy link
Contributor

Part of kubernetes/enhancements#48

xingzhou pushed a commit to xingzhou/kubernetes that referenced this pull request Dec 15, 2016
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/design Categorizes issue or PR as related to design. lgtm "Looks good to me", indicates that a PR is ready to be merged. release-note-none Denotes a PR that doesn't merit a release note. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.