As etcd is the external storage backend to Kubernetes, it needs to be in a clean and consistent state. etcd uses the Raft consensus algorithm to re-elect a leader once it is lost. Raft itself uses a quorum and therefore it is advisable to use an odd amount of etcd nodes, as an even amount does not improve the reliability in the cluster but consumes more resources and introduces another node that can fail too.
To make this visually more clear, please look at the following table:
Cluster Size | Majority | Failure Tolerance |
---|---|---|
1 |
1 |
0 |
2 |
2 |
0 |
3 |
2 |
1 |
4 |
3 |
1 |
5 |
3 |
2 |
6 |
4 |
2 |
7 |
4 |
3 |
8 |
5 |
3 |
9 |
5 |
4 |
In most cases it is advisable to use an etcd cluster consisting of five nodes. This setup ensures that up to two nodes can get unhealthy without loosing data. Bigger clusters have also a bigger failure tolerance, but suffer in decreasing write performance as the data has to be synchronized to more machines.
To get even more insights into etcd and best practices, have a look at the official faqs: FAQ
In order to operate Kubernetes clusters in a high-available mode it is advised to have at least multiple instances of the apiserver
running behind a load-balancer.
With an increasing number of worker nodes the load on the apiservers also increases as each additional node needs to request/mount secrets, volumes and synchronize the state of its Pods.
Running multiple apiservers therefore increases failure tolerance but also reduces the load per apiserver instance.
Warning
|
Some customers started without a dedicated load-balancer but instead used dns round-robin to expose multiple apiservers to the users and worker nodes. This caused very frequent problems as worker nodes cache the ip of a master and therefore the loadbalancing is not working anymore. Additionally dns caching caused connection timeouts once masters were not available anymore due to maintenance, ip change, etc. |
To increase cluster availability it is advised to spread Kuberntes cluster components over multiple availability zones in the same VPC.
Important
|
One important thing to note is the fact that EBS volumes are only available inside the same AZ. This has caused frequent problems with customers once a Pod was scheduled to a node in a different AZ as Kubernetes was not able to attach the volume anymore.
One solution can be to enrich the nodes with metadata about region and AZ and use this information in the node selector - additionally one could then build ha systems per AZ and replicate data between AZs.
Another solution is to switch to EFS volumes which are available across multiple AZs. Kubernetes and EFS integration is still not available upstream but can easily be deployed using the efs-provisioner from the official kubernetes-incubator project: https://github.com/kubernetes-incubator/external-storage/tree/master/aws/efs
|
When creating Kubernetes Pods, it is for certain reasons not advisable to use latest
tags on Container Images.
The first reason for not using it is the fact that you can´t be 100% sure which exact version of your software you are running - lets dive a bit deeper. Once Kubernetes creates a Pod for you it assigns an imagePullPolicy
to it.
By default this will be IfNotPresent
which means that the container runtime will only pull the image if it is not present on the node the Pod was assigned to.
Once you use latest
as an image tag this default behavior switches to Always
resulting in the runtime pulling the image every time it starts up a container using that image in a Pod.
There are two really important reasons why this is really bad thing to do:
-
You loose control over which exact code is running in your system
-
Rolling-Updates/Rollbacks are not possible anymore
Lets dive deeper into this:
Imagine you have version A of your Software, tag it with latest
, test version A in a CI system, start a Pod on Node 1, tag version B of your Software again with latest
, Node 1 goes down and your Pod is moved to Node 2 before version B was tested in a CI.
-
Which version of your Software will be running? Version B
-
Which version should be running? Version A
-
Can you immediately switch back to the previous version? NO!
This simple scenario already shows quite good which problems can arise and this is only the tip of the iceberg! You should always be able to see which tested version of your software is currently running, when it was deployed, which source code (e.g. commit ids) your image was built on and know how to switch to a previous version easily - ideally with the push of button.
You were told earlier that Pods are the fundamental Kubernetes building block for your container and now you hear that you shouldn’t use Pods directly but through an abstraction such as a Deployment
. Why is that and what makes the difference?
If you deploy a Pod directly to your Kubernetes cluster, your container(s) will run, but nothing takes care of its lifecycle. Once a node goes down, capacity on the current node is needed, etc the Pod will get lost forever.
Thats the point where building blocks such as ReplicaSet
and Deployment
come into play. A ReplicaSet
acts as a supervisor to the Pods it watches and recreates Pods that don´t exist anymore.
Deployments
are an even higher abstraction and create and manage ReplicaSets
to enable the developer to use a declarative state instead of imperative commands (e.g. kubectl rolling-update
). The real advantage is that Deployments will automatically do rolling-updates and always ensure a given target state instead of having to deal with imperative changes.
Coming soon
You are now ready to continue on with the workshop!