When implemented a distributed and clustered architecture that at the same time is as automated as possible, then a service discovery technology is a important part of the architecture. It enable us to scale up the number of instances running of a microservice across cluster nodes and as well the number running on each node in a dynamic way without manual configuration of new instances.
Two pieces of software must be installed for this to work
- Agents
- Central registry
The agents report status on the container they are installed on, and if any services are defined on the node they also report back status on to the central catalog.
The central registry contain the status of all nodes and services that are defined. Several products exists that provide this functionality
- Zookeeper used by Hadoop
- Etcd used by Kubernetes, together with registrator and confd.
- Consul
To enable Service Discovery in my home lab I'm using a product called Consul by HashiCorp. One of the reasons that I chose Consul was that it is one of the products from HashiCorp and I'm have a feeling that I'm going to explore more products from them in the future, like Vault and Rancher.
As quoted from their GitHub repo
Consul is a tool for service discovery and configuration. Consul is distributed, highly available, and extremely scalable. Consul provides several key features:
- Service Discovery - Consul makes it simple for services to register themselves and to discover other services via a DNS or HTTP interface. External services such as SaaS providers can be registered as well.
- Health Checking - Health Checking enables Consul to quickly alert operators about any issues in a cluster. The integration with service discovery prevents routing traffic to unhealthy hosts and enables service level circuit breakers.
- Key/Value Storage - A flexible key/value store enables storing dynamic configuration, feature flagging, coordination, leader election and more. The simple HTTP API makes it easy to use anywhere.
- Multi-Datacenter - Consul is built to be datacenter aware, and can support any number of regions without complex configuration.
I'm planning to use monitoring-plugins from Nagios in combination of health checks from Consul to provide feedback when Nodes and Services.
The Jenkins Pipeline that does scheduled runs for Puppet Apply on all hosts use the HTTP API of Consul to find all online hosts to apply on.
The Jenkins Pipeline for the sample application also use the same HTTP API for Consul to find all hosts that run the Sample Application and update them when running deploy.
When running the puppet scripts, depending on the hostname of the virtual container the Puppet Apply selects and set the puppet environment to be production, test or development. This is also used by the puppet scripts to tag the Nodes and the Services with their environment.
Example service names in Consul that work also as hostnames when querying Consul DNS.
- test.postgres.service.consul
- production.postgres.service.consul
- development.app.service.consul
Included in all roles are the profile class pve::profiles::discovery::agent All server roles include this profile. Which also means that every container that are provisioned by puppet does self register with consul when provisioning.
In the Consul web UI they will automatically appear like this
And in the puppet script for different services, for example the
Rest API for the sample app
When for example doing an DNS lookup for the app.service.consul in different environments I get the following answer.
$ dig @127.0.0.1 +short -p 8600 production.app.service.consul. ANY
10.0.20.104
10.0.20.103
10.0.20.102
$ dig @127.0.0.1 +short -p 8600 test.app.service.consul. ANY
10.0.30.102
10.0.30.101
$ dig @127.0.0.1 +short -p 8600 development.app.service.consul. ANY
10.0.30.104
The first version of my home lab was using Nginx, then replaced by HAProxy, now recently replaced by Traefik because it has native support for Consul.
Both HAProxy and Nginx could have been used in combination with consule-template but would have had trouble implementing reloads of the load balance members without breaking connections and giving connection to clients. Traefik is a young project but seems promising. Still, I have to verify this by running Apache benchmarks to get real evidence about how the system behaves but in theory it should work.
This give me the possibility to add a new LXC container, bootstrap the container and run the puppet apply.sh script, and if the role of the new container matches a service then it will be included in Consul, it will be added to the load balancer, found by the pipeline that run scheduled Puppet Apply and the Pipeline for deploying the Sample Application. All of this without any manual input after the initial bootstrap.