Skip to content

Monitoring overview

Deepak Narayana Rao edited this page Oct 17, 2017 · 6 revisions

Tech Stack

  • Prometheus: Collect and store metrics as time series data, execute monitoring checks and generate alerts
  • Prometheus Alert Manager: Alert grouping, de-duplication and dispatch to alert receivers via slack, mail, pagerduty etc
  • Grafana: Visualisation of time series metrics and dashboards

Try this simple walkthrough of prometheus setup to get hands on experience https://github.com/stefanprodan/dockprom

Look at Prometheus overview for high level view

Sunbird monitoring setup

Image: Edit Link

Deployment strategy

  • The monitoring components listed above are deployed as a service in docker swarm.
  • These services store data on disk. A Directory from swarm agent node is mounted as a volume inside container so that data is not lost when container restarts
  • We have labelled the swarm agents to ensure these services always start on the same worker node

Monitoring servers

Node exporter is used for exposing host metrics like cpu, memory, disk, network etc

Monitoring containers

Cadvisor is used for exposing container metrics like cpu, memory, disk, network etc. Cadvisor exporter is run on all worker nodes by deploying it as [global replication mode] service (https://docs.docker.com/engine/swarm/how-swarm-mode-works/services/#replicated-and-global-services) in docker swarm

View of server and container monitoring

Image Source: https://stefanprodan.com/2016/a-monitoring-solution-for-docker-hosts-containers-and-containerized-services/

Monitoring services

Prometheus community has lot of open source exporters for monitoring well known use cases. Sunbird has few homegrown exporters for few advanced cases

Discovering services inside docker swarm

Each service deployed in docker swarm would have DNS entry for

  • <service-name>: This will resolve to single virtual IP of this service inside the network. This VIP is load balanced across all the containers running in swarm using IPVS
  • tasks.<service-name>: This will resolve to multiple IP address each pointing to each container running for this service. If you are interested in collecting metrics from each containers instance, you can use prometheus dns_sd_config and tasks.<service-name>

Image Source: https://www.slideshare.net/Docker/monitoring-the-prometheus-way-julius-voltz-prometheus

Clone this wiki locally