Merge pull request #56 from bird-house/add-simple-monitoring

Monitoring for host and each docker container. ![Screenshot_2020-06-19 Docker and system monitoring - Grafana](https://user-images.githubusercontent.com/11966697/85206384-c7f6f580-b2ef-11ea-848d-46490eb95886.png) For host, using Node-exporter to collect metrics: * uptime * number of container * used disk space * used memory, available memory, used swap memory * load * cpu usage * in and out network traffic * disk I/O For each container, using cAdvisor to collect metrics: * in and out network traffic * cpu usage * memory and swap memory usage * disk usage Useful visualisation features: * zoom in one graph and all other graph update to match the same "time range" so we can correlate event * view each graph independently for more details * mouse over each data point will show value at that moment Prometheus is used as the time series DB and Grafana is used as the visualization dashboard. Node-exporter, cAdvisor and Prometheus are exposed so another Prometheus on the network can also scrape those same metrics and perform other analysis if required. The whole monitoring stack is a separate component so user is not forced to enable it if there is already another monitoring system in place. Enabling this monitoring stack is done via `env.local` file, like all other components. The Grafana dashboard is taken from https://grafana.com/grafana/dashboards/893 with many fixes (see commits) since most of the metric names have changed over time. Still it was much quicker to hit the ground running than learning the Prometheus query language and Grafana visualization options from scratch. Not counting there are lots of metrics exposed, had to filter out which one are relevant to graph. So starting from a broken dashboard was still a big win. Grafana has a big collection of existing but probably un-maintained dashboards we can leverage. So this is a first draft for monitoring. Many things I am not sure or will need tweaking or is missing: * Probably have to add more metrics or remove some that might be irrelevant, with time we will see. * Probably will have to tweak the scrape interval and the retention time, to keep the disk storage requirement reasonable, again we'll see with time. * Missing alerting. With all the pretty graph, we are not going to look at them all day, we need some kind of alerting mechanism. Test system: http://lvupavicsmaster.ouranos.ca:3001/d/pf6xQMWGz/docker-and-system-monitoring?orgId=1&refresh=5m, user: admin, passwd: the default passwd Also tested on Medus: http://medus.ouranos.ca:3001/d/pf6xQMWGz/docker-and-system-monitoring?orgId=1&refresh=5m (on Medus had to perform full yum update to get new kernel and new docker engine for cAdvisor to work properly). Part of issue #12
bird-house · Jul 2, 2020 · 775c3b3 · 775c3b3
2 parents 2867220 + 39c577b
commit 775c3b3
Show file tree

Hide file tree

Showing 9 changed files with 2,308 additions and 0 deletions.
diff --git a/birdhouse/README.md b/birdhouse/README.md
@@ -44,6 +44,12 @@ enabled and configured in the `env.local` file (a copy from
   desired, full documentation in [`env.local.example`](env.local.example).
 * Run once [`fix-write-perm`](deployment/fix-write-perm), see doc in script.
 
+Resource usage monitoring (CPU, memory, ..) for the host and each of the containers
+can be enabled by enabling the `./components/monitoring` in `env.local` file.
+
+* Add `./components/monitoring` to `EXTRA_CONF_DIRS`.
+* Change `GRAFANA_ADMIN_PASSWORD` value.
+
 To launch all the containers, use the following command:
 ```
 ./pavics-compose.sh up -d

diff --git a/birdhouse/components/monitoring/.gitignore b/birdhouse/components/monitoring/.gitignore
@@ -0,0 +1,3 @@
+prometheus.yml
+grafana_datasources.yml
+grafana_dashboards.yml
diff --git a/birdhouse/components/monitoring/docker-compose-extra.yml b/birdhouse/components/monitoring/docker-compose-extra.yml
@@ -0,0 +1,79 @@
+version: '2.1'
+
+services:
+  # https://github.com/google/cadvisor/blob/master/docs/running.md
+  # Collect per container metrics.
+  cadvisor:
+    image: gcr.io/google-containers/cadvisor:v0.36.0
+    container_name: cadvisor
+    volumes:
+      - /:/rootfs:ro
+      - /var/run:/var/run:ro
+      - /sys:/sys:ro
+      - /var/lib/docker:/var/lib/docker:ro
+    ports:
+      - 9999:8080
+    devices:
+      - /dev/kmsg
+    restart: unless-stopped
+
+  # https://github.com/prometheus/node_exporter
+  # Collect system-wide metrics.
+  node-exporter:
+    image: quay.io/prometheus/node-exporter:v1.0.0
+    container_name: node-exporter
+    volumes:
+      - /:/host:ro,rslave
+    ports:
+      - 9100:9100
+    network_mode: "host"
+    pid: "host"
+    command: --path.rootfs=/host
+    restart: unless-stopped
+
+  # https://prometheus.io/docs/prometheus/latest/installation
+  # Monitor and store collected metrics.
+  prometheus:
+    image: prom/prometheus:v2.19.0
+    container_name: prometheus
+    volumes:
+      - ./components/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
+      - prometheus_persistence:/prometheus:rw
+    ports:
+      - 9090:9090
+    command:
+      # restore original CMD from image
+      - --config.file=/etc/prometheus/prometheus.yml
+      - --storage.tsdb.path=/prometheus
+      - --web.console.libraries=/usr/share/prometheus/console_libraries
+      - --web.console.templates=/usr/share/prometheus/consoles
+      # https://prometheus.io/docs/prometheus/latest/storage/
+      - --storage.tsdb.retention.time=90d
+    restart: unless-stopped
+
+  # https://grafana.com/docs/grafana/latest/installation/docker/
+  # https://grafana.com/docs/grafana/latest/installation/configure-docker/
+  # Visualize metrics from Prometheus
+  grafana:
+    image: grafana/grafana:7.0.3
+    container_name: grafana
+    volumes:
+      - ./components/monitoring/grafana_datasources.yml:/etc/grafana/provisioning/datasources/grafana_datasources.yml:ro
+      - ./components/monitoring/grafana_dashboards.yml:/etc/grafana/provisioning/dashboards/grafana_dashboards.yml:ro
+      - ./components/monitoring/grafana_dashboards:/etc/grafana/dashboards:ro
+      - grafana_persistence:/var/lib/grafana:rw
+    environment:
+      GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
+    ports:
+      - 3001:3000
+    restart: unless-stopped
+
+volumes:
+  prometheus_persistence:
+    external:
+      name: prometheus_persistence
+  grafana_persistence:
+    external:
+      name: grafana_persistence
+
+# vi: tabstop=8 expandtab shiftwidth=2 softtabstop=2
diff --git a/birdhouse/components/monitoring/grafana_dashboards.yml.template b/birdhouse/components/monitoring/grafana_dashboards.yml.template
@@ -0,0 +1,13 @@
+# https://grafana.com/docs/grafana/latest/administration/provisioning/#dashboards
+apiVersion: 1
+
+providers:
+ - name: 'default'
+   folder: 'Local-PAVICS'
+   folderUid: 'local-pavics'
+   disableDeletion: false
+   type: file
+   editable: false
+   allowUiUpdates: false
+   options:
+     path: "/etc/grafana/dashboards"