Skip to content

Commit

Permalink
Merge pull request #56 from bird-house/add-simple-monitoring
Browse files Browse the repository at this point in the history
Monitoring for host and each docker container.

![Screenshot_2020-06-19 Docker and system monitoring - Grafana](https://user-images.githubusercontent.com/11966697/85206384-c7f6f580-b2ef-11ea-848d-46490eb95886.png)

For host, using Node-exporter to collect metrics:
* uptime
* number of container
* used disk space
* used memory, available memory, used swap memory
* load
* cpu usage
* in and out network traffic 
* disk I/O

For each container, using cAdvisor to collect metrics:
* in and out network traffic
* cpu usage
* memory and swap memory usage
* disk usage

Useful visualisation features:
* zoom in one graph and all other graph update to match the same "time range" so we can correlate event
* view each graph independently for more details
* mouse over each data point will show value at that moment

Prometheus is used as the time series DB and Grafana is used as the visualization dashboard.

Node-exporter, cAdvisor and Prometheus are exposed so another Prometheus on the network can also scrape those same metrics and perform other analysis if required.

The whole monitoring stack is a separate component so user is not forced to enable it if there is already another monitoring system in place.  Enabling this monitoring stack is done via `env.local` file, like all other components.

The Grafana dashboard is taken from https://grafana.com/grafana/dashboards/893 with many fixes (see commits) since most of the metric names have changed over time.  Still it was much quicker to hit the ground running than learning the Prometheus query language and Grafana visualization options from scratch.  Not counting there are lots of metrics exposed, had to filter out which one are relevant to graph.  So starting from a broken dashboard was still a big win.  Grafana has a big collection of existing but probably un-maintained dashboards we can leverage.

So this is a first draft for monitoring.  Many things I am not sure or will need tweaking or is missing:
* Probably have to add more metrics or remove some that might be irrelevant, with time we will see.
* Probably will have to tweak the scrape interval and the retention time, to keep the disk storage requirement reasonable, again we'll see with time.
* Missing alerting.  With all the pretty graph, we are not going to look at them all day, we need some kind of alerting mechanism.

Test system: http://lvupavicsmaster.ouranos.ca:3001/d/pf6xQMWGz/docker-and-system-monitoring?orgId=1&refresh=5m, user: admin, passwd: the default passwd

Also tested on Medus: http://medus.ouranos.ca:3001/d/pf6xQMWGz/docker-and-system-monitoring?orgId=1&refresh=5m (on Medus had to perform full yum update to get new kernel and new docker engine for cAdvisor to work properly).

Part of issue #12
  • Loading branch information
tlvu authored Jul 2, 2020
2 parents 2867220 + 39c577b commit 775c3b3
Show file tree
Hide file tree
Showing 9 changed files with 2,308 additions and 0 deletions.
6 changes: 6 additions & 0 deletions birdhouse/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -44,6 +44,12 @@ enabled and configured in the `env.local` file (a copy from
desired, full documentation in [`env.local.example`](env.local.example).
* Run once [`fix-write-perm`](deployment/fix-write-perm), see doc in script.

Resource usage monitoring (CPU, memory, ..) for the host and each of the containers
can be enabled by enabling the `./components/monitoring` in `env.local` file.

* Add `./components/monitoring` to `EXTRA_CONF_DIRS`.
* Change `GRAFANA_ADMIN_PASSWORD` value.

To launch all the containers, use the following command:
```
./pavics-compose.sh up -d
Expand Down
3 changes: 3 additions & 0 deletions birdhouse/components/monitoring/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
prometheus.yml
grafana_datasources.yml
grafana_dashboards.yml
79 changes: 79 additions & 0 deletions birdhouse/components/monitoring/docker-compose-extra.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
version: '2.1'

services:
# https://github.com/google/cadvisor/blob/master/docs/running.md
# Collect per container metrics.
cadvisor:
image: gcr.io/google-containers/cadvisor:v0.36.0
container_name: cadvisor
volumes:
- /:/rootfs:ro
- /var/run:/var/run:ro
- /sys:/sys:ro
- /var/lib/docker:/var/lib/docker:ro
ports:
- 9999:8080
devices:
- /dev/kmsg
restart: unless-stopped

# https://github.com/prometheus/node_exporter
# Collect system-wide metrics.
node-exporter:
image: quay.io/prometheus/node-exporter:v1.0.0
container_name: node-exporter
volumes:
- /:/host:ro,rslave
ports:
- 9100:9100
network_mode: "host"
pid: "host"
command: --path.rootfs=/host
restart: unless-stopped

# https://prometheus.io/docs/prometheus/latest/installation
# Monitor and store collected metrics.
prometheus:
image: prom/prometheus:v2.19.0
container_name: prometheus
volumes:
- ./components/monitoring/prometheus.yml:/etc/prometheus/prometheus.yml:ro
- prometheus_persistence:/prometheus:rw
ports:
- 9090:9090
command:
# restore original CMD from image
- --config.file=/etc/prometheus/prometheus.yml
- --storage.tsdb.path=/prometheus
- --web.console.libraries=/usr/share/prometheus/console_libraries
- --web.console.templates=/usr/share/prometheus/consoles
# https://prometheus.io/docs/prometheus/latest/storage/
- --storage.tsdb.retention.time=90d
restart: unless-stopped

# https://grafana.com/docs/grafana/latest/installation/docker/
# https://grafana.com/docs/grafana/latest/installation/configure-docker/
# Visualize metrics from Prometheus
grafana:
image: grafana/grafana:7.0.3
container_name: grafana
volumes:
- ./components/monitoring/grafana_datasources.yml:/etc/grafana/provisioning/datasources/grafana_datasources.yml:ro
- ./components/monitoring/grafana_dashboards.yml:/etc/grafana/provisioning/dashboards/grafana_dashboards.yml:ro
- ./components/monitoring/grafana_dashboards:/etc/grafana/dashboards:ro
- grafana_persistence:/var/lib/grafana:rw
environment:
GF_SECURITY_ADMIN_PASSWORD: ${GRAFANA_ADMIN_PASSWORD}
ports:
- 3001:3000
restart: unless-stopped

volumes:
prometheus_persistence:
external:
name: prometheus_persistence
grafana_persistence:
external:
name: grafana_persistence

# vi: tabstop=8 expandtab shiftwidth=2 softtabstop=2
13 changes: 13 additions & 0 deletions birdhouse/components/monitoring/grafana_dashboards.yml.template
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# https://grafana.com/docs/grafana/latest/administration/provisioning/#dashboards
apiVersion: 1

providers:
- name: 'default'
folder: 'Local-PAVICS'
folderUid: 'local-pavics'
disableDeletion: false
type: file
editable: false
allowUiUpdates: false
options:
path: "/etc/grafana/dashboards"
Loading

0 comments on commit 775c3b3

Please sign in to comment.