Monitoring for host and each docker container #56

tlvu · 2020-06-20T16:56:36Z

For host, using Node-exporter to collect metrics:

uptime
number of container
used disk space
used memory, available memory, used swap memory
load
cpu usage
in and out network traffic
disk I/O

For each container, using cAdvisor to collect metrics:

in and out network traffic
cpu usage
memory and swap memory usage
disk usage

Useful visualisation features:

zoom in one graph and all other graph update to match the same "time range" so we can correlate event
view each graph independently for more details
mouse over each data point will show value at that moment

Prometheus is used as the time series DB and Grafana is used as the visualization dashboard.

Node-exporter, cAdvisor and Prometheus are exposed so another Prometheus on the network can also scrape those same metrics and perform other analysis if required.

The whole monitoring stack is a separate component so user is not forced to enable it if there is already another monitoring system in place. Enabling this monitoring stack is done via env.local file, like all other components.

The Grafana dashboard is taken from https://grafana.com/grafana/dashboards/893 with many fixes (see commits) since most of the metric names have changed over time. Still it was much quicker to hit the ground running than learning the Prometheus query language and Grafana visualization options from scratch. Not counting there are lots of metrics exposed, had to filter out which one are relevant to graph. So starting from a broken dashboard was still a big win. Grafana has a big collection of existing but probably un-maintained dashboards we can leverage.

So this is a first draft for monitoring. Many things I am not sure or will need tweaking or is missing:

Probably have to add more metrics or remove some that might be irrelevant, with time we will see.
Probably will have to tweak the scrape interval and the retention time, to keep the disk storage requirement reasonable, again we'll see with time.
Missing alerting. With all the pretty graph, we are not going to look at them all day, we need some kind of alerting mechanism.

Test system: http://lvupavicsmaster.ouranos.ca:3001/d/pf6xQMWGz/docker-and-system-monitoring?orgId=1&refresh=5m, user: admin, passwd: the default passwd

Part of issue #12

…d node-exporter metrics

Used data volume instead of bind-mount /data/prometheus to work-around the following permission issue: level=error ts=2020-06-16T09:19:22.327Z caller=query_logger.go:87 component=activeQueryTracker msg="Error opening query log file" file=/prometheus/que ries.active err="open /prometheus/queries.active: permission denied" panic: Unable to create mmap-ed active query log

First dashboard from https://grafana.com/grafana/dashboards/893.

…hboard

…ners 1) tooltip only show current item value instead of all items because can not scroll. 2) show current value in legend and move legend to bottom instead of right side to avoid excessive scrolling on right side.

…fana The "Usage Memory" panel is more or less replaced by showing current value in the legend of the other memory graphs. Remaining and Limit Memory was always zero so no point in keeping them.

Was not showing any data before because variable $server was not defined correctly. The previous query using node_cpu do not exist anymore so had to simplify the query to not need it anymore. Now graph show raw load values, not converting to percent depending on number of cpus anymore. Didn't know how to get the number of cpus.

Was not working before as the Load 3 values graph.

The metric name changed.

Wrong metric key name and changed to new Gauge panel.

Turn Swap into percentage and update to new Gauge panel. Wrong server variable definition.

Show all disks instead of just some. Change to percent full to be more useful.

…ill show them

Before it was not considering all possible partitions. Now it displays the used percent of the most full partitions. Also migrate to newer Gauge panel.

Wrong metric name used and wrong calculation formula (it was Unavailable Memory). Remove alert: unsued and probably value is not appropriate.

Wrong metric id used. Show time on X-axis.

So we can compare the available vs total memory.

…ap per Container graph

tlvu · 2020-06-20T18:59:58Z

http://medus.ouranos.ca:3001/d/pf6xQMWGz/docker-and-system-monitoring?orgId=1&from=now-24h&to=now&refresh=5m (cAdvisor not working on Medus for now, so all stats related to containers are missing).

moulab88

Nice tools and good job!!

tlogan2000

Looks nice.

tlvu · 2020-07-02T18:40:04Z

Medus now working properly after 2 days uptime: http://medus.ouranos.ca:3001/d/pf6xQMWGz/docker-and-system-monitoring?orgId=1&refresh=5m (on Medus had to perform full yum update to get new kernel and new docker engine for cAdvisor to work properly).

Tagged 1.10.0. Bumping the minor as this is a new component added.

tlvu · 2020-07-03T13:45:32Z

Autodeployed to prod:

triggerdeploy finished START_TIME=2020-07-03T05:07:02+0000
triggerdeploy finished   END_TIME=2020-07-03T05:10:35+0000

Monitoring: add alert rules and alert handling (deduplicate, group, route, silence, inhibit). This is a follow up to the previous PR #56 that added the monitoring itself. Added cAdvisor and Node-exporter collection of alert rules found here https://awesome-prometheus-alerts.grep.to/rules with a few fixing because of errors in the rules and tweaking to reduce false positive alarms (see list of commits). Great collection of sample of ready-made rules to hit the ground running and learn PromML query language on the way. ![2020-07-08-090953_474x1490_scrot](https://user-images.githubusercontent.com/11966697/86926000-8b086c80-c0ff-11ea-92d0-6f5ccfe2b8e1.png) Added Alertmanager to handle the alerts (deduplicate, group, route, silence, inhibit). Currently the only notification route configured is email but Alertmanager is able to route alerts to Slack and any generic services accepting webhooks. ![2020-07-08-091150_1099x669_scrot](https://user-images.githubusercontent.com/11966697/86926213-cd31ae00-c0ff-11ea-8b2a-d33803ad3d5d.png) ![2020-07-08-091302_1102x1122_scrot](https://user-images.githubusercontent.com/11966697/86926276-dc186080-c0ff-11ea-9377-bda03b69640e.png) This is an initial attempt at alerting. There are several ways to tweak the system without changing the code: * To add more Prometheus alert rules, volume-mount more *.rules files to the prometheus container. * To disable existing Prometheus alert rules, add more Alertmanager inhibition rules using `ALERTMANAGER_EXTRA_INHIBITION` via `env.local` file. * Other possible Alertmanager configs via `env.local`: `ALERTMANAGER_EXTRA_GLOBAL, ALERTMANAGER_EXTRA_ROUTES, ALERTMANAGER_EXTRA_RECEIVERS`. What more could be done after this initial attempt: * Possibly add more graphs to Grafana dashboard since we have more alerts on metrics that we do not have matching Grafana graph. Graphs are useful for historical trends and correlation with other metrics, so not required if we do not need trends and correlation. * Only basic metrics are being collected currently. We could collect more useful metrics like SMART status and alert when a disk is failing. * The autodeploy mechanism can hook into this monitoring system to report pass/fail status and execution duration, with alerting for problems. Then we can also correlate any CPU, memory, disk I/O spike, when the autodeploy runs and have a trace of previous autodeploy executions. I had to test these alerts directly in prod to tweak for less false positive alert and to debug not working rules to ensure they work on prod so these changes are already in prod ! This also test the SMTP server on the network. See rules on Prometheus side: http://pavics.ouranos.ca:9090/rules, http://medus.ouranos.ca:9090/rules Manage alerts on Alertmanager side: http://pavics.ouranos.ca:9093/#/alerts, http://medus.ouranos.ca:9093/#/alerts Part of issue #12

tlvu added 28 commits June 16, 2020 04:30

monitoring: add cadvisor to collect per container metrics

ecadb44

monitoring: add node-exporter to collect system-wide metrics

c466f0c

monitoring: add prometheus to monitor and store collected cadvisor an…

79a3c85

…d node-exporter metrics

monitoring: increase prom storage retention to 90d from 15d default

919f8b4

monitoring: add Grafana to visualize metrics from Prometheus

be4dad9

monitoring: provision grafana datasources

c4179e8

monitoring: provision grafana dashboards

5c744dd

First dashboard from https://grafana.com/grafana/dashboards/893.

monitoring: replace ${DS_PROMETHEUS} with real DS name in grafana das…

5a31d85

…hboard

env.local: add mandatory GRAFANA_ADMIN_PASSWORD for monitoring component

e995ec9

monitoring: add persistance to grafana

e1fef8a

monitoring: remove 3 unnecessary panels about container memory in gra…

3fa8728

…fana The "Usage Memory" panel is more or less replaced by showing current value in the legend of the other memory graphs. Remaining and Limit Memory was always zero so no point in keeping them.

monitoring: fix the other Load single number grafana dashboard

6a8af62

Was not working before as the Load 3 values graph.

monitoring: fix Uptime grafana dashboard

32afa00

monitoring: fix Disk Space grafana dashboard

431f860

The metric name changed.

monitoring: fix Memory grafana dashboard

a5f8ecf

Wrong metric key name and changed to new Gauge panel.

monitoring: fix Swap and server variable in grafana dashboard

46cc1ec

Turn Swap into percentage and update to new Gauge panel. Wrong server variable definition.

monitoring: fix Used Disk Space grafana dashboard

f5821ee

Show all disks instead of just some. Change to percent full to be more useful.

monitoring: remove legend from Load and Used Disk Space since hover w…

bead29c

…ill show them

monitoring: generalize Disk Space gauge panel

04a956d

Before it was not considering all possible partitions. Now it displays the used percent of the most full partitions. Also migrate to newer Gauge panel.

monitoring: fix Available Memory grafana graph

faaa4e3

Wrong metric name used and wrong calculation formula (it was Unavailable Memory). Remove alert: unsued and probably value is not appropriate.

monitoring: fix Disk I/O graph

93ad45b

Wrong metric id used. Show time on X-axis.

monitoring: show time on X-axis for Network Traffic and CPU Usage graph

6db38aa

monitoring: show total memory on Available Memory graph

2335d1a

So we can compare the available vs total memory.

monitoring: swap position of Memory Usage per Container and Memory Sw…

b9215dd

…ap per Container graph

monitoring: add Disk Usage per Container graph

fb151af

tlvu requested review from tlogan2000 and moulab88 June 20, 2020 16:56

README: add instructions how to enable the monitoring stack

39c577b

moulab88 approved these changes Jun 30, 2020

View reviewed changes

tlogan2000 approved these changes Jun 30, 2020

View reviewed changes

tlvu merged commit 775c3b3 into master Jul 2, 2020

tlvu deleted the add-simple-monitoring branch July 2, 2020 18:35

tlvu mentioned this pull request Jul 8, 2020

Monitoring: add alert rules and alert handling (deduplicate, group, route, silence, inhibit) #59

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Monitoring for host and each docker container #56

Monitoring for host and each docker container #56

tlvu commented Jun 20, 2020

tlvu commented Jun 20, 2020

moulab88 left a comment

tlogan2000 left a comment

tlvu commented Jul 2, 2020

tlvu commented Jul 3, 2020

Monitoring for host and each docker container #56

Monitoring for host and each docker container #56

Conversation

tlvu commented Jun 20, 2020

tlvu commented Jun 20, 2020

moulab88 left a comment

Choose a reason for hiding this comment

tlogan2000 left a comment

Choose a reason for hiding this comment

tlvu commented Jul 2, 2020

tlvu commented Jul 3, 2020