Metrics through Prometheus

We are collecting metrics through various exporters directly into the Prometheus server that runs as a docker container through Nomad. The method offers:

Inclusive monitoring where all aspects of systems are monitored and available to provide insight and drive decisions.
Multi-dimensional data model with time series data identified by metric name and key/value pairs
Time series data: 1 name, multiple labels, a numeric value, a timestamp
Flexible Query language
Metrics have metadata: container_cpu_user_seconds_total{group="monitoring",id="/",instance="10.146.1.84:8080",job="cadvisor"}
Reliable, efficient and scalable
Alerting
Pull based monitoring, so it’s easy to run locally for testing

Exporters:

Are tools and applications that expose a metric endpoint to Prometheus to scrape from. For our environment, we are using the following exporters:

CAdvisor: Provides container users an understanding of the resource usage and performance characteristics of their running containers.
Node Exporter: Allows to measure various machine resources such as memory, disk, and CPU utilization

Prometheus Metric Types:

Metrics are stored in Prometheus' time-series database and are available to easily query to understand how these systems behave over time. The followings are different types of metrics available:

Counters that are incremental value used to measure the number of requests, errors
Gauges are the incremental or decremental value that changes over time to measure the number of a function call, CPU usage, memory usage, number of items processed, etc.
Histograms are used for observing a value, time, duration. For instance, you can tell what percentage of requests took x amount of time *Summaries are samples observations while they also provide a total count of observations and a sum of all observed values.

Grafana:

Our Grafana instance is connected to Prometheus, and can be accessed at: http://grafana.pennsignals.uphs.upenn.edu/

As of now, we have two dashboards within Grafana:

Minion Nodes that provides information about resources such as memory, disk, and CPU utilization.
Container Performance that provides stats about each container.

Both of these dashboards are available to re-import into Grafana in JSON format at: https://github.com/pennsignals/metrics-and-logs/tree/master/grafana_dashboard

PromQL:

This is the query language available to Prometheus. In order to use PQL more effectively you need to look up available metrics in Prometheus dashboard located at: http://prometheus.pennsignals.uphs.upenn.edu/graph

Browse around list of metrics and then apply various expressions as demonstrated below. As a rule of thumb any metrics from CAdvisor starts with container_* and any metrics from Node Exporter starts with node_* Screen Shot 2020-02-04 at 4 03 27 PM

Examples:

Get node's capacity in GB:

node_filesystem_bytes_total{mountpoint?=”/”}  / 1e9

Get CPU usage of a container filtered by its image name:

sum(rate(container_cpu_usage_seconds_total{image="quay.io/pennsignals/cdi-ui:v0.4.1"}[5m])) by (image) /
sum(container_spec_cpu_shares{image="quay.io/pennsignals/cdi-ui:v0.4.1"}/container_spec_cpu_period{image="quay.io/pennsignals/cdi-ui:v0.4.1"}) by (image) * 100

Get nodes' memory usage, and extract label's name from instance field using label_replace

label_replace(((node_memory_MemTotal_bytes - node_memory_MemFree_bytes - node_memory_Buffers_bytes - node_memory_Cached_bytes)/1e9), "Host", "$1", "instance", "(.*):.*")

In this exmple we are using REGEX to get the IP address of the host without port number and http protocol.

Add custom metrics:

You can add custom metrics from your application. For python you can install official Promethues client:

pip install prometheus client

The best way to achieve this is to add metrics as context manager or class decorator. You then need to push your metrics to Prometheus Pushgateway server, were Prometheus scrape the data from. Pushgateway is another Prometheus product that is running as a docker container. We rely on Consul service discovery to figure out the address to Pushgatway as seen below:

In your nomad file:

env {
       PUSH_GATEWAY = "pushgateway.service.consul:9091"
     }

In Python code:

pushgateway = getenv('PUSH_GATEWAY')

An example implementation of this is available in the demo app below:

https://github.com/pennsignals/prometheus-demo/tree/master/app/search

Once you register your metrics with the registry and run the app, make sure that your metrics are showing up in Prometheus graph browser. Additionally, you can view your metrics and their live data in Pushgateway dashboard located at: http://pushgateway.pennsignals.uphs.upenn.edu/

Provide feedback

Saved searches