Umbrella ☂️

Operating a validator can be stressful: when it rains, it pours! ⛈️

Protect your Penumbra validator from the weather by monitoring its uptime with umbrella. ☂️

What is Umbrella?

Umbrella is a Prometheus exporter which monitors on-chain uptime for one or several Penumbra validators. It connects on-demand to one or more specified Penumbra nodes, picking data from the node with the highest height, and exports Prometheus metrics via HTTP at /metrics, so that you can set alerts for validator signing downtime.

In other words, Umbrella is a caching proxy translating Prometheus scraping requests into RPC requests to Penumbra fullnodes, and translating their responses into Prometheus metrics reporting validator uptime.

What isn't Umbrella?

Umbrella is a simple tool meant to do one thing well: correctly monitor on-chain uptime for a node according to the semantics of Penumbra.

It is not a full-stack monitoring solution for validators. It does not export metrics about the health of individual nodes. It should be used in conjunction with other monitoring solutions to provide a comprehensive picture of a validator operation's wellbeing.

Quick start

Install `umbrella`

Build locally

Clone this repository and build umbrella with the command:

cargo build --release

This builds the executable target/release/umbrella, which you can then run as you desire. Depending on your deployment, you may consider using a systemd service, Docker container, or something else. For Nix configuration, see here.

Run `umbrella`

Once you've installed umbrella, you can start the metrics server like this:

umbrella --validator $VALIDATOR_IDENTITY_KEY --node $PD_GRPC_BIND_ADDRESS --fallback $FALLBACK_PD_GRPC_BIND_ADDRESS

In the above, $VALIDATOR_IDENTITY_KEY is the identity key of the validator you wish to monitor, $PD_GRPC_BIND_ADDRESS is the URI of the RPC endpoint you want to get the information from, and $FALLBACK_PD_GRPC_BIND_ADDRESS is a fallback RPC which will only be used if no --node endpoint is reachable. All of these options can be repeated any number of times to specify multiple validators, multiple fullnodes, and multiple fallbacks, respectively.

Please be nice to public RPC endpoints: All nodes specified with --node are polled concurrently, and the information from the node with the highest block height is returned to Prometheus. Only if none of them respond, each --fallback is tried sequentially in the order specified on the command line. If you're connecting to a public RPC, it's courteous to set it as a --fallback node so that you only use its resources if your own fullnodes are all unreachable.

Scrape metrics using Prometheus

To check if umbrella is working, visit localhost:1984/metrics, and you should see Prometheus metrics for your selected validator's uptime.

At present, the metrics reported are:

state{validator=...}: gauge per validator measuring the validator's state by numeric label, with the meanings: 0=Defined, 1=Disabled, 2=Inactive, 3=Active, 4=Jailed, 5=Tombstoned
uptime{validator=...} gauge per validator measuring the validator's uptime as a percentage in the numeric range [0, 100]
consecutive_missed_blocks{validator=...}: gauge per validator measuring the length in blocks of the most recent string of consecutive downtime (reset to zero every time a block is signed)
update_success: gauge reading 1 if the most recent update was successful, 0 if data could not be refreshed from any source
update_staleness: gauge measuring the number of seconds since umbrella refreshed its cache of information (reset on every attempted update, regardless of success)

Set up monitoring

Once you have umbrella running (perhaps as a systemd service or some such), you can configure Prometheus to scrape it, and Grafana to display its metrics and set alerts for when they are problematic. For a quick start Grafana dashboard, see the example Grafana dashboard.

A possible starting configuration for alerting on an active validator could be something like:

P0 critical alert if state > 3 (validator has been slashed and is jailed or tombstoned)
P1 high alert if state < 3 (validator is not active, but not due to downtime or misbehavior)
P1 high alert if update_success = 0 for longer than 10 minutes (umbrella is not managing to update itself, so you are flying blind)
P1 high alert if consecutive_missed_blocks > 120 (~10 minutes of consecutive downtime means something is wrong and it's not just ephemeral)
P1 high alert if uptime < 95 (cumulative downtime has exceeded ~40 minutes, something is interfering with availability in a significant way)
P2 moderate alert if consecutive_missed_blocks > 12 (~1 minute of consecutive downtime would be unusual for a well-configured functioning validator)
P2 moderate alert if uptime < 99 (normal operating condition should be > 99% uptime, so it might indicate an issue if there's a dip beneath this threshold)

Nix configuration

Umbrella is tested and used on NixOS internally at Starling Cybernetics. If you use Nix, you can borrow from this Nix quickstart:

First, add this input to your flake:

inputs.umbrella.url = "github:starlingcyber/umbrella";

Then, you can import the Umbrella module into a NixOS configuration to enable and configure its systemd service:

imports = [ umbrella.outputs.nixosModules.default ];

services.umbrella = {
  enable = true;
  nodes = [ "https://..." ];
  fallbacks = [];
  validators = [ "penumbravalid1..." ];
};

On the same host, you can then configure a Prometheus scrape configuration for Umbrella:

services.prometheus = {
  enable = true;
  scrapeConfigs = [
    {
      job_name = "umbrella";
      scrape_interval = "5s";
      scrape_timeout = "5s";
      metrics_path = "/metrics";
      scheme = "http";
      static_configs = [
        {
          targets = [ "localhost:1984" ];
          labels.instance = "umbrella";
        }
      ];
      relabel_configs = [
        {
          source_labels = [ "__address__" ];
          target_label = "address";
        }
      ];
    }
  ];
};

Now Prometheus should be able to see your Umbrella metrics.

Name		Name	Last commit message	Last commit date
Latest commit History 34 Commits
.github/workflows		.github/workflows
src		src
.envrc		.envrc
.gitignore		.gitignore
Cargo.lock		Cargo.lock
Cargo.toml		Cargo.toml
LICENSE		LICENSE
README.md		README.md
dashboard.example.json		dashboard.example.json
default.nix		default.nix
flake.lock		flake.lock
flake.nix		flake.nix
shell.nix		shell.nix

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Umbrella ☂️

What is Umbrella?

What isn't Umbrella?

Quick start

Install `umbrella`

Build locally

Run `umbrella`

Scrape metrics using Prometheus

Set up monitoring

Nix configuration

About

Releases

Packages

Contributors 2

Languages

License

starlingcyber/umbrella

Folders and files

Latest commit

History

Repository files navigation

Umbrella ☂️

What is Umbrella?

What isn't Umbrella?

Quick start

Install umbrella

Build locally

Run umbrella

Scrape metrics using Prometheus

Set up monitoring

Nix configuration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 2

Languages

Install `umbrella`

Run `umbrella`

Packages