For monitoring your infrastructure and sending notifications if stuff is not ok. (e.g. pinging your websites/APIs via HTTP GET at specified intervals, and alerting you if there is downtime).
- Website monitoring & latency measurement (check type:
web-ping
) - Server metrics from local machine (check type:
scollector
) - Docker container metrics (check type:
docker-stats
) - Docker container metrics from remote host via SSH (check type:
remote-docker
) - Postgres counts/stats via SQL queries (check type:
postgres
) - TCP connectivity monitoring & latency measurement (check type:
tcp
) - Execute local commands & capture output (check type:
command
) - Execute remote commands via SSH & capture output (check type:
remote-command
) - Run test suite and capture report metrics via
JUnit XML
format (check type:test-report
)
Checks will happen at specified intervals or explicit trigger (i.e. trigger check API endpoint).
- Alert notifications available on several channels:
- sending email (
gmail
) - sending SMS (
twilio
) - posting a message to Slack (
slack
) - unix stream (
stderr
)
- sending email (
- Provides ping status & latency info to
stdout
. - Adjustable back-off after a check fails (constant, linear, exponential - see notes below).
- Includes a web UI as indicated by the screenshot above. (visit localhost:8888/, configure port via cli flag)
- Triggers a failure alert (
redalert
) when a check is failing, and a recovery alert (greenalert
) when the check has recovered (e.g. a successful ping, following a failing ping). - Triggers an alert when specified metric is above/below threshold.
- Assertions are used to define criteria for checks to pass or fail:
- Assert on metrics
- source:
metric
>
orgreater than
>=
orgreater than or equal
<
orless than
<=
orless than or equal
==
or=
orequals
- source:
- Assert on metadata
- source:
metadata
web-ping
returnsstatus_code
- source:
- Assert on response
- source:
text
- source:
json
- source:
Endpoint | Description |
---|---|
GET /v1/stats |
Retrieve stats for all checks |
POST /v1/checks/{check_id}/disable |
Disable check |
POST /v1/checks/{check_id}/enable |
Enable check |
POST /v1/checks/{check_id}/trigger |
Trigger check |
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โโโโโโถโ Redalert Check Flow โ
โ โ โ
โ โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ โ
โ @interval or ->trigger โโโโโโโโโโโโโโโโโโโโโโโโ
โ โ โโถโ error during check โ
โ โผ โ โโโโโโโโโโโโโโโโโโโโโโโโ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โโโโโโโโโโโโโโโโโโโโโโโโ
โ โ is check failing? โโโค โ failing assertions โ
โ โโโโโโโโโโโโโโโโโโโโโโโโ โ โ * metrics * โ
โ โ โโถโ * metadata * โ
โ โโโโYESโโโโดโโโNOโโโโโ โ * response * โ
โ โ โ โโโโโโโโโโโโโโโโโโโโโโโโ
โ โผ โผ
โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ โsend alerts viaโ โ is check โ
โ โ notifiers โ โ recovering? โ
โ โโโโโโโโโโโโโโโโโ โโโโโโโโโโโโโโโโโ
โ โโโโโโโโโโโโโโโโโ YES
โ โadjust backoff โ โ
โ โโโโโโโโโโโโโโโโโ โผ
โ โ โโโโโโโโโโโโโโโโโ
โ โ โsend alerts viaโ
โ โ โ notifiers โ
โ โ โโโโโโโโโโโโโโโโโ
โ โ โโโโโโโโโโโโโโโโโ
โ โ โ reset backoff โ
โ โ โโโโโโโโโโโโโโโโโ
โ โ โ
โ โผ โผ
โ โโโโโโโโโโโโโโโโโโโโโโโโ
โโโโโโโโโโโ Event Storage โ
โโโโโโโโโโโโโโโโโโโโโโโโ
Run via Docker:
docker run -d -P -v /path/to/config.json:/config.json jonog/redalert
Quick bootstrap example:
curl https://gist.githubusercontent.com/jonog/32c953aedf03edf71acaef53d89ce785/raw/e87f7e933165574e1d441781465223bfe6c3f1aa/sample_redalert_config.json > /tmp/sample_redalert_config.json && \
docker run -d -P -v /tmp/sample_redalert_config.json:/config.json --name test_redalert jonog/redalert && \
open "http://$(docker port test_redalert 8888)"
Get started with the redalert
command:
Usage:
redalert [command]
Available Commands:
checks List checks
config-sync Sync file and database configurations
server Run checks and server stats
version Print the version number of Redalert
Flags:
-d, --config-db string config database url
-f, --config-file string config file (default "config.json")
-s, --config-s3 string config S3
-u, --config-url string config url
-h, --help help for redalert
-p, --port int port to run web server (default 8888)
-r, --rpc-port int port to run RPC server (default 8889)
Use "redalert [command] --help" for more information about a command.
Configure servers to monitor & alert settings via a configuration file:
- a local file (specified by
-f
or--config-file
) - defaults toconfig.json
- a file remotely accessible via HTTP (specified by
-u
or--config-url
) - a file hosted in an AWS S3 bucket (specified by
-s
or--config-s3
)
TODO: document Postgres configuration option
{
"checks":[
{
"name":"Google",
"type": "web-ping",
"config": {
"address":"http://google.com"
},
"send_alerts": ["stderr"],
"backoff": {
"type": "constant",
"interval": 10
},
"assertions": [
{
"comparison": "==",
"identifier": "status_code",
"source": "metadata",
"target": "200"
}
]
}
],
"notifications": []
}
{
"checks": [
{
"name": "Demo HTTP Status Check",
"type": "web-ping",
"config": {
"address": "http://httpstat.us/200",
"headers": {
"X-Api-Key": "ABCD1234"
}
},
"send_alerts": [
"stderr"
],
"backoff": {
"interval": 10,
"type": "constant"
},
"assertions": [
{
"comparison": "==",
"identifier": "status_code",
"source": "metadata",
"target": "200"
}
]
},
{
"name": "Demo Response Check",
"type": "web-ping",
"config": {
"address": "http://httpstat.us/400"
},
"send_alerts": [
"stderr",
"email",
"chat",
"sms"
],
"backoff": {
"interval": 10,
"type": "linear"
},
"assertions": [
{
"comparison": "less than",
"identifier": "latency",
"source": "metric",
"target": "1100"
},
{
"comparison": "==",
"identifier": "status_code",
"source": "metadata",
"target": "400"
},
{
"comparison": "==",
"source": "text",
"target": "400 Bad Request"
}
],
"verbose_logging": true
},
{
"name": "Demo Exponential Backoff",
"type": "web-ping",
"config": {
"address": "http://httpstat.us/200"
},
"send_alerts": [
"stderr"
],
"backoff": {
"interval": 10,
"multiplier": 2,
"type": "exponential"
},
"assertions": [
{
"comparison": "==",
"identifier": "status_code",
"source": "metadata",
"target": "500"
}
]
},
{
"name": "Docker Redis",
"type": "tcp",
"config": {
"host": "192.168.99.100",
"port": 1001
},
"send_alerts": [
"stderr"
],
"backoff": {
"interval": 10,
"type": "constant"
}
},
{
"name": "Docker stats",
"type": "docker-stats",
"config": {},
"send_alerts": [
"stderr"
],
"backoff": {
"interval": 30,
"type": "linear"
}
},
{
"name": "production-docker-host",
"type": "remote-docker",
"config": {
"host": "ec2-xx-xxx-xx-xxx.ap-southeast-1.compute.amazonaws.com",
"user": "ubuntu"
},
"send_alerts": [
"stderr"
],
"backoff": {
"interval": 5,
"type": "linear"
}
},
{
"name": "scollector-metrics",
"type": "scollector",
"config": {
"host": "hostname"
},
"send_alerts": [
"stderr"
],
"backoff": {
"interval": 15,
"type": "constant"
}
},
{
"name": "production-db",
"type": "postgres",
"config": {
"connection_url": "postgres://user:pass@localhost:5432/dbname?sslmode=disable",
"metric_queries": [
{
"metric": "client_count",
"query": "select count(*) from clients"
}
]
},
"send_alerts": [
"stderr"
],
"backoff": {
"interval": 120,
"type": "linear"
}
},
{
"name": "README size",
"type": "command",
"config": {
"command": "cat README.md | wc -l",
"output_type": "number"
},
"send_alerts": [
"stderr"
],
"backoff": {
"interval": 10,
"type": "constant"
}
},
{
"name": "List files",
"type": "command",
"config": {
"command": "ls"
},
"send_alerts": [
"stderr"
],
"backoff": {
"interval": 10,
"type": "constant"
}
},
{
"name": "SHH into docker-alpine-sshd",
"type": "remote-command",
"config": {
"command": "uptime",
"ssh_auth_options": {
"user": "root",
"password": "root",
"host": "localhost",
"port": 2222
}
},
"send_alerts": [
"stderr"
],
"assertions": [
{
"comparison": "==",
"identifier": "exit_status",
"source": "metadata",
"target": "0"
}
]
},
{
"name": "Run Smoke Tests",
"type": "test-report",
"config": {
"command": "./run-smoke-tests.sh"
},
"send_alerts": [
"stderr"
],
"assertions": [
{
"comparison": "==",
"identifier": "status",
"source": "metadata",
"target": "PASSING"
}
]
}
],
"notifications": [
{
"name": "email",
"type": "gmail",
"config": {
"notification_addresses": "",
"pass": "",
"user": ""
}
},
{
"name": "chat",
"type": "slack",
"config": {
"channel": "#general",
"icon_emoji": ":rocket:",
"username": "redalert",
"webhook_url": ""
}
},
{
"name": "sms",
"type": "twilio",
"config": {
"account_sid": "",
"auth_token": "",
"notification_numbers": "",
"twilio_number": ""
}
}
],
"preferences": {
"notifications": {
"fail_count_alert_threshold": 2,
"repeat_fail_alerts": false
}
}
}
Build and run (capture stderr).
go build
./redalert 2> errors.log
fail_count_alert_threshold
controls sending an alert, only after N fails (defaults to 1)repeat_fail_alerts
controls whether fail alerts are repeated, on consecutive failing checks (defaults to false)
"preferences": {
"notifications": {
"fail_count_alert_threshold": 2,
"repeat_fail_alerts": false
}
}
When a server check fails - the next check will be delayed according to the back-off algorithm. By default, there is no delay (i.e. constant
back-off), with a default interval of 10 seconds between checks. When a failing server returns to normal, the check frequency returns to its original value.
Pinging interval will remain constant. i.e. will not provide any back-off after failure.
The pinging interval upon failure will be extended linearly. i.e. failure count x pinging interval
.
With each failure, the subsequent check will be delayed by the last delayed amount, times a multiplier, resulting in time between checks exponentially increasing. The multiplier
is set to 2 by default.
If there are errors sending email via gmail - enable Access for less secure apps
under Account permissions @ https://www.google.com/settings/u/2/security
Dependencies:
- Go dependency manager - glide
- Embedding static assets into binary - go.rice
protoc
for gRPC code generation - gRPC- Docker-machine for tests
Rocket emoji via https://github.com/twitter/twemoji
See Github Issues here