GitHub - raksha1/kuberhealthy: Easy synthetic testing for Kubernetes clusters. Works great with Prometheus.

An operator for synthetic monitoring on Kubernetes. Write your own tests in your own container and Kuberhealthy will manage everything else. Automatically creates and sends metrics to Prometheus and InfluxDB. Included simple JSON status page. Supplements other solutions like Prometheus very nicely!

You can reach out to us directly on the Kubernetes Slack in the #kuberhealthy channel.

What is Kuberhealthy?

Kuberhealthy is an operator for running synthetic checks. By creating a custom resource (a khcheck) in your cluster, you can easily enable various synthetic test containers. Kuberhealthy does all the work of scheduling your checks on an interval you specify (like a CronJob), ensuring they run properly within an alotted timeout, maintaining the current up/down state with durability, and producing metrics. There are lots of useful checks already available to ensure the core functionality of Kubernetes, but checks can be used to test anything you like. We encourage you to write your own check container in any language to test your own applications!

Kuberhealthy serves a simple JSON status page, a Prometheus metrics endpoint, and supports InfluxDB metric forwarding for integration into your choice of alerting solution.

Here is an illustration of how Kuberhealthy provisions and operates checker pods. In this example, the checker pod both deploys a daemonset and tears it down while carefully watching for errors. The result of the check is then sent back to Kuberhealthy and channeled into upstream metrics and status pages to indicate basic Kubernetes cluster functionality across all nodes in a cluster.

Create Synthetic Checks for Your App

With Kuberhealthy, you can easily create synthetic tests to check your applications with real world use cases. Read more about how external checks are configured in the documentation here and learn how to create your own check container in any language here.

Installation

Helm installations are currently not available from helm/charts/kuberhealthy due to a slow PR process. For now, use the flat files below.

Helm 3 required for all chart installation.

To install using Helm 3 without Prometheus metrics: helm install stable/kuberhealthy

To install using Helm 3 with Prometheus metrics: helm install stable/kuberhealthy --set prometheus.enabled=true --set prometheus.enableScraping=true --set prometheus.enableAlerting=true

To install using Helm 3 with a Prometheus ServiceMonitor: helm install stable/kuberhealthy --set prometheus.enabled=true --set prometheus.enableScraping=true --set prometheus.enableAlerting=true --set prometheus.serviceMonitor=true

You can also use a flat spec file if you don't want to use Helm: kubectl apply -f https://raw.githubusercontent.com/Comcast/kuberhealthy/deploy/kuberhealthy.yaml

To install using other flat yaml spec files, see the deploy directory.

After installation, Kuberhealthy will only be available from within the cluster (Type: ClusterIP) at the service URL kuberhealthy.kuberhealthy. To expose Kuberhealthy to an external checking service, you must edit the service kuberhealthy and set Type: LoadBalancer. This is done for security. Options are available in the Helm chart to bypass this and deploy with Type: LoadBalancer directly.

RBAC bindings and roles are included in all configurations.

Kuberhealthy is currently tested on Kubernetes 1.9.x, to 1.15.x.

Prometheus Alerts

A ServiceMonitor configuration is available at deploy/servicemonitor.yaml.

Prometheus Grafana Dashboard

A Grafana dashboard is available at deploy/grafana/dashboard.json. To install this dashboard, follow the instructions here.

Why Are Synthetic Tests Important?

Instead of trying to identify all the things that could potentially go wrong in your application or cluster with never-ending metrics and alert configurations, synthetic tests replicate real workflow and carefully check for the expected behavior to occur. By default, Kuberhealthy monitors all basic Kubernetes cluster functionality including deployments, daemonsets, services, nodes, kube-system health and more.

Some examples of problems Kuberhealthy has detected in production with just the default checks enabled:

Nodes where new pods get stuck in Terminating due to CNI communication failures
Nodes where new pods get stuck in ContainerCreating due to disk provisoning errors
Nodes where new pods get stuck in Pending due to container runtime errors
Nodes where Docker or Kubelet is in a bad state but passing health checks
Nodes that are unable to properly communicate with the api server due to kube-api request limiting
Nodes that cannot provision or terminate pods quickly enough (15m) due to high I/O wait
A pod in the kube-system namespace that has begun restarting too quickly
An unexpected admission controller failure causing pod creation failure
Intermittent failures to access or create custom resources
kube-dns/CoreDNS DNS lookup failures (internal and external)
... more!

Status Page

You can directly access the current test statuses by accessing the kuberhealthy.kuberhealthy HTTP service on port 80. The status page displays server status in the format shown below. The boolean OK field can be used to indicate global up/down status, while the Errors array will contain a list of all check error descriptions. Granular, per-check information, including the last time a check was run, and the Kuberhealthy pod ran that specific check is available under the CheckDetails object.

{
    "OK": true,
    "Errors": [],
    "CheckDetails": {
        "kuberhealthy/daemonset": {
            "OK": true,
            "Errors": [],
            "Namespace": "kuberhealthy",
            "LastRun": "2019-11-14T23:24:16.7718171Z",
            "AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
            "uuid": "9abd3ec0-b82f-44f0-b8a7-fa6709f759cd"
        },
        "kuberhealthy/deployment": {
            "OK": true,
            "Errors": [],
            "Namespace": "kuberhealthy",
            "LastRun": "2019-11-14T23:26:40.7444659Z",
            "AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
            "uuid": "5f0d2765-60c9-47e8-b2c9-8bc6e61727b2"
        },
        "kuberhealthy/dns-status-internal": {
            "OK": true,
            "Errors": [],
            "Namespace": "kuberhealthy",
            "LastRun": "2019-11-14T23:34:04.8927434Z",
            "AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
            "uuid": "c85f95cb-87e2-4ff5-b513-e02b3d25973a"
        },
        "kuberhealthy/pod-restarts": {
            "OK": true,
            "Errors": [],
            "Namespace": "kuberhealthy",
            "LastRun": "2019-11-14T23:34:06.1938491Z",
            "AuthoritativePod": "kuberhealthy-67bf8c4686-mbl2j",
            "uuid": "a718b969-421c-47a8-a379-106d234ad9d8"
        }
    },
    "CurrentMaster": "kuberhealthy-7cf79bdc86-m78qr"
}

High Availability

Kuberhealthy scales horizontally in order to be fault tolerant. By default, two instances are used with a pod disruption budget and RollingUpdate strategy to ensure high availability.

Centralized Check State State

The state of checks is centralized as custom resource records. This allows Kuberhealthy to always serve the same result, no matter which node in the pool you hit. The current master running checks is calculated by all nodes in the deployment by simply querying the Kubernetes API for 'Ready' Kuberhealthy pods of the correct label, and sorting them alphabetically by name. The node that comes first is master. These two strategies together enable Kuberhealthy to maintain state and scale horizontally without deploying an additional backing database.

Security Considerations

By default, Kuberhealthy exposes an insecure (non-HTTPS) JSON status endpoint without authentication. You should never expose this endpoint to the public internet. Exposing Kuberhealthy's status page to the public internet could result in private cluster information being exposed to the public internet when errors occur and are displayed on the page.

Vulnerabilities or other security related issues should be logged as git issues in this project and immediately reported to The Security Incident Response Team (SIRT) via email at [email protected]. Please do not post sensitive information in git issues.

Name		Name	Last commit message	Last commit date
Latest commit History 809 Commits
.github/ISSUE_TEMPLATE		.github/ISSUE_TEMPLATE
cmd		cmd
deploy		deploy
docs		docs
images		images
pkg		pkg
.gitignore		.gitignore
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
CONTRIBUTORS.md		CONTRIBUTORS.md
Dockerfile		Dockerfile
LICENSE		LICENSE
Makefile		Makefile
NOTICE		NOTICE
README.md		README.md
_config.yml		_config.yml
go.mod		go.mod
go.sum		go.sum

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

What is Kuberhealthy?

Create Synthetic Checks for Your App

Installation

Prometheus Alerts

Prometheus Grafana Dashboard

Why Are Synthetic Tests Important?

Status Page

High Availability

Centralized Check State State

Security Considerations

About

Releases

Packages

Languages

License

raksha1/kuberhealthy

Folders and files

Latest commit

History

Repository files navigation

What is Kuberhealthy?

Create Synthetic Checks for Your App

Installation

Prometheus Alerts

Prometheus Grafana Dashboard

Why Are Synthetic Tests Important?

Status Page

High Availability

Centralized Check State State

Security Considerations

About

Resources

License

Code of conduct

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages