Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OSD-6646: Manage osd-cluster-ready #143

Closed
wants to merge 1 commit into from
Closed
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions Makefile
Original file line number Diff line number Diff line change
@@ -1,6 +1,14 @@
# Include boilerplate's generated Makefile libraries
include boilerplate/generated-includes.mk

# ===> TODO: Remove this override once the boilerplate backing image has go-bindata
.PHONY: go-generate
go-generate:
go get github.com/go-bindata/go-bindata/[email protected]
${GOENV} go generate $(TESTTARGETS)
# Don't forget to commit generated files
# <=== TODO: Remove this override once the boilerplate backing image has go-bindata

.PHONY: boilerplate-update
boilerplate-update:
@boilerplate/update
Expand Down
84 changes: 84 additions & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,17 @@
[![codecov](https://codecov.io/gh/openshift/configure-alertmanager-operator/branch/master/graph/badge.svg)](https://codecov.io/gh/openshift/configure-alertmanager-operator)
[![License](https://img.shields.io/:license-apache-blue.svg)](http://www.apache.org/licenses/LICENSE-2.0.html)

- [configure-alertmanager-operator](#configure-alertmanager-operator)
- [Summary](#summary)
- [Cluster Readiness](#cluster-readiness)
- [Metrics](#metrics)
- [Alerts](#alerts)
- [Testing](#testing)
- [Building](#building)
- [Deploying](#deploying)
- [Prevent Overwrites](#prevent-overwrites)
- [Replace the Image](#replace-the-image)

## Summary
The Configure Alertmanager Operator was created for the OpenShift Dedicated platform to dynamically manage Alertmanager configurations based on the presence or absence of secrets containing a Pager Duty RoutingKey and [Dead Man's Snitch](https://deadmanssnitch.com) URL. When the secret is created/updated/deleted, the associated Receiver and Route will be created/updated/deleted within the Alertmanager config.

Expand All @@ -13,6 +24,9 @@ The operator contains the following components:
* Secret controller: watches the `openshift-monitoring` namespace for any changes to Secrets named `alertmanager-main`, `pd-secret` or `dms-secret`.
* Types library: these types are imported from the Alertmanager [Config](https://github.com/prometheus/alertmanager/blob/master/config/config.go) library and pared down to suit our config needs. (Since their library is [intended for internal use only](https://github.com/prometheus/alertmanager/pull/1804#issuecomment-482038079)).

## Cluster Readiness
To avoid alert noise while a cluster is in the early stages of being installed and configured, this operator waits to configure Pager Duty -- effectively silencing alerts -- until a predetermined set of health checks has succeeded.
The operator uses [osd-cluster-ready](https://github.com/openshift/osd-cluster-ready/) to perform these health checks.

## Metrics
The Configure Alertmanager Operator exposes the following Prometheus metrics:
Expand All @@ -27,3 +41,73 @@ The following alerts are added to Prometheus as part of configure-alertmanager-o
* Mismatch between DMS secret and DMS Alertmanager config.
* Mismatch between PD secret and PD Alertmanager config.
* Alertmanager config secret does not exist.

## Testing
Tips for testing on a personal cluster:

### Building
You may build (`make docker-build`) and push (`make docker-push`) the operator image to a personal repository by overriding components of the image URI:
- `IMAGE_REGISTRY` overrides the *registry* (default: `quay.io`)
- `IMAGE_REPOSITORY` overrides the *organization* (default: `app-sre`)
- `IMAGE_NAME` overrides the *repository name* (default: `managed-cluster-validating-webhooks`)
- `OPERATOR_IMAGE_TAG` overrides the *image tag*. (By default this is generated based on the current commit of your local clone of the git repository; but `make docker-build` will also always tag `latest`)

For example, to build, tag, and push `quay.io/my-user/configure-alertmanager-operator:latest`, you can run:

```
make IMAGE_REPOSITORY=my-user docker-build docker-push
```

### Deploying

#### Prevent Overwrites

Note: This step requires elevated permissions

This operator is managed by OLM, so you must switch that off, or your changes to the operator's Deployment will be overwritten:

```
oc scale deploy/cluster-version-operator --replicas=0 -n openshift-cluster-version
oc scale deploy/olm-operator --replicas=0 -n openshift-operator-lifecycle-manager
```

**NOTE: Don't forget to revert these changes when you have finished testing:**

```
oc scale deploy/olm-operator --replicas=1 -n openshift-operator-lifecycle-manager
oc scale deploy/cluster-version-operator --replicas=1 -n openshift-cluster-version
```

#### Replace the Image
Edit the operator's deployment (`oc edit deployment configure-alertmanager-operator -n openshift-monitoring`), replacing the `image:` with the URI of the image you built [above](#building). The deployment will automatically delete and replace the running pod.

**NOTE:** If you are testing the osd-cluster-ready job, you may need to set the `MAX_CLUSTER_AGE_MINUTES` environment variable in the deployment's `configure-alertmanager-operator` container definition.
For example, to ensure osd-cluster-ready runs in a cluster less than 1048576 minutes (~two years) old:

```yaml
containers:
- command:
- configure-alertmanager-operator
env:
- name: WATCH_NAMESPACE
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.namespace
- name: POD_NAME
valueFrom:
fieldRef:
apiVersion: v1
fieldPath: metadata.name
- name: OPERATOR_NAME
value: configure-alertmanager-operator
### Add this entry ###
- name: MAX_CLUSTER_AGE_MINUTES
value: "1048576"
image: quay.io/2uasimojo/configure-alertmanager-operator:latest
imagePullPolicy: Always
name: configure-alertmanager-operator
resources: {}
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
```
4 changes: 3 additions & 1 deletion go.mod
Original file line number Diff line number Diff line change
Expand Up @@ -5,14 +5,16 @@ go 1.13
require (
cloud.google.com/go v0.47.0 // indirect
github.com/coreos/prometheus-operator v0.34.0
github.com/go-bindata/go-bindata v3.1.2+incompatible // indirect
github.com/go-openapi/spec v0.19.5-0.20191022081736-744796356cda // indirect
github.com/golang/mock v1.3.1
github.com/json-iterator/go v1.1.8 // indirect
github.com/onsi/ginkgo v1.12.0 // indirect
github.com/onsi/gomega v1.9.0 // indirect
github.com/openshift/api v3.9.1-0.20190924102528-32369d4db2ad+incompatible
github.com/openshift/cluster-operator v0.0.0-20190529110107-668db5da8c20
github.com/operator-framework/operator-sdk v0.16.0
github.com/prometheus/client_golang v1.2.1
github.com/prometheus/common v0.7.0
github.com/spf13/pflag v1.0.5
go.uber.org/multierr v1.2.0 // indirect
go.uber.org/zap v1.11.0 // indirect
Expand Down
6 changes: 3 additions & 3 deletions go.sum
Original file line number Diff line number Diff line change
Expand Up @@ -207,6 +207,8 @@ github.com/globalsign/mgo v0.0.0-20180905125535-1ca0a4f7cbcb/go.mod h1:xkRDCp4j0
github.com/globalsign/mgo v0.0.0-20181015135952-eeefdecb41b8/go.mod h1:xkRDCp4j0OGD1HRkm4kmhM+pmpv3AKq5SU7GMg4oO/Q=
github.com/go-acme/lego v2.5.0+incompatible/go.mod h1:yzMNe9CasVUhkquNvti5nAtPmG94USbYxYrZfTkIn0M=
github.com/go-bindata/go-bindata v3.1.1+incompatible/go.mod h1:xK8Dsgwmeed+BBsSy2XTopBn/8uK2HWuGSnA11C3Joo=
github.com/go-bindata/go-bindata v3.1.2+incompatible h1:5vjJMVhowQdPzjE1LdxyFF7YFTXg5IgGVW4gBr5IbvE=
github.com/go-bindata/go-bindata v3.1.2+incompatible/go.mod h1:xK8Dsgwmeed+BBsSy2XTopBn/8uK2HWuGSnA11C3Joo=
github.com/go-gl/glfw v0.0.0-20190409004039-e6da0acd62b1/go.mod h1:vR7hzQXu2zJy9AVAgeJqvqgH9Q5CA+iKCZ2gyEVpxRU=
github.com/go-kit/kit v0.8.0/go.mod h1:xBxKIO96dXMWWy0MnWVtmwkA9/13aqxPnvrjFYMA2as=
github.com/go-kit/kit v0.9.0/go.mod h1:xBxKIO96dXMWWy0MnWVtmwkA9/13aqxPnvrjFYMA2as=
Expand Down Expand Up @@ -295,6 +297,7 @@ github.com/golang/groupcache v0.0.0-20191027212112-611e8accdfc9/go.mod h1:cIg4er
github.com/golang/lint v0.0.0-20180702182130-06c8688daad7/go.mod h1:tluoj9z5200jBnyusfRPU2LqT6J+DAorxEvtC7LHB+E=
github.com/golang/mock v1.1.1/go.mod h1:oTYuIxOrZwtPieC+H1uAHpcLFnEyAGVDL/k47Jfbm0A=
github.com/golang/mock v1.2.0/go.mod h1:oTYuIxOrZwtPieC+H1uAHpcLFnEyAGVDL/k47Jfbm0A=
github.com/golang/mock v1.3.1 h1:qGJ6qTW+x6xX/my+8YUVl4WNpX9B7+/l2tRsHGZ7f2s=
github.com/golang/mock v1.3.1/go.mod h1:sBzyDLLjw3U8JLTeZvSv8jJB+tU5PVekmnlKIyFUx0Y=
github.com/golang/protobuf v0.0.0-20161109072736-4bd1920723d7/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
github.com/golang/protobuf v1.0.0/go.mod h1:6lQm79b+lXiMfvg/cZm0SGofjICqVBUtrP5yJMmIC1U=
Expand Down Expand Up @@ -517,8 +520,6 @@ github.com/opencontainers/selinux v1.2.2/go.mod h1:+BLncwf63G4dgOzykXAxcmnFlUaOl
github.com/openshift/api v3.9.1-0.20190924102528-32369d4db2ad+incompatible h1:6il8W875Oq9vycPkRV5TteLP9IfMEX3lyOl5yN+CtdI=
github.com/openshift/api v3.9.1-0.20190924102528-32369d4db2ad+incompatible/go.mod h1:dh9o4Fs58gpFXGSYfnVxGR9PnV53I8TW84pQaJDdGiY=
github.com/openshift/client-go v0.0.0-20190923180330-3b6373338c9b/go.mod h1:6rzn+JTr7+WYS2E1TExP4gByoABxMznR6y2SnUIkmxk=
github.com/openshift/cluster-operator v0.0.0-20190529110107-668db5da8c20 h1:FXE0nwGK3/MC0zUGqa2Wj+xhhawWo0U7q9s5vom+csA=
github.com/openshift/cluster-operator v0.0.0-20190529110107-668db5da8c20/go.mod h1:TOaKmt2XSw3ccak2GoVYvj8UtApTyLBf1vGl3u3+cV4=
github.com/openshift/origin v0.0.0-20160503220234-8f127d736703/go.mod h1:0Rox5r9C8aQn6j1oAOQ0c1uC86mYbUFObzjBRvUKHII=
github.com/openshift/prom-label-proxy v0.1.1-0.20191016113035-b8153a7f39f1/go.mod h1:p5MuxzsYP1JPsNGwtjtcgRHHlGziCJJfztff91nNixw=
github.com/opentracing/opentracing-go v1.1.0/go.mod h1:UkNAQd3GIcIGf0SeVgPpRdFStlNbqXla1AfSYxPUl2o=
Expand Down Expand Up @@ -974,7 +975,6 @@ modernc.org/strutil v1.0.0/go.mod h1:lstksw84oURvj9y3tn8lGvRxyRC1S2+g5uuIzNfIOBs
modernc.org/xc v1.0.0/go.mod h1:mRNCo0bvLjGhHO9WsyuKVU4q0ceiDDDoEeWDJHrNx8I=
rsc.io/binaryregexp v0.2.0/go.mod h1:qTv7/COck+e2FymRvadv62gMdZztPaShugOCi3I+8D8=
rsc.io/letsencrypt v0.0.1/go.mod h1:buyQKZ6IXrRnB7TdkHP0RyEybLx18HHyOSoTyoOLqNY=
sigs.k8s.io/cluster-api v0.3.9 h1:WongQFeW+vbII9Karc3nIarxMfuUuTr33QU9aSyiKfs=
sigs.k8s.io/controller-runtime v0.4.0 h1:wATM6/m+3w8lj8FXNaO6Fs/rq/vqoOjO1Q116Z9NPsg=
sigs.k8s.io/controller-runtime v0.4.0/go.mod h1:ApC79lpY3PHW9xj/w9pj+lYkLgwAAUZwfXkME1Lajns=
sigs.k8s.io/controller-tools v0.2.4/go.mod h1:m/ztfQNocGYBgTTCmFdnK94uVvgxeZeE3LtJvd/jIzA=
Expand Down
7 changes: 6 additions & 1 deletion manifests/01_role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -80,4 +80,9 @@ rules:
- watch
- patch
- update

- apiGroups:
- batch
resources:
- jobs
verbs:
- "*"
14 changes: 14 additions & 0 deletions manifests/03_role_binding.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -38,3 +38,17 @@ subjects:
- kind: ServiceAccount
name: configure-alertmanager-operator
namespace: openshift-monitoring
---
apiVersion: rbac.authorization.k8s.io/v1
kind: ClusterRoleBinding
metadata:
name: configure-alertmanager-operator.prom
namespace: openshift-monitoring
roleRef:
apiGroup: rbac.authorization.k8s.io
kind: ClusterRole
name: cluster-monitoring-view
subjects:
- kind: ServiceAccount
name: configure-alertmanager-operator
namespace: openshift-monitoring
35 changes: 29 additions & 6 deletions pkg/controller/secret/secret_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (

"github.com/openshift/configure-alertmanager-operator/config"
"github.com/openshift/configure-alertmanager-operator/pkg/metrics"
"github.com/openshift/configure-alertmanager-operator/pkg/readiness"
alertmanager "github.com/openshift/configure-alertmanager-operator/pkg/types"

configv1 "github.com/openshift/api/config/v1"
Expand Down Expand Up @@ -66,8 +67,9 @@ var _ reconcile.Reconciler = &ReconcileSecret{}
type ReconcileSecret struct {
// This client, initialized using mgr.Client() above, is a split client
// that reads objects from the cache and writes to the apiserver
client client.Client
scheme *runtime.Scheme
client client.Client
scheme *runtime.Scheme
readiness readiness.Interface
}

// Add creates a new Secret Controller and adds it to the Manager. The Manager will set fields on the Controller
Expand All @@ -78,7 +80,12 @@ func Add(mgr manager.Manager) error {

// newReconciler returns a new reconcile.Reconciler
func newReconciler(mgr manager.Manager) reconcile.Reconciler {
return &ReconcileSecret{client: mgr.GetClient(), scheme: mgr.GetScheme()}
client := mgr.GetClient()
return &ReconcileSecret{
client: client,
scheme: mgr.GetScheme(),
readiness: &readiness.Impl{Client: client},
}
}

// add adds a new Controller to mgr with r as the reconcile.Reconciler
Expand Down Expand Up @@ -424,6 +431,7 @@ func (r *ReconcileSecret) Reconcile(request reconcile.Request) (reconcile.Result
reqLogger.Info("Reconciling Secret")

// This operator is only interested in the 3 secrets listed below. Skip reconciling for all other secrets.
// TODO: Filter these with a predicate instead
switch request.Name {
case secretNamePD:
case secretNameDMS:
Expand All @@ -434,6 +442,12 @@ func (r *ReconcileSecret) Reconcile(request reconcile.Request) (reconcile.Result
}
log.Info("DEBUG: Started reconcile loop")

clusterReady, err := r.readiness.IsReady()
if err != nil {
log.Error(err, "Error determining cluster readiness.")
return r.readiness.Result(), err
}

// Get a list of all Secrets in the `openshift-monitoring` namespace.
// This is used for determining which secrets are present so that the necessary
// Alertmanager config changes can happen later.
Expand All @@ -450,7 +464,7 @@ func (r *ReconcileSecret) Reconcile(request reconcile.Request) (reconcile.Result

// Get the secret from the request. If it's a secret we monitor, flag for reconcile.
instance := &corev1.Secret{}
err := r.client.Get(context.TODO(), request.NamespacedName, instance)
err = r.client.Get(context.TODO(), request.NamespacedName, instance)

// if there was an error other than "not found" requeue
if err != nil {
Expand All @@ -469,9 +483,16 @@ func (r *ReconcileSecret) Reconcile(request reconcile.Request) (reconcile.Result
pagerdutyRoutingKey := ""
watchdogURL := ""
// If a secret exists, add the necessary configs to Alertmanager.
// But don't activate PagerDuty unless the cluster is "ready".
// This is to avoid alert noise while the cluster is still being installed and configured.
if pagerDutySecretExists {
log.Info("INFO: Pager Duty secret exists")
pagerdutyRoutingKey = readSecretKey(r, &request, secretNamePD, secretKeyPD)
if clusterReady {
log.Info("INFO: Cluster is ready; configuring Pager Duty")
pagerdutyRoutingKey = readSecretKey(r, &request, secretNamePD, secretKeyPD)
} else {
log.Info("INFO: Cluster is not ready; skipping Pager Duty configuration")
}
}
if snitchSecretExists {
log.Info("INFO: Dead Man's Snitch secret exists")
Expand All @@ -496,7 +517,9 @@ func (r *ReconcileSecret) Reconcile(request reconcile.Request) (reconcile.Result
// Update metrics after all reconcile operations are complete.
metrics.UpdateSecretsMetrics(secretList, alertmanagerconfig)
reqLogger.Info("Finished reconcile for secret.")
return reconcile.Result{}, nil

// The readiness Result decides whether we should requeue, effectively "polling" the readiness logic.
return r.readiness.Result(), nil
}

func (r *ReconcileSecret) getClusterID() (string, error) {
Expand Down
Loading