metal3-io · NymanRobin · Aug 23, 2024
diff --git a/prow/manifests/overlays/metal3/ingress.yaml b/prow/manifests/overlays/metal3/ingress.yaml
@@ -25,6 +25,13 @@ spec:
             name: hook
             port:
               number: 8888
+      - path: /monitoring
+        pathType: Prefix
+        backend:
+          service:
+            name: grafana
+            port:
+              number: 80
   tls:
   - hosts:
     - prow.apps.test.metal3.io

diff --git a/prow/monitoring/README.md b/prow/monitoring/README.md
@@ -0,0 +1,156 @@
+# Monitoring of K8s cluster and Prow resources
+
+This is a wip that provides insight into how to monitor k8s
+cluster resources and prow services.
+
+The k8s is based on the kubernetes mixins that can found here:
+[k8s-mixin](https://github.com/kubernetes-monitoring/kubernetes-mixin)
+
+The steps to set this up is the following and is tested to work in minikube.
+In our case we need to integrate this to the granfana.yaml here.
+
+The main detail is the generation of the resources and deciding if we want
+to generate these dynamically or if we take a static snapshot of the yaml
+and use that. Currently there is static snapshot in grafana-dashboard-definitions.
+
+Also the kustomize.yaml resource needs to be created it should be able
+to automate the process of creating a configmap out of grafana-dashboard-definitions
+
+Further for the alertmanager to automate the alerts the slackwebhook needs to be created as
+a secret. This can be done the sameway as the secrets in
+`project-infra/prow/manifests/overlays/metal3`
+
+## Deploying Grafana with Kubernetes-mixins
+
+
+### Step 1: Install Prometheus and Grafana using Helm
+
+NOTE: We will most likely not use helm but kustomize, only used helm for a quick poc
+
+First, add the Helm repositories for Prometheus and Grafana:
+
+```kubectl
+helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
+helm repo add grafana https://grafana.github.io/helm-charts
+helm repo update
+```
+
+Now, install Prometheus and Grafana:
+
+```kubectl
+helm install prometheus prometheus-community/kube-prometheus-stack --namespace monitoring --create-namespace
+```
+
+This command installs the Prometheus stack, which includes Prometheus, Alertmanager, and Grafana.
+
+### Step 2: Access Grafana
+
+Expose the Grafana service using `kubectl port-forward`:
+
+```kubectl
+kubectl port-forward svc/prometheus-grafana 3000:80 -n monitoring
+```
+
+You can now access Grafana at `http://localhost:3000`. The default login is:
+
+- **Username:** `admin`
+- **Password:** `prom-operator`
+
+### Step 3: Generate and Create a ConfigMap for Grafana Dashboards
+
+Assuming you have cloned the kubernetes-mixin You can manually generate the
+alerts, dashboards and rules files, but first you must install some tools:
+
+```
+$ go install github.com/jsonnet-bundler/jsonnet-bundler/cmd/jb@latest
+$ brew install jsonnet
+```
+
+Then, grab the mixin and its dependencies:
+
+```
+$ git clone https://github.com/kubernetes-monitoring/kubernetes-mixin
+$ cd kubernetes-mixin
+$ jb install
+```
+
+Finally, build the mixin:
+
+```
+$ make prometheus_alerts.yaml
+$ make prometheus_rules.yaml
+$ make dashboards_out
+```
+1. To apply the rules and alerts you need to add the following to the files
+
+    So add the following header and replace the groups with whatever was generated
+
+   ```yaml
+    apiVersion: monitoring.coreos.com/v1
+    kind: PrometheusRule
+    metadata:
+    name: kubernetes-mixin-alerts
+    namespace: monitoring
+    spec:
+        "groups":
+            ...
+   ```
+
+2. Create a ConfigMap with the Grafana dashboards:
+
+    ```kubectl
+    kubectl create configmap grafana-dashboards --from-file=dashboards_out/ -n monitoring
+    ```
+
+    This command creates a ConfigMap named `grafana-dashboards` in the
+    `monitoring` namespace, containing all the JSON files in the
+    `dashboards_out/` directory. This needs to be mounted to grafana
+    in the following steps
+
+### Step 4: Configure Grafana to Load Dashboards from the ConfigMap
+
+Patch the Grafana deployment:
+
+1. **Edit the Grafana deployment:**
+
+    ```kubectl
+    kubectl edit deployment prometheus-grafana -n monitoring
+    ```
+
+2. **Add the following under `spec` > `volumes`:**
+
+    ```yaml
+    volumes:
+      - name: grafana-dashboards
+        configMap:
+          name: grafana-dashboards
+    ```
+
+3. **Mount the volume under `containers` > `volumeMounts`:**
+
+    ```yaml
+    volumeMounts:
+      - name: grafana-dashboards
+        mountPath: /var/lib/grafana/dashboards
+    ```
+
+4. **Ensure Grafana is configured to load dashboards:**
+
+    Ensure that Grafana is set up to load dashboards from the specified directory:
+
+    ```yaml
+    env:
+      - name: GF_DASHBOARDS_JSON_ENABLED
+        value: "true"
+      - name: GF_DASHBOARDS_JSON_PATH
+        value: "/var/lib/grafana/dashboards"
+    ```
+
+### Step 5: Verify the Dashboards in Grafana
+
+After applying the changes, Grafana should automatically load the dashboards from the ConfigMap.
+
+1. Access Grafana at `http://localhost:3000`.
+2. Navigate to "Dashboards" > "Manage" and you should see the dashboards listed and ready to use.
+
+
diff --git a/prow/monitoring/additional-scrape-configs_secret.yaml b/prow/monitoring/additional-scrape-configs_secret.yaml
@@ -0,0 +1,24 @@
+apiVersion: v1
+kind: Secret
+metadata:
+  name: additional-scrape-configs
+  namespace: prow-monitoring
+stringData:
+  prometheus-additional.yaml: |
+    - job_name: blackbox
+      metrics_path: /probe
+      params:
+        module: [http_2xx]
+      static_configs:
+        - targets:
+          # ATTENTION: Keep this in sync with the list in mixins/prometheus/prober_alerts.libsonnet
+          - https://prow.apps.test.metal3.io/
+          # - https://monitoring.prow.apps.test.metal3.io/ Add this once we have the subdomain
+      relabel_configs:
+        - source_labels: [__address__]
+          target_label: __param_target
+        - source_labels: [__param_target]
+          target_label: instance
+        - target_label: __address__
+          replacement: blackbox-prober
+type: Opaque
diff --git a/prow/monitoring/alertmanager.yaml b/prow/monitoring/alertmanager.yaml
@@ -0,0 +1,94 @@
+apiVersion: monitoring.coreos.com/v1
+kind: Alertmanager
+metadata:
+  name: prow
+  namespace: prow-monitoring
+spec:
+  replicas: 3
+  image: docker.io/prom/alertmanager
+  listenLocal: false
+  nodeSelector: {}
+  securityContext:
+    fsGroup: 2000
+    runAsNonRoot: true
+    runAsUser: 1000
+  serviceAccountName: alertmanager
+  version: v0.27.0
+  storage: # Note that this section is immutable so changes require deleting and recreating the resource.
+    volumeClaimTemplate:
+      metadata:
+        name: prometheus
+      spec:
+        accessModes:
+        - "ReadWriteOnce"
+        storageClassName: "standard"
+        resources:
+          requests:
+            storage: 10Gi
+---
+apiVersion: v1
+kind: Service
+metadata:
+  labels:
+    app: alertmanager
+  name: alertmanager
+  namespace: prow-monitoring
+spec:
+  ports:
+  - name: http
+    port: 9093
+    protocol: TCP
+    targetPort: 9093
+  selector:
+    alertmanager: prow
+    app: alertmanager
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: alertmanager
+  namespace: prow-monitoring
+---
+# TODO: NEED CHANGE HERE TO CORRECT SLACK SETTINGS
+# Slack endpoint, or even different methods of alerting
+# Please replace '{{ api_url }}' below with the URL of slack incoming hook
+# before `kubectl apply -f`
+apiVersion: v1
+kind: Secret
+metadata:
+  name: alertmanager-prow
+  namespace: prow-monitoring
+stringData:
+  alertmanager.yaml: |
+    global:
+      resolve_timeout: 5m
+
+    route:
+      group_by: ['alertname', 'job']
+      group_wait: 30s
+      group_interval: 10m
+      repeat_interval: 4h
+      receiver: 'slack-warnings'
+      routes:
+      - receiver: 'cluster-api-aws-alerts'
+        group_interval: 5m
+        repeat_interval: 2h
+        match:
+          boskos_type: aws-account
+
+
+
+    receivers:
+    - name: 'slack-warnings'
+      slack_configs:
+      - channel: '#prow-alerts'
+        api_url: '{{ api_url }}'
+        icon_url: https://avatars3.githubusercontent.com/u/3380462
+        text: '{{ template "custom_slack_text" . }}'
+        link_names: true
+
+    templates:
+    - '*.tmpl'
+  msg.tmpl: |
+    {{ define "custom_slack_text" }}{{ .CommonAnnotations.message }}{{ end }}
+type: Opaque
diff --git a/prow/monitoring/alertmanager_rbac.yaml b/prow/monitoring/alertmanager_rbac.yaml
@@ -0,0 +1,6 @@
+---
+apiVersion: v1
+kind: ServiceAccount
+metadata:
+  name: alertmanager
+  namespace: prow-monitoring
diff --git a/prow/monitoring/blackbox_prober.yaml b/prow/monitoring/blackbox_prober.yaml
@@ -0,0 +1,66 @@
+apiVersion: apps/v1
+kind: Deployment
+metadata:
+  name: blackbox-prober
+  namespace: prow-monitoring
+  labels:
+    app: blackbox-prober
+spec:
+  selector:
+    matchLabels:
+      app: blackbox-prober
+  replicas: 1
+  template:
+    metadata:
+      labels:
+        app: blackbox-prober
+    spec:
+      containers:
+      - name: blackbox-prober
+        args:
+        - --config.file=/etc/config/prober.yaml
+        image: prom/blackbox-exporter:v0.15.1
+        volumeMounts:
+        - name: config
+          mountPath: /etc/config/
+      volumes:
+      - name: config
+        configMap:
+          name: blackbox-prober-config
+---
+apiVersion: v1
+kind: ConfigMap
+metadata:
+  name: blackbox-prober-config
+  namespace: prow-monitoring
+  labels:
+    app: blackbox-prober
+data:
+  prober.yaml: |-
+    modules:
+      http_2xx:
+        prober: http
+        timeout: 8s
+        http:
+          # valid_status_codes defaults to 2xx
+          method: GET
+          no_follow_redirects: false
+          fail_if_ssl: false
+          fail_if_not_ssl: true
+          preferred_ip_protocol: "ip4" # Defaults to ip6
+---
+apiVersion: v1
+kind: Service
+metadata:
+  name: blackbox-prober
+  namespace: prow-monitoring
+  labels:
+    app: blackbox-prober
+spec:
+  type: ClusterIP
+  ports:
+  - name: blackbox-prober
+    port: 80
+    targetPort: 9115
+  selector:
+    app: blackbox-prober