Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

docs: Add Integrate Prometheus with Telemetry Manager using Alerting ADR #703

Merged
merged 33 commits into from
Jan 12, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
33 commits
Select commit Hold shift + click to select a range
081f605
Add prometheus helm values
skhalash Jan 9, 2024
689e109
Set up alertmanager wbhook
skhalash Jan 10, 2024
bc8526c
Query Prometheus alerts
skhalash Jan 10, 2024
f96121f
WIP: add ADR
skhalash Jan 11, 2024
186ef04
Describe problems with direct queries
skhalash Jan 11, 2024
83305e7
Finalize the ADR
skhalash Jan 11, 2024
7dca036
WIP: describe the PoC
skhalash Jan 11, 2024
0fc8ce6
WIP: describe setup
skhalash Jan 11, 2024
bb20c15
Revert main.go changes
skhalash Jan 11, 2024
f4c52d1
Finalize the PoC doc
skhalash Jan 11, 2024
31f352d
Revert kustomize
skhalash Jan 11, 2024
b8fbe21
Revert rest of code changes
skhalash Jan 11, 2024
806641a
Minor addition
skhalash Jan 11, 2024
423951f
Revert go.sum
skhalash Jan 11, 2024
c4be632
Merge branch 'main' of github.com:kyma-project/telemetry-manager into…
skhalash Jan 11, 2024
631671c
Revert prometheus.go
skhalash Jan 11, 2024
052c18d
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
1f6648c
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
afa8593
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
70de51d
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
28c0345
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
4a29764
Update docs/contributor/pocs/integrate-prometheus-with-telemetry-mana…
Jan 12, 2024
c5c35cc
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
cef7518
Update docs/contributor/pocs/integrate-prometheus-with-telemetry-mana…
Jan 12, 2024
675b54e
Update docs/contributor/pocs/integrate-prometheus-with-telemetry-mana…
Jan 12, 2024
dded73e
Fix list
skhalash Jan 12, 2024
d32436b
Fix code blocks
skhalash Jan 12, 2024
58524fe
Improve readability
skhalash Jan 12, 2024
9036181
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
1ea8f95
Merge branch 'prometheus-integration-poc' of github.com:skhalash/tele…
skhalash Jan 12, 2024
69954b8
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
bbe8c68
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
55660d0
Update docs/contributor/arch/003-integrate-prometheus-with-telemetry-…
Jan 12, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# 1. Fluent Bit Configuration and File-System Buffer Usage
# 2. Fluent Bit Configuration and File-System Buffer Usage

Date: 2023-11-23

Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,51 @@
# 3. Integrate Prometheus With Telemetry Manager Using Alerting

Date: 2024-01-11

## Status

Accepted

## Context

As outlined in [ADR 001: Trace/Metric Pipeline status based on OTel Collector metrics](./001-otel-collector-metric-based-pipeline-status.md), our objective is to utilize a managed Prometheus instance to reflect specific telemetry flow issues (such as backpressure, data loss, backend unavailability) in the status of a telemetry pipeline custom resource (CR).
We have previously determined that both Prometheus and its configuration will be managed within the Telemetry Manager's code, aligning with our approach for managing Fluent Bit and OTel Collector.

To address the integration of Prometheus querying into the reconciliation loop, a Proof of Concept was executed.

## Decision

The results of the query tests affirm that invoking Prometheus APIs won't notably impact the overall reconciliation time. In theory, we could directly query Prometheus within the Reconcile routine. However, this straightforward approach presents a few challenges.

### Challenges

#### Timing of Invocation
Our current reconciliation strategy triggers either when a change occurs or every minute. While this is acceptable for periodic status updates, it may not be optimal when considering future plans to use Prometheus for autoscaling decisions.

#### Flakiness Mitigation
To ensure reliability and avoid false alerts, it's crucial to introduce a delay before signaling a problem. As suggested in [OTel Collector monitoring best practices](https://github.com/open-telemetry/opentelemetry-collector/blob/main/docs/monitoring.md):

> Use the rate of otelcol_processor_dropped_spans > 0 and otelcol_processor_dropped_metric_points > 0 to detect data loss. Depending on requirements, set up a minimal time window before alerting to avoid notifications for minor losses that fall within acceptable levels of reliability.

If we directly query Prometheus, we would need to implement such a mechanism to mitigate flakiness ourselves.

### Solution

Fortunately, we can leverage the Alerting feature of Prometheus to address the aforementioned challenges. The proposed workflow is as follows:
skhalash marked this conversation as resolved.
Show resolved Hide resolved

#### Rendering Alerting Rules
Telemetry Manager dynamically generates alerting rules based on the deployed pipeline configuration.
These alerting rules are then mounted into the Prometheus Pod, which is also deployed by the Telemetry Manager.

#### Alert Retrieval in Reconciliation
During each reconciliation iteration, the Telemetry Manager queries the [Prometheus Alerts API](https://prometheus.io/docs/prometheus/latest/querying/api/#alerts) using `github.com/prometheus/client_golang` to retrieve information about all fired alerts.
The obtained alerts are then translated into corresponding CR statuses.

#### Webhook for Immediate Reconciliation
The Telemetry Manager exposes an endpoint intended to be invoked by Prometheus whenever there is a change in the state of alerts. To facilitate this, we can configure Prometheus to treat our endpoint as an Alertmanager instance. Upon receiving a call, this endpoint initiates an immediate reconciliation of all affected resources using the https://pkg.go.dev/sigs.k8s.io/controller-runtime/pkg/builder#Builder.WatchesRawSource with https://pkg.go.dev/sigs.k8s.io/[email protected]/pkg/source#Channel.

By adopting this approach, we transfer the effort associated with expression evaluation and waiting to Prometheus.

## Consequences

The described setup involves a lot of interaction between Telemetry Manager and Prometheus, which should be sufficiently monitored.
skhalash marked this conversation as resolved.
Show resolved Hide resolved
Original file line number Diff line number Diff line change
@@ -0,0 +1,256 @@
# Integrate Prometheus With Telemetry Manager Using Alerting

## Goal

The goal of the Proof of Concept is to test integrating Prometheus into Telemetry Manager using Alerting.

## Setup

Follow these steps to set up the required environment:

1. Create a Kubernetes cluster (k3d or Gardener).
2. Create an overrides file specifically for the Prometheus Helm Chart. Save the file as `overrides.yaml`.
```yaml
alertmanager:
enabled: false

prometheus-pushgateway:
enabled: false

prometheus-node-exporter:
enabled: false

server:
alertmanagers:
- static_configs:
- targets:
- telemetry-operator-alerts-webhook.kyma-system:9090

serverFiles:
alerting_rules.yml:
groups:
- name: Instances
rules:
- alert: InstanceDown
expr: up == 0
for: 5m
labels:
severity: page
annotations:
description: '{{ $labels.instance }} of job {{ $labels.job }} has been down for more than 5 minutes.'
summary: 'Instance {{ $labels.instance }} down'
prometheus.yml:
rule_files:
- /etc/config/recording_rules.yml
- /etc/config/alerting_rules.yml

scrape_configs:
- job_name: prometheus
static_configs:
- targets:
- localhost:9090

- job_name: 'kubernetes-service-endpoints'
honor_labels: true
kubernetes_sd_configs:
- role: endpoints
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape_slow]
action: drop
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_service_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+?)(?::\d+)?;(\d+)
replacement: $1:$2
- action: labelmap
regex: __meta_kubernetes_service_annotation_prometheus_io_param_(.+)
replacement: __param_$1
- action: labelmap
regex: __meta_kubernetes_service_label_(.+)
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_service_name]
action: replace
target_label: service
- source_labels: [__meta_kubernetes_pod_node_name]
action: replace
target_label: node
```
3. Deploy Prometheus.
```shell
kubectl create ns prometheus
helm install -f overrides.yaml prometheus prometheus-community/prometheus
```
4. Create an endpoint in Telemetry Manager to be invoked by Prometheus:
```go
reconcileTriggerChan := make(chan event.GenericEvent, 1024)
go func() {
handler := func(w http.ResponseWriter, r *http.Request) {
body, readErr := io.ReadAll(r.Body)
if readErr != nil {
http.Error(w, "Error reading request body", http.StatusInternalServerError)
return
}
defer r.Body.Close()

// TODO: add more context about which objects have to reconciled
reconcileTriggerChan <- event.GenericEvent{}
w.WriteHeader(http.StatusOK)
}

mux := http.NewServeMux()
mux.HandleFunc("/api/v2/alerts", handler)

server := &http.Server{
Addr: ":9090",
ReadHeaderTimeout: 10 * time.Second,
Handler: mux,
}

if serverErr := server.ListenAndServe(); serverErr != nil {
mutex.Lock()
setupLog.Error(serverErr, "Cannot start webhook server")
mutex.Unlock()
}
}()
```
5. Trigger reconciliation in MetricPipelineController whenever the endpoint is called by Prometheus:
```go
func NewMetricPipelineReconciler(client client.Client, reconcileTriggerChan chan event.GenericEvent, reconciler *metricpipeline.Reconciler) *MetricPipelineReconciler {
return &MetricPipelineReconciler{
Client: client,
reconciler: reconciler,
Client: client,
reconciler: reconciler,
reconcileTriggerChan: reconcileTriggerChan,
}
}

// SetupWithManager sets up the controller with the Manager.
func (r *MetricPipelineReconciler) SetupWithManager(mgr ctrl.Manager) error {
// We use `Watches` instead of `Owns` to trigger a reconciliation also when owned objects without the controller flag are changed.
return ctrl.NewControllerManagedBy(mgr).
For(&telemetryv1alpha1.MetricPipeline{}).
WatchesRawSource(&source.Channel{Source: r.reconcileTriggerChan},
handler.EnqueueRequestsFromMapFunc(r.mapPrometheusAlertEvent)).
...
}

func (r *MetricPipelineReconciler) mapPrometheusAlertEvent(ctx context.Context, _ client.Object) []reconcile.Request {
logf.FromContext(ctx).Info("Handling Prometheus alert event")
requests, err := r.createRequestsForAllPipelines(ctx)
if err != nil {
logf.FromContext(ctx).Error(err, "Unable to create reconcile requests")
}
return requests
}
```
6. Query Prometheus alerts in the Reconcile function:
```go
import (
"context"
"fmt"
"time"

"github.com/prometheus/client_golang/api"
promv1 "github.com/prometheus/client_golang/api/prometheus/v1"
logf "sigs.k8s.io/controller-runtime/pkg/log"
)

const prometheusAPIURL = "http://prometheus-server.default:80"

func queryAlerts(ctx context.Context) error {
client, err := api.NewClient(api.Config{
Address: prometheusAPIURL,
})
if err != nil {
return fmt.Errorf("failed to create Prometheus client: %w", err)
}

v1api := promv1.NewAPI(client)
ctx, cancel := context.WithTimeout(ctx, 10*time.Second)
defer cancel()

start := time.Now()
alerts, err := v1api.Alerts(ctx)

if err != nil {
return fmt.Errorf("failed to query Prometheus alerts: %w", err)
}

logf.FromContext(ctx).Info("Prometheus alert query succeeded!",
"elapsed_ms", time.Since(start).Milliseconds(),
"alerts", alerts)
return nil
}
```

7. Add a Kubernetes service for the alerts endpoint to the kustomize file:
```yaml
apiVersion: v1
kind: Service
metadata:
name: operator-alerts-webhook
namespace: system
spec:
ports:
- name: webhook
port: 9090
targetPort: 9090
selector:
app.kubernetes.io/name: operator
app.kubernetes.io/instance: telemetry
kyma-project.io/component: controller
control-plane: telemetry-operator
```
8. Whitelist the endpoint port (9090) in the operator network policy:
```yaml
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
name: operator-pprof-deny-ingress
spec:
podSelector:
matchLabels:
app.kubernetes.io/name: operator
app.kubernetes.io/instance: telemetry
kyma-project.io/component: controller
control-plane: telemetry-operator
policyTypes:
- Ingress
ingress:
- from:
- ipBlock:
cidr: 0.0.0.0/0
ports:
- protocol: TCP
port: 8080
- protocol: TCP
port: 8081
- protocol: TCP
port: 9443
- protocol: TCP
port: 9090
```
9. Deploy the modified Telemetry Manager:
```shell
export IMG=$DEV_IMAGE_REPO
make docker-build
make docker-push
make install
make deploy
```
10. Intentionally break any scrape Target to fire the InstanceDown alert. Look at Telemetry Manager logs, you should see that Prometheus is pushing alerts via the endpoint, which triggers immediate reconciliation.