Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add InhibitionClusterWithoutWorkerNodes for CAPA #1397

Merged
merged 2 commits into from
Oct 22, 2024
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 4 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0

## [Unreleased]

### Added

- Added InhibitionClusterWithoutWorkerNodes for CAPA

### Changed

- Modify `KyvernoWebhookHasNoAvailableReplicas` to check specifically for Kyverno resource webhook.
Expand Down
Original file line number Diff line number Diff line change
@@ -0,0 +1,34 @@
{{- if eq .Values.managementCluster.provider.kind "capa" }}
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
creationTimestamp: null
labels:
{{- include "labels.common" . | nindent 4 }}
cluster_type: "management_cluster"
name: capa.inhibitions.rules
namespace: {{ .Values.namespace }}
spec:
groups:
- name: capa.inhibitions
rules:
- alert: InhibitionClusterWithoutWorkerNodes
annotations:
description: '{{`Cluster ({{ $labels.cluster_id }}) has no worker nodes.`}}'
expr: |-
label_replace(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why is this for capa only? otherwise it looks good

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would think that with capi_machinepool_spec_replicas and capi_machinedeployment_spec_replicas we should be good?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Makes sense :)

capi_cluster_status_condition{type="ControlPlaneReady", status="True"},
"cluster_id",
"$1",
"name",
"(.*)"
) == 1
unless on (cluster_id) (
sum(capi_machinepool_spec_replicas{} > 0) by (cluster_id)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we accept 1 worker nodes only?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Prometheus-agent/Alloy could run on a 1-node WC, so that's potentially ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are we sure we want to use capi_machinepool_spec_replicas? That's the number of replicas defined in the MachinePool spec, but that doesn't necessarily represent the current number of nodes. For the inhibition, I thought we would prefer to use the current number of replicas. There are other metrics like

  • capi_machinepool_status_replicas: Replicas is the most recently observed number of replicas.
  • capi_machinepool_status_replicas_ready: The number of ready replicas for this MachinePool. A machine is considered ready when the node has been created and is "Ready".
  • capi_machinepool_status_replicas_available: The number of available replicas (ready for at least minReadySeconds) for this MachinePool.

Do you think it would make sense to use one of those instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Current inhibition works when the cluster has been purposely scaled down.

If the cluster should have nodes but none is ready/available, I think we should get a page.
I don't think current state of CAPI monitoring manages it, so I'd rather have a "prometheus-agent down" alert than no alert in this case.

It seems to me that there's quite a gap between vintage AWS and CAPA alerts, but I don't think it's Atlas responsibility to fix it. So I went the quickest way to solving my actual issue 😅

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, I thought you wanted the inhibition to avoid paging when the cluster was having other issues i.e. no ready nodes, meaning there was nothing wrong with your component.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That would be ideal. But my first expectation is to not get paged when a cluster has no issues, for now 🤣

)
labels:
area: kaas
has_worker_nodes: "false"
team: phoenix
topic: status
{{- end }}
Original file line number Diff line number Diff line change
@@ -0,0 +1,47 @@
---
rule_files:
- capa.inhibition.rules.yml

tests:
# Tests for `InhibitionClusterWithoutWorkerNodes` inhibition alert
- interval: 1m
input_series:
- series: 'capi_cluster_status_condition{cluster_id="golem", cluster_type="management_cluster", name="golem", pipeline="testing", status="True", type="ControlPlaneReady"}'
values: "1+0x300"
- series: 'capi_machinepool_spec_replicas{cluster_id="golem", cluster_name="golem", cluster_type="management_cluster", customer="giantswarm", installation="golem", organization="giantswarm", pipeline="testing", provider="capa"}'
values: "_x60 0x60 3x60"
alert_rule_test:
- alertname: InhibitionClusterWithoutWorkerNodes
eval_time: 30m
exp_alerts:
- exp_labels:
area: kaas
cluster_id: "golem"
cluster_type: "management_cluster"
has_worker_nodes: "false"
name: "golem"
pipeline: "testing"
status: "True"
team: "phoenix"
topic: "status"
type: "ControlPlaneReady"
exp_annotations:
description: "Cluster (golem) has no worker nodes."
- alertname: InhibitionClusterWithoutWorkerNodes
eval_time: 90m
exp_alerts:
- exp_labels:
area: kaas
cluster_id: "golem"
cluster_type: "management_cluster"
has_worker_nodes: "false"
name: "golem"
pipeline: "testing"
status: "True"
team: "phoenix"
topic: "status"
type: "ControlPlaneReady"
exp_annotations:
description: "Cluster (golem) has no worker nodes."
- alertname: InhibitionClusterWithoutWorkerNodes
eval_time: 150m