Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add alert for failing helmreleases deploying aws components #1432

Merged
merged 5 commits into from
Nov 21, 2024

Conversation

fiunchinho
Copy link
Member

@fiunchinho fiunchinho commented Nov 19, 2024

Towards https://github.com/giantswarm/giantswarm/issues/32121

We want to get alerted whenever one of the HelmReleases fore core components like the aws cloud-controller or the aws-ebs-csi-driver are in failed state. Currently, on alba there are some in this state, and we'd get paged by these

image

Checklist

@fiunchinho fiunchinho self-assigned this Nov 19, 2024
@fiunchinho fiunchinho marked this pull request as ready for review November 19, 2024 16:51
@fiunchinho fiunchinho requested a review from a team as a code owner November 19, 2024 16:51
@fiunchinho fiunchinho requested a review from a team November 19, 2024 16:51
@fiunchinho
Copy link
Member Author

@giantswarm/team-atlas could you give me a bit of guidance regarding CI? It's failing.

description: |-
{{`Flux HelmRelease {{ $labels.name }} in ns {{ $labels.exported_namespace }} on {{ $labels.installation }}/{{ $labels.cluster_id }} is stuck in Failed state.`}}
opsrecipe: fluxcd-failing-helmrelease/
expr: gotk_reconcile_condition{type="Ready", status="False", kind="HelmRelease", cluster_type="management_cluster", exported_namespace!="flux-giantswarm", name=~".*(aws-ebs-csi-driver|cloud-provider-aws).*"} > 0
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Maybe extend these to all critical ones i.e. add cilium, coredns, network-policies suffixes as well? And do it for every CAPI-based provider, not just CAPA, so we have an easy list that we can extend for the other providers?

@hervenicol
Copy link
Contributor

@giantswarm/team-atlas could you give me a bit of guidance regarding CI? It's failing.

Oh, I guess that's an error that's happening earlier in the tests and that's not properly caught:

Templating chart for provider: capi/capa
Error: YAML parse error on prometheus-rules/templates/kaas/phoenix/alerting-rules/aws-cloud-components.rules.yml: error converting YAML to JSON: yaml: line 24: could not find expected ':'

I think it's due to the indentation for the yaml multiline blocks description and namespace. You need an extra indentation for the content block.

Also, the tests will probably fail later (hopefully with a more meaningful error) because you did not write any unit tests.

Copy link
Contributor

@QuentinBisson QuentinBisson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. If you want to add rocket ones, you could do that without issues :)

@fiunchinho fiunchinho requested a review from a team November 20, 2024 14:00
@fiunchinho fiunchinho requested a review from AndiDog November 20, 2024 14:06
@fiunchinho fiunchinho merged commit e4a5df4 into main Nov 21, 2024
7 checks passed
@fiunchinho fiunchinho deleted the helmreleases-aws branch November 21, 2024 14:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants