-
Notifications
You must be signed in to change notification settings - Fork 3
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add alert for failing helmreleases deploying aws components #1432
Conversation
@giantswarm/team-atlas could you give me a bit of guidance regarding CI? It's failing. |
description: |- | ||
{{`Flux HelmRelease {{ $labels.name }} in ns {{ $labels.exported_namespace }} on {{ $labels.installation }}/{{ $labels.cluster_id }} is stuck in Failed state.`}} | ||
opsrecipe: fluxcd-failing-helmrelease/ | ||
expr: gotk_reconcile_condition{type="Ready", status="False", kind="HelmRelease", cluster_type="management_cluster", exported_namespace!="flux-giantswarm", name=~".*(aws-ebs-csi-driver|cloud-provider-aws).*"} > 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe extend these to all critical ones i.e. add cilium, coredns, network-policies suffixes as well? And do it for every CAPI-based provider, not just CAPA, so we have an easy list that we can extend for the other providers?
Oh, I guess that's an error that's happening earlier in the tests and that's not properly caught:
I think it's due to the indentation for the yaml multiline blocks Also, the tests will probably fail later (hopefully with a more meaningful error) because you did not write any unit tests. |
640f70c
to
0c2a896
Compare
helm/prometheus-rules/templates/kaas/phoenix/alerting-rules/aws-cloud-components.rules.yml
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. If you want to add rocket ones, you could do that without issues :)
5ea26c7
to
8dbc683
Compare
Towards https://github.com/giantswarm/giantswarm/issues/32121
We want to get alerted whenever one of the
HelmReleases
fore core components like the aws cloud-controller or the aws-ebs-csi-driver are in failed state. Currently, onalba
there are some in this state, and we'd get paged by theseChecklist
oncall-kaas-cloud
GitHub group).