Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add the receiver name to notification metrics #3045

Merged

Conversation

sinkingpoint
Copy link
Contributor

Solves #3012

This PR adds in a second label to all the Notification metrics, so that we not only have the type of notification, but all the receiver_name that the notification belongs to.

Because of the proliferation of webhook notifiers (as recommended by the docs), this is necessary to be able to properly diagnose failures from metrics - currently we get one metric for all web hooks, regardless of the fact that they might represent completely different notification pathways.

This is technically a breaking change, so this might have to wait for a bigger release, if its even worthwhile at all

@roidelapluie
Copy link
Member

Could we have this behind a cli flag?

Cli flag because I think we should issue it consistently over the lifetime of the app.

@roidelapluie roidelapluie reopened this Oct 4, 2022
@sinkingpoint
Copy link
Contributor Author

Sorry for the hiatus. Will push a CLI flag today

@sinkingpoint sinkingpoint force-pushed the sinkingpoint/receiver_name_metrics branch from 1ed8b42 to 35d0483 Compare January 31, 2023 00:09
@sinkingpoint sinkingpoint force-pushed the sinkingpoint/receiver_name_metrics branch 2 times, most recently from 3c6c530 to 769396f Compare June 22, 2023 06:46
@sinkingpoint
Copy link
Contributor Author

Fixed this up and it's ready for another review

@alimehrabikoshki
Copy link

@roidelapluie could you pls take another look?

sinkingpoint and others added 3 commits August 7, 2023 13:14
This commit adds in a second label to the notify family of metrics
(e.g. numTotalFailedNotifications) - the receiver name. This allows
disambiguating which receiver is failing when one has many receivers
with the same integration type

Signed-off-by: sinkingpoint <[email protected]>
Alertmanager doesn't actually have a feature flag system yet, so this
invents the universe, mostly by cribbing off the Prometheus model.

Here we introduce a new CLI flag `--enable-feature` that we parse into
a new `featureConfigs` struct. We also update the PipelineBuilder
to take that flag and pass it through the metrics infrastructure. This
also required a bit of deduplication to make finding the correct
labelset (with or without the receiver name) easier in every place that
it's used, and updating the tests to use that same mechanism.

Signed-off-by: sinkingpoint <[email protected]>
@gotjosh gotjosh force-pushed the sinkingpoint/receiver_name_metrics branch from 769396f to acb5eaa Compare August 7, 2023 14:49
@gotjosh
Copy link
Member

gotjosh commented Aug 7, 2023

Thank you very much for your contribution - you've been waiting for a long time on this so instead of dragging it back and forth, and went ahead and implemented a couple of suggestions I wanted on this PR.

The bulk of the change is around having the "feature flag management" on its package and applying a couple of patterns to keep them tidy for future additions.

The other big change is the fact that we can initialize ahead of time the metrics for the given receiver names.

I'd love if @simonpasquier can take a look as I've written a big chunk of this code.

Signed-off-by: gotjosh <[email protected]>
Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main remark is that old label values would be retained after a config reload in case some receivers are deleted/renamed. I guess that we need DeleteLabelValues() to prune them.

@gotjosh
Copy link
Member

gotjosh commented Sep 6, 2023

My main remark is that old label values would be retained after a config reload in case some receivers are deleted/renamed. I guess that we need DeleteLabelValues() to prune them.

I can't use https://pkg.go.dev/github.com/prometheus/client_golang/prometheus#MetricVec.DeleteLabelValues as I would need to know the old label values to prune them (and I can, but this would be a pretty invasive change). Instead, I chose to use Reset() to prune all of the metrics in the given vector and then recreate them again - given all of these are counters this shouldn't be a problem, as they would simply go back to 0 and would count as counter reset anyways.

Copy link
Member

@simonpasquier simonpasquier left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@gotjosh gotjosh merged commit cfe4411 into prometheus:main Sep 6, 2023
1 check passed
@gotjosh
Copy link
Member

gotjosh commented Sep 6, 2023

Thank you very much for your contribution @sinkingpoint.

@mmikitka-arcticwolf
Copy link

@gotjosh Do you have an estimate for when the next official Alertmanager container image will be available that includes this change?

@rtkwlf-nachiketrao
Copy link

@gotjosh Is there any update on when the new Alertmanager image will be available?

@gotjosh
Copy link
Member

gotjosh commented Feb 20, 2024

Have you tested out the current rc.0 https://github.com/prometheus/alertmanager/releases/tag/v0.27.0-rc.0? I'm planning on promoting the the RC to 27.0 by the end of the week if nothing major is reported.

SuperQ added a commit that referenced this pull request Oct 16, 2024
* [CHANGE] Deprecate and remove api/v1/ #2970
* [CHANGE] Remove unused feature flags #3676
* [CHANGE] Newlines in smtp password file are now ignored #3681
* [CHANGE] Change compat metrics to counters #3686
* [CHANGE] Do not register compat metrics in amtool #3713
* [CHANGE] Remove metrics from compat package #3714
* [CHANGE] Mark muted alerts #3793
* [FEATURE] Add metric for inhibit rules #3681
* [FEATURE] Support UTF-8 label matchers #3453, #3507, #3523, #3483, #3567, #3568, #3569, #3571, #3595, #3604, #3619, #3658, #3659, #3662, #3668, 3572
* [FEATURE] Add counter to track alerts dropped outside of time_intervals #3565
* [FEATURE] Add date and tz functions to templates #3812
* [FEATURE] Add limits for silences #3852
* [FEATURE] Add time helpers for templates #3863
* [FEATURE] Add auto GOMAXPROCS #3837
* [FEATURE] Add auto GOMEMLIMIT #3895
* [FEATURE] Add Jira receiver integration #3590
* [ENHANCEMENT] Add the receiver name to notification metrics #3045
* [ENHANCEMENT] Add the route ID to uuid #3372
* [ENHANCEMENT] Add duration to the notify success message #3559
* [ENHANCEMENT] Implement webhook_url_file for discord and msteams #3555
* [ENHANCEMENT] Add debug logs for muted alerts #3558
* [ENHANCEMENT] API: Allow the Silences API to use their own 400 response #3610
* [ENHANCEMENT] Add summary to msteams notification #3616
* [ENHANCEMENT] Add context reasons to notifications failed counter #3631
* [ENHANCEMENT] Add optional native histogram support to latency metrics #3737
* [ENHANCEMENT] Enable setting ThreadId for Telegram notifications #3638
* [ENHANCEMENT] Allow webex roomID from template #3801
* [BUGFIX] Add missing integrations to notify metrics #3480
* [BUGFIX] Add missing ttl in pushhover #3474
* [BUGFIX] Fix scheme required for webhook url in amtool #3409
* [BUGFIX] Remove duplicate integration from metrics #3516
* [BUGFIX] Reflect Discord's max length message limits #3597
* [BUGFIX] Fix nil error in warn logs about incompatible matchers #3683
* [BUGFIX] Fix a small number of inconsistencies in compat package logging #3718
* [BUGFIX] Fix log line in featurecontrol #3719
* [BUGFIX] Fix panic in acceptance tests #3592
* [BUGFIX] Fix flaky test TestClusterJoinAndReconnect/TestTLSConnection #3722
* [BUGFIX] Fix crash on errors when url_file is used #3800
* [BUGFIX] Fix race condition in dispatch.go #3826
* [BUGFIX] Fix race conditions in the memory alerts store #3648
* [BUGFIX] Hide config.SecretURL when the URL is incorrect. #3887
* [BUGFIX] Fix invalid silence causes incomplete updates #3898
* [BUGFIX] Fix leaking of Silences matcherCache entries #3930
* [BUGFIX] Close SMTP submission correctly to handle errors #4006

Signed-off-by: SuperQ <[email protected]>
@SuperQ SuperQ mentioned this pull request Oct 16, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants