-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How to fix non unique GroupKeys? #3817
Comments
This commit adds (optional) names to routes to fix prometheus#3817. In the case where a user has two routes with the same receiver, matchers and group_by, a name can be used to ensure their groups have unique group keys and avoid using the same nflog. Signed-off-by: George Robinson <[email protected]>
Another option is to add the receiver name to the Here is an example of how this could look for the following configuration file: receivers:
- name: test1
- name: test2
route:
receiver: test1
routes:
- receiver: test1
matchers:
- foo=bar
continue: true
- receiver: test2
matchers:
- foo=bar
mute_time_intervals:
- name: weekends
continue: true Without the change:
With the change:
This could also be shortened to something like:
func (l *Log) Log(r *pb.Receiver, gkey string, firingAlerts, resolvedAlerts []uint64, expiry time.Duration) error { |
This commit adds the receiver name to the route key to reduce the chances of having non unique group keys (prometheus#3817). Like the previous version, it does not guarantee the group key is unique, however it does make it collisions less likely to occur. Signed-off-by: George Robinson <[email protected]>
Any chance the example needs to have a "continue" to match the second route? |
Yes it does :) I missed that in the example! |
What if we don't attempt to fix this without requiring extra configuration? What if a future version of AM refuses to accept a config where both GroupKey and Receiver are identical for two siblings? What is the users intention in creating such config? We could offer a user a way to differentiate these two routes ( |
This is something that I've considered too. I like it a lot because it means we don't need to add extra configuration, but I'm also concerned about breaking configurations.
If we choose this option I propose adding a "name" field to the route, similar to what we have for receivers. |
Can we add a deduplication key instead? The default Pagerduty receiver has it. Something similar seems easier to create than keeping track of the intricacies of how hashed values from other fields track over time/user changes on configuration. |
Yeah that's one of the options (named routes). |
Background
GroupKey is a string that is derived from the matchers in the route (including any parent routes) and the labels in the
group_by
of the matching route.There are a number of cases where GroupKey is not unique and its possible for two (or more) different groups to have the same GroupKey. The following YAML shows a configuration containing two routes that create groups with the same GroupKey:
The reason a user might have such a configuration is to mute notifications on the weekends, but still send webhooks to an issue tracker.
Problems
This creates a number of problems:
Solutions
I've been thinking on the issue of the GroupKey being non-unique for a little while, and if its possible to make the GroupKey both stable and unique without adding new fields to the configuration file.
What do I mean by stable? I mean that the GroupKey should not depend on the position of the route relative to its siblings. This is the reason why we cannot use the RouteID when calculating the GroupKey, because as soon as the route is moved higher or lower in the configuration file the group's notification log is invalidated causing repeat notifications.
One option we have is to use a non cryptographic hash function (for example fnv) to calculate a hash using extra metadata from the route that is not included in the GroupKey at present. For example:
This would change how GroupKey is calculated from being a top-down path of routes from parent to child (i.e.
route0/route1/route...N/groupLabels
) to a bottom-up Merkle tree where the parents are derived from the hash of their children.This means that when either the receiver, matchers, group by labels or active or mute time intervals are changed in a child route, the notification logs for the parent upwards are invalidated. This might be undesirable. The other issue I see with this solution is observability. Without adequate tools, it will be very hard to see which group is flushed in the logs just from its hash.
So perhaps it's better to just add an (optional) name to routes? If the name is absent, then Alertmanager uses the current mechanism (which can be non unique in some cases). If a unique GroupKey is required, then a name can be set on the route to discriminate between them.
The text was updated successfully, but these errors were encountered: