-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Alternative fix for #3805 by creating a snapshot and using it #3827
base: main
Are you sure you want to change the base?
Conversation
@grobinson-grafana A full mr and separating the snapshot-ing from the behavior change. The failing test seems unrelated. |
Awesome! Yes that test flakes a lot. It needs to be fixed 😞 |
bd37add
to
933c3af
Compare
Thanks once again for opening this @zecke! I'm going to review it this week! 👍 |
} | ||
} | ||
|
||
func (a *AlertSnapshot) Resolved() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice! This is what I was expecting 👍
for _, a := range alerts { | ||
v := a.Alert | ||
// If the end timestamp is not reached yet, do not expose it. | ||
func (as AlertsSnapshot) Less(i, j int) bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps we can share the Less
function between AlertSlice
and AlertsSnapshot
? Then we can avoid duplication of the same code! 🙂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just realized we do not need both AlertSlice
and AlertsSnapshot
. We can either a.) rename AlertSlice
to AlertsSnapshot
and change its definition to []*AlertSnapshot
or b.) delete AlertSlice
and make the following additional changes:
diff --git a/api/v2/api.go b/api/v2/api.go
index 6f034d25..958d1db9 100644
--- a/api/v2/api.go
+++ b/api/v2/api.go
@@ -290,7 +290,7 @@ func (api *API) getAlertsHandler(params alert_ops.GetAlertsParams) middleware.Re
continue
}
- alert := AlertToOpenAPIAlert(a, api.getAlertStatus(a.Fingerprint()), receivers)
+ alert := AlertToOpenAPIAlert(types.NewAlertSnapshot(a, now), api.getAlertStatus(a.Fingerprint()), receivers)
res = append(res, alert)
}
diff --git a/api/v2/compat.go b/api/v2/compat.go
index 112bfc4f..4b17b368 100644
--- a/api/v2/compat.go
+++ b/api/v2/compat.go
@@ -117,7 +117,7 @@ func PostableSilenceToProto(s *open_api_models.PostableSilence) (*silencepb.Sile
}
// AlertToOpenAPIAlert converts internal alerts, alert types, and receivers to *open_api_models.GettableAlert.
-func AlertToOpenAPIAlert(alert *types.Alert, status types.AlertStatus, receivers []string) *open_api_models.GettableAlert {
+func AlertToOpenAPIAlert(alert *types.AlertSnapshot, status types.AlertStatus, receivers []string) *open_api_models.GettableAlert {
startsAt := strfmt.DateTime(alert.StartsAt)
updatedAt := strfmt.DateTime(alert.UpdatedAt)
endsAt := strfmt.DateTime(alert.EndsAt)
diff --git a/dispatch/dispatch.go b/dispatch/dispatch.go
index 8bf7ad6a..4860a64e 100644
--- a/dispatch/dispatch.go
+++ b/dispatch/dispatch.go
@@ -199,7 +199,7 @@ func (d *Dispatcher) run(it provider.AlertIterator) {
// AlertGroup represents how alerts exist within an aggrGroup.
type AlertGroup struct {
- Alerts types.AlertSlice
+ Alerts types.AlertsSnapshot
Labels model.LabelSet
Receiver string
}
@@ -241,7 +241,7 @@ func (d *Dispatcher) Groups(routeFilter func(*Route) bool, alertFilter func(*typ
}
alerts := ag.alerts.List()
- filteredAlerts := make([]*types.Alert, 0, len(alerts))
+ filteredAlerts := make([]*types.AlertSnapshot, 0, len(alerts))
for _, a := range alerts {
if !alertFilter(a, now) {
continue
@@ -258,7 +258,7 @@ func (d *Dispatcher) Groups(routeFilter func(*Route) bool, alertFilter func(*typ
receivers[fp] = []string{receiver}
}
- filteredAlerts = append(filteredAlerts, a)
+ filteredAlerts = append(filteredAlerts, types.NewAlertSnapshot(a, now))
}
if len(filteredAlerts) == 0 {
continue
@@ -501,10 +501,10 @@ func (ag *aggrGroup) flush(notify func(...*types.AlertSnapshot) bool) {
alertsSlice := types.SnapshotAlerts(ag.alerts.List(), time.Now())
sort.Stable(alertsSlice)
- resolvedSlice := make(types.AlertSlice, 0, len(alertsSlice))
+ resolvedSlice := make(types.AlertsSnapshot, 0, len(alertsSlice))
for _, alert := range alertsSlice {
if alert.Resolved() {
- resolvedSlice = append(resolvedSlice, &alert.Alert)
+ resolvedSlice = append(resolvedSlice, alert)
}
}
diff --git a/store/store.go b/store/store.go
index 5de0c3ee..0ae0e853 100644
--- a/store/store.go
+++ b/store/store.go
@@ -116,7 +116,7 @@ func (a *Alerts) Set(alert *types.Alert) error {
// DeleteIfNotModified deletes the slice of Alerts from the store if not
// modified.
-func (a *Alerts) DeleteIfNotModified(alerts types.AlertSlice) error {
+func (a *Alerts) DeleteIfNotModified(alerts types.AlertsSnapshot) error {
a.Lock()
defer a.Unlock()
for _, alert := range alerts {
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay. The implication is that all callers of Groups including the API will work on a snapshot.
351a58e
to
fb91517
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please see comments, otherwise LGTM. @gotjosh please review too!
@@ -28,6 +28,8 @@ func TestClusterDeduplication(t *testing.T) { | |||
t.Parallel() | |||
|
|||
conf := ` | |||
global: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this to fix flaking tests? If so, let's move it to a separate PR and I will approve it 🙂 It sounds pedantic, but it makes it much easier to use git blame
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am certainly old school in bringing change sets to a GitHub pull request. Let's split the PR into two pieces then. The first to introduce snapshots and then the second to change the EndsAt behavior.
The reason to introduce the resolve timeout is from here:
2f05568
Update the acceptance tests as the webhook now has a non-zero EndsAt
timestamp for firing alerts. The models.GettableAlert doesn't have a
`status` field which means the current test is unable to differentiate
between firing and resolved alerts.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The PR to introduce the snapshot is #3882. I will update this PR to include the behavior change for the webhook.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah OK! I didn't see that, my mistake! 😞 I thought it was to fix some flaking tests that were unrelated to this work, but I misunderstood!
@@ -29,6 +29,8 @@ func TestInhibiting(t *testing.T) { | |||
// gets resolved. | |||
|
|||
conf := ` | |||
global: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same here 🙂
@@ -21,6 +21,10 @@ import ( | |||
. "github.com/prometheus/alertmanager/test/with_api_v2" | |||
) | |||
|
|||
const ( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here 🙂
@@ -84,6 +84,8 @@ func TestSilencing(t *testing.T) { | |||
t.Parallel() | |||
|
|||
conf := ` | |||
global: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here 🙂
@@ -274,6 +274,8 @@ func TestSendAlertsToUTF8Route(t *testing.T) { | |||
t.Parallel() | |||
|
|||
conf := ` | |||
global: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
And here 🙂
fb91517
to
7903fc8
Compare
How can we make progress on this change? I am happy to help reviewing other people changes to increase the bench of reviewers... |
Make sure that the alerts passed to notifyFunc are consistent regardless of retries and other delays. Signed-off-by: Holger Hans Peter Freyther <[email protected]>
Change the AlertGroup to use the AlertsSnapshot and the Group call to take the snapshot time. Make changes to accomodate this. Signed-off-by: Holger Hans Peter Freyther <[email protected]>
This commit fixes a bug where firing alerts had zeroed EndsAt timestamps in notifications. This bug was introduced in another PR (prometheus#1686) to fix issues where alerts are resolved during a flush. (Commit message from George Robinson) Update the acceptance tests as the webhook now has a non-zero EndsAt timestamp for firing alerts. The models.GettableAlert doesn't havea `status` field which means the current test is unable to differentiate between firing and resolved alerts. Signed-off-by: Holger Hans Peter Freyther <[email protected]>
7903fc8
to
20eb118
Compare
I rebased this PR with main to fix the conflict. |
Introduce types.AlertSnapshot and types.AlertsSnapshot that keeps track of the original dispatch time and uses it for the
Status()
andResolved()
method. This makes sure the notification is consistent across retries and/or other delays.Change the behavior to not zero EndsAt for firing alerts and update the acceptance tests.