Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix MaxSilences limit causes incomplete updates of existing silences #3877

Conversation

grobinson-grafana
Copy link
Contributor

@grobinson-grafana grobinson-grafana commented Jun 14, 2024

This commit fixes a bug where the MaxSilences limit can cause an incomplete update of existing silences, where the old silence can be expired but the new silence cannot be created because it would exceeded the maximum number of silences.

@grobinson-grafana grobinson-grafana force-pushed the grobinson/fix-limits-replacing-silences branch from de674ea to 41e148f Compare June 14, 2024 11:07
@grobinson-grafana grobinson-grafana force-pushed the grobinson/fix-limits-replacing-silences branch 3 times, most recently from 72bbc68 to 4e027b7 Compare June 21, 2024 15:06
This commit fixes a bug where the MaxSilences limit can cause an
incomplete update of existing silences, where the old silence can
be expired but the new silence cannot be created because it would
exceeded the maximum number of silences.

Signed-off-by: George Robinson <[email protected]>
@grobinson-grafana grobinson-grafana force-pushed the grobinson/fix-limits-replacing-silences branch from 4e027b7 to 8de387e Compare June 24, 2024 13:26
@grobinson-grafana grobinson-grafana changed the title Fix limits cause incomplete update for existing silences Fix MaxSilences limit causes incomplete updates of existing silences Jun 24, 2024
@@ -564,6 +564,13 @@ func (s *Silences) getSilence(id string) (*pb.Silence, bool) {
return msil.Silence, true
}

func (s *Silences) toMeshSilence(sil *pb.Silence) *pb.MeshSilence {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've moved this to a function for two reasons:

  1. It allows us to check the full error in unit tests. For example:

    require.EqualError(t, s.Set(sil3), fmt.Sprintf("silence exceeded maximum size: %d bytes (limit: 4096 bytes)", s.toMeshSilence(sil3).Size()))
  2. We will use it in the next PR where we fix the same bug but for maximum size limits and validation errors.

return fmt.Errorf("expire previous silence: %w", err)
}
}
if ok && canUpdate(prev, sil, now) {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This code should run before checking the MaxSilences() limit because we don't want to prevent updating an existing silence if we are at the limit and the update does not require expiring and creating a new silence.

@@ -631,6 +627,13 @@ func (s *Silences) Set(sil *pb.Silence) error {
}
}

if ok && getState(prev, s.nowUTC()) != types.SilenceStateExpired {
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The entire change here is moving the expiration of the existing silence to run after checking the MaxSilences() limit. This ensures that an existing silence is not expired if we are at the limit, as what would happen is the existing silence is expired and then the new silence cannot be created.

StartsAt: time.Now(),
EndsAt: time.Now().Add(5 * time.Minute),
}
require.EqualError(t, s.Set(sil2), "exceeded maximum number of silences: 1 (limit: 1)")
require.Equal(t, "", sil2.Id)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have removed all checks for require.Equal(t, "", id) in this test because it's unrelated to limits.

StartsAt: time.Now(),
EndsAt: time.Now().Add(5 * time.Minute),
}
err = s.Set(sil3)
require.Error(t, err)
// Do not check the exact size as it can change between consecutive runs
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed this.

require.NotEqual(t, "", sil3.Id)
require.EqualError(t, s.Set(sil3), fmt.Sprintf("silence exceeded maximum size: %d bytes (limit: 4096 bytes)", s.toMeshSilence(sil3).Size()))

// Should be able to insert sil4.
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added additional tests here to cover the following cases:

  1. Updating an existing silence without modifications.
  2. Adding a comment to an existing silence.
  3. Failing to update an existing silence because the maximum number of silences has been exceeded should not expire the existing silence.

Signed-off-by: George Robinson <[email protected]>
Copy link
Member

@gotjosh gotjosh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

silence/silence.go Outdated Show resolved Hide resolved
silence/silence.go Outdated Show resolved Hide resolved
Signed-off-by: George Robinson <[email protected]>
@gotjosh gotjosh merged commit b444381 into prometheus:main Jun 24, 2024
11 checks passed
grobinson-grafana added a commit to grobinson-grafana/alertmanager that referenced this pull request Jun 24, 2024
…rometheus#3877)

* Fix MaxSilences limit causes incomplete updates of existing silences

This commit fixes a bug where the MaxSilences limit can cause an
incomplete update of existing silences, where the old silence can
be expired but the new silence cannot be created because it would
exceeded the maximum number of silences.

Signed-off-by: George Robinson <[email protected]>

---------

Signed-off-by: George Robinson <[email protected]>
@grobinson-grafana grobinson-grafana deleted the grobinson/fix-limits-replacing-silences branch June 25, 2024 15:59
grobinson-grafana added a commit to grafana/mimir that referenced this pull request Jun 26, 2024
This commit fixes the following bugs in silences:

- prometheus/alertmanager#3877
- prometheus/alertmanager#3898
- prometheus/alertmanager#3897

which could cause an existing silence to be deleted/expired
when updating the silence failed. This could be because
the replacing silence exceeded limits or was invalid.
grobinson-grafana added a commit to grafana/mimir that referenced this pull request Jun 26, 2024
This commit fixes the following bugs in silences:

- prometheus/alertmanager#3877
- prometheus/alertmanager#3898
- prometheus/alertmanager#3897

which could cause an existing silence to be deleted/expired
when updating the silence failed. This could be because
the replacing silence exceeded limits or was invalid.
grobinson-grafana added a commit to grafana/mimir that referenced this pull request Jun 26, 2024
This commit fixes the following bugs in silences:

- prometheus/alertmanager#3877
- prometheus/alertmanager#3898
- prometheus/alertmanager#3897

which could cause an existing silence to be deleted/expired
when updating the silence failed. This could be because
the replacing silence exceeded limits or was invalid.

additional tests in upstream.
grobinson-grafana added a commit to grafana/mimir that referenced this pull request Jun 26, 2024
This commit fixes the following bugs in silences:

- prometheus/alertmanager#3877
- prometheus/alertmanager#3898
- prometheus/alertmanager#3897

which could cause an existing silence to be deleted/expired
when updating the silence failed. This could be because
the replacing silence exceeded limits or was invalid.
dimitarvdimitrov pushed a commit to grafana/mimir that referenced this pull request Jul 2, 2024
This commit fixes the following bugs in silences:

- prometheus/alertmanager#3877
- prometheus/alertmanager#3898
- prometheus/alertmanager#3897

which could cause an existing silence to be deleted/expired
when updating the silence failed. This could be because
the replacing silence exceeded limits or was invalid.

(cherry picked from commit 1cfb657)
dimitarvdimitrov added a commit to grafana/mimir that referenced this pull request Jul 2, 2024
* Fixes a number of bugs in silences (#8525)

This commit fixes the following bugs in silences:

- prometheus/alertmanager#3877
- prometheus/alertmanager#3898
- prometheus/alertmanager#3897

which could cause an existing silence to be deleted/expired
when updating the silence failed. This could be because
the replacing silence exceeded limits or was invalid.

(cherry picked from commit 1cfb657)

* Update CHANGELOG.md (#8526)

(cherry picked from commit 36f7af3)

---------

Co-authored-by: George Robinson <[email protected]>
TheMeier pushed a commit to TheMeier/alertmanager that referenced this pull request Sep 29, 2024
…rometheus#3877)

* Fix MaxSilences limit causes incomplete updates of existing silences

This commit fixes a bug where the MaxSilences limit can cause an
incomplete update of existing silences, where the old silence can
be expired but the new silence cannot be created because it would
exceeded the maximum number of silences.

Signed-off-by: George Robinson <[email protected]>

---------

Signed-off-by: George Robinson <[email protected]>
gotjosh added a commit that referenced this pull request Oct 24, 2024
* Release v0.28.0-rc.0

* [CHANGE] Templating errors in the SNS integration now return an error. #3531 #3879
* [FEATURE] Add a new Microsoft Teams integration based on Flows #4024
* [FEATURE] Add a new Rocket.Chat integration #3600
* [FEATURE] Add a new Jira integration #3590 #3931
* [FEATURE] Add support for `GOMEMLIMIT`, enable it via the feature flag `--enable-feature=auto-gomemlimit`. #3895
* [FEATURE] Add support for `GOMAXPROCS`, enable it via the feature flag `--enable-feature=auto-gomaxprocs`. #3837
* [FEATURE] Add support for limits of silences including the maximum number of active and pending silences, and the maximum size per silence (in bytes). You can use the flags `--silences.max-silences` and `--silences.max-silence-size-bytes` to set them accordingly #3852 #3862 #3866 #3885 #3886 #3877
* [FEATURE] Muted alerts now show whether they are suppressed or not in both the `/api/v2/alerts` endpoint and the Alertmanager UI. #3793 #3797 #3792
* [ENHANCEMENT] Add support for `content`, `username` and `avatar_url` in the Discord integration. `content` and `username` also support templating. #4007
* [ENHANCEMENT] Only invalidate the silences cache if a new silence is created or an existing silence replaced - should improve latency on both `GET api/v2/alerts` and `POST api/v2/alerts` API endpoint. #3961
* [ENHANCEMENT] Add image source label to Dockerfile. To get changelogs shown when using Renovate #4062
* [ENHANCEMENT] Build using go 1.23 #4071
* [ENHANCEMENT] Support setting a global SMTP TLS configuration. #3732
* [ENHANCEMENT] The setting `room_id` in the WebEx integration can now be templated to allow for dynamic room IDs. #3801
* [ENHANCEMENT] Enable setting `message_thread_id` for the Telegram integration. #3638
* [ENHANCEMENT] Support the `since` and `humanizeDuration` functions to templates. This means users can now format time to more human-readable text. #3863
* [ENHANCEMENT] Support the `date` and `tz` functions to templates. This means users can now format time in a specified format and also change the timezone to their specific locale. #3812
* [ENHANCEMENT] Latency metrics now support native histograms. #3737
* [BUGFIX] Fix the SMTP integration not correctly closing an SMTP submission, which may lead to unsuccessful dispatches being marked as successful. #4006
* [BUGFIX]  The `ParseMode` option is now set explicitly in the Telegram integration. If we don't HTML tags had not been parsed by default. #4027
* [BUGFIX] Fix a memory leak that was caused by updates silences continuously. #3930
* [BUGFIX] Fix hiding secret URLs when the URL is incorrect. #3887
* [BUGFIX] Fix a race condition in the alerts - it was more of a hypothetical race condition that could have occurred in the alert reception pipeline. #3648
* [BUGFIX] Fix a race condition in the alert delivery pipeline that would cause a firing alert that was delivered earlier to be deleted from the aggregation group when instead it should have been delivered again. #3826
* [BUGFIX] Fix version in APIv1 deprecation notice. #3815
* [BUGFIX] Fix crash errors when using `url_file` in the Webhook integration. #3800
* [BUGFIX] fix `Route.ID()` returns conflicting IDs. #3803
* [BUGFIX] Fix deadlock on the alerts memory store. #3715
* [BUGFIX] Fix `amtool template render` when using the default values. #3725
* [BUGFIX] Fix `webhook_url_file` for both the Discord and Microsoft Teams integrations. #3728 #3745

---------

Signed-off-by: SuperQ <[email protected]>
Signed-off-by: gotjosh <[email protected]>
Co-authored-by: gotjosh <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants