-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Address flaky/confusing sourcegraph.com alerts sentiment #11966
Comments
I think the larger meta issue here is we need the team who owns a service to own the alerts for it. The pavlovian response of acking and ignoring for the 1 week of oncall is pretty easy to live with. People should also feel more comfortable adjusting the alerts so they don't false fire, even if there is a real underlying issue. EG for repo-updater update the alert to treat repo-updater differently and file an issue. I believe there may be a real (but minor) issue this is exposing. I'm actually not familiar with how easy it is to adjust alerts these days. Back in the day it was super simple, we edited the alert file in the infra repo and that was that. I'm hoping it is mostly as easy, although I fear the alerts may be prometheus yaml embedded inside of another kubernetes yaml now :P Thanks for filing this. I'll look and comment at each of the linked solution issues. |
Yep, 100% agree - this is already under way and one reason the new monitoring generator breaks down monitoring by service instead of arbitrary dashboards like "HTTP" :) RFC 189: On-call rotation changes
I would say we should feel comfortable silencing alerts in that case, because that actually indicates "we're aware of the problem and don't care enough to fix it right now" whereas minor tweaks without investigation can lead to "why didn't this alert catch this?" in the future.
Today on sourcegraph.com it's the same as it was in the past, you would edit this file and merge it then it'd auto-deploy. In the future, once https://github.com/sourcegraph/sourcegraph/issues/5370#issuecomment-629406540 is merged, you would edit the service config file in the monitoring generator and merge it. Then it'd auto-release the new Docker image and deploy on sourcegraph.com shortly after. |
Thanks for collecting this Stephen, please check my comment here https://github.com/sourcegraph/sourcegraph/issues/10742#issuecomment-654254352 as well.
I usually consider a flaky alert as an alert that causes a false positive, which was my interpretation of this issue, but a lot of the alerts seem to be related to real downtime, in most cases caused by a known event or deployment.
I agree that we need to ensure that alerts are actioned and not just snoozed. Are those alerts related to the deployment downtime or some other alerts? |
Correct, that is the case today - the problem is we have some sentiment that these alerts are false-positives because we've had a lot of false-positives in the past and because they sometimes auto-resolve themselves (even if they are real issues).
Agreed.
Correct, and we don't e.g. deploy gitserver continuously due to this (it would cause a brief site-wide outage basically).
Technically, we do have a mechanism in place for this
Both. You can look at #opsgenie in slack to get a sense of what people have faced historically |
seems most of this is finished, so closing. |
This is a tracking issue for @sourcegraph/distribution to address the general sentiment I have observed across the engineering org that sourcegraph.com alerting is flaky or confusing. Specifically, the goal of this issue is to record feedback and collect a list of solutions we intend to enact.
Quoting the amazing meme by @efritz :
I am the owner of keeping this issue up-to-date and ensuring we are focusing on the right underlying solutions to resolve this problem.
Feedback
Solution plan
To address these issues, we plan to:
The text was updated successfully, but these errors were encountered: