Address flaky/confusing sourcegraph.com alerts sentiment #11966

slimsag · 2020-07-06T19:16:03Z

This is a tracking issue for @sourcegraph/distribution to address the general sentiment I have observed across the engineering org that sourcegraph.com alerting is flaky or confusing. Specifically, the goal of this issue is to record feedback and collect a list of solutions we intend to enact.

Quoting the amazing meme by @efritz :

I am the owner of keeping this issue up-to-date and ensuring we are focusing on the right underlying solutions to resolve this problem.

Feedback

July 2 - Keegan notes some flaky alerts are real: https://sourcegraph.slack.com/archives/CMBA8F926/p1593698293011100
June 26th - Thorsten gets "flaky" alert which was actually signaling a real problem (deploying gitserver by accident more frequently, which led to more frequent downtime) https://sourcegraph.slack.com/archives/C0J618TTM/p1593169954159000
June 30th - Simon (first on-call experience) has lots of questions/difficulty understanding what our alerts mean: https://sourcegraph.slack.com/archives/CMBA8F926/p1593531846001800
July 2 - Keegan gets flaky pod missing alert due to deployment: https://sourcegraph.slack.com/archives/C0J618TTM/p1593673892009100 - which has recently in just the past 2 weeks become flaky.
July 6 - Tomas questions if "repo-updater being down" alerts are real or not: https://sourcegraph.slack.com/archives/C0J618TTM/p1594038142011800 - which has recently in just the past 2 weeks become flaky.
July 6 - Keegan is not sure how editing alerts looks today: https://github.com/sourcegraph/sourcegraph/issues/11966#issuecomment-654424847

Solution plan

To address these issues, we plan to:

keegancsmith · 2020-07-06T19:32:10Z

I think the larger meta issue here is we need the team who owns a service to own the alerts for it. The pavlovian response of acking and ignoring for the 1 week of oncall is pretty easy to live with.

People should also feel more comfortable adjusting the alerts so they don't false fire, even if there is a real underlying issue. EG for repo-updater update the alert to treat repo-updater differently and file an issue. I believe there may be a real (but minor) issue this is exposing.

I'm actually not familiar with how easy it is to adjust alerts these days. Back in the day it was super simple, we edited the alert file in the infra repo and that was that. I'm hoping it is mostly as easy, although I fear the alerts may be prometheus yaml embedded inside of another kubernetes yaml now :P

Thanks for filing this. I'll look and comment at each of the linked solution issues.

slimsag · 2020-07-06T20:11:05Z

I think the larger meta issue here is we need the team who owns a service to own the alerts for it.

Yep, 100% agree - this is already under way and one reason the new monitoring generator breaks down monitoring by service instead of arbitrary dashboards like "HTTP" :) RFC 189: On-call rotation changes

People should also feel more comfortable adjusting the alerts so they don't false fire, even if there is a real underlying issue. EG for repo-updater update the alert to treat repo-updater differently and file an issue. I believe there may be a real (but minor) issue this is exposing.

I would say we should feel comfortable silencing alerts in that case, because that actually indicates "we're aware of the problem and don't care enough to fix it right now" whereas minor tweaks without investigation can lead to "why didn't this alert catch this?" in the future.

I'm actually not familiar with how easy it is to adjust alerts these days. Back in the day it was super simple, we edited the alert file in the infra repo and that was that. I'm hoping it is mostly as easy, although I fear the alerts may be prometheus yaml embedded inside of another kubernetes yaml now :P

Today on sourcegraph.com it's the same as it was in the past, you would edit this file and merge it then it'd auto-deploy.

In the future, once https://github.com/sourcegraph/sourcegraph/issues/5370#issuecomment-629406540 is merged, you would edit the service config file in the monitoring generator and merge it. Then it'd auto-release the new Docker image and deploy on sourcegraph.com shortly after.

pecigonzalo · 2020-07-07T12:57:20Z

Thanks for collecting this Stephen, please check my comment here https://github.com/sourcegraph/sourcegraph/issues/10742#issuecomment-654254352 as well.

TLDR

Downtime should raise an alert in general and deployments might need to mute them

Most external issues (DNS, internet, etc) should notify us even if in some cases they dont page on-call

I usually consider a flaky alert as an alert that causes a false positive, which was my interpretation of this issue, but a lot of the alerts seem to be related to real downtime, in most cases caused by a known event or deployment.
The linked solutions all seem to improve the general alerting and monitoring and how clear they are about the actual problem, but some underlying issues seem to remain.

Some of our deployments require downtime
We don't have a mechanism in place to silence certain notifications while we perform scheduled maintenance or deployments

I think the larger meta issue here is we need the team who owns a service to own the alerts for it. The pavlovian response of acking and ignoring for the 1 week of oncall is pretty easy to live with.

I agree that we need to ensure that alerts are actioned and not just snoozed. Are those alerts related to the deployment downtime or some other alerts?

slimsag · 2020-07-09T00:37:21Z

I usually consider a flaky alert as an alert that causes a false positive, which was my interpretation of this issue, but a lot of the alerts seem to be related to real downtime, in most cases caused by a known event or deployment.

Correct, that is the case today - the problem is we have some sentiment that these alerts are false-positives because we've had a lot of false-positives in the past and because they sometimes auto-resolve themselves (even if they are real issues).

The linked solutions all seem to improve the general alerting and monitoring and how clear they are about the actual problem, but some underlying issues seem to remain.

Agreed.

Some of our deployments require downtime

Correct, and we don't e.g. deploy gitserver continuously due to this (it would cause a brief site-wide outage basically).

We don't have a mechanism in place to silence certain notifications while we perform scheduled maintenance or deployments

Technically, we do have a mechanism in place for this /genie oncall me 1h is what people use to take all notifications while performing scheduled maintenance/infra changes - but this does not occur during our CI deployments (so in that sense, we do not have this).

Are those alerts related to the deployment downtime or some other alerts?

Both. You can look at #opsgenie in slack to get a sense of what people have faced historically

slimsag · 2020-10-21T23:47:06Z

seems most of this is finished, so closing.

slimsag added the team/distribution label Jul 6, 2020

slimsag self-assigned this Jul 6, 2020

pecigonzalo mentioned this issue Jul 13, 2020

monitoring: reassess flakey/unactionable critical alerts #12011

Closed

4 tasks

slimsag closed this as completed Oct 21, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Address flaky/confusing sourcegraph.com alerts sentiment #11966

Address flaky/confusing sourcegraph.com alerts sentiment #11966

slimsag commented Jul 6, 2020 •

edited

Loading

keegancsmith commented Jul 6, 2020

slimsag commented Jul 6, 2020

pecigonzalo commented Jul 7, 2020 •

edited

Loading

slimsag commented Jul 9, 2020

slimsag commented Oct 21, 2020

Address flaky/confusing sourcegraph.com alerts sentiment #11966

Address flaky/confusing sourcegraph.com alerts sentiment #11966

Comments

slimsag commented Jul 6, 2020 • edited Loading

Feedback

Solution plan

keegancsmith commented Jul 6, 2020

slimsag commented Jul 6, 2020

pecigonzalo commented Jul 7, 2020 • edited Loading

slimsag commented Jul 9, 2020

slimsag commented Oct 21, 2020

slimsag commented Jul 6, 2020 •

edited

Loading

pecigonzalo commented Jul 7, 2020 •

edited

Loading