Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Address flaky/confusing sourcegraph.com alerts sentiment #11966

Closed
slimsag opened this issue Jul 6, 2020 · 5 comments
Closed

Address flaky/confusing sourcegraph.com alerts sentiment #11966

slimsag opened this issue Jul 6, 2020 · 5 comments
Assignees

Comments

@slimsag
Copy link
Member

slimsag commented Jul 6, 2020

This is a tracking issue for @sourcegraph/distribution to address the general sentiment I have observed across the engineering org that sourcegraph.com alerting is flaky or confusing. Specifically, the goal of this issue is to record feedback and collect a list of solutions we intend to enact.

Quoting the amazing meme by @efritz :

image

I am the owner of keeping this issue up-to-date and ensuring we are focusing on the right underlying solutions to resolve this problem.

Feedback

Solution plan

To address these issues, we plan to:

@slimsag slimsag self-assigned this Jul 6, 2020
@keegancsmith
Copy link
Member

I think the larger meta issue here is we need the team who owns a service to own the alerts for it. The pavlovian response of acking and ignoring for the 1 week of oncall is pretty easy to live with.

People should also feel more comfortable adjusting the alerts so they don't false fire, even if there is a real underlying issue. EG for repo-updater update the alert to treat repo-updater differently and file an issue. I believe there may be a real (but minor) issue this is exposing.

I'm actually not familiar with how easy it is to adjust alerts these days. Back in the day it was super simple, we edited the alert file in the infra repo and that was that. I'm hoping it is mostly as easy, although I fear the alerts may be prometheus yaml embedded inside of another kubernetes yaml now :P

Thanks for filing this. I'll look and comment at each of the linked solution issues.

@slimsag
Copy link
Member Author

slimsag commented Jul 6, 2020

I think the larger meta issue here is we need the team who owns a service to own the alerts for it.

Yep, 100% agree - this is already under way and one reason the new monitoring generator breaks down monitoring by service instead of arbitrary dashboards like "HTTP" :) RFC 189: On-call rotation changes

People should also feel more comfortable adjusting the alerts so they don't false fire, even if there is a real underlying issue. EG for repo-updater update the alert to treat repo-updater differently and file an issue. I believe there may be a real (but minor) issue this is exposing.

I would say we should feel comfortable silencing alerts in that case, because that actually indicates "we're aware of the problem and don't care enough to fix it right now" whereas minor tweaks without investigation can lead to "why didn't this alert catch this?" in the future.

I'm actually not familiar with how easy it is to adjust alerts these days. Back in the day it was super simple, we edited the alert file in the infra repo and that was that. I'm hoping it is mostly as easy, although I fear the alerts may be prometheus yaml embedded inside of another kubernetes yaml now :P

Today on sourcegraph.com it's the same as it was in the past, you would edit this file and merge it then it'd auto-deploy.

In the future, once https://github.com/sourcegraph/sourcegraph/issues/5370#issuecomment-629406540 is merged, you would edit the service config file in the monitoring generator and merge it. Then it'd auto-release the new Docker image and deploy on sourcegraph.com shortly after.

@pecigonzalo
Copy link
Contributor

pecigonzalo commented Jul 7, 2020

Thanks for collecting this Stephen, please check my comment here https://github.com/sourcegraph/sourcegraph/issues/10742#issuecomment-654254352 as well.

TLDR

  • Downtime should raise an alert in general and deployments might need to mute them
  • Most external issues (DNS, internet, etc) should notify us even if in some cases they dont page on-call

I usually consider a flaky alert as an alert that causes a false positive, which was my interpretation of this issue, but a lot of the alerts seem to be related to real downtime, in most cases caused by a known event or deployment.
The linked solutions all seem to improve the general alerting and monitoring and how clear they are about the actual problem, but some underlying issues seem to remain.

  • Some of our deployments require downtime
  • We don't have a mechanism in place to silence certain notifications while we perform scheduled maintenance or deployments

I think the larger meta issue here is we need the team who owns a service to own the alerts for it. The pavlovian response of acking and ignoring for the 1 week of oncall is pretty easy to live with.

I agree that we need to ensure that alerts are actioned and not just snoozed. Are those alerts related to the deployment downtime or some other alerts?

@slimsag
Copy link
Member Author

slimsag commented Jul 9, 2020

I usually consider a flaky alert as an alert that causes a false positive, which was my interpretation of this issue, but a lot of the alerts seem to be related to real downtime, in most cases caused by a known event or deployment.

Correct, that is the case today - the problem is we have some sentiment that these alerts are false-positives because we've had a lot of false-positives in the past and because they sometimes auto-resolve themselves (even if they are real issues).

The linked solutions all seem to improve the general alerting and monitoring and how clear they are about the actual problem, but some underlying issues seem to remain.

Agreed.

Some of our deployments require downtime

Correct, and we don't e.g. deploy gitserver continuously due to this (it would cause a brief site-wide outage basically).

We don't have a mechanism in place to silence certain notifications while we perform scheduled maintenance or deployments

Technically, we do have a mechanism in place for this /genie oncall me 1h is what people use to take all notifications while performing scheduled maintenance/infra changes - but this does not occur during our CI deployments (so in that sense, we do not have this).

Are those alerts related to the deployment downtime or some other alerts?

Both. You can look at #opsgenie in slack to get a sense of what people have faced historically

@slimsag
Copy link
Member Author

slimsag commented Oct 21, 2020

seems most of this is finished, so closing.

@slimsag slimsag closed this as completed Oct 21, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants