Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Dogfood the monitoring we ship with Sourcegraph #5370

Closed
slimsag opened this issue Aug 26, 2019 · 11 comments
Closed

Dogfood the monitoring we ship with Sourcegraph #5370

slimsag opened this issue Aug 26, 2019 · 11 comments

Comments

@slimsag
Copy link
Member

slimsag commented Aug 26, 2019

Sourcegraph.com is currently running our old alerting stack (see here) which is very poor in multiple aspects.

The new / recently-added monitoring stack which ships with Sourcegraph is what we should switch to using (i.e. https://docs.sourcegraph.com/admin/observability/alerting )

For more details see https://github.com/sourcegraph/sourcegraph/issues/5370#issuecomment-629406540

@beyang
Copy link
Member

beyang commented Apr 8, 2020

@slimsag I believe this issue has been replaced by others or is mostly done. Re-open if still applicable.

@beyang beyang closed this as completed Apr 8, 2020
@slimsag slimsag changed the title use Grafana for alerting, move away from Prometheus alertmanager move away from Prometheus alertmanager on Sourcegraph.com specifically Apr 9, 2020
@slimsag
Copy link
Member Author

slimsag commented Apr 9, 2020

Reworded to match the current state of affairs

@slimsag slimsag reopened this Apr 9, 2020
@beyang beyang added this to the Backlog milestone Apr 9, 2020
@slimsag slimsag modified the milestones: Backlog, 3.16 Apr 20, 2020
@beyang beyang modified the milestones: Backlog, 3.16 Apr 21, 2020
@davejrt
Copy link
Contributor

davejrt commented Apr 30, 2020

@slimsag is it possible to get some more context here, or an example of where this is currently configured and working?

@uwedeportivo
Copy link
Contributor

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.16 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@slimsag
Copy link
Member Author

slimsag commented May 15, 2020

Historically, Sourcegraph.com has had alerting that was not shared by our customers or other Sourcegraph instances and was also very outdated/broken regularly because Sourcegraph.com was being neglected.

Very recently, we created a new set of monitoring dashboards and alerting rules, which resulted in this generator: https://github.com/sourcegraph/sourcegraph/tree/master/monitoring

We have unified the Prometheus and Grafana rules on Sourcegraph.com and this monitoring generator for the most part, but a few outliers remain:

One outlier is Site24x7, but this is expected, see https://github.com/sourcegraph/sourcegraph/issues/10742

The major outlier (what this issue is about) is that on Sourcegraph.com we still use Prometheus alertmanager to report alerts from Prometheus to OpsGenie. This is in contrast to using Grafana to send alerts to OpsGenie, which is what we advise all our customers to do (see docs here: https://docs.sourcegraph.com/admin/observability/alerting )

What we need to do to solve this issue is:

  1. Make it possible to configure Grafana alerting via a file on disk (currently we use hacky UI editing which is tedious and easily gets undone): https://github.com/sourcegraph/sourcegraph/issues/10641
  2. Set up Critical alerts on Sourcegraph.com to go to OpsGenie (via Grafana config file or something we can store in VCS) in https://github.com/sourcegraph/deploy-sourcegraph-dot-com => https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/3073
  3. Set up Warning alerts on Sourcegraph.com to go to the #alerts Slack channel. => https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/2839
  4. Make sure our existing Alertmanager rules are fully encapsulated already by what the monitoring generator covers alerts-wise. This may require introducing a Sourcegraph.com-specific mode to the monitoring generator, but I'd like to avoid that if at all possible. => https://github.com/sourcegraph/sourcegraph/issues/5370#issuecomment-654725536 => https://github.com/sourcegraph/sourcegraph/issues/12117 => Marking as completed since there is a separate issue, and since these alerts can now be subscribed via observability.alerts, we'll essentially get the same alerting as customers after point 2
  5. Remove Prometheus Alertmanager from sourcegraph.com entirely (depends on point 2)
  6. Remove all customizations in our ConfigMap so we use the exact same ConfigMap our customers do

If you want to get started on https://github.com/sourcegraph/sourcegraph/issues/10641 that would be great. This issue, however, has many prerequisites and dependencies on me as noted above so it was a mistake to assign this to this milestone. Kicking it back.

@bobheadxi
Copy link
Member

Added a 2d estimate to try and leave room for potential difficulties migrating old alerts smoothly to things like https://github.com/sourcegraph/sourcegraph/pull/11832

@bobheadxi
Copy link
Member

Make sure our existing Alertmanager rules are fully encapsulated already by what the monitoring generator covers alerts-wise. This may require introducing a Sourcegraph.com-specific mode to the monitoring generator, but I'd like to avoid that if at all possible.

@slimsag is this required? it seems that these alerts are already included in deploy-sourcegraph, meaning they do get distributed already: https://github.com/sourcegraph/deploy-sourcegraph/pull/784/files#diff-dbdb75693e4866b4d8494acb29ef5c8bR238

@slimsag
Copy link
Member Author

slimsag commented Jul 8, 2020

@slimsag is this required? it seems that these alerts are already included in deploy-sourcegraph, meaning they do get distributed already:

Eventually these need to be removed, so all alerts/monitoring items are defined in https://github.com/sourcegraph/sourcegraph/tree/master/monitoring

In other words, for us to be considered fully using the new monitoring stack I think this is necessary. But if we can use both today and do that migration slowly, I am happy with that (just break that portion out of this issue by filing a new one)

@daxmc99
Copy link
Contributor

daxmc99 commented Jul 13, 2020

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.18 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@bobheadxi
Copy link
Member

The last real piece in this pie is subscribing opsgenie to critical alerts. I will coordinate a time with @slimsag sometime this week or next week where we will take on-call and enable this, so that one of us can handle any pages that happen in that time and determine if they should just be silenced instead.

With that in mind, I'm adding this to 3.19 alongside https://github.com/sourcegraph/sourcegraph/issues/12160, which tracks the other remaining TODO here separately. cc @pecigonzalo

@bobheadxi bobheadxi modified the milestones: 3.18, 3.19 Jul 14, 2020
@bobheadxi bobheadxi changed the title Not dogfooding the monitoring we ship with Sourcegraph Dogfood the monitoring we ship with Sourcegraph Jul 14, 2020
bobheadxi added a commit that referenced this issue Jul 14, 2020
This banner is currently a significant source of confusion, and since we're still working on dogfooding alerts in sourcegraph.com (#5370) it's hard to say we're confident about all our alerts enough to have it displayed so prominently (yet)

Co-authored-by: ᴜɴᴋɴᴡᴏɴ <[email protected]>
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

7 participants