Dogfood the monitoring we ship with Sourcegraph #5370

slimsag · 2019-08-26T20:06:43Z

Sourcegraph.com is currently running our old alerting stack (see here) which is very poor in multiple aspects.

The new / recently-added monitoring stack which ships with Sourcegraph is what we should switch to using (i.e. https://docs.sourcegraph.com/admin/observability/alerting )

For more details see https://github.com/sourcegraph/sourcegraph/issues/5370#issuecomment-629406540

beyang · 2020-04-08T23:56:32Z

@slimsag I believe this issue has been replaced by others or is mostly done. Re-open if still applicable.

slimsag · 2020-04-09T00:28:20Z

Reworded to match the current state of affairs

davejrt · 2020-04-30T20:21:06Z

@slimsag is it possible to get some more context here, or an example of where this is currently configured and working?

uwedeportivo · 2020-05-13T17:51:17Z

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.16 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

slimsag · 2020-05-15T18:15:49Z

Historically, Sourcegraph.com has had alerting that was not shared by our customers or other Sourcegraph instances and was also very outdated/broken regularly because Sourcegraph.com was being neglected.

Very recently, we created a new set of monitoring dashboards and alerting rules, which resulted in this generator: https://github.com/sourcegraph/sourcegraph/tree/master/monitoring

We have unified the Prometheus and Grafana rules on Sourcegraph.com and this monitoring generator for the most part, but a few outliers remain:

One outlier is Site24x7, but this is expected, see https://github.com/sourcegraph/sourcegraph/issues/10742

The major outlier (what this issue is about) is that on Sourcegraph.com we still use Prometheus alertmanager to report alerts from Prometheus to OpsGenie. This is in contrast to using Grafana to send alerts to OpsGenie, which is what we advise all our customers to do (see docs here: https://docs.sourcegraph.com/admin/observability/alerting )

What we need to do to solve this issue is:

Make it possible to configure Grafana alerting via a file on disk (currently we use hacky UI editing which is tedious and easily gets undone): https://github.com/sourcegraph/sourcegraph/issues/10641
Set up Critical alerts on Sourcegraph.com to go to OpsGenie (via Grafana config file or something we can store in VCS) in https://github.com/sourcegraph/deploy-sourcegraph-dot-com => https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/3073
Set up Warning alerts on Sourcegraph.com to go to the #alerts Slack channel. => https://github.com/sourcegraph/deploy-sourcegraph-dot-com/pull/2839
Make sure our existing Alertmanager rules are fully encapsulated already by what the monitoring generator covers alerts-wise. This may require introducing a Sourcegraph.com-specific mode to the monitoring generator, but I'd like to avoid that if at all possible. => https://github.com/sourcegraph/sourcegraph/issues/5370#issuecomment-654725536 => https://github.com/sourcegraph/sourcegraph/issues/12117 => Marking as completed since there is a separate issue, and since these alerts can now be subscribed via observability.alerts, we'll essentially get the same alerting as customers after point 2
Remove Prometheus Alertmanager from sourcegraph.com entirely (depends on point 2)
Remove all customizations in our ConfigMap so we use the exact same ConfigMap our customers do

If you want to get started on https://github.com/sourcegraph/sourcegraph/issues/10641 that would be great. This issue, however, has many prerequisites and dependencies on me as noted above so it was a mistake to assign this to this milestone. Kicking it back.

bobheadxi · 2020-07-02T10:11:22Z

Added a 2d estimate to try and leave room for potential difficulties migrating old alerts smoothly to things like https://github.com/sourcegraph/sourcegraph/pull/11832

bobheadxi · 2020-07-07T09:30:36Z

Make sure our existing Alertmanager rules are fully encapsulated already by what the monitoring generator covers alerts-wise. This may require introducing a Sourcegraph.com-specific mode to the monitoring generator, but I'd like to avoid that if at all possible.

@slimsag is this required? it seems that these alerts are already included in deploy-sourcegraph, meaning they do get distributed already: https://github.com/sourcegraph/deploy-sourcegraph/pull/784/files#diff-dbdb75693e4866b4d8494acb29ef5c8bR238

slimsag · 2020-07-08T04:00:02Z

@slimsag is this required? it seems that these alerts are already included in deploy-sourcegraph, meaning they do get distributed already:

Eventually these need to be removed, so all alerts/monitoring items are defined in https://github.com/sourcegraph/sourcegraph/tree/master/monitoring

In other words, for us to be considered fully using the new monitoring stack I think this is necessary. But if we can use both today and do that migration slowly, I am happy with that (just break that portion out of this issue by filing a new one)

daxmc99 · 2020-07-13T18:01:51Z

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.18 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

bobheadxi · 2020-07-14T14:53:00Z

The last real piece in this pie is subscribing opsgenie to critical alerts. I will coordinate a time with @slimsag sometime this week or next week where we will take on-call and enable this, so that one of us can handle any pages that happen in that time and determine if they should just be silenced instead.

With that in mind, I'm adding this to 3.19 alongside https://github.com/sourcegraph/sourcegraph/issues/12160, which tracks the other remaining TODO here separately. cc @pecigonzalo

This banner is currently a significant source of confusion, and since we're still working on dogfooding alerts in sourcegraph.com (#5370) it's hard to say we're confident about all our alerts enough to have it displayed so prominently (yet) Co-authored-by: ᴜɴᴋɴᴡᴏɴ <[email protected]>

slimsag added the team/distribution 🚢📦💨 label Aug 26, 2019

slimsag mentioned this issue Aug 26, 2019

alert for indexed-search pod #5369

Closed

beyang closed this as completed Apr 8, 2020

slimsag changed the title ~~use Grafana for alerting, move away from Prometheus alertmanager~~ move away from Prometheus alertmanager on Sourcegraph.com specifically Apr 9, 2020

slimsag reopened this Apr 9, 2020

beyang added this to the Backlog milestone Apr 9, 2020

slimsag modified the milestones: Backlog, 3.16 Apr 20, 2020

beyang modified the milestones: Backlog, 3.16 Apr 21, 2020

beyang assigned davejrt Apr 21, 2020

beyang mentioned this issue Apr 21, 2020

distribution: 3.16 tracking issue #10069

Closed

47 tasks

slimsag self-assigned this May 15, 2020

slimsag modified the milestones: 3.16, 3.17 May 15, 2020

bobheadxi mentioned this issue May 29, 2020

observability: Alerting should be easily configured through a file or site configuration #10641

Closed

slimsag added the planned/3.17 label May 29, 2020

slimsag mentioned this issue May 29, 2020

Distribution: 3.17 Tracking issue #10788

Closed

37 tasks

slimsag removed their assignment Jun 3, 2020

slimsag mentioned this issue Jun 3, 2020

Determine missing alerts between alertmanager and grafana #7528

Closed

slimsag assigned slimsag and bobheadxi and unassigned slimsag Jul 1, 2020

slimsag added the okr/distribution/admin-experience label Jul 2, 2020

bobheadxi added the estimate/2d label Jul 2, 2020

pecigonzalo added the planned/3.18 label Jul 2, 2020

slimsag mentioned this issue Jul 6, 2020

Address flaky/confusing sourcegraph.com alerts sentiment #11966

Closed

bobheadxi mentioned this issue Jul 7, 2020

prometheus: add builtin alertmanager, labels.level for builtin alerts sourcegraph/deploy-sourcegraph#784

Merged

This was referenced Jul 7, 2020

monitoring: adjust thresholds for noisy critical alerts #11988

Merged

monitoring: reassess flakey/unactionable critical alerts #12011

Closed

This was referenced Jul 8, 2020

monitoring: alerting followups #12026

Closed

monitoring: migrate existing alert rules to generator #12117

Closed

This was referenced Jul 14, 2020

monitoring: disable critical alerts banner by default #12155

Merged

monitoring: remove custom alertmanager from cloud #12160

Closed

bobheadxi modified the milestones: 3.18, 3.19 Jul 14, 2020

bobheadxi added the planned/3.19 label Jul 14, 2020

pecigonzalo mentioned this issue Jul 14, 2020

Distribution: 3.19 Tracking issue #11954

Closed

55 tasks

bobheadxi changed the title ~~Not dogfooding the monitoring we ship with Sourcegraph~~ Dogfood the monitoring we ship with Sourcegraph Jul 14, 2020

chayim mentioned this issue Jul 14, 2020

RFC 196 Tracking Issue #12166

Closed

bobheadxi mentioned this issue Jul 15, 2020

Approved: Proposal: RFC-189: Support per-team alerts and on-call rotations #12010

Closed

bobheadxi mentioned this issue Jul 28, 2020

prometheus: remove custom alert rules and records, scrape prometheus sourcegraph/deploy-sourcegraph#805

Merged

bobheadxi closed this as completed Aug 4, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dogfood the monitoring we ship with Sourcegraph #5370

Dogfood the monitoring we ship with Sourcegraph #5370

slimsag commented Aug 26, 2019 •

edited

Loading

beyang commented Apr 8, 2020

slimsag commented Apr 9, 2020

davejrt commented Apr 30, 2020

uwedeportivo commented May 13, 2020

slimsag commented May 15, 2020 •

edited by bobheadxi

Loading

bobheadxi commented Jul 2, 2020

bobheadxi commented Jul 7, 2020

slimsag commented Jul 8, 2020

daxmc99 commented Jul 13, 2020

bobheadxi commented Jul 14, 2020

Dogfood the monitoring we ship with Sourcegraph #5370

Dogfood the monitoring we ship with Sourcegraph #5370

Comments

slimsag commented Aug 26, 2019 • edited Loading

beyang commented Apr 8, 2020

slimsag commented Apr 9, 2020

davejrt commented Apr 30, 2020

uwedeportivo commented May 13, 2020

slimsag commented May 15, 2020 • edited by bobheadxi Loading

bobheadxi commented Jul 2, 2020

bobheadxi commented Jul 7, 2020

slimsag commented Jul 8, 2020

daxmc99 commented Jul 13, 2020

bobheadxi commented Jul 14, 2020

slimsag commented Aug 26, 2019 •

edited

Loading

slimsag commented May 15, 2020 •

edited by bobheadxi

Loading