distribution: add monitoring architecture page #1221

slimsag · 2020-07-16T15:46:01Z

@bobheadxi noted to me over Slack:

I think this is a great idea and something that I have long intended to put together - but I think an RFC may be the wrong forum for it because many of these decisions have already been made. I am therefor proposing the following:

The current monitoring architecture and decisions we have already made are documented and live in this document. For any question @pecigonzalo has asked me or you @bobheadxi about "why are we doing this this way?" I would like this document to explain it
For changes to the current monitoring architecture and decisions we are considering, those should be RFCs (or, if small changes they can be lightweight Proposal: ... GitHub issues, like https://github.com/sourcegraph/sourcegraph/issues/12010).

@bobheadxi can you own filling this out from here with all the context I've transferred to you, any questions Gonza has asked you, etc.? And then I'll follow-up and add any context I may have as well after

Rendered: https://github.com/sourcegraph/about/blob/monitoring-architecture/handbook/engineering/distribution/observability/monitoring_architecture.md

slimsag · 2020-07-16T17:12:27Z

Added an architecture diagram:

bobheadxi · 2020-07-17T06:57:43Z

handbook/engineering/distribution/observability/monitoring_pillars.md

@@ -16,6 +17,20 @@ See [the monitoring developer guide](monitoring.md) for information on how to de
  - [Why can't I have all information expanded by default on the dashboard?](#faq-why-cant-i-have-all-information-expanded-by-default-on-the-dashboard)
 - [Next steps](#next-steps)

+## Long-term vision


paraphrased from https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648628953

bobheadxi · 2020-07-17T07:51:41Z

handbook/engineering/distribution/observability/monitoring_architecture.md

+
+Learn more about the `alert_count` metrics in the [metrics guide](https://docs.sourcegraph.com/admin/observability/metrics_guide#alert-count).
+
+*Rationale for `alert_count`*: TODO(@slimsag)


@slimsag

Also posted a related question on Slack: https://sourcegraph.slack.com/archives/CJX299FGE/p1594971987034300

I will follow-up here soon, as with everything there is some nuance and reasons why this made sense in the past (and may or may not make sense today). :)

…-architecture

pecigonzalo

This is looking really good!

handbook/engineering/distribution/observability/monitoring_architecture.md

pecigonzalo · 2020-07-17T13:09:11Z

handbook/engineering/distribution/observability/monitoring_architecture.md

+
+Alertmanager is bundled in `sourcegraph/prometheus`, and notifications are configured for Sourcegraph alerts [using site configuration](https://docs.sourcegraph.com/admin/observability/alerting). This functionality is provided by the [prom-wrapper](#prom-wrapper).
+
+*Rationale for notifiers in site configuration*: Due to the limitations of [admin reverse-proxies](#admin-reverse-proxy), alerts cannot be configured without port-forwarding or custom ConfigMaps, something we [want to avoid](monitoring_pillars.md#long-term-vision).


As I recall from our conversation, we could have shipped via ConfigMaps, this is a limitation of the implementation we want to do of siteconfig <- prom-wrapper.
I think we should add somewhere that we want to request admins when onboarding a site through the frontend to configure a default destination for notifications, which is driving this implementation, otherwise I think this could also be a required var of our deployments.

It's twofold - both your point, and this comment

(at least, from my understanding 😅 )

we want to request admins when onboarding a site through the frontend to configure a default destination for notifications

https://github.com/sourcegraph/sourcegraph/issues/12332 and one of the points here, currently worded as:

we want minimize the number of Sourcegraph instances without any alerting set up

handbook/engineering/distribution/observability/monitoring_architecture.md

…-architecture

bobheadxi · 2020-07-20T12:51:44Z

handbook/engineering/distribution/observability/monitoring_architecture.md

+
+## Custom additions
+
+TODO: how we handle out-of-band metrics, alerts (things we don't ship to customers)


@slimsag From what I understand, this is pretty ad-hoc "add more prometheus yaml" at the moment - is there anything I'm missing?

If there's no current process for this, I'd like to get an RFC started about a set of metric/alert labels/standards for out-of-band metrics and alerts that can integrate with the built-in alerting stack. This will somewhat depend on how https://github.com/sourcegraph/sourcegraph/issues/12010 comes out

For example, oddity with blackbox alerts: https://sourcegraph.slack.com/archives/CJX299FGE/p1595254363107300

update: started sketching out RFC 208, but probably will keep this on hold given our new priorities

handbook/engineering/distribution/observability/monitoring_architecture.md

…-architecture

slimsag · 2020-08-06T01:06:57Z

Will merge before EOW.

bobheadxi · 2020-08-25T06:39:29Z

I just found out about RFC 131, but it seems incomplete - might be worth calling out on this page regardless?

…rchitecture

slimsag · 2020-09-14T19:09:07Z

Deferring to 3.21. I still intend to follow-up on this, but a higher-priority problem came up.

slimsag · 2020-10-19T23:11:08Z

So sorry this took me so long to follow-up on. This is really great, @bobheadxi ! I made a few small tweaks only, merging now.

distribution: add monitoring architecture page

dd48947

slimsag requested review from a team and removed request for a team July 16, 2020 15:46

slimsag assigned bobheadxi Jul 16, 2020

bobheadxi marked this pull request as draft July 16, 2020 15:48

slimsag added 3 commits July 16, 2020 10:08

add architecture diagrams

faf3b4d

Update monitoring_architecture.md

e9325a9

delete SVG variant (renders poorly)

7ea5dea

first pass writeup

37e6d62

bobheadxi reviewed Jul 17, 2020

View reviewed changes

pecigonzalo mentioned this pull request Jul 17, 2020

Approved: Proposal: RFC-189: Support per-team alerts and on-call rotations sourcegraph/sourcegraph-public-snapshot#12010

Closed

bobheadxi added 3 commits July 17, 2020 16:14

make custom additions a top-level thing

c5b3d03

Merge branch 'master' of github.com:sourcegraph/about into monitoring…

6853fb7

…-architecture

un-nest prom-wrapper

56dc80f

bobheadxi requested a review from pecigonzalo July 17, 2020 08:41

bobheadxi mentioned this pull request Jul 17, 2020

Distribution: 3.19 Tracking issue sourcegraph/sourcegraph-public-snapshot#11954

Closed

55 tasks

pecigonzalo reviewed Jul 17, 2020

View reviewed changes

rationale for prom-wrapper

fd20f42

bobheadxi force-pushed the monitoring-architecture branch from 8defdc0 to fd20f42 Compare July 20, 2020 03:10

bobheadxi added 2 commits July 20, 2020 20:43

improve wording about grafana editing situation

5bf3928

Merge branch 'master' of github.com:sourcegraph/about into monitoring…

cc2acaf

…-architecture

bobheadxi reviewed Jul 20, 2020

View reviewed changes

handbook/engineering/distribution/observability/monitoring_architecture.md Outdated Show resolved Hide resolved

bobheadxi mentioned this pull request Jul 20, 2020

Alert if containers are entirely down/missing sourcegraph/sourcegraph-public-snapshot#9792

Closed

bobheadxi and others added 2 commits July 20, 2020 21:31

add note about alert ownership

fe8cddd

add description for blackbox exporter

1951df8

uwedeportivo approved these changes Jul 22, 2020

View reviewed changes

link to RFC about custom additions

56a5448

Merge branch 'master' of github.com:sourcegraph/about into monitoring…

a0521b4

…-architecture

bobheadxi marked this pull request as ready for review July 26, 2020 09:09

bobheadxi requested a review from a team July 26, 2020 09:10

bobheadxi added 2 commits July 30, 2020 09:35

Merge branch 'master' of github.com:sourcegraph/about into monitoring…

590e5f7

…-architecture

add per-team alerts details

eafe676

bobheadxi approved these changes Jul 30, 2020

View reviewed changes

slimsag assigned slimsag and unassigned bobheadxi Aug 6, 2020

slimsag added the team/distribution label Aug 6, 2020

slimsag added this to the 3.19 milestone Aug 6, 2020

slimsag modified the milestones: 3.19, 3.20 Aug 24, 2020

slimsag added planned/3.19 labels Aug 24, 2020

pecigonzalo mentioned this pull request Aug 24, 2020

Distribution 3.20 Tracking issue sourcegraph/sourcegraph-public-snapshot#12836

Closed

49 tasks

sqs changed the base branch from master to main September 5, 2020 04:37

Merge branch 'main' of github.com:sourcegraph/about into monitoring-a…

f8ea0a5

…rchitecture

slimsag requested a review from nicksnyder as a code owner September 11, 2020 23:29

slimsag modified the milestones: 3.20, 3.21 Sep 14, 2020

sourcegraph-bot mentioned this pull request Sep 14, 2020

Distribution 3.21 Tracking issue sourcegraph/sourcegraph-public-snapshot#13675

Closed

55 tasks

slimsag removed the planned/3.20 label Sep 14, 2020

Update monitoring_architecture.md

a21da70

slimsag merged commit 62dcabf into main Oct 19, 2020

slimsag deleted the monitoring-architecture branch October 19, 2020 23:11

rvantonder mentioned this pull request Oct 20, 2020

observability docs: move file to where it's referenced #1785

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

distribution: add monitoring architecture page #1221

distribution: add monitoring architecture page #1221

slimsag commented Jul 16, 2020 •

edited by bobheadxi

Loading

slimsag commented Jul 16, 2020

bobheadxi Jul 17, 2020

bobheadxi Jul 17, 2020

slimsag Jul 18, 2020

pecigonzalo left a comment

pecigonzalo Jul 17, 2020

bobheadxi Jul 17, 2020

bobheadxi Jul 20, 2020 •

edited

Loading

bobheadxi Jul 20, 2020

bobheadxi Jul 20, 2020

bobheadxi Jul 26, 2020

slimsag commented Aug 6, 2020

bobheadxi commented Aug 25, 2020

slimsag commented Sep 14, 2020

slimsag commented Oct 19, 2020


		Learn more about the `alert_count` metrics in the [metrics guide](https://docs.sourcegraph.com/admin/observability/metrics_guide#alert-count).

		Rationale for `alert_count`: TODO(@slimsag)


		Alertmanager is bundled in `sourcegraph/prometheus`, and notifications are configured for Sourcegraph alerts [using site configuration](https://docs.sourcegraph.com/admin/observability/alerting). This functionality is provided by the [prom-wrapper](#prom-wrapper).

		Rationale for notifiers in site configuration: Due to the limitations of [admin reverse-proxies](#admin-reverse-proxy), alerts cannot be configured without port-forwarding or custom ConfigMaps, something we [want to avoid](monitoring_pillars.md#long-term-vision).


		## Custom additions

		TODO: how we handle out-of-band metrics, alerts (things we don't ship to customers)

distribution: add monitoring architecture page #1221

distribution: add monitoring architecture page #1221

Conversation

slimsag commented Jul 16, 2020 • edited by bobheadxi Loading

slimsag commented Jul 16, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pecigonzalo left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bobheadxi Jul 20, 2020 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slimsag commented Aug 6, 2020

bobheadxi commented Aug 25, 2020

slimsag commented Sep 14, 2020

slimsag commented Oct 19, 2020

slimsag commented Jul 16, 2020 •

edited by bobheadxi

Loading

bobheadxi Jul 20, 2020 •

edited

Loading