Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

distribution: add monitoring architecture page #1221

Merged
merged 20 commits into from
Oct 19, 2020
Merged

Conversation

slimsag
Copy link
Member

@slimsag slimsag commented Jul 16, 2020

@bobheadxi noted to me over Slack:

image

I think this is a great idea and something that I have long intended to put together - but I think an RFC may be the wrong forum for it because many of these decisions have already been made. I am therefor proposing the following:

  1. The current monitoring architecture and decisions we have already made are documented and live in this document. For any question @pecigonzalo has asked me or you @bobheadxi about "why are we doing this this way?" I would like this document to explain it
  2. For changes to the current monitoring architecture and decisions we are considering, those should be RFCs (or, if small changes they can be lightweight Proposal: ... GitHub issues, like https://github.com/sourcegraph/sourcegraph/issues/12010).

@bobheadxi can you own filling this out from here with all the context I've transferred to you, any questions Gonza has asked you, etc.? And then I'll follow-up and add any context I may have as well after

Rendered: https://github.com/sourcegraph/about/blob/monitoring-architecture/handbook/engineering/distribution/observability/monitoring_architecture.md

@slimsag slimsag requested review from a team and removed request for a team July 16, 2020 15:46
@bobheadxi bobheadxi marked this pull request as draft July 16, 2020 15:48
@slimsag
Copy link
Member Author

slimsag commented Jul 16, 2020

Added an architecture diagram:

@@ -16,6 +17,20 @@ See [the monitoring developer guide](monitoring.md) for information on how to de
- [Why can't I have all information expanded by default on the dashboard?](#faq-why-cant-i-have-all-information-expanded-by-default-on-the-dashboard)
- [Next steps](#next-steps)

## Long-term vision
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.


Learn more about the `alert_count` metrics in the [metrics guide](https://docs.sourcegraph.com/admin/observability/metrics_guide#alert-count).

*Rationale for `alert_count`*: TODO(@slimsag)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will follow-up here soon, as with everything there is some nuance and reasons why this made sense in the past (and may or may not make sense today). :)

Copy link
Contributor

@pecigonzalo pecigonzalo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is looking really good!


Alertmanager is bundled in `sourcegraph/prometheus`, and notifications are configured for Sourcegraph alerts [using site configuration](https://docs.sourcegraph.com/admin/observability/alerting). This functionality is provided by the [prom-wrapper](#prom-wrapper).

*Rationale for notifiers in site configuration*: Due to the limitations of [admin reverse-proxies](#admin-reverse-proxy), alerts cannot be configured without port-forwarding or custom ConfigMaps, something we [want to avoid](monitoring_pillars.md#long-term-vision).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As I recall from our conversation, we could have shipped via ConfigMaps, this is a limitation of the implementation we want to do of siteconfig <- prom-wrapper.
I think we should add somewhere that we want to request admins when onboarding a site through the frontend to configure a default destination for notifications, which is driving this implementation, otherwise I think this could also be a required var of our deployments.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's twofold - both your point, and this comment

(at least, from my understanding 😅 )

Copy link
Member

@bobheadxi bobheadxi Jul 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we want to request admins when onboarding a site through the frontend to configure a default destination for notifications

https://github.com/sourcegraph/sourcegraph/issues/12332 and one of the points here, currently worded as:

we want minimize the number of Sourcegraph instances without any alerting set up


## Custom additions

TODO: how we handle out-of-band metrics, alerts (things we don't ship to customers)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@slimsag From what I understand, this is pretty ad-hoc "add more prometheus yaml" at the moment - is there anything I'm missing?

If there's no current process for this, I'd like to get an RFC started about a set of metric/alert labels/standards for out-of-band metrics and alerts that can integrate with the built-in alerting stack. This will somewhat depend on how https://github.com/sourcegraph/sourcegraph/issues/12010 comes out

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For example, oddity with blackbox alerts: https://sourcegraph.slack.com/archives/CJX299FGE/p1595254363107300

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: started sketching out RFC 208, but probably will keep this on hold given our new priorities

@bobheadxi bobheadxi marked this pull request as ready for review July 26, 2020 09:09
@bobheadxi bobheadxi requested a review from a team July 26, 2020 09:10
@slimsag
Copy link
Member Author

slimsag commented Aug 6, 2020

Will merge before EOW.

@bobheadxi
Copy link
Member

I just found out about RFC 131, but it seems incomplete - might be worth calling out on this page regardless?

@sqs sqs changed the base branch from master to main September 5, 2020 04:37
@slimsag slimsag modified the milestones: 3.20, 3.21 Sep 14, 2020
@slimsag
Copy link
Member Author

slimsag commented Sep 14, 2020

Deferring to 3.21. I still intend to follow-up on this, but a higher-priority problem came up.

@slimsag
Copy link
Member Author

slimsag commented Oct 19, 2020

So sorry this took me so long to follow-up on. This is really great, @bobheadxi ! I made a few small tweaks only, merging now.

@slimsag slimsag merged commit 62dcabf into main Oct 19, 2020
@slimsag slimsag deleted the monitoring-architecture branch October 19, 2020 23:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants