-
Notifications
You must be signed in to change notification settings - Fork 172
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
distribution: add monitoring architecture page #1221
Conversation
@@ -16,6 +17,20 @@ See [the monitoring developer guide](monitoring.md) for information on how to de | |||
- [Why can't I have all information expanded by default on the dashboard?](#faq-why-cant-i-have-all-information-expanded-by-default-on-the-dashboard) | |||
- [Next steps](#next-steps) | |||
|
|||
## Long-term vision |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
|
||
Learn more about the `alert_count` metrics in the [metrics guide](https://docs.sourcegraph.com/admin/observability/metrics_guide#alert-count). | ||
|
||
*Rationale for `alert_count`*: TODO(@slimsag) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also posted a related question on Slack: https://sourcegraph.slack.com/archives/CJX299FGE/p1594971987034300
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will follow-up here soon, as with everything there is some nuance and reasons why this made sense in the past (and may or may not make sense today). :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is looking really good!
handbook/engineering/distribution/observability/monitoring_architecture.md
Outdated
Show resolved
Hide resolved
|
||
Alertmanager is bundled in `sourcegraph/prometheus`, and notifications are configured for Sourcegraph alerts [using site configuration](https://docs.sourcegraph.com/admin/observability/alerting). This functionality is provided by the [prom-wrapper](#prom-wrapper). | ||
|
||
*Rationale for notifiers in site configuration*: Due to the limitations of [admin reverse-proxies](#admin-reverse-proxy), alerts cannot be configured without port-forwarding or custom ConfigMaps, something we [want to avoid](monitoring_pillars.md#long-term-vision). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As I recall from our conversation, we could have shipped via ConfigMaps
, this is a limitation of the implementation we want to do of siteconfig <- prom-wrapper
.
I think we should add somewhere that we want to request admins when onboarding a site through the frontend to configure a default destination for notifications, which is driving this implementation, otherwise I think this could also be a required var
of our deployments.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's twofold - both your point, and this comment
(at least, from my understanding 😅 )
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we want to request admins when onboarding a site through the frontend to configure a default destination for notifications
https://github.com/sourcegraph/sourcegraph/issues/12332 and one of the points here, currently worded as:
we want minimize the number of Sourcegraph instances without any alerting set up
handbook/engineering/distribution/observability/monitoring_architecture.md
Show resolved
Hide resolved
8defdc0
to
fd20f42
Compare
|
||
## Custom additions | ||
|
||
TODO: how we handle out-of-band metrics, alerts (things we don't ship to customers) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@slimsag From what I understand, this is pretty ad-hoc "add more prometheus yaml" at the moment - is there anything I'm missing?
If there's no current process for this, I'd like to get an RFC started about a set of metric/alert labels/standards for out-of-band metrics and alerts that can integrate with the built-in alerting stack. This will somewhat depend on how https://github.com/sourcegraph/sourcegraph/issues/12010 comes out
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For example, oddity with blackbox alerts: https://sourcegraph.slack.com/archives/CJX299FGE/p1595254363107300
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: started sketching out RFC 208, but probably will keep this on hold given our new priorities
handbook/engineering/distribution/observability/monitoring_architecture.md
Outdated
Show resolved
Hide resolved
Will merge before EOW. |
I just found out about RFC 131, but it seems incomplete - might be worth calling out on this page regardless? |
Deferring to 3.21. I still intend to follow-up on this, but a higher-priority problem came up. |
So sorry this took me so long to follow-up on. This is really great, @bobheadxi ! I made a few small tweaks only, merging now. |
@bobheadxi noted to me over Slack:
I think this is a great idea and something that I have long intended to put together - but I think an RFC may be the wrong forum for it because many of these decisions have already been made. I am therefor proposing the following:
Proposal: ...
GitHub issues, like https://github.com/sourcegraph/sourcegraph/issues/12010).@bobheadxi can you own filling this out from here with all the context I've transferred to you, any questions Gonza has asked you, etc.? And then I'll follow-up and add any context I may have as well after
Rendered: https://github.com/sourcegraph/about/blob/monitoring-architecture/handbook/engineering/distribution/observability/monitoring_architecture.md