monitoring: granular alerts notifications with Alertmanager #11452

bobheadxi · 2020-06-12T05:29:12Z

Right now (as of https://github.com/sourcegraph/sourcegraph/pull/11483), we create a copy of the home dashboard (since the home dashboard is non-editable) to attach alerts to, and alerts basically only tell you if warning alerts are firing or if critical alerts are firing.

We want a expand on https://github.com/sourcegraph/sourcegraph/issues/10641 (where notifications only tell you "warning alerts are firing" and "critical alerts are firing") to get better granularity for alerts (ie provide what specific alert is firing), and be able to support per-alert silencing down the line (#11210)

Discussions

https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648567565

https://github.com/sourcegraph/sourcegraph/pull/11427/files#r439128693

Updated plan

https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648652514

bobheadxi · 2020-06-22T05:46:45Z

From #10640:

Today you can configure two alerts at the topmost level, one for critical alerts in Sourcegraph and one for warning alerts in Sourcegraph and then when you get notified via slack, pager, email, etc. go over to Sourcegraph's Grafana to identify the cause.

But, it would be nicer if there was an easy way to get the actual alert name / description / service name as part of the alert Grafana sends. This may be difficult to implement because Grafana alerting does not support templating in alert definitions, currently, so this issue may involve fixing that: grafana/grafana#6557

The linked issue is 4 years old and has been locked, so I'll address this here by adding alerts set up as part of the dashboard generation. Since each dashboard panel is attached to specific alerts, we might be able to tackle it that way

bobheadxi · 2020-06-23T03:53:52Z

@slimsag unfortunately, I'm not sure even the approach we discussed today (place alerts on individual panels) is possible either :( Since some observables can have both warnings and critical-level alerts, we would need multiple alerts per panel, which is not possible (grafana/grafana#7832)

bobheadxi · 2020-06-23T04:09:08Z

I feel like the only way forward here is to split the per-service "alerts firing" panel into one for warning and one for critical, and place the Grafana alerts on those

The alternative would be to duplicate every panel with warning and critical, which will probably clutter things up a lot 😢

bobheadxi · 2020-06-23T06:32:57Z

More thoughts: implementing setting up alerts as part of dashboard generation will effectively render https://github.com/sourcegraph/sourcegraph/issues/11210 impossible, since alerts are generated ahead of time, unless we:

give each service's alerts a special prefix, ie src-frontend-...
when applying notifiers, create ~14 copies of each notifier src-frontend-...
selectively exclude notifiers when creating them

I feel like the only way forward here is to split the per-service "alerts firing" panel into one for warning and one for critical, and place the Grafana alerts on those

I think the only way to make this work with https://github.com/sourcegraph/sourcegraph/issues/11210 is to change the query of the "alerts firing" panels to exclude silenced alerts. It might be mysterious to have alerts vanish off that dashboard, but we'll still have the data in ie bug-report and on the overview dashboard

tl;dr I think the only way to reconcile the generated dashboards with the ability to silence alerts long-term is to remove generation from the build step and have grafana-wrapper create them (disk-provisioned dashboards are not editable), and I'm leaning towards doing that for implementing this issue since I think we're going to have to do it anyway and the approach will be different. It'll be a significant change but https://github.com/sourcegraph/sourcegraph/pull/11554 contains some refactors that would help make this not too messy (I hope)

slimsag · 2020-06-23T08:02:11Z

Since some observables can have both warnings and critical-level alerts, we would need multiple alerts per panel, which is not possible (grafana/grafana#7832)

Grafana is full of fun little limitations, isn't it? 🙃 It's about time someone forked it and fixed this + templating in alerts.. honestly I wonder how hard that would be compared to working around these limitations.. but I digress

I feel like the only way forward here is to split the per-service "alerts firing" panel into one for warning and one for critical, and place the Grafana alerts on those
I think the only way to make this work with #11210 is to change the query of the "alerts firing" panels to exclude silenced alerts.

Correct me if I am wrong, but if we did do this wouldn't we only be able to emit an alert saying "a frontend critical alert is firing" rather than the specific alert (like "critical: frontend disk space low" or whatever), because Grafana doesn't support templating?

One way to do https://github.com/sourcegraph/sourcegraph/issues/11210 - which may be better (I haven't thought it through extensively yet) - would be to have a prometheus-wrapper which simply regenerates the Prometheus rules to omit that alert definition entirely. Then it'd be like the alert was never defined in the first place. This would introduce questions for how we handle that on the Grafana side, though.

The two original motivations for stepping away from Alertmanager and towards Grafana's Alerting was because of its better UI and support for more pager/notification types. But, stepping back and looking at the problem from a glance now I am wondering if this is not the wrong choice entirely due to the limitations in Grafana's alerting.

Alertmanager does support the same set of pager/notification types we have in place today, which does seem sufficient: https://prometheus.io/docs/alerting/latest/configuration/ and since we aren't directing admins to the Grafana UI instead but sideloading our own config from the Sourcegraph site config.. maybe Alertmanager would be a better approach? I haven't looked at this extensively, but I think given the limitations we're hitting here it may be worth stepping back and reconsidering. What do you think?

slimsag · 2020-06-23T08:10:04Z

From a quick 5m assessment, if we did go back to alertmanager:

We could use the same grafana-wrapper type setup, but instead have a prometheus-wrapper
We could map the same site config to the Alertmanager config format
Since it already relies on a config file on disk, solving https://github.com/sourcegraph/sourcegraph/issues/11663 could be as simple as "just do nothing if the frontend is down"
There wouldn't be any questions about the query execution interval between Grafana and Prometheus potentially not matching up
It supports <tmpl_string> so we could potentially have just 1 templated alert definition that handles sending them all
We could have some fun stylizing the Slack alerts to look more like Sourcegraph and less like Grafana: https://prometheus.io/docs/alerting/latest/configuration/#action_config
We could generate a link to the Grafana dashboard (just the service, no need to link to the panel itself) and include that in the alert message
We wouldn't need to copy any dashboards or pre-provision alerts to workaround Grafana limitations/restrictions

seems like only upsides to me 🤦 I feel bad for not thinking of this earlier

bobheadxi · 2020-06-23T08:48:36Z

Correct me if I am wrong, but if we did do this wouldn't we only be able to emit an alert saying "a frontend critical alert is firing" rather than the specific alert (like "critical: frontend disk space low" or whatever), because Grafana doesn't support templating?

Yep :(

would be to have a prometheus-wrapper which simply regenerates the Prometheus rules to omit that alert definition entirely

This seems risky - I don't think muting an alert === never recording it 😅

seems like only upsides to me 🤦 I feel bad for not thinking of this earlier

I dug into it and it does seem to be mostly upsides 😅 The single downside is that it's a new service to deploy. I probably should have thought harder about it too, since the warning signs were there... oops :(

Doing this switch back to Alertmanager will take some fenangling:

probably close https://github.com/sourcegraph/sourcegraph/pull/11612, https://github.com/sourcegraph/sourcegraph/pull/11554 use it as a basis to get rid of grafana-wrapper and replace it with alertmanager-wrapper
we'll need to take some care to make sure we don't break the configuration format we've released (in hindsight, should have gone with our own abstraction over the Grafana notifiers 😢 )
roll back various changes to the Grafana image

A very unfortunate state of affairs haha

slimsag · 2020-06-23T09:03:05Z

https://prometheus.io/docs/alerting/latest/alertmanager/#silences Alertmanagers also has native support for silencing

As for it being another thing to deploy, I agree but that also seems minor.

I think exploring this more is worth time, I’d say timebox this to a few days and post an update with whether you think we should continue down that path or not :)

For the config schema, I am Magine that they will be pretty similar anyway, but if not we can always do something simpler like handling backwards compatibility just for the absolute most basic thing, or not handle backwards compatibility at all and just advise site admin’s to update the configuration when they upgrade

bobheadxi · 2020-06-23T09:15:59Z

in all honesty after looking into this today I feel like Grafana alerting is not the way to go given the specific goals of:

every alert triggers a unique message
individual alerts must be silence-able

However, an approach that could work with Grafana: let's say we forgo point 1, and set notifications on when each service has critical alerts or warning alerts firing:

I actually think this is relatively desirable, personally I feel like as an admin I would rather be notified on a per-service level than get 40 slack messages for every single alerts, and go and triage it / ask for help myself (customer insight might be useful here). These notifications would take users to the "firing alerts" dashboard of the relevant, which doesn't actually feel like the worst place to start triaging what's up
- edit: alertmanager is super fancy and handles this with grouping
we would avoid adding another service to the deployment (though as you say that's probably comparatively minor), but I personally still think just having Grafana is a plus (my rough thoughts: observability in Sourcegraph is to help maintain Sourcegraph, and should probably be relatively unobtrusive - in between Prometheus, Grafana, Jaeger, cAdvisor, there's already a lot of stuff in my mind, but I could be a bit misaligned with what our high-level monitoring goals are though)
we can still include custom messages in the alerts providing some high-level guidance on what to do with the Grafana alert (we just won't be able to benefit from templating)

Since most of the points above are based on assumptions of how much granularity is useful to a sourcegraph admin, do you have any thoughts on this / past feedback from admins I can refer to?

bobheadxi · 2020-06-24T03:54:56Z

Summary on granular alerts notifications

Response to Grafana potentially being the wrong choice for alert notifications - https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-647938831

From @slimsag:

The two original motivations for stepping away from Alertmanager and towards Grafana's Alerting was because of its better UI and support for more pager/notification types. But, stepping back and looking at the problem from a glance now I am wondering if this is not the wrong choice entirely due to the limitations in Grafana's alerting.

Goals, from the issue description:

We want a expand on https://github.com/sourcegraph/sourcegraph/issues/10641 (where notifications only tell you "warning alerts are firing" and "critical alerts are firing") to get better granularity for alerts (ie provide what specific alert is firing), and be able to support per-alert silencing down the line (#11210)

Alertmanager

On the surface, Alertmanager gives us everything we want right now and more (especially features around grouping and silencing alerts, which would solve https://github.com/sourcegraph/sourcegraph/issues/11210). Since Grafana has a plugin for Alertmanager, the only downsides are that it's a new service to configure and deploy, and we'll have to be careful about how we adapt what went out in 3.17 to Alertmanager (ideas: https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648003317).

Quick summary of what transitioning to Alertmanager might look like by @slimsag: https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-647983479.

Grafana alerting

At the moment, the biggest blocker to Grafana alerting is essentially that it does not let us set multiple alerts on one panel. Many of our Observables have a Warning and Critical level alert for the same panel, which means we must duplicate panels, which might look confusing (but might actually not be that bad, if we put some thought into how they are grouped, ie "critical alert" panels together).

We can achieve a sort of middle ground by expanding on setting alerts on the "alerts firing" panels on a per-service level rather than a Sourcegraph-wide level (ie what is available now in 3.17 with https://github.com/sourcegraph/sourcegraph/issues/10641). This approach could be sufficient if only we could act on the results of a triggered alert - this is currently unimplemented (and from the looks of it, won't be for quite some time: grafana/grafana#6041, grafana/grafana#7832, grafana/grafana#6553)

Leveraging Grafana alerting has the advantage of keeping most things monitoring in one place. I'm not sure if adding a new service is the mix is worth the featureset that Alertmanager offers since our alerting is relatively minimal, and until recently setting up alerts was quite a process (docs), so I'm not sure how comprehensive our alerting solution has to be (as mentioned in my other comment though this is based on assumptions of how much granularity is useful to a sourcegraph admin, past feedback from admins / data on what custom alerts admins set today would be helpful - Slack thread).

"All alerts should be actionable"

Grafana alerting is limited, but it seems intentionally limited (1, 2), which means either we accept that Grafana alerting is just not for us or maybe we take a look at how we should leverage our current alerts to align with how their alerts work. For example, one approach could be:

Warning alerts are just warnings - the "action" we want users to take is to just take a look and be aware, so they can be less granular (ie "frontend has warning alerts - please check grafana"). Users receiving warning alerts will be taken to an "alerts firing" panel. In other words, if a warning requires specific action, maybe it should be a critical alert instead, and if not it should fine to have lower-granularity notifications for warnings, since then it won't fall under the definition of being actionable. By our own definition, warning alerts don't require strict action, just awareness - in this case, having warning alerts go to a "alerts firing" panel sounds pretty closely aligned with what we want.

Warning alerts are worth looking into, but may not be a real issue with Sourcegraph. We suggest checking in on these periodically, or using a notification channel that will not bother anyone if it is spammed.
Critical alerts are something specific definitely going wrong - we can set these granularly per-panel, which will adhere to Grafana's one-alert-per-panel restriction. Users receiving critical alerts will be taken to the relevant panel for that metric - by our own definition, with at least one specific action to take:

If you see one, it means something is definitely wrong. We suggest e.g. emailing the site admin when these occur.

I personally am in favour of this approach. With some finessing, this can be made to support alert silencing as well (https://github.com/sourcegraph/sourcegraph/issues/11210). Another consideration:

There wouldn't be any questions about the query execution interval between Grafana and Prometheus potentially not matching up

We can address this by overlaying each panel with a query (B) for the alert_count, and set the Grafana alert on that instead. This has the advantage of a clear visualization of whether the "actual alert" is triggering as part of the visualization (though arguably a horizontal line indicating the threshold is clearer)

Wait for Grafana to implement things

None of the issues related to Grafana's alertings restrictions are closed so they might not be a strict "we'll never implement this". There is a vague promise of reworked alerts coming: grafana/grafana#6041 (comment) (emphasis mine)

We are currently redesigning the alert system in Grafana and writing up a design doc about how we want to improve the alerting features in Grafana. Once the doc is in better shape, we will share it on github. In our current thinking it will be possible to route and and group alert notifications based on data from the alert rule and the series returned in the query. So the user will have much more flexibility on when to send or not send alert notifications.

This would allow us to send granular alerts via the "alerts firing" panels, as described above. That was March 2019 though without an update since, so waiting is probably out of the question.

slimsag · 2020-06-24T06:50:34Z

Thank you for the detailed write-up / extensive communication here! And apologies that you weren't given all of the user-facing context & vision here earlier.

First I want to provide some higher level context about why we're working on these monitoring changes (not just this issue) at all, and where we're at - because this is all part of a larger vision.

Monitoring vision

Today, most Sourcegraph instances do not have alerting set up at all. There are a few notable exceptions:

One particularly large customer & the earliest adopter of our monitoring stack simply pulls in our alert_count metric via Prometheus's HTTP API directly into their own custom monitoring stack and fire alerts via that (they merely say "go to Sourcegraph's Grafana, repo-updater has issues" for example): https://docs.sourcegraph.com/admin/observability/alerting_custom_consumption
I would estimate 1-3 other large customers consume our Prometheus metrics directly via Prometheus federation described here: https://docs.sourcegraph.com/admin/observability/metrics#can-i-consume-sourcegraph-s-prometheus-metrics-in-my-own-monitoring-system-datadog-new-relic-etc they then rely on our Grafana for general insights when something goes wrong.
One single important customer uses the actual Grafana alerting we have. They created a separate dashboard in Grafana by copying our home dashboard and configuring alerting on critical alerts only.

All this is to say, I would estimate around 80-90% of our customers and Sourcegraph deployments do not have any alerting set up at all. Historically, they did not have monitoring either - but we recently in the past ~6 months began shipping preconfigured Grafana and Prometheus out-of-the-box with Sourcegraph deployments.

The product vision here is simple: We have a way to determine if something is definitely wrong with Sourcegraph (critical alerts), and a way to determine if something may be wrong with Sourcegraph (warning alerts), but site admins are for the most part entirely unaware of it until something goes wrong and e.g. a user has already had a bad experience. We want all (or most) site admins to configure alerting in the future.

What blocks us from getting there? A few things:

Configuring alerting through the Grafana web UI is/was a huge pain (and maybe not even possible). Robert fixed this.
If customers start to rely on these alerts and they e.g. page an engineer saying "Sourcegraph has major issues!" in the middle of the night, and it turns out not to be an immediate problem, they need a way to silence that.
- We cannot say "just disable/silence all critical alerts" because they will then not have any monitoring for their deployment
- We cannot say "just fix the problem" in all cases, because in some instances they will need to wait for the next Sourcegraph release which may be e.g. a month out
- We need some way for a site admin to say "Yeah, I saw that specific alert and I don't care about it for now."
The process for setting up alerts needs to be super smooth and clean. If we want this to be a step that site admins perform when setting up Sourcegraph, we can't ask them to really go out of their way to configure things.
- For example, the following would be bad because it would make the onboarding experience (when customers are considering "should we buy Sourcegraph?") have much more friction and most admins simply wouldn't do it: "Stop this Kubernetes deployment, mount this configuration file and make these edits, redeploy and then navigate to Grafana and test alerts and see if they work. If they do, then configure alerting in the web UI."
- The following for example would be good because it wouldn't hamper the onboarding process and most site admins would likely perform it: "Great, you set up Sourcegraph and you've connected your code host - now give us some email or Slack credentials and we'll let you know if it is unhealthy or might be unhealthy"

The choice to use Grafana alerts

I originally made the choice that we should adopt Grafana alerts over Alertmanager, because it seemed to be more in-line with point 3 above ("The process for setting up alerts needs to be super smooth and clean."). At the time I did not envision that:

We could configure alerts through the site configuration easily or in a reasonable way (turns out, we can)
Configuring alerts in Grafana would be such a tedious and painful/impossible process if the dashboards already existed

These two new pieces of information mean we should reconsider the choice I made, for certain, and decide what gets us closer to the product goals here.

Responses to Robert's write-up above

Since Grafana has a plugin for Alertmanager

I think we wouldn't strictly need to use Grafana's plugin for Alertmanager at all. Alertmanager could alert directly on the alert_count metric we have in Prometheus, and Grafana could be purely for viewing dashboards when an alert fires. I think this is the more traditional usage of Grafana+Alertmanager+Prometheus anyway.

the only downsides are that it's a new service to configure and deploy, and we'll have to be careful about how we adapt what went out in 3.17 to Alertmanager

In practice, I think we will find less than 5 customers actually start using the alert configuration that went out in 3.17 - because we have not advertised it to them heavily yet. This is not to say we shouldn't consider that aspect carefully, it is just to say that we can get away with doing simple things here (like my suggestion that merely saying "Hey site admin! You'll need to update your alerting configuration" if they did configure it, would be OK).

At the moment, the biggest blocker to Grafana alerting is essentially that it does not let us set multiple alerts on one panel. Many of our Observables have a Warning and Critical level alert for the same panel, which means we must duplicate panels, which might look confusing (but might actually not be that bad, if we put some thought into how they are grouped, ie "critical alert" panels together).

I think this would be bad, this would make all of our dashboards have 2x as many panels on them which will make it much harder for site admins (who do not necessarily have any idea what is going on) to find relevant information.

I'm not sure if adding a new service is the mix is worth the featureset that Alertmanager offers since our alerting is relatively minimal [...]

I don't think that introducing a new service for this is too big of a deal. I view not having a new service here a "nice-to-have". One way we could avoid this would be by bundling alertmanager into the sourcegraph/prometheus docker container, since we would already have a wrapper process around it:

sourcegraph/prometheus Docker container
- Go wrapper process, if either subprocess dies it dies.
  - Prometheus process
  - Alertmanager process

This wouldn't be too hard to do I think because it is a very simple Go static binary deployment (e.g. download a prebuilt linux binary here.

how much granularity is useful to a sourcegraph admin, past feedback from admins / data on what custom alerts admins set today would be helpful

Site admins do not set or configure custom alerts in our Grafana instance (there may be a few minor exceptions to this, but the general operating assumption we should have is "we fully own and manage the Grafana/Prometheus instances and all third-party configuration or changes are unsupported" - in practice we just need to say in our changelog something like "You should use the new site configuration alerting options to configure alerting, any third-party configuration in Grafana or Proemtheus will not be supported in the future to ensure Sourcegraph is easily backed up.")

In terms of granularity, site admins care about a few things I believe:

They do not want to get an opaque "critical alert firing" message that doesn't say what the actual problem is. I feel very confident this will frustrate admins ("You say there is a critical problem but the container just restarted and came back up on its own just fine, I get that's a problem but I wish you could've just said a container restarted")
They will not want to get spammed with alerts. This happens to us today with our separate OpsGenie when Sourcegraph.com goes down (we get like ~25 alerts on our phones), and it is horrible and painful to acknowledge them all.
Site admins will want to be aware of warning alerts, but they are inherently flaky/spammy.
- Imagine a situation where we have like ~400 useful warning alerts, but ~40 of them are flaky on your instance.
- If these get dumped into a Slack channel, for example, and each one says "Warning alerts firing - check Grafana" you may be very tempted to say "well, yeah, that happens a lot I'm gonna ignore it." but in contrast if it's not one you've seen before and it says "warning - search requests slow" you might be much more tempted to look into that. The same effect can be achieved with the ability to silence specific warning alerts - but we must also be realistic that many people will just shrug that off and not do it in hopes of the situation improving in an upcoming release of Sourcegraph, etc.
- Similarly, if there is a particularly spammy/noisy alert and you cannot silence it - you're gonna be tempted to just ignore the channel entirely.

Grafana alerting is limited, but it seems intentionally limited (1, 2), which means either we accept that Grafana alerting is just not for us or maybe we take a look at how we should leverage our current alerts to align with how their alerts work.

I think there are definitely some incompatibilities with the way we must do alerting (and monitoring in general) and how a traditional Grafana user may do it.

A traditional Grafana user is going to have a lot of context about: what the dashboards are showing, what the alerts mean, what Grafana is, they will either have set the thing up themselves OR have people who did.

In stark contrast, we're putting together an alerting system for people with zero context about how Sourcegraph works, what the alerts mean, and in many cases it may be their first time interacting with Grafana at all. In many cases, Grafana and our alerts may not give the site admin any useful information that they themselves can act on other than "I should contact support and ask why this is happening"

In a traditional Grafana deployment, I don't think you would really want separate warning and critical thresholds in practice for most things. If something is wrong, you want to alert someone. If something isn't wrong, don't alert them. If the threshold is not appropriate, adjust it. That last part ("adjust it") is something we can't ask our site admins to do: "Is it the right threshold and something is actually wrong? Or do I need to adjust this? How do I know without contacting support?"

In our case, we want to sort of remotely manage these alerts because we're defining them and shipping them to someone else - and in this context it really does make sense for us to want to say "this is definitely a problem" or "this could be a problem, but we're not positive"

Wait for Grafana to implement things

I don't think this is a valid option; they don't seem interested in supporting these use cases anytime soon (which is fair, they have different goals than us).

Warning alerts are just warnings - the "action" we want users to take is to just take a look and be aware, so they can be less granular (ie "frontend has warning alerts - please check grafana")

Are my arguments above convincing? What other thoughts do you have here, and based on what I've said above about what I think site admins will expect and the overall product vision here what do you think the right tech choice / direction is for us?

bobheadxi · 2020-06-24T07:40:39Z

Thanks for the super detailed writeup! I think I have a clearer pictures of our goals now and am pretty convinced, especially:

In stark contrast, we're putting together an alerting system for people with zero context about how Sourcegraph works, what the alerts mean, and in many cases it may be their first time interacting with Grafana at all. In many cases, Grafana and our alerts may not give the site admin any useful information that they themselves can act on other than "I should contact support and ask why this is happening"

I also didn't realize our monitoring stack was so new, so I made the wrong assumption that the barebones nature of it was intentional 😅 Thanks for clearing all that up!

With all that said, here's how I am hoping to proceed:

Close or revert to draft the following pull requests: cmd(server): add support for grafana-wrapper #11612, monitoring: grafana email notifier support #11554, Add search, update for Alerts API grafana-tools/sdk#91
Implement a new prom-wrapper to bundle Alertmanager as part of sourcegraph/prometheus, and take a similar approach to what is currently in grafana-wrapper to achieve the same functionality we have in 3.17 - if this works poorly, I believe it won't be too much work to then transition from a bunbled Alertmanager to a standalone one. This PR will also role back various changes to grafana-wrapper
Go around updating any docs that were grafana-wrapper specific, and also remove the frontend variables that helped point Grafana to the frontend

This will change the scope/requirements of https://github.com/sourcegraph/sourcegraph/issues/11663, https://github.com/sourcegraph/sourcegraph/issues/11473, https://github.com/sourcegraph/sourcegraph/issues/11454

If all that sounds good @slimsag I'll probably start work on this on Monday (since I'll be away tomorrow and the day after) - thanks again!

davejrt · 2020-06-24T18:48:20Z

Interesting notes @slimsag @bobheadxi

Seems as though you've come to a conclusion, but in support of a few points and perhaps an opinion as someone who has been responsible for systems like this in the past:

In stark contrast, we're putting together an alerting system for people with zero context about how Sourcegraph works, what the alerts mean, and in many cases it may be their first time interacting with Grafana at all. In many cases, Grafana and our alerts may not give the site admin any useful information that they themselves can act on other than "I should contact support and ask why this is happening"

This is more or less my experience. I have found it more vaulable as I learn more, but initially I didn't see a lot in there that made sense to me...or helped me diagnose a problem other than. The disk is full or CPU/MEM are high.

I feel as though the process to generate, create, edit...whatever action someone would normally want to perform on this is really clunky and would lead people to simply throw it in the too hard basket and wait for a problem to occur and use other means to determine what it is or wait for support help.

sourcegraph/prometheus Docker container
Go wrapper process, if either subprocess dies it dies.
Prometheus process
Alertmanager process

What's the feeling about just running alertmanager as another container in the prometheus pod as opposed to introducting an entirely new service? For example at the moment I have been toying with blackbox exporter to see if it can alleviate some pain on #10742

slimsag · 2020-06-24T22:25:54Z

What's the feeling about just running alertmanager as another container in the prometheus pod as opposed to introducting an entirely new service?

In practice, introducing a new container does mean introducing a new service. We can make this distinction in Kubernetes only. When it comes to Docker Compose, Pure-Docker, and Server deployments in all practicality containers have a 1:1 relationship with services. Being able to bundle a container in a Kubernetes service is simply a nice thing Kubernetes offers, but not something we can use elsewhere (unfortunately).

bobheadxi added team/distribution labels Jun 12, 2020

bobheadxi self-assigned this Jun 12, 2020

bobheadxi mentioned this issue Jun 12, 2020

monitoring: configure alert notifications from site config #11427

Merged

3 tasks

bobheadxi added this to the 3.18 milestone Jun 22, 2020

bobheadxi added the estimate/2d label Jun 22, 2020

bobheadxi mentioned this issue Jun 22, 2020

Make it possible to easily get slack/pager notifications for each defined alert in Sourcegraph #10640

Closed

bobheadxi changed the title ~~monitoring: investigate bootstrapping alerts when generating dashboards~~ monitoring: investigate bootstrapping alerts when generating dashboards for granular alerts Jun 22, 2020

attfarhan mentioned this issue Jun 22, 2020

Search: 3.18 Tracking issue #11613

Closed

16 tasks

slimsag mentioned this issue Jun 22, 2020

Distribution: 3.18 Tracking issue #11646

Closed

56 tasks

bobheadxi removed the estimate/2d label Jun 23, 2020

bobheadxi changed the title ~~monitoring: investigate bootstrapping alerts when generating dashboards for granular alerts~~ monitoring: granular alerts Jun 24, 2020

bobheadxi changed the title ~~monitoring: granular alerts~~ monitoring: granular alerts notifications Jun 24, 2020

bobheadxi changed the title ~~monitoring: granular alerts notifications~~ monitoring: granular alerts notifications with Alertmanager Jun 24, 2020

This was referenced Jun 24, 2020

cmd(server): add support for grafana-wrapper #11612

Closed

monitoring: grafana email notifier support #11554

Closed

bobheadxi added the estimate/2d label Jun 30, 2020

bobheadxi mentioned this issue Jun 30, 2020

monitoring: update sourcegraph/server to use prom-wrapper #11473

Closed

bobheadxi mentioned this issue Jul 1, 2020

prometheus: bundle Alertmanager and siteConfig sync #11832

Merged

3 tasks

slimsag added monitoring monitoring-for-all labels Jul 1, 2020

pecigonzalo added the planned/3.18 label Jul 2, 2020

slimsag mentioned this issue Jul 6, 2020

Alert if containers are entirely down/missing #9792

Closed

bobheadxi closed this as completed in #11832 Jul 8, 2020

This was referenced Jul 8, 2020

monitoring: alerting followups #12026

Closed

Approved: Proposal: RFC-189: Support per-team alerts and on-call rotations #12010

Closed

bobheadxi mentioned this issue Jul 17, 2020

distribution: add monitoring architecture page sourcegraph/about#1221

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monitoring: granular alerts notifications with Alertmanager #11452

monitoring: granular alerts notifications with Alertmanager #11452

bobheadxi commented Jun 12, 2020 •

edited

Loading

bobheadxi commented Jun 22, 2020

bobheadxi commented Jun 23, 2020

bobheadxi commented Jun 23, 2020

bobheadxi commented Jun 23, 2020 •

edited

Loading

slimsag commented Jun 23, 2020

slimsag commented Jun 23, 2020 •

edited

Loading

bobheadxi commented Jun 23, 2020 •

edited

Loading

slimsag commented Jun 23, 2020

bobheadxi commented Jun 23, 2020 •

edited

Loading

bobheadxi commented Jun 24, 2020 •

edited

Loading

slimsag commented Jun 24, 2020

bobheadxi commented Jun 24, 2020 •

edited

Loading

davejrt commented Jun 24, 2020

slimsag commented Jun 24, 2020

monitoring: granular alerts notifications with Alertmanager #11452

monitoring: granular alerts notifications with Alertmanager #11452

Comments

bobheadxi commented Jun 12, 2020 • edited Loading

Discussions

Updated plan

bobheadxi commented Jun 22, 2020

bobheadxi commented Jun 23, 2020

bobheadxi commented Jun 23, 2020

bobheadxi commented Jun 23, 2020 • edited Loading

slimsag commented Jun 23, 2020

slimsag commented Jun 23, 2020 • edited Loading

bobheadxi commented Jun 23, 2020 • edited Loading

slimsag commented Jun 23, 2020

bobheadxi commented Jun 23, 2020 • edited Loading

bobheadxi commented Jun 24, 2020 • edited Loading

Summary on granular alerts notifications

Alertmanager

Grafana alerting

"All alerts should be actionable"

Wait for Grafana to implement things

slimsag commented Jun 24, 2020

Monitoring vision

The choice to use Grafana alerts

Responses to Robert's write-up above

bobheadxi commented Jun 24, 2020 • edited Loading

davejrt commented Jun 24, 2020

slimsag commented Jun 24, 2020

bobheadxi commented Jun 12, 2020 •

edited

Loading

bobheadxi commented Jun 23, 2020 •

edited

Loading

slimsag commented Jun 23, 2020 •

edited

Loading

bobheadxi commented Jun 23, 2020 •

edited

Loading

bobheadxi commented Jun 23, 2020 •

edited

Loading

bobheadxi commented Jun 24, 2020 •

edited

Loading

bobheadxi commented Jun 24, 2020 •

edited

Loading