Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

monitoring: granular alerts notifications with Alertmanager #11452

Closed
bobheadxi opened this issue Jun 12, 2020 · 14 comments · Fixed by #11832
Closed

monitoring: granular alerts notifications with Alertmanager #11452

bobheadxi opened this issue Jun 12, 2020 · 14 comments · Fixed by #11832

Comments

@bobheadxi
Copy link
Member

bobheadxi commented Jun 12, 2020

Right now (as of https://github.com/sourcegraph/sourcegraph/pull/11483), we create a copy of the home dashboard (since the home dashboard is non-editable) to attach alerts to, and alerts basically only tell you if warning alerts are firing or if critical alerts are firing.

We want a expand on https://github.com/sourcegraph/sourcegraph/issues/10641 (where notifications only tell you "warning alerts are firing" and "critical alerts are firing") to get better granularity for alerts (ie provide what specific alert is firing), and be able to support per-alert silencing down the line (#11210)

Discussions

https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648567565

https://github.com/sourcegraph/sourcegraph/pull/11427/files#r439128693

Updated plan

https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648652514

@bobheadxi bobheadxi self-assigned this Jun 12, 2020
@bobheadxi bobheadxi added this to the 3.18 milestone Jun 22, 2020
@bobheadxi bobheadxi changed the title monitoring: investigate bootstrapping alerts when generating dashboards monitoring: investigate bootstrapping alerts when generating dashboards for granular alerts Jun 22, 2020
@bobheadxi
Copy link
Member Author

From #10640:

Today you can configure two alerts at the topmost level, one for critical alerts in Sourcegraph and one for warning alerts in Sourcegraph and then when you get notified via slack, pager, email, etc. go over to Sourcegraph's Grafana to identify the cause.

But, it would be nicer if there was an easy way to get the actual alert name / description / service name as part of the alert Grafana sends. This may be difficult to implement because Grafana alerting does not support templating in alert definitions, currently, so this issue may involve fixing that: grafana/grafana#6557

The linked issue is 4 years old and has been locked, so I'll address this here by adding alerts set up as part of the dashboard generation. Since each dashboard panel is attached to specific alerts, we might be able to tackle it that way

@bobheadxi
Copy link
Member Author

@slimsag unfortunately, I'm not sure even the approach we discussed today (place alerts on individual panels) is possible either :( Since some observables can have both warnings and critical-level alerts, we would need multiple alerts per panel, which is not possible (grafana/grafana#7832)

@bobheadxi
Copy link
Member Author

I feel like the only way forward here is to split the per-service "alerts firing" panel into one for warning and one for critical, and place the Grafana alerts on those

The alternative would be to duplicate every panel with warning and critical, which will probably clutter things up a lot 😢

@bobheadxi
Copy link
Member Author

bobheadxi commented Jun 23, 2020

More thoughts: implementing setting up alerts as part of dashboard generation will effectively render https://github.com/sourcegraph/sourcegraph/issues/11210 impossible, since alerts are generated ahead of time, unless we:

  • give each service's alerts a special prefix, ie src-frontend-...
  • when applying notifiers, create ~14 copies of each notifier src-frontend-...
  • selectively exclude notifiers when creating them

I feel like the only way forward here is to split the per-service "alerts firing" panel into one for warning and one for critical, and place the Grafana alerts on those

I think the only way to make this work with https://github.com/sourcegraph/sourcegraph/issues/11210 is to change the query of the "alerts firing" panels to exclude silenced alerts. It might be mysterious to have alerts vanish off that dashboard, but we'll still have the data in ie bug-report and on the overview dashboard

tl;dr I think the only way to reconcile the generated dashboards with the ability to silence alerts long-term is to remove generation from the build step and have grafana-wrapper create them (disk-provisioned dashboards are not editable), and I'm leaning towards doing that for implementing this issue since I think we're going to have to do it anyway and the approach will be different. It'll be a significant change but https://github.com/sourcegraph/sourcegraph/pull/11554 contains some refactors that would help make this not too messy (I hope)

@slimsag
Copy link
Member

slimsag commented Jun 23, 2020

Since some observables can have both warnings and critical-level alerts, we would need multiple alerts per panel, which is not possible (grafana/grafana#7832)

Grafana is full of fun little limitations, isn't it? 🙃 It's about time someone forked it and fixed this + templating in alerts.. honestly I wonder how hard that would be compared to working around these limitations.. but I digress

I feel like the only way forward here is to split the per-service "alerts firing" panel into one for warning and one for critical, and place the Grafana alerts on those
I think the only way to make this work with #11210 is to change the query of the "alerts firing" panels to exclude silenced alerts.

Correct me if I am wrong, but if we did do this wouldn't we only be able to emit an alert saying "a frontend critical alert is firing" rather than the specific alert (like "critical: frontend disk space low" or whatever), because Grafana doesn't support templating?

One way to do https://github.com/sourcegraph/sourcegraph/issues/11210 - which may be better (I haven't thought it through extensively yet) - would be to have a prometheus-wrapper which simply regenerates the Prometheus rules to omit that alert definition entirely. Then it'd be like the alert was never defined in the first place. This would introduce questions for how we handle that on the Grafana side, though.


The two original motivations for stepping away from Alertmanager and towards Grafana's Alerting was because of its better UI and support for more pager/notification types. But, stepping back and looking at the problem from a glance now I am wondering if this is not the wrong choice entirely due to the limitations in Grafana's alerting.

Alertmanager does support the same set of pager/notification types we have in place today, which does seem sufficient: https://prometheus.io/docs/alerting/latest/configuration/ and since we aren't directing admins to the Grafana UI instead but sideloading our own config from the Sourcegraph site config.. maybe Alertmanager would be a better approach? I haven't looked at this extensively, but I think given the limitations we're hitting here it may be worth stepping back and reconsidering. What do you think?

@slimsag
Copy link
Member

slimsag commented Jun 23, 2020

From a quick 5m assessment, if we did go back to alertmanager:

  • We could use the same grafana-wrapper type setup, but instead have a prometheus-wrapper
  • We could map the same site config to the Alertmanager config format
  • Since it already relies on a config file on disk, solving https://github.com/sourcegraph/sourcegraph/issues/11663 could be as simple as "just do nothing if the frontend is down"
  • There wouldn't be any questions about the query execution interval between Grafana and Prometheus potentially not matching up
  • It supports <tmpl_string> so we could potentially have just 1 templated alert definition that handles sending them all
  • We could have some fun stylizing the Slack alerts to look more like Sourcegraph and less like Grafana: https://prometheus.io/docs/alerting/latest/configuration/#action_config
  • We could generate a link to the Grafana dashboard (just the service, no need to link to the panel itself) and include that in the alert message
  • We wouldn't need to copy any dashboards or pre-provision alerts to workaround Grafana limitations/restrictions

seems like only upsides to me 🤦 I feel bad for not thinking of this earlier

@bobheadxi
Copy link
Member Author

bobheadxi commented Jun 23, 2020

Correct me if I am wrong, but if we did do this wouldn't we only be able to emit an alert saying "a frontend critical alert is firing" rather than the specific alert (like "critical: frontend disk space low" or whatever), because Grafana doesn't support templating?

Yep :(

would be to have a prometheus-wrapper which simply regenerates the Prometheus rules to omit that alert definition entirely

This seems risky - I don't think muting an alert === never recording it 😅

seems like only upsides to me 🤦 I feel bad for not thinking of this earlier

I dug into it and it does seem to be mostly upsides 😅 The single downside is that it's a new service to deploy. I probably should have thought harder about it too, since the warning signs were there... oops :(

Doing this switch back to Alertmanager will take some fenangling:

A very unfortunate state of affairs haha

@slimsag
Copy link
Member

slimsag commented Jun 23, 2020

https://prometheus.io/docs/alerting/latest/alertmanager/#silences Alertmanagers also has native support for silencing

As for it being another thing to deploy, I agree but that also seems minor.

I think exploring this more is worth time, I’d say timebox this to a few days and post an update with whether you think we should continue down that path or not :)

For the config schema, I am Magine that they will be pretty similar anyway, but if not we can always do something simpler like handling backwards compatibility just for the absolute most basic thing, or not handle backwards compatibility at all and just advise site admin’s to update the configuration when they upgrade

@bobheadxi
Copy link
Member Author

bobheadxi commented Jun 23, 2020

in all honesty after looking into this today I feel like Grafana alerting is not the way to go given the specific goals of:

  1. every alert triggers a unique message
  2. individual alerts must be silence-able

However, an approach that could work with Grafana: let's say we forgo point 1, and set notifications on when each service has critical alerts or warning alerts firing:

  • I actually think this is relatively desirable, personally I feel like as an admin I would rather be notified on a per-service level than get 40 slack messages for every single alerts, and go and triage it / ask for help myself (customer insight might be useful here). These notifications would take users to the "firing alerts" dashboard of the relevant, which doesn't actually feel like the worst place to start triaging what's up
    • edit: alertmanager is super fancy and handles this with grouping
  • we would avoid adding another service to the deployment (though as you say that's probably comparatively minor), but I personally still think just having Grafana is a plus (my rough thoughts: observability in Sourcegraph is to help maintain Sourcegraph, and should probably be relatively unobtrusive - in between Prometheus, Grafana, Jaeger, cAdvisor, there's already a lot of stuff in my mind, but I could be a bit misaligned with what our high-level monitoring goals are though)
  • we can still include custom messages in the alerts providing some high-level guidance on what to do with the Grafana alert (we just won't be able to benefit from templating)

Since most of the points above are based on assumptions of how much granularity is useful to a sourcegraph admin, do you have any thoughts on this / past feedback from admins I can refer to?

@bobheadxi bobheadxi changed the title monitoring: investigate bootstrapping alerts when generating dashboards for granular alerts monitoring: granular alerts Jun 24, 2020
@bobheadxi bobheadxi changed the title monitoring: granular alerts monitoring: granular alerts notifications Jun 24, 2020
@bobheadxi
Copy link
Member Author

bobheadxi commented Jun 24, 2020

Summary on granular alerts notifications

Response to Grafana potentially being the wrong choice for alert notifications - https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-647938831

From @slimsag:

The two original motivations for stepping away from Alertmanager and towards Grafana's Alerting was because of its better UI and support for more pager/notification types. But, stepping back and looking at the problem from a glance now I am wondering if this is not the wrong choice entirely due to the limitations in Grafana's alerting.

Goals, from the issue description:

We want a expand on https://github.com/sourcegraph/sourcegraph/issues/10641 (where notifications only tell you "warning alerts are firing" and "critical alerts are firing") to get better granularity for alerts (ie provide what specific alert is firing), and be able to support per-alert silencing down the line (#11210)

Alertmanager

On the surface, Alertmanager gives us everything we want right now and more (especially features around grouping and silencing alerts, which would solve https://github.com/sourcegraph/sourcegraph/issues/11210). Since Grafana has a plugin for Alertmanager, the only downsides are that it's a new service to configure and deploy, and we'll have to be careful about how we adapt what went out in 3.17 to Alertmanager (ideas: https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648003317).

Quick summary of what transitioning to Alertmanager might look like by @slimsag: https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-647983479.

Grafana alerting

At the moment, the biggest blocker to Grafana alerting is essentially that it does not let us set multiple alerts on one panel. Many of our Observables have a Warning and Critical level alert for the same panel, which means we must duplicate panels, which might look confusing (but might actually not be that bad, if we put some thought into how they are grouped, ie "critical alert" panels together).

We can achieve a sort of middle ground by expanding on setting alerts on the "alerts firing" panels on a per-service level rather than a Sourcegraph-wide level (ie what is available now in 3.17 with https://github.com/sourcegraph/sourcegraph/issues/10641). This approach could be sufficient if only we could act on the results of a triggered alert - this is currently unimplemented (and from the looks of it, won't be for quite some time: grafana/grafana#6041, grafana/grafana#7832, grafana/grafana#6553)

Leveraging Grafana alerting has the advantage of keeping most things monitoring in one place. I'm not sure if adding a new service is the mix is worth the featureset that Alertmanager offers since our alerting is relatively minimal, and until recently setting up alerts was quite a process (docs), so I'm not sure how comprehensive our alerting solution has to be (as mentioned in my other comment though this is based on assumptions of how much granularity is useful to a sourcegraph admin, past feedback from admins / data on what custom alerts admins set today would be helpful - Slack thread).

"All alerts should be actionable"

Grafana alerting is limited, but it seems intentionally limited (1, 2), which means either we accept that Grafana alerting is just not for us or maybe we take a look at how we should leverage our current alerts to align with how their alerts work. For example, one approach could be:

  • Warning alerts are just warnings - the "action" we want users to take is to just take a look and be aware, so they can be less granular (ie "frontend has warning alerts - please check grafana"). Users receiving warning alerts will be taken to an "alerts firing" panel. In other words, if a warning requires specific action, maybe it should be a critical alert instead, and if not it should fine to have lower-granularity notifications for warnings, since then it won't fall under the definition of being actionable. By our own definition, warning alerts don't require strict action, just awareness - in this case, having warning alerts go to a "alerts firing" panel sounds pretty closely aligned with what we want.

    Warning alerts are worth looking into, but may not be a real issue with Sourcegraph. We suggest checking in on these periodically, or using a notification channel that will not bother anyone if it is spammed.

  • Critical alerts are something specific definitely going wrong - we can set these granularly per-panel, which will adhere to Grafana's one-alert-per-panel restriction. Users receiving critical alerts will be taken to the relevant panel for that metric - by our own definition, with at least one specific action to take:

    If you see one, it means something is definitely wrong. We suggest e.g. emailing the site admin when these occur.

I personally am in favour of this approach. With some finessing, this can be made to support alert silencing as well (https://github.com/sourcegraph/sourcegraph/issues/11210). Another consideration:

There wouldn't be any questions about the query execution interval between Grafana and Prometheus potentially not matching up

We can address this by overlaying each panel with a query (B) for the alert_count, and set the Grafana alert on that instead. This has the advantage of a clear visualization of whether the "actual alert" is triggering as part of the visualization (though arguably a horizontal line indicating the threshold is clearer)

Wait for Grafana to implement things

None of the issues related to Grafana's alertings restrictions are closed so they might not be a strict "we'll never implement this". There is a vague promise of reworked alerts coming: grafana/grafana#6041 (comment) (emphasis mine)

We are currently redesigning the alert system in Grafana and writing up a design doc about how we want to improve the alerting features in Grafana. Once the doc is in better shape, we will share it on github. In our current thinking it will be possible to route and and group alert notifications based on data from the alert rule and the series returned in the query. So the user will have much more flexibility on when to send or not send alert notifications.

This would allow us to send granular alerts via the "alerts firing" panels, as described above. That was March 2019 though without an update since, so waiting is probably out of the question.

@slimsag
Copy link
Member

slimsag commented Jun 24, 2020

Thank you for the detailed write-up / extensive communication here! And apologies that you weren't given all of the user-facing context & vision here earlier.

First I want to provide some higher level context about why we're working on these monitoring changes (not just this issue) at all, and where we're at - because this is all part of a larger vision.

Monitoring vision

Today, most Sourcegraph instances do not have alerting set up at all. There are a few notable exceptions:

All this is to say, I would estimate around 80-90% of our customers and Sourcegraph deployments do not have any alerting set up at all. Historically, they did not have monitoring either - but we recently in the past ~6 months began shipping preconfigured Grafana and Prometheus out-of-the-box with Sourcegraph deployments.

The product vision here is simple: We have a way to determine if something is definitely wrong with Sourcegraph (critical alerts), and a way to determine if something may be wrong with Sourcegraph (warning alerts), but site admins are for the most part entirely unaware of it until something goes wrong and e.g. a user has already had a bad experience. We want all (or most) site admins to configure alerting in the future.

What blocks us from getting there? A few things:

  1. Configuring alerting through the Grafana web UI is/was a huge pain (and maybe not even possible). Robert fixed this.
  2. If customers start to rely on these alerts and they e.g. page an engineer saying "Sourcegraph has major issues!" in the middle of the night, and it turns out not to be an immediate problem, they need a way to silence that.
    • We cannot say "just disable/silence all critical alerts" because they will then not have any monitoring for their deployment
    • We cannot say "just fix the problem" in all cases, because in some instances they will need to wait for the next Sourcegraph release which may be e.g. a month out
    • We need some way for a site admin to say "Yeah, I saw that specific alert and I don't care about it for now."
  3. The process for setting up alerts needs to be super smooth and clean. If we want this to be a step that site admins perform when setting up Sourcegraph, we can't ask them to really go out of their way to configure things.
    • For example, the following would be bad because it would make the onboarding experience (when customers are considering "should we buy Sourcegraph?") have much more friction and most admins simply wouldn't do it: "Stop this Kubernetes deployment, mount this configuration file and make these edits, redeploy and then navigate to Grafana and test alerts and see if they work. If they do, then configure alerting in the web UI."
    • The following for example would be good because it wouldn't hamper the onboarding process and most site admins would likely perform it: "Great, you set up Sourcegraph and you've connected your code host - now give us some email or Slack credentials and we'll let you know if it is unhealthy or might be unhealthy"

The choice to use Grafana alerts

I originally made the choice that we should adopt Grafana alerts over Alertmanager, because it seemed to be more in-line with point 3 above ("The process for setting up alerts needs to be super smooth and clean."). At the time I did not envision that:

  • We could configure alerts through the site configuration easily or in a reasonable way (turns out, we can)
  • Configuring alerts in Grafana would be such a tedious and painful/impossible process if the dashboards already existed

These two new pieces of information mean we should reconsider the choice I made, for certain, and decide what gets us closer to the product goals here.

Responses to Robert's write-up above

Since Grafana has a plugin for Alertmanager

I think we wouldn't strictly need to use Grafana's plugin for Alertmanager at all. Alertmanager could alert directly on the alert_count metric we have in Prometheus, and Grafana could be purely for viewing dashboards when an alert fires. I think this is the more traditional usage of Grafana+Alertmanager+Prometheus anyway.

the only downsides are that it's a new service to configure and deploy, and we'll have to be careful about how we adapt what went out in 3.17 to Alertmanager

In practice, I think we will find less than 5 customers actually start using the alert configuration that went out in 3.17 - because we have not advertised it to them heavily yet. This is not to say we shouldn't consider that aspect carefully, it is just to say that we can get away with doing simple things here (like my suggestion that merely saying "Hey site admin! You'll need to update your alerting configuration" if they did configure it, would be OK).

At the moment, the biggest blocker to Grafana alerting is essentially that it does not let us set multiple alerts on one panel. Many of our Observables have a Warning and Critical level alert for the same panel, which means we must duplicate panels, which might look confusing (but might actually not be that bad, if we put some thought into how they are grouped, ie "critical alert" panels together).

I think this would be bad, this would make all of our dashboards have 2x as many panels on them which will make it much harder for site admins (who do not necessarily have any idea what is going on) to find relevant information.

I'm not sure if adding a new service is the mix is worth the featureset that Alertmanager offers since our alerting is relatively minimal [...]

I don't think that introducing a new service for this is too big of a deal. I view not having a new service here a "nice-to-have". One way we could avoid this would be by bundling alertmanager into the sourcegraph/prometheus docker container, since we would already have a wrapper process around it:

  • sourcegraph/prometheus Docker container
    • Go wrapper process, if either subprocess dies it dies.
      • Prometheus process
      • Alertmanager process

This wouldn't be too hard to do I think because it is a very simple Go static binary deployment (e.g. download a prebuilt linux binary here.

how much granularity is useful to a sourcegraph admin, past feedback from admins / data on what custom alerts admins set today would be helpful

Site admins do not set or configure custom alerts in our Grafana instance (there may be a few minor exceptions to this, but the general operating assumption we should have is "we fully own and manage the Grafana/Prometheus instances and all third-party configuration or changes are unsupported" - in practice we just need to say in our changelog something like "You should use the new site configuration alerting options to configure alerting, any third-party configuration in Grafana or Proemtheus will not be supported in the future to ensure Sourcegraph is easily backed up.")

In terms of granularity, site admins care about a few things I believe:

  1. They do not want to get an opaque "critical alert firing" message that doesn't say what the actual problem is. I feel very confident this will frustrate admins ("You say there is a critical problem but the container just restarted and came back up on its own just fine, I get that's a problem but I wish you could've just said a container restarted")

  2. They will not want to get spammed with alerts. This happens to us today with our separate OpsGenie when Sourcegraph.com goes down (we get like ~25 alerts on our phones), and it is horrible and painful to acknowledge them all.

  3. Site admins will want to be aware of warning alerts, but they are inherently flaky/spammy.

    • Imagine a situation where we have like ~400 useful warning alerts, but ~40 of them are flaky on your instance.
    • If these get dumped into a Slack channel, for example, and each one says "Warning alerts firing - check Grafana" you may be very tempted to say "well, yeah, that happens a lot I'm gonna ignore it." but in contrast if it's not one you've seen before and it says "warning - search requests slow" you might be much more tempted to look into that. The same effect can be achieved with the ability to silence specific warning alerts - but we must also be realistic that many people will just shrug that off and not do it in hopes of the situation improving in an upcoming release of Sourcegraph, etc.
    • Similarly, if there is a particularly spammy/noisy alert and you cannot silence it - you're gonna be tempted to just ignore the channel entirely.

Grafana alerting is limited, but it seems intentionally limited (1, 2), which means either we accept that Grafana alerting is just not for us or maybe we take a look at how we should leverage our current alerts to align with how their alerts work.

I think there are definitely some incompatibilities with the way we must do alerting (and monitoring in general) and how a traditional Grafana user may do it.

A traditional Grafana user is going to have a lot of context about: what the dashboards are showing, what the alerts mean, what Grafana is, they will either have set the thing up themselves OR have people who did.

In stark contrast, we're putting together an alerting system for people with zero context about how Sourcegraph works, what the alerts mean, and in many cases it may be their first time interacting with Grafana at all. In many cases, Grafana and our alerts may not give the site admin any useful information that they themselves can act on other than "I should contact support and ask why this is happening"

In a traditional Grafana deployment, I don't think you would really want separate warning and critical thresholds in practice for most things. If something is wrong, you want to alert someone. If something isn't wrong, don't alert them. If the threshold is not appropriate, adjust it. That last part ("adjust it") is something we can't ask our site admins to do: "Is it the right threshold and something is actually wrong? Or do I need to adjust this? How do I know without contacting support?"

In our case, we want to sort of remotely manage these alerts because we're defining them and shipping them to someone else - and in this context it really does make sense for us to want to say "this is definitely a problem" or "this could be a problem, but we're not positive"

Wait for Grafana to implement things

I don't think this is a valid option; they don't seem interested in supporting these use cases anytime soon (which is fair, they have different goals than us).

Warning alerts are just warnings - the "action" we want users to take is to just take a look and be aware, so they can be less granular (ie "frontend has warning alerts - please check grafana")

Are my arguments above convincing? What other thoughts do you have here, and based on what I've said above about what I think site admins will expect and the overall product vision here what do you think the right tech choice / direction is for us?

@bobheadxi
Copy link
Member Author

bobheadxi commented Jun 24, 2020

Thanks for the super detailed writeup! I think I have a clearer pictures of our goals now and am pretty convinced, especially:

In stark contrast, we're putting together an alerting system for people with zero context about how Sourcegraph works, what the alerts mean, and in many cases it may be their first time interacting with Grafana at all. In many cases, Grafana and our alerts may not give the site admin any useful information that they themselves can act on other than "I should contact support and ask why this is happening"

I also didn't realize our monitoring stack was so new, so I made the wrong assumption that the barebones nature of it was intentional 😅 Thanks for clearing all that up!

With all that said, here's how I am hoping to proceed:

This will change the scope/requirements of https://github.com/sourcegraph/sourcegraph/issues/11663, https://github.com/sourcegraph/sourcegraph/issues/11473, https://github.com/sourcegraph/sourcegraph/issues/11454

If all that sounds good @slimsag I'll probably start work on this on Monday (since I'll be away tomorrow and the day after) - thanks again!

@bobheadxi bobheadxi changed the title monitoring: granular alerts notifications monitoring: granular alerts notifications with Alertmanager Jun 24, 2020
@davejrt
Copy link
Contributor

davejrt commented Jun 24, 2020

Interesting notes @slimsag @bobheadxi

Seems as though you've come to a conclusion, but in support of a few points and perhaps an opinion as someone who has been responsible for systems like this in the past:

In stark contrast, we're putting together an alerting system for people with zero context about how Sourcegraph works, what the alerts mean, and in many cases it may be their first time interacting with Grafana at all. In many cases, Grafana and our alerts may not give the site admin any useful information that they themselves can act on other than "I should contact support and ask why this is happening"

This is more or less my experience. I have found it more vaulable as I learn more, but initially I didn't see a lot in there that made sense to me...or helped me diagnose a problem other than. The disk is full or CPU/MEM are high.

I feel as though the process to generate, create, edit...whatever action someone would normally want to perform on this is really clunky and would lead people to simply throw it in the too hard basket and wait for a problem to occur and use other means to determine what it is or wait for support help.

sourcegraph/prometheus Docker container
Go wrapper process, if either subprocess dies it dies.
Prometheus process
Alertmanager process

What's the feeling about just running alertmanager as another container in the prometheus pod as opposed to introducting an entirely new service? For example at the moment I have been toying with blackbox exporter to see if it can alleviate some pain on #10742

@slimsag
Copy link
Member

slimsag commented Jun 24, 2020

What's the feeling about just running alertmanager as another container in the prometheus pod as opposed to introducting an entirely new service?

In practice, introducing a new container does mean introducing a new service. We can make this distinction in Kubernetes only. When it comes to Docker Compose, Pure-Docker, and Server deployments in all practicality containers have a 1:1 relationship with services. Being able to bundle a container in a Kubernetes service is simply a nice thing Kubernetes offers, but not something we can use elsewhere (unfortunately).

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants