-
Notifications
You must be signed in to change notification settings - Fork 1.3k
monitoring: granular alerts notifications with Alertmanager #11452
Comments
From #10640:
The linked issue is 4 years old and has been locked, so I'll address this here by adding alerts set up as part of the dashboard generation. Since each dashboard panel is attached to specific alerts, we might be able to tackle it that way |
@slimsag unfortunately, I'm not sure even the approach we discussed today (place alerts on individual panels) is possible either :( Since some observables can have both warnings and critical-level alerts, we would need multiple alerts per panel, which is not possible (grafana/grafana#7832) |
I feel like the only way forward here is to split the per-service "alerts firing" panel into one for warning and one for critical, and place the Grafana alerts on those The alternative would be to duplicate every panel with warning and critical, which will probably clutter things up a lot 😢 |
More thoughts: implementing setting up alerts as part of dashboard generation will effectively render https://github.com/sourcegraph/sourcegraph/issues/11210 impossible, since alerts are generated ahead of time, unless we:
I think the only way to make this work with https://github.com/sourcegraph/sourcegraph/issues/11210 is to change the query of the "alerts firing" panels to exclude silenced alerts. It might be mysterious to have alerts vanish off that dashboard, but we'll still have the data in ie bug-report and on the overview dashboard tl;dr I think the only way to reconcile the generated dashboards with the ability to silence alerts long-term is to remove generation from the build step and have |
Grafana is full of fun little limitations, isn't it? 🙃 It's about time someone forked it and fixed this + templating in alerts.. honestly I wonder how hard that would be compared to working around these limitations.. but I digress
Correct me if I am wrong, but if we did do this wouldn't we only be able to emit an alert saying "a frontend critical alert is firing" rather than the specific alert (like "critical: frontend disk space low" or whatever), because Grafana doesn't support templating? One way to do https://github.com/sourcegraph/sourcegraph/issues/11210 - which may be better (I haven't thought it through extensively yet) - would be to have a The two original motivations for stepping away from Alertmanager and towards Grafana's Alerting was because of its better UI and support for more pager/notification types. But, stepping back and looking at the problem from a glance now I am wondering if this is not the wrong choice entirely due to the limitations in Grafana's alerting. Alertmanager does support the same set of pager/notification types we have in place today, which does seem sufficient: https://prometheus.io/docs/alerting/latest/configuration/ and since we aren't directing admins to the Grafana UI instead but sideloading our own config from the Sourcegraph site config.. maybe Alertmanager would be a better approach? I haven't looked at this extensively, but I think given the limitations we're hitting here it may be worth stepping back and reconsidering. What do you think? |
From a quick 5m assessment, if we did go back to alertmanager:
seems like only upsides to me 🤦 I feel bad for not thinking of this earlier |
Yep :(
This seems risky - I don't think muting an alert === never recording it 😅
I dug into it and it does seem to be mostly upsides 😅 The single downside is that it's a new service to deploy. I probably should have thought harder about it too, since the warning signs were there... oops :( Doing this switch back to Alertmanager will take some fenangling:
A very unfortunate state of affairs haha |
https://prometheus.io/docs/alerting/latest/alertmanager/#silences Alertmanagers also has native support for silencing As for it being another thing to deploy, I agree but that also seems minor. I think exploring this more is worth time, I’d say timebox this to a few days and post an update with whether you think we should continue down that path or not :) For the config schema, I am Magine that they will be pretty similar anyway, but if not we can always do something simpler like handling backwards compatibility just for the absolute most basic thing, or not handle backwards compatibility at all and just advise site admin’s to update the configuration when they upgrade |
in all honesty after looking into this today I feel like Grafana alerting is not the way to go given the specific goals of:
However, an approach that could work with Grafana: let's say we forgo point 1, and set notifications on when each service has critical alerts or warning alerts firing:
Since most of the points above are based on assumptions of how much granularity is useful to a sourcegraph admin, do you have any thoughts on this / past feedback from admins I can refer to? |
Summary on granular alerts notificationsResponse to Grafana potentially being the wrong choice for alert notifications - https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-647938831 From @slimsag:
Goals, from the issue description:
AlertmanagerOn the surface, Alertmanager gives us everything we want right now and more (especially features around grouping and silencing alerts, which would solve https://github.com/sourcegraph/sourcegraph/issues/11210). Since Grafana has a plugin for Alertmanager, the only downsides are that it's a new service to configure and deploy, and we'll have to be careful about how we adapt what went out in 3.17 to Alertmanager (ideas: https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648003317). Quick summary of what transitioning to Alertmanager might look like by @slimsag: https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-647983479. Grafana alertingAt the moment, the biggest blocker to Grafana alerting is essentially that it does not let us set multiple alerts on one panel. Many of our We can achieve a sort of middle ground by expanding on setting alerts on the "alerts firing" panels on a per-service level rather than a Sourcegraph-wide level (ie what is available now in 3.17 with https://github.com/sourcegraph/sourcegraph/issues/10641). This approach could be sufficient if only we could act on the results of a triggered alert - this is currently unimplemented (and from the looks of it, won't be for quite some time: grafana/grafana#6041, grafana/grafana#7832, grafana/grafana#6553) Leveraging Grafana alerting has the advantage of keeping most things monitoring in one place. I'm not sure if adding a new service is the mix is worth the featureset that Alertmanager offers since our alerting is relatively minimal, and until recently setting up alerts was quite a process (docs), so I'm not sure how comprehensive our alerting solution has to be (as mentioned in my other comment though this is based on assumptions of how much granularity is useful to a sourcegraph admin, past feedback from admins / data on what custom alerts admins set today would be helpful - Slack thread). "All alerts should be actionable"Grafana alerting is limited, but it seems intentionally limited (1, 2), which means either we accept that Grafana alerting is just not for us or maybe we take a look at how we should leverage our current alerts to align with how their alerts work. For example, one approach could be:
I personally am in favour of this approach. With some finessing, this can be made to support alert silencing as well (https://github.com/sourcegraph/sourcegraph/issues/11210). Another consideration:
We can address this by overlaying each panel with a query ( Wait for Grafana to implement thingsNone of the issues related to Grafana's alertings restrictions are closed so they might not be a strict "we'll never implement this". There is a vague promise of reworked alerts coming: grafana/grafana#6041 (comment) (emphasis mine)
This would allow us to send granular alerts via the "alerts firing" panels, as described above. That was March 2019 though without an update since, so waiting is probably out of the question. |
Thank you for the detailed write-up / extensive communication here! And apologies that you weren't given all of the user-facing context & vision here earlier. First I want to provide some higher level context about why we're working on these monitoring changes (not just this issue) at all, and where we're at - because this is all part of a larger vision. Monitoring visionToday, most Sourcegraph instances do not have alerting set up at all. There are a few notable exceptions:
All this is to say, I would estimate around 80-90% of our customers and Sourcegraph deployments do not have any alerting set up at all. Historically, they did not have monitoring either - but we recently in the past ~6 months began shipping preconfigured Grafana and Prometheus out-of-the-box with Sourcegraph deployments. The product vision here is simple: We have a way to determine if something is definitely wrong with Sourcegraph (critical alerts), and a way to determine if something may be wrong with Sourcegraph (warning alerts), but site admins are for the most part entirely unaware of it until something goes wrong and e.g. a user has already had a bad experience. We want all (or most) site admins to configure alerting in the future. What blocks us from getting there? A few things:
The choice to use Grafana alertsI originally made the choice that we should adopt Grafana alerts over Alertmanager, because it seemed to be more in-line with point 3 above ("The process for setting up alerts needs to be super smooth and clean."). At the time I did not envision that:
These two new pieces of information mean we should reconsider the choice I made, for certain, and decide what gets us closer to the product goals here. Responses to Robert's write-up above
I think we wouldn't strictly need to use Grafana's plugin for Alertmanager at all. Alertmanager could alert directly on the
In practice, I think we will find less than 5 customers actually start using the alert configuration that went out in 3.17 - because we have not advertised it to them heavily yet. This is not to say we shouldn't consider that aspect carefully, it is just to say that we can get away with doing simple things here (like my suggestion that merely saying "Hey site admin! You'll need to update your alerting configuration" if they did configure it, would be OK).
I think this would be bad, this would make all of our dashboards have 2x as many panels on them which will make it much harder for site admins (who do not necessarily have any idea what is going on) to find relevant information.
I don't think that introducing a new service for this is too big of a deal. I view not having a new service here a "nice-to-have". One way we could avoid this would be by bundling alertmanager into the
This wouldn't be too hard to do I think because it is a very simple Go static binary deployment (e.g. download a prebuilt linux binary here.
Site admins do not set or configure custom alerts in our Grafana instance (there may be a few minor exceptions to this, but the general operating assumption we should have is "we fully own and manage the Grafana/Prometheus instances and all third-party configuration or changes are unsupported" - in practice we just need to say in our changelog something like "You should use the new site configuration alerting options to configure alerting, any third-party configuration in Grafana or Proemtheus will not be supported in the future to ensure Sourcegraph is easily backed up.") In terms of granularity, site admins care about a few things I believe:
I think there are definitely some incompatibilities with the way we must do alerting (and monitoring in general) and how a traditional Grafana user may do it. A traditional Grafana user is going to have a lot of context about: what the dashboards are showing, what the alerts mean, what Grafana is, they will either have set the thing up themselves OR have people who did. In stark contrast, we're putting together an alerting system for people with zero context about how Sourcegraph works, what the alerts mean, and in many cases it may be their first time interacting with Grafana at all. In many cases, Grafana and our alerts may not give the site admin any useful information that they themselves can act on other than "I should contact support and ask why this is happening" In a traditional Grafana deployment, I don't think you would really want separate warning and critical thresholds in practice for most things. If something is wrong, you want to alert someone. If something isn't wrong, don't alert them. If the threshold is not appropriate, adjust it. That last part ("adjust it") is something we can't ask our site admins to do: "Is it the right threshold and something is actually wrong? Or do I need to adjust this? How do I know without contacting support?" In our case, we want to sort of remotely manage these alerts because we're defining them and shipping them to someone else - and in this context it really does make sense for us to want to say "this is definitely a problem" or "this could be a problem, but we're not positive"
I don't think this is a valid option; they don't seem interested in supporting these use cases anytime soon (which is fair, they have different goals than us).
Are my arguments above convincing? What other thoughts do you have here, and based on what I've said above about what I think site admins will expect and the overall product vision here what do you think the right tech choice / direction is for us? |
Thanks for the super detailed writeup! I think I have a clearer pictures of our goals now and am pretty convinced, especially:
I also didn't realize our monitoring stack was so new, so I made the wrong assumption that the barebones nature of it was intentional 😅 Thanks for clearing all that up! With all that said, here's how I am hoping to proceed:
This will change the scope/requirements of https://github.com/sourcegraph/sourcegraph/issues/11663, https://github.com/sourcegraph/sourcegraph/issues/11473, https://github.com/sourcegraph/sourcegraph/issues/11454 If all that sounds good @slimsag I'll probably start work on this on Monday (since I'll be away tomorrow and the day after) - thanks again! |
Interesting notes @slimsag @bobheadxi Seems as though you've come to a conclusion, but in support of a few points and perhaps an opinion as someone who has been responsible for systems like this in the past:
This is more or less my experience. I have found it more vaulable as I learn more, but initially I didn't see a lot in there that made sense to me...or helped me diagnose a problem other than. The disk is full or CPU/MEM are high. I feel as though the process to generate, create, edit...whatever action someone would normally want to perform on this is really clunky and would lead people to simply throw it in the too hard basket and wait for a problem to occur and use other means to determine what it is or wait for support help.
What's the feeling about just running alertmanager as another container in the prometheus pod as opposed to introducting an entirely new service? For example at the moment I have been toying with blackbox exporter to see if it can alleviate some pain on #10742 |
In practice, introducing a new container does mean introducing a new service. We can make this distinction in Kubernetes only. When it comes to Docker Compose, Pure-Docker, and Server deployments in all practicality containers have a 1:1 relationship with services. Being able to bundle a container in a Kubernetes service is simply a nice thing Kubernetes offers, but not something we can use elsewhere (unfortunately). |
Right now (as of https://github.com/sourcegraph/sourcegraph/pull/11483), we create a copy of the home dashboard (since the home dashboard is non-editable) to attach alerts to, and alerts basically only tell you if warning alerts are firing or if critical alerts are firing.
We want a expand on https://github.com/sourcegraph/sourcegraph/issues/10641 (where notifications only tell you "warning alerts are firing" and "critical alerts are firing") to get better granularity for alerts (ie provide what specific alert is firing), and be able to support per-alert silencing down the line (#11210)
Discussions
https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648567565
https://github.com/sourcegraph/sourcegraph/pull/11427/files#r439128693
Updated plan
https://github.com/sourcegraph/sourcegraph/issues/11452#issuecomment-648652514
The text was updated successfully, but these errors were encountered: