-
Notifications
You must be signed in to change notification settings - Fork 1.3k
Improve reliability of Sourcegraph.com site24x7 ping alerts #10742
Comments
Given our goal of growing awareness of sourcegraph.com, it makes sense (until now)? |
@unknwon I don't understand your message above. Are you saying you think we should or should not continue testing the worldwide internet connection? I would say we should not, because if e.g. Sourcegraph.com is inaccessible from a specific country but is otherwise up -- there is no action we can take to resolve the issue |
Sorry :) Let me rephrase. I think "testing the worldwide internet connection" is valuable given RFC 151 wants to grow awareness of sourcegraph.com. We would have developers rely on/use sourcegraph.com worldwide. |
I agree testing that is valuable, but perhaps not actionable unless it goes on for longer periods of time. So I guess we need two things:
|
I've just setup an alternative monitor on apex ping to see if any of the events on site 24x7 over the next few days line up.
My initial thoughts on this are we need something closer to our stack to alert us when something is really broken ie: via grafana alert manager that the frontend isn't returning 200's or has a lot of 500 errors. I think by doing this we take out the site24x7 and cloudflare anomalies and stop people be potentially woken up during the middle of the evening. We expose ourselves to risk that if grafana or prometheus go down on the cluster, we don't get those alerts.
I think this is where we can rely on site24x7. Given our distributed nature it's more than likely that one of our internal team would catch these types of errors too, at least monday to friday. |
@davejrt we do (or will soon) have monitoring for e.g. 200-status response codes from the frontend, but I think this omits a few things:
I think those properties are worth testing and paging on in the middle of the night, just not from every location around the world. Thoughts on those two points? |
Also, nice on setting up apex ping! Excited to hear how that goes and your experience with it Also, if it has a better UX, etc. I don't think anyone here is particularly tied to site24x7 so long as we can configure whatever service we use via terraform. so if apex ping seems better to you, that's a judgement call you can definitely make here |
@davejrt Is it possible to update Site24x7 to alert only when a number of locations reports problems? |
@pecigonzalo we do have multiple locations configured in site24x7 so it will only alert when at least 3 of the locations are failing. That being said all the alerts in the past we've seen have all locations failing. My approach thus far has been to monitor internally on the stack as well testing externally to ensure to determine where along the request chain there might be an issue whether that be outside of our cluster(ie cloudflare) or internally in the cluster. |
@davejrt That is great, I got the impression from one the descriptions (or maybe the linked issue or a chat) that we were fully removing the external test. |
@pecigonzalo with regard to your question, my understanding is we're not specifically ruling out having an external monitoring tool, nor that there is no value in external testing. The concern thus far as has been that site24x7 has reported a failure, which has either been:
The comment around "testing the whole internet" as I understand it is, when these alerts are triggered, the root cause is not always our service, and rather some upstream issue which we can't pin point. Hence, all we are doing is creating noise, and saying "there is an issue somewhere on the internet...it isn't us but we don't know here". My rationale was to improve or add to our additional checks by trying to testing closer to our stack as to where the issue is. For example if we are testing |
This is clear and makes sense, as do the changes being implemented. What I'm trying to understand is the scope of the problem and if we have some other root problems.
In some cases, this should trigger an alert, but we might need to update our deployments to mute alerts for a period during planned outages.
If our DNS is failing or routing/internet providers are failing, and some of our users can't use the service I believe we should get a notification. I agree that we cant action it and it's mostly FYI (maybe we can notify Slack and not page) but I would expect that 3+ locations failing to reach us is not common and should not be creating a lot of noise. |
I agree there is value in external checks. My take is we probably have too many in site24x7, and should just rely on "can the internet reach sourcegraph-frontend on kubernetes". Then we could rely on something like |
Dear all, This is your release captain speaking. 🚂🚂🚂 Branch cut for the 3.18 release is scheduled for tomorrow. Is this issue / PR going to make it in time? Please change the milestone accordingly. Thank you |
Blackbox exporter is now running on sourcegraph.com and can be queried via grafana using the Endpoints and alerts can be configured by updating the prometheus.yaml in our configmap. |
Closed by sourcegraph/deploy-sourcegraph-dot-com#2984 |
We use site24x7 on Sourcegraph.com to ensure it doesn't go down unnoticed (e.g. if both prometheus and grafana get taken down, we want something external to confirm the site is reachable).
This reports directly to OpsGenie (doesn't go through Prometheus or Grafana) and the configuration lives here.
It's OK/ideal that it bypasses Prometheus and Grafana, but site24x7 is a regular source of flaky alerts for us and we'd like to find something less flaky (it seems we're "testing the worldwide internet connection" currently).
This is the first major project I'd like for you to take on @davejrt !
The text was updated successfully, but these errors were encountered: