Improve reliability of Sourcegraph.com site24x7 ping alerts #10742

slimsag · 2020-05-15T18:02:48Z

We use site24x7 on Sourcegraph.com to ensure it doesn't go down unnoticed (e.g. if both prometheus and grafana get taken down, we want something external to confirm the site is reachable).

This reports directly to OpsGenie (doesn't go through Prometheus or Grafana) and the configuration lives here.

It's OK/ideal that it bypasses Prometheus and Grafana, but site24x7 is a regular source of flaky alerts for us and we'd like to find something less flaky (it seems we're "testing the worldwide internet connection" currently).

This is the first major project I'd like for you to take on @davejrt !

unknwon · 2020-05-16T06:44:08Z

(it seems we're "testing the worldwide internet connection" currently).

Given our goal of growing awareness of sourcegraph.com, it makes sense (until now)?

slimsag · 2020-05-18T18:21:21Z

@unknwon I don't understand your message above. Are you saying you think we should or should not continue testing the worldwide internet connection?

I would say we should not, because if e.g. Sourcegraph.com is inaccessible from a specific country but is otherwise up -- there is no action we can take to resolve the issue

unknwon · 2020-05-20T09:13:28Z

@unknwon I don't understand your message above. Are you saying you think we should or should not continue testing the worldwide internet connection?

I would say we should not, because if e.g. Sourcegraph.com is inaccessible from a specific country but is otherwise up -- there is no action we can take to resolve the issue

Sorry :) Let me rephrase. I think "testing the worldwide internet connection" is valuable given RFC 151 wants to grow awareness of sourcegraph.com. We would have developers rely on/use sourcegraph.com worldwide.

slimsag · 2020-06-03T22:16:44Z

I agree testing that is valuable, but perhaps not actionable unless it goes on for longer periods of time.

So I guess we need two things:

Something to alert us when Sourcegraph.com goes down because of a problem on our end, which we can address. This should e.g. wake someone up and call them via pagerduty.
Something to alert us if Sourcegraph.com is inaccessible from one part of the world for an extended period of time, e.g. via email or Slack notifications.

davejrt · 2020-06-10T19:35:35Z

I've just setup an alternative monitor on apex ping to see if any of the events on site 24x7 over the next few days line up.

Something to alert us when Sourcegraph.com goes down because of a problem on our end, which we can address. This should e.g. wake someone up and call them via pagerduty.

My initial thoughts on this are we need something closer to our stack to alert us when something is really broken ie: via grafana alert manager that the frontend isn't returning 200's or has a lot of 500 errors. I think by doing this we take out the site24x7 and cloudflare anomalies and stop people be potentially woken up during the middle of the evening. We expose ourselves to risk that if grafana or prometheus go down on the cluster, we don't get those alerts.

Something to alert us if Sourcegraph.com is inaccessible from one part of the world for an extended period of time, e.g. via email or Slack notifications.

I think this is where we can rely on site24x7.

Given our distributed nature it's more than likely that one of our internal team would catch these types of errors too, at least monday to friday.

slimsag · 2020-06-10T19:50:17Z

@davejrt we do (or will soon) have monitoring for e.g. 200-status response codes from the frontend, but I think this omits a few things:

If our cloudflare configuration is borked
If our SSL certificate has expired

I think those properties are worth testing and paging on in the middle of the night, just not from every location around the world. Thoughts on those two points?

slimsag · 2020-06-10T19:52:17Z

Also, nice on setting up apex ping! Excited to hear how that goes and your experience with it

Also, if it has a better UX, etc. I don't think anyone here is particularly tied to site24x7 so long as we can configure whatever service we use via terraform. so if apex ping seems better to you, that's a judgement call you can definitely make here

pecigonzalo · 2020-07-01T13:19:09Z

@davejrt Is it possible to update Site24x7 to alert only when a number of locations reports problems?
I used Statuscake in the past (which can be configured using Terraform) which minimized noise.

davejrt · 2020-07-03T19:45:10Z

@pecigonzalo we do have multiple locations configured in site24x7 so it will only alert when at least 3 of the locations are failing. That being said all the alerts in the past we've seen have all locations failing.

My approach thus far has been to monitor internally on the stack as well testing externally to ensure to determine where along the request chain there might be an issue whether that be outside of our cluster(ie cloudflare) or internally in the cluster.

pecigonzalo · 2020-07-06T10:38:48Z

@davejrt That is great, I got the impression from one the descriptions (or maybe the linked issue or a chat) that we were fully removing the external test.

davejrt · 2020-07-06T13:20:21Z

Im a bit confused by https://github.com/sourcegraph/sourcegraph/issues/10742 now 😄. If the tests are failing from all locations, and some of the linked issues like #3909 are actual issues, what is telling us the test if flaky? https://sourcegraph.slack.com/archives/CJX299FGE/p1594032949233500

@pecigonzalo with regard to your question, my understanding is we're not specifically ruling out having an external monitoring tool, nor that there is no value in external testing. The concern thus far as has been that site24x7 has reported a failure, which has either been:

ephemeral due to a deployment
an issue with cloudflare
an issue with site24x7 itself

The comment around "testing the whole internet" as I understand it is, when these alerts are triggered, the root cause is not always our service, and rather some upstream issue which we can't pin point. Hence, all we are doing is creating noise, and saying "there is an issue somewhere on the internet...it isn't us but we don't know here".

My rationale was to improve or add to our additional checks by trying to testing closer to our stack as to where the issue is. For example if we are testing http://sourcegraph-frontend-internal as well as https://sourcegraph.com` and only one is failing, we can at least remove one area of doubt and start to look elsewhere. By adding other checks with blackbox exporter such as the ssl-cert check we can eliminate that as an issue as well.

pecigonzalo · 2020-07-06T13:55:37Z

my understanding is we're not specifically ruling out having an external monitoring tool, nor that there is no value in external testing.
[...]
My rationale was to improve or add to our additional checks by trying to testing closer to our stack as to where the issue is. For example if we are testing http://sourcegraph-frontend-internal as well as https://sourcegraph.com` and only one is failing, we can at least remove one area of doubt and start to look elsewhere. By adding other checks with blackbox exporter such as the ssl-cert check we can eliminate that as an issue as well.

This is clear and makes sense, as do the changes being implemented. What I'm trying to understand is the scope of the problem and if we have some other root problems.

ephemeral due to a deployment

In some cases, this should trigger an alert, but we might need to update our deployments to mute alerts for a period during planned outages.

an issue with cloudflare

The comment around "testing the whole internet" as I understand it is, when these alerts are triggered, the root cause is not always our service, and rather some upstream issue which we can't pin point. Hence, all we are doing is creating noise, and saying "there is an issue somewhere on the internet...it isn't us but we don't know here".

If our DNS is failing or routing/internet providers are failing, and some of our users can't use the service I believe we should get a notification. I agree that we cant action it and it's mostly FYI (maybe we can notify Slack and not page) but I would expect that 3+ locations failing to reach us is not common and should not be creating a lot of noise.

keegancsmith · 2020-07-06T19:38:58Z

I agree there is value in external checks. My take is we probably have too many in site24x7, and should just rely on "can the internet reach sourcegraph-frontend on kubernetes". Then we could rely on something like blackbox_exporter in our cluster to test each of the API endpoints we care about. Then we have normal prometheus metrics, and the same alerting infra we use for our services. That alerting infra is much nicer to configure to not be noisy, look at graphs, etc.

daxmc99 · 2020-07-13T18:01:26Z

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.18 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

davejrt · 2020-07-17T14:19:24Z

Blackbox exporter is now running on sourcegraph.com and can be queried via grafana using the probe_http_status_code to get a picture of when our services/sites are returning error codes other than a 200.

Endpoints and alerts can be configured by updating the prometheus.yaml in our configmap.

davejrt · 2020-07-17T14:24:28Z

Closed by sourcegraph/deploy-sourcegraph-dot-com#2984

slimsag added team/distribution monitoring dot-com Applies to sourcegraph.com only labels May 15, 2020

slimsag added this to the 3.17 milestone May 15, 2020

slimsag assigned davejrt May 15, 2020

slimsag mentioned this issue May 15, 2020

Dogfood the monitoring we ship with Sourcegraph #5370

Closed

slimsag mentioned this issue May 18, 2020

Distribution: 3.17 Tracking issue #10788

Closed

37 tasks

slimsag changed the title ~~switch away from site24x7 or find a way to make it more reliable~~ observability: switch away from site24x7 or find a way to make it more reliable May 18, 2020

slimsag added the okr/distribution/dev-productivity label May 18, 2020

slimsag added the estimate/8d label May 22, 2020

davejrt changed the title ~~observability: switch away from site24x7 or find a way to make it more reliable~~ ❓ (stretch) : observability: switch away from site24x7 or find a way to make it more reliable May 26, 2020

slimsag added the planned/3.17 label May 29, 2020

slimsag changed the title ~~❓ (stretch) : observability: switch away from site24x7 or find a way to make it more reliable~~ observability: switch away from site24x7 or find a way to make it more reliable Jun 3, 2020

slimsag modified the milestones: 3.17, 3.18 Jun 3, 2020

slimsag added planned/3.18 and removed planned/3.17 labels Jun 3, 2020

attfarhan mentioned this issue Jun 22, 2020

Search: 3.18 Tracking issue #11613

Closed

16 tasks

slimsag mentioned this issue Jun 22, 2020

Distribution: 3.18 Tracking issue #11646

Closed

56 tasks

slimsag changed the title ~~observability: switch away from site24x7 or find a way to make it more reliable~~ Improve reliability of Sourcegraph.com ping alerts Jun 23, 2020

davejrt mentioned this issue Jun 24, 2020

monitoring: granular alerts notifications with Alertmanager #11452

Closed

slimsag changed the title ~~Improve reliability of Sourcegraph.com ping alerts~~ Improve reliability of Sourcegraph.com site24x7 ping alerts Jul 1, 2020

slimsag mentioned this issue Jul 6, 2020

Address flaky/confusing sourcegraph.com alerts sentiment #11966

Closed

davejrt closed this as completed Jul 17, 2020

davejrt mentioned this issue Sep 3, 2020

blackbox exporter & site 24/7 next steps #13627

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve reliability of Sourcegraph.com site24x7 ping alerts #10742

Improve reliability of Sourcegraph.com site24x7 ping alerts #10742

slimsag commented May 15, 2020

unknwon commented May 16, 2020

slimsag commented May 18, 2020

unknwon commented May 20, 2020

slimsag commented Jun 3, 2020

davejrt commented Jun 10, 2020 •

edited

Loading

slimsag commented Jun 10, 2020

slimsag commented Jun 10, 2020

pecigonzalo commented Jul 1, 2020 •

edited

Loading

davejrt commented Jul 3, 2020

pecigonzalo commented Jul 6, 2020 •

edited

Loading

davejrt commented Jul 6, 2020

pecigonzalo commented Jul 6, 2020

keegancsmith commented Jul 6, 2020

daxmc99 commented Jul 13, 2020

davejrt commented Jul 17, 2020 •

edited

Loading

davejrt commented Jul 17, 2020

Improve reliability of Sourcegraph.com site24x7 ping alerts #10742

Improve reliability of Sourcegraph.com site24x7 ping alerts #10742

Comments

slimsag commented May 15, 2020

unknwon commented May 16, 2020

slimsag commented May 18, 2020

unknwon commented May 20, 2020

slimsag commented Jun 3, 2020

davejrt commented Jun 10, 2020 • edited Loading

slimsag commented Jun 10, 2020

slimsag commented Jun 10, 2020

pecigonzalo commented Jul 1, 2020 • edited Loading

davejrt commented Jul 3, 2020

pecigonzalo commented Jul 6, 2020 • edited Loading

davejrt commented Jul 6, 2020

pecigonzalo commented Jul 6, 2020

keegancsmith commented Jul 6, 2020

daxmc99 commented Jul 13, 2020

davejrt commented Jul 17, 2020 • edited Loading

davejrt commented Jul 17, 2020

davejrt commented Jun 10, 2020 •

edited

Loading

pecigonzalo commented Jul 1, 2020 •

edited

Loading

pecigonzalo commented Jul 6, 2020 •

edited

Loading

davejrt commented Jul 17, 2020 •

edited

Loading