Skip to content
This repository has been archived by the owner on Sep 30, 2024. It is now read-only.

Improve reliability of Sourcegraph.com site24x7 ping alerts #10742

Closed
slimsag opened this issue May 15, 2020 · 16 comments
Closed

Improve reliability of Sourcegraph.com site24x7 ping alerts #10742

slimsag opened this issue May 15, 2020 · 16 comments
Assignees
Labels
Milestone

Comments

@slimsag
Copy link
Member

slimsag commented May 15, 2020

We use site24x7 on Sourcegraph.com to ensure it doesn't go down unnoticed (e.g. if both prometheus and grafana get taken down, we want something external to confirm the site is reachable).

This reports directly to OpsGenie (doesn't go through Prometheus or Grafana) and the configuration lives here.

It's OK/ideal that it bypasses Prometheus and Grafana, but site24x7 is a regular source of flaky alerts for us and we'd like to find something less flaky (it seems we're "testing the worldwide internet connection" currently).

This is the first major project I'd like for you to take on @davejrt !

@unknwon
Copy link
Member

unknwon commented May 16, 2020

(it seems we're "testing the worldwide internet connection" currently).

Given our goal of growing awareness of sourcegraph.com, it makes sense (until now)?

@slimsag slimsag changed the title switch away from site24x7 or find a way to make it more reliable observability: switch away from site24x7 or find a way to make it more reliable May 18, 2020
@slimsag
Copy link
Member Author

slimsag commented May 18, 2020

@unknwon I don't understand your message above. Are you saying you think we should or should not continue testing the worldwide internet connection?

I would say we should not, because if e.g. Sourcegraph.com is inaccessible from a specific country but is otherwise up -- there is no action we can take to resolve the issue

@unknwon
Copy link
Member

unknwon commented May 20, 2020

@unknwon I don't understand your message above. Are you saying you think we should or should not continue testing the worldwide internet connection?

I would say we should not, because if e.g. Sourcegraph.com is inaccessible from a specific country but is otherwise up -- there is no action we can take to resolve the issue

Sorry :) Let me rephrase. I think "testing the worldwide internet connection" is valuable given RFC 151 wants to grow awareness of sourcegraph.com. We would have developers rely on/use sourcegraph.com worldwide.

@davejrt davejrt changed the title observability: switch away from site24x7 or find a way to make it more reliable ❓ (stretch) : observability: switch away from site24x7 or find a way to make it more reliable May 26, 2020
@slimsag slimsag changed the title ❓ (stretch) : observability: switch away from site24x7 or find a way to make it more reliable observability: switch away from site24x7 or find a way to make it more reliable Jun 3, 2020
@slimsag slimsag modified the milestones: 3.17, 3.18 Jun 3, 2020
@slimsag
Copy link
Member Author

slimsag commented Jun 3, 2020

I agree testing that is valuable, but perhaps not actionable unless it goes on for longer periods of time.

So I guess we need two things:

  1. Something to alert us when Sourcegraph.com goes down because of a problem on our end, which we can address. This should e.g. wake someone up and call them via pagerduty.
  2. Something to alert us if Sourcegraph.com is inaccessible from one part of the world for an extended period of time, e.g. via email or Slack notifications.

@davejrt
Copy link
Contributor

davejrt commented Jun 10, 2020

I've just setup an alternative monitor on apex ping to see if any of the events on site 24x7 over the next few days line up.

  1. Something to alert us when Sourcegraph.com goes down because of a problem on our end, which we can address. This should e.g. wake someone up and call them via pagerduty.

My initial thoughts on this are we need something closer to our stack to alert us when something is really broken ie: via grafana alert manager that the frontend isn't returning 200's or has a lot of 500 errors. I think by doing this we take out the site24x7 and cloudflare anomalies and stop people be potentially woken up during the middle of the evening. We expose ourselves to risk that if grafana or prometheus go down on the cluster, we don't get those alerts.

  1. Something to alert us if Sourcegraph.com is inaccessible from one part of the world for an extended period of time, e.g. via email or Slack notifications.

I think this is where we can rely on site24x7.

Given our distributed nature it's more than likely that one of our internal team would catch these types of errors too, at least monday to friday.

@slimsag
Copy link
Member Author

slimsag commented Jun 10, 2020

@davejrt we do (or will soon) have monitoring for e.g. 200-status response codes from the frontend, but I think this omits a few things:

  • If our cloudflare configuration is borked
  • If our SSL certificate has expired

I think those properties are worth testing and paging on in the middle of the night, just not from every location around the world. Thoughts on those two points?

@slimsag
Copy link
Member Author

slimsag commented Jun 10, 2020

Also, nice on setting up apex ping! Excited to hear how that goes and your experience with it

Also, if it has a better UX, etc. I don't think anyone here is particularly tied to site24x7 so long as we can configure whatever service we use via terraform. so if apex ping seems better to you, that's a judgement call you can definitely make here

@slimsag slimsag changed the title observability: switch away from site24x7 or find a way to make it more reliable Improve reliability of Sourcegraph.com ping alerts Jun 23, 2020
@pecigonzalo
Copy link
Contributor

pecigonzalo commented Jul 1, 2020

@davejrt Is it possible to update Site24x7 to alert only when a number of locations reports problems?
I used Statuscake in the past (which can be configured using Terraform) which minimized noise.

@slimsag slimsag changed the title Improve reliability of Sourcegraph.com ping alerts Improve reliability of Sourcegraph.com site24x7 ping alerts Jul 1, 2020
@davejrt
Copy link
Contributor

davejrt commented Jul 3, 2020

@pecigonzalo we do have multiple locations configured in site24x7 so it will only alert when at least 3 of the locations are failing. That being said all the alerts in the past we've seen have all locations failing.

My approach thus far has been to monitor internally on the stack as well testing externally to ensure to determine where along the request chain there might be an issue whether that be outside of our cluster(ie cloudflare) or internally in the cluster.

@pecigonzalo
Copy link
Contributor

pecigonzalo commented Jul 6, 2020

@davejrt That is great, I got the impression from one the descriptions (or maybe the linked issue or a chat) that we were fully removing the external test.

@davejrt
Copy link
Contributor

davejrt commented Jul 6, 2020

Im a bit confused by https://github.com/sourcegraph/sourcegraph/issues/10742 now 😄. If the tests are failing from all locations, and some of the linked issues like #3909 are actual issues, what is telling us the test if flaky? https://sourcegraph.slack.com/archives/CJX299FGE/p1594032949233500

@pecigonzalo with regard to your question, my understanding is we're not specifically ruling out having an external monitoring tool, nor that there is no value in external testing. The concern thus far as has been that site24x7 has reported a failure, which has either been:

  • ephemeral due to a deployment
  • an issue with cloudflare
  • an issue with site24x7 itself

The comment around "testing the whole internet" as I understand it is, when these alerts are triggered, the root cause is not always our service, and rather some upstream issue which we can't pin point. Hence, all we are doing is creating noise, and saying "there is an issue somewhere on the internet...it isn't us but we don't know here".

My rationale was to improve or add to our additional checks by trying to testing closer to our stack as to where the issue is. For example if we are testing http://sourcegraph-frontend-internal as well as https://sourcegraph.com` and only one is failing, we can at least remove one area of doubt and start to look elsewhere. By adding other checks with blackbox exporter such as the ssl-cert check we can eliminate that as an issue as well.

@pecigonzalo
Copy link
Contributor

my understanding is we're not specifically ruling out having an external monitoring tool, nor that there is no value in external testing.
[...]
My rationale was to improve or add to our additional checks by trying to testing closer to our stack as to where the issue is. For example if we are testing http://sourcegraph-frontend-internal as well as https://sourcegraph.com` and only one is failing, we can at least remove one area of doubt and start to look elsewhere. By adding other checks with blackbox exporter such as the ssl-cert check we can eliminate that as an issue as well.

This is clear and makes sense, as do the changes being implemented. What I'm trying to understand is the scope of the problem and if we have some other root problems.

  • ephemeral due to a deployment

In some cases, this should trigger an alert, but we might need to update our deployments to mute alerts for a period during planned outages.

  • an issue with cloudflare

The comment around "testing the whole internet" as I understand it is, when these alerts are triggered, the root cause is not always our service, and rather some upstream issue which we can't pin point. Hence, all we are doing is creating noise, and saying "there is an issue somewhere on the internet...it isn't us but we don't know here".

If our DNS is failing or routing/internet providers are failing, and some of our users can't use the service I believe we should get a notification. I agree that we cant action it and it's mostly FYI (maybe we can notify Slack and not page) but I would expect that 3+ locations failing to reach us is not common and should not be creating a lot of noise.

@keegancsmith
Copy link
Member

I agree there is value in external checks. My take is we probably have too many in site24x7, and should just rely on "can the internet reach sourcegraph-frontend on kubernetes". Then we could rely on something like blackbox_exporter in our cluster to test each of the API endpoints we care about. Then we have normal prometheus metrics, and the same alerting infra we use for our services. That alerting infra is much nicer to configure to not be noisy, look at graphs, etc.

@daxmc99
Copy link
Contributor

daxmc99 commented Jul 13, 2020

Dear all,

This is your release captain speaking. 🚂🚂🚂

Branch cut for the 3.18 release is scheduled for tomorrow.

Is this issue / PR going to make it in time? Please change the milestone accordingly.
When in doubt, reach out!

Thank you

@davejrt
Copy link
Contributor

davejrt commented Jul 17, 2020

Blackbox exporter is now running on sourcegraph.com and can be queried via grafana using the probe_http_status_code to get a picture of when our services/sites are returning error codes other than a 200.

Endpoints and alerts can be configured by updating the prometheus.yaml in our configmap.

@davejrt
Copy link
Contributor

davejrt commented Jul 17, 2020

Closed by sourcegraph/deploy-sourcegraph-dot-com#2984

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Projects
None yet
Development

No branches or pull requests

6 participants