Skip to content
This repository has been archived by the owner on Feb 8, 2018. It is now read-only.

update policies in light of recent downtime #4124

Closed
arcfide opened this issue Sep 4, 2016 · 37 comments
Closed

update policies in light of recent downtime #4124

arcfide opened this issue Sep 4, 2016 · 37 comments

Comments

@arcfide
Copy link

arcfide commented Sep 4, 2016

Where would I check to find out the status of the Gratipay website and services?

@rohitpaulk
Copy link
Contributor

Gratipay indeed was down, I restarted the dynos and looks like it is back up now.

screen shot 2016-09-04 at 1 24 18 pm

@rohitpaulk
Copy link
Contributor

rohitpaulk commented Sep 4, 2016

@arcfide We've got some stats up on http://inside.gratipay.com/appendices/health, thanks for notifying here :)

@rohitpaulk
Copy link
Contributor

screen shot 2016-09-04 at 1 26 49 pm

@rohitpaulk
Copy link
Contributor

Was down for over an hour 😕

@rohitpaulk
Copy link
Contributor

Heroku shows alerts on its dashboard -

screen shot 2016-09-04 at 1 29 12 pm

@rohitpaulk
Copy link
Contributor

According to the graphs, this is the time window we were down for:

4th September 2016
06:10 AM - 07:50 AM

@rohitpaulk
Copy link
Contributor

:( We've exhausted our papertrail capacity, so can't view logs

screen shot 2016-09-04 at 1 32 32 pm

@rohitpaulk
Copy link
Contributor

No new events on Sentry

@rohitpaulk
Copy link
Contributor

  • Fix/Re-activate pagerduty alerts
  • Upgrade papertrail plan

@rohitpaulk
Copy link
Contributor

I don't see any obvious traffic spikes

@rohitpaulk
Copy link
Contributor

No hackerone reports either

@rohitpaulk
Copy link
Contributor

Heroku has a button to 'Configure Alerts', but I'm not able to click on it - I guess this is because we're using Hobby dynos, not the Professional ones

@arcfide
Copy link
Author

arcfide commented Sep 4, 2016

Glad to see it back up. I can't tell from the above trail whether the Gratipay team is automatically notified of downtime or not? Anyways, glad to see that things are up. Hope the above issues are easily resolved. Keep up the good work.

@rohitpaulk
Copy link
Contributor

:) We used to have pagerduty alerts long ago, I think that our subscription has either expired or we haven't set them up properly.

#4124 (comment) is my checklist for how we can better address such situations in the future

@techtonik
Copy link
Contributor

@techtonik
Copy link
Contributor

Need a speccy for engineering max uptime reliability (SLA?), of which monitoring is a critical step:

  • notify when Gratipay is down
    • notification list with free subscribe
  • provide manual instructions for troubleshooting downtime
  • deploy status/monitoring server on independent server
    • make sure monitor is monitored

Do I overengineer it? =)

@techtonik
Copy link
Contributor

:( We've exhausted our papertrail capacity, so can't view logs

What a fail. Paying money just to know one that "you haven't been paying enough, sorry". =)

I guess there are no any distributed monitoring solutions that ping each other and send alerts when only one monitoring node is left. I guess each node should know somehow that it is still connected to the net, so that it won't go into panic and start collecting deferred alerts in mailbox.

If that was possible, I could run monitor on my machine too.

@chadwhitacre
Copy link
Contributor

Blech. 😞

Fix/Re-activate pagerduty alerts

We do have PagerDuty still, and I did get SMS notifications for this. I didn't see them until after seeing this, however. Maybe a new ticket to sort out a PagerDuty rotation?

@chadwhitacre
Copy link
Contributor

Maybe a new ticket to sort out a PagerDuty rotation?

Or we can do it here. I guess we take the payday rotation as our pattern? Should it be the same rotation? cc: @clone1018

@techtonik
Copy link
Contributor

Why there is no service like http://www.fedmsg.com/en/latest/ where any can subscribe?

@rohitpaulk
Copy link
Contributor

How bout we just fan out messages - and the first person to see them will respond?

@rohitpaulk
Copy link
Contributor

(rather than rotation)

@techtonik
Copy link
Contributor

Yea, and ping someone who can ping someone who can ping the servers. =)

@chadwhitacre
Copy link
Contributor

How bout we just fan out messages - and the first person to see them will respond?

I like the idea of fanning out messages, but I think we should also have someone on call, or it will only be a matter of time before an incident occurs and no-one sees it for longer than we'd like.

@chadwhitacre
Copy link
Contributor

Fix/Re-activate pagerduty alerts

@rohitpaulk You should have a Pagerduty invite. I guess for now adding you to the list is doubling our capacity, ya?

Upgrade papertrail plan

Is the $7 a month plan okay?

$ / mo MB / day search (days) archive (days)
0 10 2 7
7 50 7 365
15 100 7 365
29 200 7 365
65 500 7 365

@chadwhitacre chadwhitacre changed the title Is Gratipay currently down? update policies in light of recent downtime Sep 21, 2016
@chadwhitacre
Copy link
Contributor

I'm seeing SMS notifications that we were down for about two hours earlier today (ending about six hours ago).

@chadwhitacre
Copy link
Contributor

@rohitpaulk Looks like you accepted the PagerDuty invite. Did you receive the notifications? Did you take any action?

@chadwhitacre
Copy link
Contributor

chadwhitacre commented Sep 28, 2016

Hmm ... now it's looking like we got DOS'd by a security researcher. The timeline syncs up with https://hackerone.com/reports/172573.

screen shot 2016-09-28 at 1 20 20 pm

@chadwhitacre
Copy link
Contributor

I've upgraded Papertrail to the $7/mo plan.

@rohitpaulk
Copy link
Contributor

Did you receive the notifications? Did you take any action?

Yes, I've accepted it - and I'm not sure what to do now...

@chadwhitacre
Copy link
Contributor

Do I overengineer it? =)

Maybe a bit. :)

notify when Gratipay is down

We've got notifications, now to two of us. !m @rohitpaulk :)

deploy status/monitoring server on independent server

Done. We use Uptime Robot for monitoring, and it emails PagerDuty, which fans out via SMS.

make sure monitor is monitored

Meh. That's Uptime Robot's problem for now. ;)

notification list with free subscribe

http://rss.uptimerobot.com/u128001-24a87b86a29b9037c031a7e6691db9fc 💃

provide manual instructions for troubleshooting downtime

The catch is that this requires access to our Heroku account, which right now resides with @rohitpaulk @clone1018 and myself. Furthermore, debugging downtime is a complex task, not a rote one. I don't see value in maintaining complex documentation for just the three of us.

@chadwhitacre
Copy link
Contributor

Yes, I've accepted it - and I'm not sure what to do now...

I've reconfigured our PagerDuty escalation policy to simply notify you and I immediately after an incident is triggered. I've manually created an incident in order to test. Didja get a notification? :-)

@chadwhitacre
Copy link
Contributor

I've added Papertrail to the budget.

@rohitpaulk
Copy link
Contributor

Didn't get the notification the first time around as a call (coz I had entered my phone number wrong) - fixed now.

So. Good to close? :)

@chadwhitacre
Copy link
Contributor

Yeah, I think so! 💃

@chadwhitacre
Copy link
Contributor

Didn't get the notification the first time around as a call

Does that mean someone is wondering what the heck the message they got is about? 😁

@rohitpaulk
Copy link
Contributor

Probably :D

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants