-
Notifications
You must be signed in to change notification settings - Fork 308
update policies in light of recent downtime #4124
Comments
@arcfide We've got some stats up on http://inside.gratipay.com/appendices/health, thanks for notifying here :) |
Was down for over an hour 😕 |
According to the graphs, this is the time window we were down for: 4th September 2016 |
No new events on Sentry |
|
I don't see any obvious traffic spikes |
No hackerone reports either |
Heroku has a button to 'Configure Alerts', but I'm not able to click on it - I guess this is because we're using Hobby dynos, not the Professional ones |
Glad to see it back up. I can't tell from the above trail whether the Gratipay team is automatically notified of downtime or not? Anyways, glad to see that things are up. Hope the above issues are easily resolved. Keep up the good work. |
:) We used to have pagerduty alerts long ago, I think that our subscription has either expired or we haven't set them up properly. #4124 (comment) is my checklist for how we can better address such situations in the future |
Need a speccy for engineering max uptime reliability (SLA?), of which monitoring is a critical step:
Do I overengineer it? =) |
What a fail. Paying money just to know one that "you haven't been paying enough, sorry". =) I guess there are no any distributed monitoring solutions that ping each other and send alerts when only one monitoring node is left. I guess each node should know somehow that it is still connected to the net, so that it won't go into panic and start collecting deferred alerts in mailbox. If that was possible, I could run monitor on my machine too. |
Blech. 😞
We do have PagerDuty still, and I did get SMS notifications for this. I didn't see them until after seeing this, however. Maybe a new ticket to sort out a PagerDuty rotation? |
Or we can do it here. I guess we take the payday rotation as our pattern? Should it be the same rotation? cc: @clone1018 |
Why there is no service like http://www.fedmsg.com/en/latest/ where any can subscribe? |
How bout we just fan out messages - and the first person to see them will respond? |
(rather than rotation) |
Yea, and ping someone who can ping someone who can ping the servers. =) |
I like the idea of fanning out messages, but I think we should also have someone on call, or it will only be a matter of time before an incident occurs and no-one sees it for longer than we'd like. |
@rohitpaulk You should have a Pagerduty invite. I guess for now adding you to the list is doubling our capacity, ya?
Is the $7 a month plan okay?
|
I'm seeing SMS notifications that we were down for about two hours earlier today (ending about six hours ago). |
@rohitpaulk Looks like you accepted the PagerDuty invite. Did you receive the notifications? Did you take any action? |
Hmm ... now it's looking like we got DOS'd by a security researcher. The timeline syncs up with https://hackerone.com/reports/172573. |
I've upgraded Papertrail to the $7/mo plan. |
Yes, I've accepted it - and I'm not sure what to do now... |
Maybe a bit. :)
We've got notifications, now to two of us. !m @rohitpaulk :)
Done. We use Uptime Robot for monitoring, and it emails PagerDuty, which fans out via SMS.
Meh. That's Uptime Robot's problem for now. ;)
http://rss.uptimerobot.com/u128001-24a87b86a29b9037c031a7e6691db9fc 💃
The catch is that this requires access to our Heroku account, which right now resides with @rohitpaulk @clone1018 and myself. Furthermore, debugging downtime is a complex task, not a rote one. I don't see value in maintaining complex documentation for just the three of us. |
I've reconfigured our PagerDuty escalation policy to simply notify you and I immediately after an incident is triggered. I've manually created an incident in order to test. Didja get a notification? :-) |
I've added Papertrail to the budget. |
Didn't get the notification the first time around as a call (coz I had entered my phone number wrong) - fixed now. So. Good to close? :) |
Yeah, I think so! 💃 |
Does that mean someone is wondering what the heck the message they got is about? 😁 |
Probably :D |
Where would I check to find out the status of the Gratipay website and services?
The text was updated successfully, but these errors were encountered: