Skip to content
This repository has been archived by the owner on May 4, 2022. It is now read-only.

Runbooks

James Greenhill edited this page Dec 11, 2020 · 2 revisions

Here is a list of things to do if something goes sideways.

First step - Get your bearings

Before you do anything. Well, ACK the page first, then check out the metrics dashboards to get your bearings:

Posthog Cloud

Instance Health

Step Two - Determine urgency

Three categories of failure by impact

  1. Site is down - events are flowing
  2. Events service is down - Site is up
  3. Site and events are down - bad news

Levels of Urgency

  • Level 0 - Wake up anyone who can help. Data loss is occurring.
  • Level 1 - Try your best to figure it out, then wake up anyone who can help
  • Level 2 - If you are on call. You should wake up and try to fix it. If you can't figure it out email or ping others to loop them in for later.
  • Level 3 - If you are up look at it, if you are asleep look at it first thing in the morning.
  • Level 4 - Normal tasks

Site is down - events are flowing. Level 1

If this is the case it is a huge inconvenience for our users, but the pain is temporary. Work with customers to set expectations that the site will be back online soon (hopefully) and that the events are still going to be ingested.

Events service is down - Level 0

If events are not being ingested, there is a problem. This is a long term pain for customers because it could be something that is critical to their business learnings. This is very urgent and do not worry about waking up your coworkers to find the root cause. They will thank you for finding the issue

Both services are down - Level 0

The same as Events but hopefully if metrics didn't catch it you can catch it sooner because of the obvious symptoms. Don't worry about waking up a team member for this!

I think about this like I think about Gmail. If Gmail is down, it's annoying. I'll probably stick with it for a bit before moving. If Gmail is not receiving emails that I should be receiving I lose trust in it and will probably move to something else.

Step Three - Debug

This section is dedicated to debugging steps.

Clickhouse CPU

This one has been bothersome recently, usually at night for us in the US. If you notice that the application is not responding, but the response times >30s is not paging you are probably having this issue.

The first thing to do is to confirm on the Instance Health that the primary node of the clickhouse cluster is indeed pinned by cpu.

If it is, you will need to restart clickhouse which is easier than it sounds. Just ssh onto the clickhouse node at ch.posthog.net and execute

sudo service clickhouse-server restart

Immediately after that you will want to ensure that the server is up by running

sudo service clickhouse-server status 

Which will return a summary that says either clickhouse is Active: active (running) or stopped. If it is stopped call James or Tim for further debugging.

Clickhouse Table Latency

If you are getting paged about table latency being critical and you have checked out the posthog cloud metrics page and sure enough latency looks like a hockey stick you'll need to bump setting on clickhouse. To do this you'll need to ssh onto the primary clickhouse node at ch.posthog.net.

First check the error logs:

 tail -f /var/log/clickhouse-server/clickhouse-server.err.log

These logs can be a bit overwhelming so what I generally do is grep them for the table that is running behind.

 tail -f /var/log/clickhouse-server/clickhouse-server.err.log | grep <table name>

If you see a lot of errors that reference ingestion is too slow and that there are too many parts you will need to update the merge-tree configs for clickhouse. These are the configs for the storage engine that we use for all of the tables that we have in clickhouse.

sudo vim /etc/clickhouse-server/config.xml

Find the settings for merge trees

<!-- Settings to fine tune MergeTree tables. See documentation in source code, in MergeTreeSettings.h -->
    <merge_tree>
        <max_suspicious_broken_parts>5</max_suspicious_broken_parts>
        <parts_to_throw_insert>20000</parts_to_throw_insert>
    </merge_tree>

Whatever the setting is for parts_to_throw_insert bump that by about 5k. This will only buy you time to recover reading from kafka. Raise the issue at the next standup so that the root cause can be addressed. After you update this config you will need to restart clickhouse-server

To restart clickhouse server run this on the same node that you updated the configs on:

sudo service clickhouse-server restart

Immediately after that you will want to ensure that the server is up by running

sudo service clickhouse-server status 

Flapping ECS Service

If you are getting paged about the site being down repeatedly and it automatically recovers, one of the ECS services is flapping. This is caused because ECS cannot scale the service fast enough to make up for the slowdown that happens with the increased traffic. When the service is hit with a stampede of traffic all at once the service will suddenly slow down. This causes the health checks to fail coming from the load balancers. This reduces the number of running healthy tasks which causes the service to slow down even more. ECS will attempt to recover the failed tasks which will bring the service back online, but then the cycle repeats as the traffic is slowly brought back online from the load balancer.

The best thing to do in this case is to go to the ECS console on AWS. Select the affected service that is flapping. Update the service and increase the number of nodes significantly. This will force ECS to scale up, and then slowly scale down to a reasonable number of tasks for the service.