Skip to content

Commit

Permalink
Add oncall docs (#144)
Browse files Browse the repository at this point in the history
  • Loading branch information
abtris authored Apr 21, 2024
2 parents f6886c5 + ea9eac6 commit 63956ec
Showing 1 changed file with 91 additions and 0 deletions.
91 changes: 91 additions & 0 deletions content/courses/how-to-make-oncall/chapter10a.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: On-Call Documentation
toc: false
type: docs
date: "2024-04-18T00:00:00+01:00"
draft: false
menu:
how-to-make-oncall:
parent: Intro
weight: 10

# Prev/next pager order (if `docs_section_pager` enabled in `params.toml`)
weight: 15
---

Writing documentation effectively can be achieved by following a framework such as [Diátaxis](https://diataxis.fr/). Diátaxis is a way of thinking about and doing documentation. We will apply similar approach and split documentation into four parts.

- [Tutorials](#tutorials)
- [How-to Guides](#how-to-guides)
- [Runbooks / Playbooks (Google SRE naming)](#runbooks--playbooks-google-sre-naming)
- [How to create an effective runbook](#how-to-create-an-effective-runbook)
- [Glossaries](#glossaries)
- [Explanation](#explanation)

## Tutorials

What You Need for On-Call - A tutorial that helps new on-call members onboard into the process, tools, and gain all necessary access for the on-call shift.

- [PagerDuty Incident Response](https://response.pagerduty.com/)

## How-to Guides

Standard Operation Procedures (SOP) for deploying, rotating secrets, adding new regions, scale down and up, adding capacity etc. Runbooks for Incidents - every actionable alert needs a runbook, and we should write and test it regularly to ensure it works and is straightforward to understand.

You can start with a checklist as mentioned in the SRE Workbook[^1].

- Administering production jobs
- Understanding debugging info
- "Draining" traffic away from a cluster
- Rolling back a bad software push
- Blocking or rate-limiting unwanted traffic
- Bringing up additional serving capacity
- Using the monitoring systems (for alerting and dashboards)
- Describing the architecture, various components, and dependencies of the services


Before going on-call, the team reviewed precise guidelines about the responsibilities of on-call engineers.

For example:
- At the start of each shift, the on-call engineer reads the handoff from the previous shift.
- The on-call engineer minimizes user impact first, then makes sure the issues are fully addressed.
- At the end of the shift, the on-call engineer sends a handoff email to the next engineer on-call.

### Runbooks / Playbooks (Google SRE naming)

As The Site Reliability Workbook[^1] says, playbooks "reduce stress, the mean time to repair (MTTR), and the risk of human error."

All alerts should be immediately actionable. There should be an action we expect a human to take immediately after they receive the page that the system is unable to take itself. The signal-to-noise ratio should be high to ensure few false positives; a low signal-to-noise ratio raises the risk for on-call engineers to develop alert fatigue.

Just like new code, new alerts should be thoroughly and thoughtfully reviewed. Each alert should have a corresponding playbook (runbook) entry.

- [Example of runbook template](https://github.com/SkeltonThatcher/run-book-template/blob/master/run-book-template.md)

### How to create an effective runbook

There are five attributes of any good runbook; the five As.[^3] It must be:

- **Actionable**. It’s nice to know the big picture and architecture of a system, but when you are looking for a runbook, you’re looking to take action based on a particular situation.
- **Accessible**. If you can’t find the runbook, it doesn’t matter how well it is written.
- **Accurate**. If it doesn’t contain truthful information, it’s worse than nothing at all.
- **Authoritative**. It is confusing to have more than one runbook for any given process.
- **Adaptable**. Systems evolve, and if you can’t change your runbook, the drift will make it unusable.

## Glossaries

Glossaries[^2] can be helpful for a few reasons:

- Glossaries help you avoid repetition. When you can refer to a definition with a linked explanation, you save time and words.
- Glossaries ensure consistency in descriptions. If something is explained in multiple ways, it can become confusing.
- Glossaries enhance the usability of a runbook for engineers at all experience levels. By including a glossary in your runbook, you provide essential explanations for newer engineers and streamline information for more experienced ones.

## Explanation

- [RFC](https://en.wikipedia.org/wiki/Request_for_Comments)
- [RFD](https://rfd.shared.oxide.computer/rfd/0001)
- [ADR](https://adr.github.io/)
- paper, wiki, anything that helps why things are how they and how works in detail.

[^1]: [Google SRE Workbook](https://sre.google/workbook/on-call/)
[^2]: [Transposit SRE Blog on Writing Runbook Documentation](https://www.transposit.com/devops-blog/sre/2020.01.30-writing-runbook-documentation-when-youre-an-sre/)
[^3]: [Transposit ITSM Blog on Good Runbooks](https://www.transposit.com/devops-blog/itsm/what-makes-a-good-runbook/)

0 comments on commit 63956ec

Please sign in to comment.