From d93c9d36af863b61392b72196c2f265bbc6f0247 Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec Date: Fri, 19 Apr 2024 13:18:15 +0200 Subject: [PATCH 01/11] Add oncall docs --- .../courses/how-to-make-oncall/chapter10a.md | 91 +++++++++++++++++++ 1 file changed, 91 insertions(+) create mode 100644 content/courses/how-to-make-oncall/chapter10a.md diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md new file mode 100644 index 0000000..b0a9fba --- /dev/null +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -0,0 +1,91 @@ +--- +title: On-Call Documentation +toc: false +type: docs +date: "2024-04-18T00:00:00+01:00" +draft: false +menu: + how-to-make-oncall: + parent: Intro + weight: 10 + +# Prev/next pager order (if `docs_section_pager` enabled in `params.toml`) +weight: 15 +--- + +For writing documentation is good follow framework as [Diátaxis](https://diataxis.fr/). Diátaxis is a way of thinking about and doing documentation. We will apply similar approach and split documentation into four parts. + +- [Tutorials](#tutorials) +- [How-to Guides](#how-to-guides) + - [Runbooks / Playbooks (Google SRE naming)](#runbooks--playbooks-google-sre-naming) + - [How to create an effective runbook](#how-to-create-an-effective-runbook) +- [Glossaries](#glossaries) +- [Explanation](#explanation) + +## Tutorials + +What you need for On-Call - tutorial that help new on-call member onboard into process, tools and get all access need it for on-call shift. + +- [PagerDuty Incident Response](https://response.pagerduty.com/) + +## How-to Guides + +Standard Operation Procedures (SOP) for deploying, rotating secrets, adding new regions, scale down and up, adding capacity etc. Runbooks for Incident - every actionable alert need runbook and we should write it and test it on regular basic that works and is easy to understand. + +You can start with checklist as mention in SRE Workbook[^1] + +- Administering production jobs +- Understanding debugging info +- "Draining" traffic away from a cluster +- Rolling back a bad software push +- Blocking or rate-limiting unwanted traffic +- Bringing up additional serving capacity +- Using the monitoring systems (for alerting and dashboards) +- Describing the architecture, various components, and dependencies of the services + +Before going on-call, the team reviewed precise guidelines about the responsibilities of on-call engineers. + +For example: +- At the start of each shift, the on-call engineer reads the handoff from the previous shift. +- The on-call engineer minimizes user impact first, then makes sure the issues are fully addressed. +- At the end of the shift, the on-call engineer sends a handoff email to the next engineer on-call. + +### Runbooks / Playbooks (Google SRE naming) + +As The Site Reliability Workbook[^1] says, playbooks "reduce stress, the mean time to repair (MTTR), and the risk of human error." + +All alerts should be immediately actionable. There should be an action we expect a human to take immediately after they receive the page that the system is unable to take itself. The signal-to-noise ratio should be high to ensure few false positives; a low signal-to-noise ratio raises the risk for on-call engineers to develop alert fatigue. + +Just like new code, new alerts should be thoroughly and thoughtfully reviewed. Each alert should have a corresponding playbook (runbook) entry. + +- [Example of runbook template](https://github.com/SkeltonThatcher/run-book-template/blob/master/run-book-template.md) + +### How to create an effective runbook + +There are five attributes of any good runbook; the five As.[^3] It must be: + +- **actionable**. It’s nice to know the big picture and architecture of a system, but when you are looking for a runbook, you’re looking to take action based on a particular situation. +- **accessible**. If you can’t find the runbook, it doesn’t matter how well it is written. +- **accurate**. If it doesn’t contain truthful information, it’s worse than nothing at all. +- **authoritative**. It is confusing to have more than one runbook for any given process. +- **adaptable**. Systems evolve, and if you can’t change your runbook, the drift will make it unusable. + +## Glossaries + +Glossaries[^2] can be helpful for a few reasons: + +- Glossaries help you repeat yourself less. When you can refer to a definition with a linked explanation, you just saved yourself time and words. +- Glossaries make descriptions more consistent. If something is explained in five different ways, it can get confusing. +- Glossaries allow a runbook to be more easily used by engineers with different levels of experience. By referencing a glossary in your runbook, you allow someone newer to the on-call rotation to get the explanation of concepts or terms they need. For more experienced on-call engineers, you remove extraneous information from the runbook. + +## Explanation + +- [RFC](https://en.wikipedia.org/wiki/Request_for_Comments) +- [RFD](https://rfd.shared.oxide.computer/rfd/0001) +- [ADR](https://adr.github.io/) +- paper, wiki, anything that helps why things are how they and how works in detail. + + +[^1]: https://sre.google/workbook/on-call/ +[^2]: https://www.transposit.com/devops-blog/sre/2020.01.30-writing-runbook-documentation-when-youre-an-sre/ +[^3]: https://www.transposit.com/devops-blog/itsm/what-makes-a-good-runbook/ From 08045266ffdfa5e60a3789fa23b1dd2f4cccb74d Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 18:34:47 +0200 Subject: [PATCH 02/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 10 +++++----- 1 file changed, 5 insertions(+), 5 deletions(-) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index b0a9fba..822b754 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -64,11 +64,11 @@ Just like new code, new alerts should be thoroughly and thoughtfully reviewed. E There are five attributes of any good runbook; the five As.[^3] It must be: -- **actionable**. It’s nice to know the big picture and architecture of a system, but when you are looking for a runbook, you’re looking to take action based on a particular situation. -- **accessible**. If you can’t find the runbook, it doesn’t matter how well it is written. -- **accurate**. If it doesn’t contain truthful information, it’s worse than nothing at all. -- **authoritative**. It is confusing to have more than one runbook for any given process. -- **adaptable**. Systems evolve, and if you can’t change your runbook, the drift will make it unusable. +- **Actionable**. It’s nice to know the big picture and architecture of a system, but when you are looking for a runbook, you’re looking to take action based on a particular situation. +- **Accessible**. If you can’t find the runbook, it doesn’t matter how well it is written. +- **Accurate**. If it doesn’t contain truthful information, it’s worse than nothing at all. +- **Authoritative**. It is confusing to have more than one runbook for any given process. +- **Adaptable**. Systems evolve, and if you can’t change your runbook, the drift will make it unusable. ## Glossaries From 6e3b8ee75fe94a245e879fe57112962eb8079f56 Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 18:36:25 +0200 Subject: [PATCH 03/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index 822b754..6066321 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -74,9 +74,9 @@ There are five attributes of any good runbook; the five As.[^3] It must be: Glossaries[^2] can be helpful for a few reasons: -- Glossaries help you repeat yourself less. When you can refer to a definition with a linked explanation, you just saved yourself time and words. -- Glossaries make descriptions more consistent. If something is explained in five different ways, it can get confusing. -- Glossaries allow a runbook to be more easily used by engineers with different levels of experience. By referencing a glossary in your runbook, you allow someone newer to the on-call rotation to get the explanation of concepts or terms they need. For more experienced on-call engineers, you remove extraneous information from the runbook. +- Glossaries help you avoid repetition. When you can refer to a definition with a linked explanation, you save time and words. +- Glossaries ensure consistency in descriptions. If something is explained in multiple ways, it can become confusing. +- Glossaries enhance the usability of a runbook for engineers at all experience levels. By including a glossary in your runbook, you provide essential explanations for newer engineers and streamline information for more experienced ones. ## Explanation From 6e931401210bce10882d37fd5e6db7e36c62a38e Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 18:36:43 +0200 Subject: [PATCH 04/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index 6066321..64b6db2 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -30,7 +30,7 @@ What you need for On-Call - tutorial that help new on-call member onboard into p ## How-to Guides -Standard Operation Procedures (SOP) for deploying, rotating secrets, adding new regions, scale down and up, adding capacity etc. Runbooks for Incident - every actionable alert need runbook and we should write it and test it on regular basic that works and is easy to understand. +Standard Operation Procedures (SOP) for deploying, rotating secrets, adding new regions, scale down and up, adding capacity etc. Runbooks for Incidents - every actionable alert needs a runbook, and we should write and test it regularly to ensure it works and is easy to understand. You can start with checklist as mention in SRE Workbook[^1] From c69a052a895a2374cb55984c0c971648b6de6eba Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 18:37:00 +0200 Subject: [PATCH 05/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index 64b6db2..b549743 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -24,7 +24,7 @@ For writing documentation is good follow framework as [Diátaxis](https://diatax ## Tutorials -What you need for On-Call - tutorial that help new on-call member onboard into process, tools and get all access need it for on-call shift. +What You Need for On-Call - A tutorial that helps new on-call members onboard into the process, tools, and gain all necessary access for the on-call shift. - [PagerDuty Incident Response](https://response.pagerduty.com/) From 01c57f8dd23dcfe478ffed691d4cad89757db302 Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 18:37:19 +0200 Subject: [PATCH 06/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index b549743..ff40f87 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -13,7 +13,7 @@ menu: weight: 15 --- -For writing documentation is good follow framework as [Diátaxis](https://diataxis.fr/). Diátaxis is a way of thinking about and doing documentation. We will apply similar approach and split documentation into four parts. +Writing documentation effectively can be achieved by following a framework such as [Diátaxis](https://diataxis.fr/). Diátaxis is a way of thinking about and doing documentation. We will apply similar approach and split documentation into four parts. - [Tutorials](#tutorials) - [How-to Guides](#how-to-guides) From bbdaf05f22d5c3655a607c7353fe66b52e6d33d6 Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 18:37:54 +0200 Subject: [PATCH 07/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index ff40f87..b00ba76 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -32,7 +32,7 @@ What You Need for On-Call - A tutorial that helps new on-call members onboard in Standard Operation Procedures (SOP) for deploying, rotating secrets, adding new regions, scale down and up, adding capacity etc. Runbooks for Incidents - every actionable alert needs a runbook, and we should write and test it regularly to ensure it works and is easy to understand. -You can start with checklist as mention in SRE Workbook[^1] +You can start with a checklist as mentioned in the SRE Workbook[^1]. - Administering production jobs - Understanding debugging info From 3ce2febce9a0286e14809c9406ef1706cd8d598c Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 20:14:38 +0200 Subject: [PATCH 08/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index b00ba76..353c3ee 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -86,6 +86,6 @@ Glossaries[^2] can be helpful for a few reasons: - paper, wiki, anything that helps why things are how they and how works in detail. -[^1]: https://sre.google/workbook/on-call/ -[^2]: https://www.transposit.com/devops-blog/sre/2020.01.30-writing-runbook-documentation-when-youre-an-sre/ -[^3]: https://www.transposit.com/devops-blog/itsm/what-makes-a-good-runbook/ +[^1]: [Google SRE Workbook](https://sre.google/workbook/on-call/) +[^2]: [Transposit SRE Blog on Writing Runbook Documentation](https://www.transposit.com/devops-blog/sre/2020.01.30-writing-runbook-documentation-when-youre-an-sre/) +[^3]: [Transposit ITSM Blog on Good Runbooks](https://www.transposit.com/devops-blog/itsm/what-makes-a-good-runbook/) From 21c55bd9b7b9b9bf426cd35fe8c13e5bf6bb6731 Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 20:14:48 +0200 Subject: [PATCH 09/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 1 - 1 file changed, 1 deletion(-) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index 353c3ee..1482d4e 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -85,7 +85,6 @@ Glossaries[^2] can be helpful for a few reasons: - [ADR](https://adr.github.io/) - paper, wiki, anything that helps why things are how they and how works in detail. - [^1]: [Google SRE Workbook](https://sre.google/workbook/on-call/) [^2]: [Transposit SRE Blog on Writing Runbook Documentation](https://www.transposit.com/devops-blog/sre/2020.01.30-writing-runbook-documentation-when-youre-an-sre/) [^3]: [Transposit ITSM Blog on Good Runbooks](https://www.transposit.com/devops-blog/itsm/what-makes-a-good-runbook/) From b6b8780b8664ffd3c333410d728970079622b87a Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 20:15:35 +0200 Subject: [PATCH 10/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index 1482d4e..7ac87d2 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -30,7 +30,7 @@ What You Need for On-Call - A tutorial that helps new on-call members onboard in ## How-to Guides -Standard Operation Procedures (SOP) for deploying, rotating secrets, adding new regions, scale down and up, adding capacity etc. Runbooks for Incidents - every actionable alert needs a runbook, and we should write and test it regularly to ensure it works and is easy to understand. +Standard Operation Procedures (SOP) for deploying, rotating secrets, adding new regions, scale down and up, adding capacity etc. Runbooks for Incidents - every actionable alert needs a runbook, and we should write and test it regularly to ensure it works and is straightforward to understand. You can start with a checklist as mentioned in the SRE Workbook[^1]. From ea9eac6326079162992373b1d802b0f308d7aa1b Mon Sep 17 00:00:00 2001 From: Ladislav Prskavec <100356+abtris@users.noreply.github.com> Date: Sun, 21 Apr 2024 20:15:51 +0200 Subject: [PATCH 11/11] Update content/courses/how-to-make-oncall/chapter10a.md Co-authored-by: coderabbitai[bot] <136622811+coderabbitai[bot]@users.noreply.github.com> --- content/courses/how-to-make-oncall/chapter10a.md | 1 + 1 file changed, 1 insertion(+) diff --git a/content/courses/how-to-make-oncall/chapter10a.md b/content/courses/how-to-make-oncall/chapter10a.md index 7ac87d2..5b43f3e 100644 --- a/content/courses/how-to-make-oncall/chapter10a.md +++ b/content/courses/how-to-make-oncall/chapter10a.md @@ -43,6 +43,7 @@ You can start with a checklist as mentioned in the SRE Workbook[^1]. - Using the monitoring systems (for alerting and dashboards) - Describing the architecture, various components, and dependencies of the services + Before going on-call, the team reviewed precise guidelines about the responsibilities of on-call engineers. For example: