363 refresh dashboard design principles (#364)

* feat: add new word to dictionary * refactor: dashboard design principles * fix: add words to dictionary * fix: add words to dictionary * fix: reinstate responding to alerts pages * fix: add words to dictionary * fix: table issues
HO-CTO · Nov 2, 2022 · c012586 · c012586
1 parent 59fcae3
commit c012586
Show file tree

Hide file tree

Showing 4 changed files with 65 additions and 27 deletions.
diff --git a/docs/source/images/dashboard-hierarchy.png b/docs/source/images/dashboard-hierarchy.png
diff --git a/docs/source/monitor-your-service/index.html.md.erb b/docs/source/monitor-your-service/index.html.md.erb
@@ -7,32 +7,49 @@ weight: 40
 
 Following the design principle of "Hierarchical dashboards with drill-downs to the next level" <sup>1</sup>, we have developed a five tier dashboard structure to fulfil different persona needs as follows: -
 
+![Dashboard Hierarchy](../../images/dashboard-hierarchy.png)
+
 | Dashboard            | Description                                                                                                          | Persona / User           | Dashboard Title  |
 | ---------------------| ---------------------------------------------------------------------------------------------------------------------|--------------------------|----------------------------------------------------------|
 | Overview             | Observability of all products and tenants running on a platform.                                                     | Service Manager          | SRE MaC / Overview                                       |
 | Product View         | Observability of all the user journeys running on an individual product.                                             | Product Manager and Team | SRE MaC / {Product Name}                                 |
-| User Journey View    | Observability of all the SLIs in a single user journey.                                                              | Product Manager and Team | SRE MaC / {Product Name} / {User Journey Name}           |
-| Detail View          | Observability of all whitebox and blackbox metrics which contribute to SLIs and Service Health. For troubleshooting. | Engineers | detail-view  | SRE MaC / {Product Name} / {User Journey Name} / Detail  |
-
-These hierarchical dashboards support a generic troubleshooting workflow: -
+| User Journey View    | Observability of all the SLIs in a single user journey.                                                              | Engineers / Analyst      | SRE MaC / {Product Name} / {User Journey Name}           |
+| Detail View          | Observability of all whitebox and blackbox metrics which contribute to SLIs and Service Health. For troubleshooting. | Engineers / Analyst      | SRE MaC / {Product Name} / {User Journey Name} / Detail  |
 
 ## Dashboard Design Principles
 
-* Methodical dashboards according to an SLI/SLO strategy.
-* Hierarchical dashboards with drill-downs to the next level.
-* Actively reduce sprawl.
-    * Regularly review existing dashboards to make sure they are still relevant.
-    * Only approved dashboards added to master dashboard list.
-    * Tracking dashboard use.
-* Scripting libraries to generate dashboards, ensure consistency in pattern and style. grafonnet (Jsonnet)
-    * No editing in the browser. Dashboard viewers change views with variables.
-* Browsing for dashboards is the exception, not the rule.
-* Perform experimentation and testing on a feature branch (consider nonprod environment to be production).
-* Expressive charts with meaningful use of colour and normalising axes where you can.
-    * Example of meaningful colour: Blue means it’s good, red means it’s bad. Thresholds can help with that.
-    * Example of normalising axes: When comparing CPU usage, measure by percentage rather than raw number, because machines can have a different number of cores. Normalising CPU usage by the number of cores reduces cognitive load because the viewer can trust that at 100% all cores are being used, without having to know the number of CPUs.
-
-
-
+### 1.0 Methodology
+
+| ID  | Principles                                                                   |
+|-----|------------------------------------------------------------------------------|
+| 1.1 | Methodical dashboards according to DDaT SLI/SLO standards.                   |
+|     | - Dashboards focused on symptoms rather than causes.                         |
+|     | - The ability to visualise adherence to SLOs in a dashboard                  |
+|     | - The ability to visualise Error budget in a dashboard                       |
+|     | - The ability to visualise Burn Rate in a dashboard                          |
+| 1.2 | Align SLI/SLO dashboards to standard Google SLI Categories                   |
+
+### 2.0 Automation
+
+| ID  | Principles                                                                   |
+|-----|------------------------------------------------------------------------------|
+| 2.1 | Scripting libraries to generate dashboards, ensure consistency in pattern and style. |
+|     | - No editing in the browser. Dashboard viewers change views with variables. |
+| 2.2 | Version controlled dashboards iterated inline with code management best practices |
+| 2.3 | Reuse dashboards and enforce consistency by using templates and variables. |
+| 2.4 | Dashboards should be linked to by alerts. |
+
+### 3.0 Visualisation
+
+| ID  | Principles                                                                  |
+|-----|-----------------------------------------------------------------------------|
+| 3.1 | Keep graphs simple and focused on answering one question                    |
+| 3.2 | Dashboards should reduce cognitive load and be quick to figure out          |
+| 3.3 | Expressive charts with meaningful use of colour and normalising axes where you can. |
+|     | - Example of meaningful colour: Green/Blue means it’s good, red means it’s bad. |
+|     | - Example of normalising axes: When comparing CPU usage, measure by percentage rather than raw number. |
+| 3.4 | Use a meaningful name                                                       |
+| 3.5 | Browsing should be directed with links.                                     |
+| 3.6 | Add documentation to dashboards and panels                                  |
 
 
diff --git a/...ce/responding-to-alerts/index.html.md.erb → ...ce/responding-to-alerts/index.html.md.erb b/...ce/responding-to-alerts/index.html.md.erb → ...ce/responding-to-alerts/index.html.md.erb
@@ -5,17 +5,17 @@ weight: 41
 
 # Responding to alerts
 
-Debugging a digital service is never as clean as the idealized model presented below. However, using a combination of alerts, system metrics in dashboards and logs can help understand the current operating state. Google SRE has presented some useful steps below that can make the process less painful and more productive. We have overlayed how our Monitoring-as-Code framework can support those users responding to system problems.
+Debugging a digital service is never as clean as the idealised model presented below. However, using a combination of alerts, system metrics in dashboards and logs can help understand the current operating state. Google SRE has presented some useful steps below that can make the process less painful and more productive. We have overlaid how our Monitoring-as-Code framework can support those users responding to system problems.
 
 ![responding to alerts](../../images/responding-to-alerts.png)
 
-Diagram courtesy of [Effective Troublshooting - Google SRE Workbook](https://sre.google/sre-book/effective-troubleshooting/)
+Diagram courtesy of [Effective Troubleshooting - Google SRE Workbook](https://sre.google/sre-book/effective-troubleshooting/)
 
 ## Problem report
 
 Every digital service issue starts with a problem report, which might be an automated alert or one of our users saying, “The system is slow.” Monitoring-as-Code will deliver an automated alert if an SLO is in danger of being breached. Depending on how quickly the Error Budget is forecast to breach will determine what level of alert is triggered. Below we have detailed the typical makeup of an alert.
 
-    ALERT: grapi - severity: 2 - Availability - FLAPI Create Email Address Write Back API
+    ALERT: grapi - severity: 2 - Availability - GRAPI Create Email Address Write Back API
     Alert from HO-Monitoring <dashboard-link> <silence-link> <runbook-link>
     • alertname: grapi_writeback_SLI05_ErrorBudgetBurn
     • assignment_group: Great Respect API
@@ -46,7 +46,7 @@ On receipt of an alert you should: -
 
 ## Triage
 
-On receipt of an alert and in the absence of auto-ticketing. You should create an ServiceNow incident populating the fields using the alert labels provided for **Primary Service Impacted**, **Configuration Item**, **Assignment Group** and **Severity**.
+On receipt of an alert and in the absence of auto-ticketing. You should create a ServiceNow incident populating the fields using the alert labels provided for **Primary Service Impacted**, **Configuration Item**, **Assignment Group** and **Severity**.
 
 ## Examine
 
@@ -58,7 +58,7 @@ Consult the product-view dashboard (pictured below) to confirm the SLI which is
 
 ### journey-view
 
-Use the journey-view dashboard (pictured below) to discover remaining error budget, the error rate spikes that have caused the forecasted breack of SLO before drilling down to the detail view.
+Use the journey-view dashboard (pictured below) to discover remaining error budget, the error rate spikes that have caused the forecasted breach of SLO before drilling down to the detail view.
 
 ![Journey view dashboard](../../images/journey-view.png)
 
@@ -76,4 +76,4 @@ Use a combination of logs (pictured below) and monitors to find out what it’s
 
 ## Test / Treat
 
-## Cure
+## Cure
diff --git a/monitoring-as-code/tools/spell-checker/dictionary.txt b/monitoring-as-code/tools/spell-checker/dictionary.txt
@@ -1,9 +1,13 @@
 alertmanager
 Alertmanager
+alertname
 alertPayloadConfig
 blackbox
+ci_type
 cloudwatch
 Cloudwatch
+cmdb_ci_service_auto
+CMDB_CI_Service_Auto
 codebase
 config
 contributing.md
@@ -19,19 +23,26 @@ dashboardSelectors
 datasources
 detailDashboardConfig
 detailDashboardElements
+DDaT
 Dockerfile
 eco-system
 errorStatus
 etcetera
 evalInterval
 Fong-Jones
+forecasted
 formatter
 GDS
 .githooks
 grafana
 Grafana
 Grafonnet
 grafonnet
+grapi
+grapi_writeback_SLI05_ErrorBudgetBurn
+GRAPI
+http
+http-errors
 http_server_requests_seconds
 http_server_requests_seconds_bucket
 http_server_requests_seconds_count
@@ -78,16 +89,22 @@ Readme
 README.md
 repo
 ruleSelectors
+runbook-link
 runbooks
 Runbooks
 S3
 scribing
 selectorLabels
+servicenow
+ServiceNow
 SLI
 SLIs
 sliSpec
 sliTypesConfig
 sli-value-libraries
+sli05
+SLI05
+slo
 SLO
 SLOs
 src
@@ -98,6 +115,8 @@ sre-demo-java-app
 sre-demo-node-app
 sre-monitoring-as-code
 standardTemplates
+svc
+Svc
 targetMetrics
 TBC
 templating
@@ -113,5 +132,7 @@ UKHomeOffice
 URI
 url
 whitebox
+writeback
 yace
-Yet-Another-Cloudwatch-Exporter
+Yet-Another-Cloudwatch-Exporter
+2m