Skip to content

Commit

Permalink
363 refresh dashboard design principles (#364)
Browse files Browse the repository at this point in the history
* feat: add new word to dictionary

* refactor: dashboard design principles

* fix: add words to dictionary

* fix: add words to dictionary

* fix: reinstate responding to alerts pages

* fix: add words to dictionary

* fix: table issues
  • Loading branch information
michaelpearsonHO authored Nov 2, 2022
1 parent 59fcae3 commit c012586
Show file tree
Hide file tree
Showing 4 changed files with 65 additions and 27 deletions.
Binary file added docs/source/images/dashboard-hierarchy.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
57 changes: 37 additions & 20 deletions docs/source/monitor-your-service/index.html.md.erb
Original file line number Diff line number Diff line change
Expand Up @@ -7,32 +7,49 @@ weight: 40

Following the design principle of "Hierarchical dashboards with drill-downs to the next level" <sup>1</sup>, we have developed a five tier dashboard structure to fulfil different persona needs as follows: -

![Dashboard Hierarchy](../../images/dashboard-hierarchy.png)

| Dashboard | Description | Persona / User | Dashboard Title |
| ---------------------| ---------------------------------------------------------------------------------------------------------------------|--------------------------|----------------------------------------------------------|
| Overview | Observability of all products and tenants running on a platform. | Service Manager | SRE MaC / Overview |
| Product View | Observability of all the user journeys running on an individual product. | Product Manager and Team | SRE MaC / {Product Name} |
| User Journey View | Observability of all the SLIs in a single user journey. | Product Manager and Team | SRE MaC / {Product Name} / {User Journey Name} |
| Detail View | Observability of all whitebox and blackbox metrics which contribute to SLIs and Service Health. For troubleshooting. | Engineers | detail-view | SRE MaC / {Product Name} / {User Journey Name} / Detail |

These hierarchical dashboards support a generic troubleshooting workflow: -
| User Journey View | Observability of all the SLIs in a single user journey. | Engineers / Analyst | SRE MaC / {Product Name} / {User Journey Name} |
| Detail View | Observability of all whitebox and blackbox metrics which contribute to SLIs and Service Health. For troubleshooting. | Engineers / Analyst | SRE MaC / {Product Name} / {User Journey Name} / Detail |

## Dashboard Design Principles

* Methodical dashboards according to an SLI/SLO strategy.
* Hierarchical dashboards with drill-downs to the next level.
* Actively reduce sprawl.
* Regularly review existing dashboards to make sure they are still relevant.
* Only approved dashboards added to master dashboard list.
* Tracking dashboard use.
* Scripting libraries to generate dashboards, ensure consistency in pattern and style. grafonnet (Jsonnet)
* No editing in the browser. Dashboard viewers change views with variables.
* Browsing for dashboards is the exception, not the rule.
* Perform experimentation and testing on a feature branch (consider nonprod environment to be production).
* Expressive charts with meaningful use of colour and normalising axes where you can.
* Example of meaningful colour: Blue means it’s good, red means it’s bad. Thresholds can help with that.
* Example of normalising axes: When comparing CPU usage, measure by percentage rather than raw number, because machines can have a different number of cores. Normalising CPU usage by the number of cores reduces cognitive load because the viewer can trust that at 100% all cores are being used, without having to know the number of CPUs.



### 1.0 Methodology

| ID | Principles |
|-----|------------------------------------------------------------------------------|
| 1.1 | Methodical dashboards according to DDaT SLI/SLO standards. |
| | - Dashboards focused on symptoms rather than causes. |
| | - The ability to visualise adherence to SLOs in a dashboard |
| | - The ability to visualise Error budget in a dashboard |
| | - The ability to visualise Burn Rate in a dashboard |
| 1.2 | Align SLI/SLO dashboards to standard Google SLI Categories |

### 2.0 Automation

| ID | Principles |
|-----|------------------------------------------------------------------------------|
| 2.1 | Scripting libraries to generate dashboards, ensure consistency in pattern and style. |
| | - No editing in the browser. Dashboard viewers change views with variables. |
| 2.2 | Version controlled dashboards iterated inline with code management best practices |
| 2.3 | Reuse dashboards and enforce consistency by using templates and variables. |
| 2.4 | Dashboards should be linked to by alerts. |

### 3.0 Visualisation

| ID | Principles |
|-----|-----------------------------------------------------------------------------|
| 3.1 | Keep graphs simple and focused on answering one question |
| 3.2 | Dashboards should reduce cognitive load and be quick to figure out |
| 3.3 | Expressive charts with meaningful use of colour and normalising axes where you can. |
| | - Example of meaningful colour: Green/Blue means it’s good, red means it’s bad. |
| | - Example of normalising axes: When comparing CPU usage, measure by percentage rather than raw number. |
| 3.4 | Use a meaningful name |
| 3.5 | Browsing should be directed with links. |
| 3.6 | Add documentation to dashboards and panels |


Original file line number Diff line number Diff line change
Expand Up @@ -5,17 +5,17 @@ weight: 41

# Responding to alerts

Debugging a digital service is never as clean as the idealized model presented below. However, using a combination of alerts, system metrics in dashboards and logs can help understand the current operating state. Google SRE has presented some useful steps below that can make the process less painful and more productive. We have overlayed how our Monitoring-as-Code framework can support those users responding to system problems.
Debugging a digital service is never as clean as the idealised model presented below. However, using a combination of alerts, system metrics in dashboards and logs can help understand the current operating state. Google SRE has presented some useful steps below that can make the process less painful and more productive. We have overlaid how our Monitoring-as-Code framework can support those users responding to system problems.

![responding to alerts](../../images/responding-to-alerts.png)

Diagram courtesy of [Effective Troublshooting - Google SRE Workbook](https://sre.google/sre-book/effective-troubleshooting/)
Diagram courtesy of [Effective Troubleshooting - Google SRE Workbook](https://sre.google/sre-book/effective-troubleshooting/)

## Problem report

Every digital service issue starts with a problem report, which might be an automated alert or one of our users saying, “The system is slow.” Monitoring-as-Code will deliver an automated alert if an SLO is in danger of being breached. Depending on how quickly the Error Budget is forecast to breach will determine what level of alert is triggered. Below we have detailed the typical makeup of an alert.

ALERT: grapi - severity: 2 - Availability - FLAPI Create Email Address Write Back API
ALERT: grapi - severity: 2 - Availability - GRAPI Create Email Address Write Back API
Alert from HO-Monitoring <dashboard-link> <silence-link> <runbook-link>
• alertname: grapi_writeback_SLI05_ErrorBudgetBurn
• assignment_group: Great Respect API
Expand Down Expand Up @@ -46,7 +46,7 @@ On receipt of an alert you should: -

## Triage

On receipt of an alert and in the absence of auto-ticketing. You should create an ServiceNow incident populating the fields using the alert labels provided for **Primary Service Impacted**, **Configuration Item**, **Assignment Group** and **Severity**.
On receipt of an alert and in the absence of auto-ticketing. You should create a ServiceNow incident populating the fields using the alert labels provided for **Primary Service Impacted**, **Configuration Item**, **Assignment Group** and **Severity**.

## Examine

Expand All @@ -58,7 +58,7 @@ Consult the product-view dashboard (pictured below) to confirm the SLI which is

### journey-view

Use the journey-view dashboard (pictured below) to discover remaining error budget, the error rate spikes that have caused the forecasted breack of SLO before drilling down to the detail view.
Use the journey-view dashboard (pictured below) to discover remaining error budget, the error rate spikes that have caused the forecasted breach of SLO before drilling down to the detail view.

![Journey view dashboard](../../images/journey-view.png)

Expand All @@ -76,4 +76,4 @@ Use a combination of logs (pictured below) and monitors to find out what it’s

## Test / Treat

## Cure
## Cure
23 changes: 22 additions & 1 deletion monitoring-as-code/tools/spell-checker/dictionary.txt
Original file line number Diff line number Diff line change
@@ -1,9 +1,13 @@
alertmanager
Alertmanager
alertname
alertPayloadConfig
blackbox
ci_type
cloudwatch
Cloudwatch
cmdb_ci_service_auto
CMDB_CI_Service_Auto
codebase
config
contributing.md
Expand All @@ -19,19 +23,26 @@ dashboardSelectors
datasources
detailDashboardConfig
detailDashboardElements
DDaT
Dockerfile
eco-system
errorStatus
etcetera
evalInterval
Fong-Jones
forecasted
formatter
GDS
.githooks
grafana
Grafana
Grafonnet
grafonnet
grapi
grapi_writeback_SLI05_ErrorBudgetBurn
GRAPI
http
http-errors
http_server_requests_seconds
http_server_requests_seconds_bucket
http_server_requests_seconds_count
Expand Down Expand Up @@ -78,16 +89,22 @@ Readme
README.md
repo
ruleSelectors
runbook-link
runbooks
Runbooks
S3
scribing
selectorLabels
servicenow
ServiceNow
SLI
SLIs
sliSpec
sliTypesConfig
sli-value-libraries
sli05
SLI05
slo
SLO
SLOs
src
Expand All @@ -98,6 +115,8 @@ sre-demo-java-app
sre-demo-node-app
sre-monitoring-as-code
standardTemplates
svc
Svc
targetMetrics
TBC
templating
Expand All @@ -113,5 +132,7 @@ UKHomeOffice
URI
url
whitebox
writeback
yace
Yet-Another-Cloudwatch-Exporter
Yet-Another-Cloudwatch-Exporter
2m

0 comments on commit c012586

Please sign in to comment.