-
Notifications
You must be signed in to change notification settings - Fork 440
Best practices for monitoring
Create a Text panel at the very top of every dashboard on its own (unnamed) row. For example, see ResourceLoader. The purpose of this text panel is to:
- Define in a short statement what the subject of the dashboard is. A Dashboard should tell a story or answer a question.
- Summarise in a sentence or two the flow of the data from the source to your screen.
- Answer the question, if I show this to someone else, how long will it take them to figure out what is the dashboard about?
Try not to add too many panels in a single dashboard if they are not related. Instead, create separate dashboards. To address this problem, periodically review your dashboards and remove unnecessary ones. Only add as many panels on a dashboard that can be viewed on a single screen without scrolling. We have decided to stick with screen size 1440x900.
- Preferred timezone: UTC.
- Preferred range: Last 3 hours for most dashboards.
- Auto-refresh: Provide options for 5min and 15min. If on by default, use 5min as the default interval. Avoid smaller intervals to not cause high load.
When creating a graph, keep in mind what question you want the graph to answer. If possible, try to focus on a single metric only. More metrics are usually a sign that a graph may be attempting to answer too many questions at once.
To decide which dashboard to use to add new metrics we can use common observability strategies. It helps to make uniform dashboards and scale your observability platform more easily.
There are two methods USE (to monitor hardware resources in infrastructure) and RED (to monitor services). Details on these methods are available here.
To further simplify the observability platform we can introduce hierarchies where related panels can be linked together using the Panel options -> Panel links
feature.
- Do all graphs have a left Y with a useful and correct unit?
- Is it obvious and easy to understand what a graph represents exactly?
- Do all the graphs have a meaningful description, title, and name?
- Do the alert messages make sense? Do they have the correct corresponding channel?
We used the following sources to write this page:
- https://grafana.com/docs/grafana/latest/best-practices/
- https://grafana.com/blog/2019/05/29/grafana-labs-at-kubecon-foolproof-kubernetes-dashboards-for-sleep-deprived-on-calls/
- https://wikitech.wikimedia.org/wiki/Performance/Runbook/Grafana_best_practices
- https://www.datadoghq.com/blog/timeseries-metric-graphs-101/
- Development Environment Overview
- Development Environment Tips & Tricks
- Spec-Tips
- Code Style
- Rubocop
- Testing with VCR
- Authentication
- Authorization
- Autocomplete
- BS Requests
- Events
- ProjectLog
- Notifications
- Feature Toggles
- Build Results
- Attrib classes
- Flags
- The BackendPackage Cache
- Maintenance classes
- Cloud uploader
- Delayed Jobs
- Staging Workflow
- StatusHistory
- OBS API
- Owner Search
- Search
- Links
- Distributions
- Repository
- Data Migrations
- next_rails
- Ruby Update
- Rails Profiling
- Installing a local LDAP-server
- Remote Pairing Setup Guide
- Factory Dashboard
- osc
- Setup an OBS Development Environment on macOS
- Run OpenQA smoketest locally
- Responsive Guidelines
- Importing database dumps
- Problem Statement & Solution
- Kickoff New Stuff
- New Swagger API doc
- Documentation and Communication
- GitHub Actions
- How to Introduce Software Design Patterns
- Query Objects
- Services
- View Components
- RFC: Core Components
- RFC: Decorator Pattern
- RFC: Backend models