Site Reliability Engineering (SRE) is a software engineering approach to IT operations. It was pioneered by Google to support large-scale systems by introducing automation and applying software engineering principles. SRE aims for proactive management and monitoring to ensure the reliability and efficiency of systems and services.
SREs understand that 100% availability is not practical for most systems. Therefore, they define a level of acceptable risk and work within that threshold. This level of acceptable risk is expressed as a Service Level Objective (SLO).
SLOs are a fundamental part of SRE practice. They provide a quantitative means to define the level of service a system should provide. SLOs can encompass aspects such as system availability, throughput, latency, and error rates.
The concept of an error budget stems from the SLO. If a system's reliability is exceeding the SLO, the surplus can be considered as an error budget. This budget can be "spent" on pushing new features, which may increase the risk of failure.
SRE promotes a culture where everyone shares ownership of the production environment. Developers and operations staff collaborate closely to ensure the reliability, scalability, and efficiency of systems.
Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly as a service grows. SREs strive to automate this kind of work to focus more on strategic and long-term planning activities.
SRE encourages quick iteration while minimizing the impact of failures. This approach requires robust monitoring, quick and effective incident response, and learning from failures.
Good monitoring provides insights into how systems are behaving and helps SREs identify problems before they affect users. Alerts should be actionable and meaningful. They should represent situations that require human intervention.
SREs need to plan for the system's capacity to ensure it can handle the load and can scale up as demand increases.
Changes in the production environment should be controlled and managed. SREs apply principles like canarying, phased rollouts, and feature flags to minimize the impact of changes.
When incidents occur, SREs should be ready to respond quickly and effectively. Postmortems are done after resolving incidents to learn from them and prevent recurrence.
Blameless postmortem culture is central to learning from failures. It drives the organization to view mistakes, errors, slips, lapses, etc. as a systemic effect, rather than blaming an individual.
To learn more about Site Reliability Engineering, you can refer to the Google SRE book.