Skip to content

Latest commit

 

History

History
57 lines (29 loc) · 3.1 KB

sre.md

File metadata and controls

57 lines (29 loc) · 3.1 KB

Site Reliability Engineering (SRE)

Introduction

Site Reliability Engineering (SRE) is a software engineering approach to IT operations. It was pioneered by Google to support large-scale systems by introducing automation and applying software engineering principles. SRE aims for proactive management and monitoring to ensure the reliability and efficiency of systems and services.

Core Principles of SRE

Embrace Risk

SREs understand that 100% availability is not practical for most systems. Therefore, they define a level of acceptable risk and work within that threshold. This level of acceptable risk is expressed as a Service Level Objective (SLO).

Service Level Objectives (SLOs)

SLOs are a fundamental part of SRE practice. They provide a quantitative means to define the level of service a system should provide. SLOs can encompass aspects such as system availability, throughput, latency, and error rates.

Error Budgets

The concept of an error budget stems from the SLO. If a system's reliability is exceeding the SLO, the surplus can be considered as an error budget. This budget can be "spent" on pushing new features, which may increase the risk of failure.

Reduce Organizational Silos

SRE promotes a culture where everyone shares ownership of the production environment. Developers and operations staff collaborate closely to ensure the reliability, scalability, and efficiency of systems.

Automate Toil

Toil is the kind of work tied to running a production service that tends to be manual, repetitive, automatable, tactical, devoid of enduring value, and scales linearly as a service grows. SREs strive to automate this kind of work to focus more on strategic and long-term planning activities.

Move Fast by Reducing the Cost of Failure

SRE encourages quick iteration while minimizing the impact of failures. This approach requires robust monitoring, quick and effective incident response, and learning from failures.

Key Practices

Monitoring and Alerting

Good monitoring provides insights into how systems are behaving and helps SREs identify problems before they affect users. Alerts should be actionable and meaningful. They should represent situations that require human intervention.

Capacity Planning

SREs need to plan for the system's capacity to ensure it can handle the load and can scale up as demand increases.

Change Management

Changes in the production environment should be controlled and managed. SREs apply principles like canarying, phased rollouts, and feature flags to minimize the impact of changes.

Incident Response

When incidents occur, SREs should be ready to respond quickly and effectively. Postmortems are done after resolving incidents to learn from them and prevent recurrence.

Blameless Postmortem

Blameless postmortem culture is central to learning from failures. It drives the organization to view mistakes, errors, slips, lapses, etc. as a systemic effect, rather than blaming an individual.

Further Resources

To learn more about Site Reliability Engineering, you can refer to the Google SRE book.