SRE is a discipline, not a job title
"Site Reliability Engineering" became a buzzword years ago. The actual practice — committed reliability targets, error budgets, post-incident reviews, on-call discipline — is genuinely valuable and is what separates teams that ship reliably from teams that don't. It's also surprisingly under-implemented; the title has spread faster than the discipline.Below is what we apply, distilled from running a few production-critical systems.
SLOs, not "uptime"
"99.9 % uptime" is meaningless without context. A useful SLO specifies:- The user journey it measures — "user is able to complete checkout end-to-end" not "the checkout service responds to its health check"
- The success criterion — what counts as a "success" event vs a "failure" event
- The target — 99.5 %, 99.9 %, 99.95 % — over a defined window (28 days is typical)
- The measurement methodology — synthetic probes, real user monitoring, server-side instrumentation
A team with a few well-chosen SLOs running consistently is in a different operational class than a team with a "99.99 % everything" aspiration that nobody can verify.
Error budgets — the negotiation tool
Error budget = (1 - SLO) × time. A 99.9 % SLO over 28 days = ~40 minutes of "allowed unreliability".The discipline:
- When the budget is healthy → ship features fast, take controlled risks
- When the budget is exhausted → freeze features, focus on reliability work
- The decision is mechanical, not political
The error budget is the artefact that gives reliability work organisational legitimacy. Without it, "we should be more reliable" is a values statement; with it, it's a budget conversation.
On-call: the discipline matters more than the rotation
Patterns that work:- Primary + secondary — primary takes the page, secondary is backup if primary is unreachable
- One-week rotations — long enough to develop intuition, short enough to recover from
- Compensated — on-call is real work, even when no pages happen. Pay for it.
- Time off after a bad on-call — burnout prevention is operational. A primary who's dealt with a 3 AM incident gets the next morning.
- On-call handoffs — formal, with state of open incidents, ongoing investigations, anything to watch for. Email handoff is fine; verbal is better.
Teams that run on-call as "we'll see what happens" have churn problems.
The incident lifecycle
- Detection — alert fires, or user reports an issue.
- Triage — primary on-call assesses severity within 5 minutes. Severity drives response.
- Mobilisation — for high-severity, an incident channel is opened, an Incident Commander is assigned, a comms lead is assigned.
- Mitigation — restore service, even if not fully understood. Mitigation > diagnosis during the incident.
- Communication — to users (if user-impacting), to internal stakeholders, to the on-call channel.
- Post-incident review — within 5 business days. Blameless. Documented. Action items tracked.
The Incident Commander role is the underrated one. Their job is not to fix the incident; it's to coordinate the people fixing it. Without them, three engineers debug in parallel and the diagnosis goes slower.
Post-incident reviews — get them right
A post-incident review (post-mortem) is valuable when:- It's blameless — the goal is system improvement, not individual accountability
- It produces concrete action items with owners and due dates
- The action items are tracked to completion (most teams' weak point)
- Patterns across incidents are identified and addressed
The single biggest improvement to most teams' reliability is closing the loop on post-incident action items.
Toil — the work that's secretly killing the team
Toil = manual, repetitive, no-lasting-value work. Examples: provisioning a new environment by hand, manually rolling a deployment, fixing the same recurring data inconsistency.The SRE discipline tracks toil. When toil exceeds ~50 % of the team's time, the team has stopped being reliability engineering and has become operations. The fix: automate, prioritise the automation work, accept that it requires deferred feature work to invest in.
Chaos engineering — when it's worth it
Deliberately injecting failures into production (in controlled ways) to verify that the systems handle failure as designed. Worth it when:- The team has solid observability — chaos without observability just produces incidents
- The team has SLOs — chaos has a budget within which it can run
- The system is mature enough that the design assumes failures
Not worth it for early-stage systems where the failure modes are still being learned organically.
One pattern we'd warn about
Hiring SREs without giving them authority. The "SRE team" that can't say "no, we're not shipping that until reliability work is done" is a support team, not an SRE function. The error budget gives the authority structure; without it, SRE is advisory.One pattern that always pays off
The on-call review meeting. Weekly or bi-weekly, the on-call engineer walks the team through what happened on their shift. Pages, interesting incidents, near-misses, things that should have alerted but didn't. The team's collective intuition compounds, and recurring problems get fixed faster.What's your incident process? And — for SRE leads — how do you handle the political dimension of declaring an SLO breach during a feature push?