SRE in practice: incidents, on-call, error budgets, and the discipline that makes them work

Aior · Thursday at 11:50 PM

SRE is a discipline, not a job title

"Site Reliability Engineering" became a buzzword years ago. The actual practice — committed reliability targets, error budgets, post-incident reviews, on-call discipline — is genuinely valuable and is what separates teams that ship reliably from teams that don't. It's also surprisingly under-implemented; the title has spread faster than the discipline.

Below is what we apply, distilled from running a few production-critical systems.

SLOs, not "uptime"

"99.9 % uptime" is meaningless without context. A useful SLO specifies:

The user journey it measures — "user is able to complete checkout end-to-end" not "the checkout service responds to its health check"
The success criterion — what counts as a "success" event vs a "failure" event
The target — 99.5 %, 99.9 %, 99.95 % — over a defined window (28 days is typical)
The measurement methodology — synthetic probes, real user monitoring, server-side instrumentation

A team with a few well-chosen SLOs running consistently is in a different operational class than a team with a "99.99 % everything" aspiration that nobody can verify.

Error budgets — the negotiation tool

Error budget = (1 - SLO) × time. A 99.9 % SLO over 28 days = ~40 minutes of "allowed unreliability".

The discipline:

When the budget is healthy → ship features fast, take controlled risks
When the budget is exhausted → freeze features, focus on reliability work
The decision is mechanical, not political

The error budget is the artefact that gives reliability work organisational legitimacy. Without it, "we should be more reliable" is a values statement; with it, it's a budget conversation.

On-call: the discipline matters more than the rotation

Patterns that work:

Primary + secondary — primary takes the page, secondary is backup if primary is unreachable
One-week rotations — long enough to develop intuition, short enough to recover from
Compensated — on-call is real work, even when no pages happen. Pay for it.
Time off after a bad on-call — burnout prevention is operational. A primary who's dealt with a 3 AM incident gets the next morning.
On-call handoffs — formal, with state of open incidents, ongoing investigations, anything to watch for. Email handoff is fine; verbal is better.

Teams that run on-call as "we'll see what happens" have churn problems.

The incident lifecycle

Detection — alert fires, or user reports an issue.
Triage — primary on-call assesses severity within 5 minutes. Severity drives response.
Mobilisation — for high-severity, an incident channel is opened, an Incident Commander is assigned, a comms lead is assigned.
Mitigation — restore service, even if not fully understood. Mitigation > diagnosis during the incident.
Communication — to users (if user-impacting), to internal stakeholders, to the on-call channel.
Post-incident review — within 5 business days. Blameless. Documented. Action items tracked.

The Incident Commander role is the underrated one. Their job is not to fix the incident; it's to coordinate the people fixing it. Without them, three engineers debug in parallel and the diagnosis goes slower.

Post-incident reviews — get them right

A post-incident review (post-mortem) is valuable when:

It's blameless — the goal is system improvement, not individual accountability
It produces concrete action items with owners and due dates
The action items are tracked to completion (most teams' weak point)
Patterns across incidents are identified and addressed

The single biggest improvement to most teams' reliability is closing the loop on post-incident action items.

Toil — the work that's secretly killing the team

Toil = manual, repetitive, no-lasting-value work. Examples: provisioning a new environment by hand, manually rolling a deployment, fixing the same recurring data inconsistency.

The SRE discipline tracks toil. When toil exceeds ~50 % of the team's time, the team has stopped being reliability engineering and has become operations. The fix: automate, prioritise the automation work, accept that it requires deferred feature work to invest in.

Chaos engineering — when it's worth it

Deliberately injecting failures into production (in controlled ways) to verify that the systems handle failure as designed. Worth it when:

The team has solid observability — chaos without observability just produces incidents
The team has SLOs — chaos has a budget within which it can run
The system is mature enough that the design assumes failures

Not worth it for early-stage systems where the failure modes are still being learned organically.

One pattern we'd warn about

Hiring SREs without giving them authority. The "SRE team" that can't say "no, we're not shipping that until reliability work is done" is a support team, not an SRE function. The error budget gives the authority structure; without it, SRE is advisory.

One pattern that always pays off

The on-call review meeting. Weekly or bi-weekly, the on-call engineer walks the team through what happened on their shift. Pages, interesting incidents, near-misses, things that should have alerted but didn't. The team's collective intuition compounds, and recurring problems get fixed faster.

What's your incident process? And — for SRE leads — how do you handle the political dimension of declaring an SLO breach during a feature push?

SRE in practice: incidents, on-call, error budgets, and the discipline that makes them work

SRE in practice: incidents, on-call, error budgets, and the discipline that makes them work

Aior

Administrator

SRE is a discipline, not a job title

SLOs, not "uptime"

Error budgets — the negotiation tool

On-call: the discipline matters more than the rotation

The incident lifecycle

Post-incident reviews — get them right

Toil — the work that's secretly killing the team

Chaos engineering — when it's worth it

One pattern we'd warn about

One pattern that always pays off

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Legal Notice

We value your privacy

SRE in practice: incidents, on-call, error budgets, and the discipline that makes them work

SRE in practice: incidents, on-call, error budgets, and the discipline that makes them work

Aior

Administrator

SRE is a discipline, not a job title​

SLOs, not "uptime"​

Error budgets — the negotiation tool​

On-call: the discipline matters more than the rotation​

The incident lifecycle​

Post-incident reviews — get them right​

Toil — the work that's secretly killing the team​

Chaos engineering — when it's worth it​

One pattern we'd warn about​

One pattern that always pays off​

Forum statistics

Members online

Latest posts

Newest members

Featured content

Trending content

Share this page

Tüm ihtiyaçlarınız için Teklif alın

Legal Notice

We value your privacy

SRE is a discipline, not a job title

SLOs, not "uptime"

Error budgets — the negotiation tool

On-call: the discipline matters more than the rotation

The incident lifecycle

Post-incident reviews — get them right

Toil — the work that's secretly killing the team

Chaos engineering — when it's worth it

One pattern we'd warn about

One pattern that always pays off