İçeriğe geç
KAMPANYA

Logo Tasarım + Web Tasarım + 1 Yıl Domain + E-posta + Hosting — $299 +KDV

AIOR

SRE in practice: incidents, on-call, error budgets, and the discipline that makes them work

Sektör topluluğu — sorularınız, deneyimleriniz ve duyurularınız için.

SRE in practice: incidents, on-call, error budgets, and the discipline that makes them work

Aior

Administrator
Staff member
Joined
Apr 2, 2023
Messages
175
Reaction score
2
Points
18
Age
40
Location
Turkey
Website
aior.com
1/3
Thread owner
500


SRE is a discipline, not a job title​

"Site Reliability Engineering" became a buzzword years ago. The actual practice — committed reliability targets, error budgets, post-incident reviews, on-call discipline — is genuinely valuable and is what separates teams that ship reliably from teams that don't. It's also surprisingly under-implemented; the title has spread faster than the discipline.

Below is what we apply, distilled from running a few production-critical systems.

SLOs, not "uptime"​

"99.9 % uptime" is meaningless without context. A useful SLO specifies:
  • The user journey it measures — "user is able to complete checkout end-to-end" not "the checkout service responds to its health check"
  • The success criterion — what counts as a "success" event vs a "failure" event
  • The target — 99.5 %, 99.9 %, 99.95 % — over a defined window (28 days is typical)
  • The measurement methodology — synthetic probes, real user monitoring, server-side instrumentation

A team with a few well-chosen SLOs running consistently is in a different operational class than a team with a "99.99 % everything" aspiration that nobody can verify.

Error budgets — the negotiation tool​

Error budget = (1 - SLO) × time. A 99.9 % SLO over 28 days = ~40 minutes of "allowed unreliability".

The discipline:
  • When the budget is healthy → ship features fast, take controlled risks
  • When the budget is exhausted → freeze features, focus on reliability work
  • The decision is mechanical, not political

The error budget is the artefact that gives reliability work organisational legitimacy. Without it, "we should be more reliable" is a values statement; with it, it's a budget conversation.

On-call: the discipline matters more than the rotation​

Patterns that work:
  • Primary + secondary — primary takes the page, secondary is backup if primary is unreachable
  • One-week rotations — long enough to develop intuition, short enough to recover from
  • Compensated — on-call is real work, even when no pages happen. Pay for it.
  • Time off after a bad on-call — burnout prevention is operational. A primary who's dealt with a 3 AM incident gets the next morning.
  • On-call handoffs — formal, with state of open incidents, ongoing investigations, anything to watch for. Email handoff is fine; verbal is better.

Teams that run on-call as "we'll see what happens" have churn problems.

The incident lifecycle​

  1. Detection — alert fires, or user reports an issue.
  2. Triage — primary on-call assesses severity within 5 minutes. Severity drives response.
  3. Mobilisation — for high-severity, an incident channel is opened, an Incident Commander is assigned, a comms lead is assigned.
  4. Mitigation — restore service, even if not fully understood. Mitigation > diagnosis during the incident.
  5. Communication — to users (if user-impacting), to internal stakeholders, to the on-call channel.
  6. Post-incident review — within 5 business days. Blameless. Documented. Action items tracked.

The Incident Commander role is the underrated one. Their job is not to fix the incident; it's to coordinate the people fixing it. Without them, three engineers debug in parallel and the diagnosis goes slower.

Post-incident reviews — get them right​

A post-incident review (post-mortem) is valuable when:
  • It's blameless — the goal is system improvement, not individual accountability
  • It produces concrete action items with owners and due dates
  • The action items are tracked to completion (most teams' weak point)
  • Patterns across incidents are identified and addressed

The single biggest improvement to most teams' reliability is closing the loop on post-incident action items.

Toil — the work that's secretly killing the team​

Toil = manual, repetitive, no-lasting-value work. Examples: provisioning a new environment by hand, manually rolling a deployment, fixing the same recurring data inconsistency.

The SRE discipline tracks toil. When toil exceeds ~50 % of the team's time, the team has stopped being reliability engineering and has become operations. The fix: automate, prioritise the automation work, accept that it requires deferred feature work to invest in.

Chaos engineering — when it's worth it​

Deliberately injecting failures into production (in controlled ways) to verify that the systems handle failure as designed. Worth it when:
  • The team has solid observability — chaos without observability just produces incidents
  • The team has SLOs — chaos has a budget within which it can run
  • The system is mature enough that the design assumes failures

Not worth it for early-stage systems where the failure modes are still being learned organically.

One pattern we'd warn about​

Hiring SREs without giving them authority. The "SRE team" that can't say "no, we're not shipping that until reliability work is done" is a support team, not an SRE function. The error budget gives the authority structure; without it, SRE is advisory.

One pattern that always pays off​

The on-call review meeting. Weekly or bi-weekly, the on-call engineer walks the team through what happened on their shift. Pages, interesting incidents, near-misses, things that should have alerted but didn't. The team's collective intuition compounds, and recurring problems get fixed faster.

What's your incident process? And — for SRE leads — how do you handle the political dimension of declaring an SLO breach during a feature push?
 

Forum statistics

Threads
171
Messages
178
Members
27
Latest member
AIORAli

Members online

No members online now.

Featured content

AIOR
AIOR TEKNOLOJİ

Tüm ihtiyaçlarınız için Teklif alın

Hosting · Domain · Sunucu · Tasarım · Yazılım · Mühendislik · Sektörel Çözümler

Teklif al

7/24 Destek · Anında yanıt

Back
Top