Most engineering teams want their systems to be reliable. Very few have a written contract for how reliable, who owns that number, or what happens when reliability and feature velocity disagree. Site Reliability Engineering (SRE) is the discipline that puts those questions on paper and then engineers against the answers.
It is, in Ben Treynor’s original framing at Google, what happens when you ask a software engineer to design an operations function. The output is not a separate ops team — it is a software team that happens to operate production.
A note on how I use “SRE” on this site. I write about SRE inclusively — not the strict Google-2003 definition only. DevOps culture, cloud architecture, platform engineering, internal developer platforms, and modern reliability practice all sit inside the same conversation here. The lines between them blurred years ago in real organizations, and the interesting work happens in the overlap.

Courtesy: Google SRE — official site — the canonical SRE book, free online
Where SRE came from
Google introduced SRE internally in 2003 to run a search infrastructure that was growing faster than any traditional sysadmin team could keep up with. The thesis was simple: if operating a service requires running scripts and clicking buttons, it doesn’t scale. Treat the service like a software product, measure it like one, and pay down operational cost the same way you pay down technical debt.
The discipline went public in 2016 with the Site Reliability Engineering book (the “Google SRE book”), and the playbook has since been adopted — with varying degrees of faithfulness — across the industry.
The core principles
SRE rests on a small set of load-bearing ideas. Everything else is implementation detail.
- Reliability is a product feature, and it has a target. That target is the Service Level Objective (SLO) — a number, a window, and a measurement. “99.9% of requests under 200 ms over a rolling 28 days” is an SLO. “We want it to be fast” is not.
- Perfect reliability is the wrong goal. Pushing reliability from 99.9% to 99.99% costs roughly 10× and eats the velocity budget. Pick the SLO that matches what users will notice — then spend the gap.
- The gap between 100% and your SLO is an error budget. It is currency. You spend it on risky deploys, experiments, and migrations. When it’s empty, you stop shipping until you earn it back. This is the lever that aligns dev and ops without arguments.
- Toil must be capped. Manual, repetitive, automatable work scales linearly with the service; engineering work scales sublinearly. Google’s SRE guideline caps toil at 50% of an SRE’s time — anything above that is a sign the service is operating you.
- Blameless postmortems. Incidents are inevitable; the failure mode is not learning from them. Postmortems focus on the system, not the human who clicked the wrong button.
- Observability before automation. You cannot automate what you cannot measure. Logs, metrics, and traces come before runbooks; runbooks come before agents.

Courtesy: Team Topologies — team interaction modes between stream-aligned, platform, and enabling teams
SRE vs DevOps vs Platform Engineering
These three get conflated constantly. They are not the same thing, but they overlap.
DevOps is a cultural movement — the idea that developers and operators should share goals, tools, and on-call. It does not prescribe a job title, an org chart, or a measurement system. DevOps tells you what kind of team to be; SRE tells you what numbers to hit and how to govern them.
Platform Engineering is the productization of internal infrastructure — building an Internal Developer Platform (IDP) so application teams can self-serve compute, deploys, and observability. Platform Engineering owns the paved road; SRE owns the reliability contract for the services that run on it.
A useful collapse: DevOps is the philosophy, Platform Engineering is the paved road, SRE is the reliability contract. They coexist comfortably and most mature organizations have all three.
What SRE is not
- Not “ops, but with Python.” Scripting your old work doesn’t make it SRE — it still scales linearly. SRE is software engineering that eliminates the work.
- Not a job title you can sprinkle on. An “SRE team” with no SLO, no error budget, and no toil cap is a renamed ops team.
- Not zero-downtime at all costs. An SRE team that refuses to ship is failing the same way an unreliable team is. Burning error budget is part of the job.
What’s changing in 2026
The principles haven’t moved. What’s moving is who does the work. Generative AI and agentic systems are starting to absorb the bottom of the toil curve — first triage, postmortem drafting, runbook synthesis, log summarization — and the autonomy ladder is climbing fast. (For the conceptual frame, see Mental models for applying AI to infrastructure, and for what those autonomous components look like in practice, What is an AI agent?.)
The SLO/error-budget contract becomes more important in that world, not less — because once a service is partly run by a non-deterministic agent, the only way to govern it is to keep measuring user-visible reliability and letting the budget set the leash. The agent is just another contributor to the burn rate.
That’s the thread I follow through most of what’s written on cloudandsre.com — classical SRE principles as the load-bearing layer, AI as the next tool on top, not a replacement for the discipline.
References
- Site Reliability Engineering — Google — the canonical SRE book, free online
- The SRE Workbook — Google — companion volume, more practical
- Ben Treynor on what SRE is — Google Cloud blog
- Atlassian — SRE explained — practitioner-friendly overview
- DevOps vs SRE — DigitalOcean explainer
- Team Topologies — key concepts — how SRE, Platform, and stream-aligned teams fit together
- DORA — State of DevOps research — the empirical evidence base for what works
- CNCF — Platform Engineering whitepaper — where Platform Engineering and SRE intersect