Every operations team has work that nobody wants to do twice but somehow does fifty times a month. Restart the stuck job. Re-run the failed deploy. Add the new tenant to the IAM group. Answer the same support question. Individually, each task takes minutes. Collectively, it consumes the team and produces nothing durable.
That work has a name in SRE: toil. And the discipline has a specific answer for it — measure it, cap it, then engineer it away.
A note on framing. I use “SRE” inclusively here — the toil problem belongs to DevOps engineers, platform teams, cloud architects, and traditional ops just as much as it belongs to people with “SRE” in their title. The 50% rule is a load-bearing idea wherever production work happens.
Courtesy: Google SRE Book — Eliminating Toil, Chapter 5
What toil actually means
Toil is not “work I find boring.” Google’s SRE book defines it precisely: toil is operational work tied to running a production service that has the following properties.
- Manual — a human has to do it.
- Repetitive — it happens again and again, the same way.
- Automatable — a machine could do it, given the design effort.
- Tactical — reactive, interrupt-driven, not strategic.
- Without enduring value — when you’re done, the service is in the same state as before, not improved.
- Scales linearly with the service — twice the traffic, twice the work.
The last point is the killer one. Toil is what makes operating a service get worse as it grows. Engineering work, by contrast, scales sublinearly — a deduplication system you build once handles 10× the load with the same effort.
What is not toil, even though it might feel like it:
- Overhead (meetings, email, HR) — annoying, but not service-tied.
- Engineering work that’s tedious — refactoring a config system is still engineering; the output is durable.
- One-time setup work — painful, but doesn’t repeat.
The line matters because the 50% cap only applies to the toil column.
The 50% rule
Google’s SRE guideline: no SRE should spend more than 50% of their time on toil. The other 50% must go to engineering work that reduces future toil — automation, tooling, architectural improvements, capacity planning, postmortem follow-ups.
The rule exists because toil has a gravitational pull. If you let the toil percentage drift past 50%, the team has no time to build the things that would bring it back down, and the trend is one-way. The 50% cap is a circuit breaker on that drift.
Teams routinely violate it for short bursts — an incident, a launch, an oncall storm. That’s fine. The discipline is to notice the violation, name it, and route the next quarter of capacity toward burning it back down.
How to measure toil honestly
You can’t manage what you don’t measure, and toil measurement is where most teams get vague. Three approaches that work:
- Time-tracking sample weeks. Pick one week per quarter. Each SRE logs every 30-minute block as toil / engineering / overhead. Cheap, surprisingly accurate, hard to game.
- Ticket categorization. Tag every support ticket and oncall page as toil or not. Roll up monthly. Works if your ticketing discipline is already good.
- Pager-rooted estimate. Count pages and tickets; multiply by an average resolution time. Less accurate but easy to automate.
The number to watch is not the absolute value but the trend. A team at 35% toil that’s been climbing for three quarters is in worse shape than a team at 55% that’s been falling.
Courtesy: Google SRE Workbook — Eliminating Toil
Common toil sources (and what to do with them)
| Toil pattern | Typical fix |
|---|---|
| Manual deploys / rollbacks | CI/CD pipeline + progressive delivery |
| Capacity adds for known-growing services | Autoscaling + capacity planning |
| Repeated investigation of the same alert | Better SLO-based alerting; runbook; auto-remediation |
| Tenant / user provisioning | Self-service portal in the IDP |
| Certificate / secret rotation | cert-manager, external-secrets, automated rotation |
| Answering the same question in Slack | Internal docs + searchable knowledge base |
| Postmortem write-up after every incident | Templates + drafting assistance (this one used to require a senior SRE; in 2026 a well-prompted agent does the first draft) |
The pattern: the toil itself is the spec for the tool that replaces it.
What’s changing in 2026
The interesting shift this year is which kinds of toil are now economically worth eliminating. Before 2024, automating “summarize this Slack thread into an incident report” or “explain this Prometheus alert in English” was rarely worth a half-quarter of engineering effort — the volume was too low, the variance too high. With production-grade LLMs and the MCP integration layer underneath them, the marginal cost of building those tools dropped by an order of magnitude.
That doesn’t change the 50% rule. It changes the toil sources you can attack with a week of work instead of a quarter — and it adds a new failure mode to watch for. An agent that runs once and produces a wrong runbook is a one-time mistake. An agent wired into your remediation loop is now contributing to the SLI burn rate, and it deserves the same scrutiny as any other component. (When NOT to use AI in production SRE is the longer version of that argument.)
The discipline is the same as it’s always been: name the toil, measure it, ship the thing that removes it, then measure again.
References
- Eliminating Toil — Google SRE Book, Chapter 5 — the original definition
- Identifying and Tracking Toil — SRE Workbook — practical measurement
- Atlassian — Toil and how to measure it
- PagerDuty — Reducing toil to improve reliability
- DORA — How automation reduces operational toil — empirical data from the State of DevOps reports
- Honeycomb — On toil and the cost of “doing the work” — Charity Majors’s recurring theme
- USENIX SREcon — community talks on toil reduction