May 23, 2026 · 5 min read SRE Reliability Fundamentals

Toil and the 50% rule — what it is, how to measure it, and how to kill it

A primer on toil — the manual, repetitive work that eats SRE teams. Google's six-part definition, the 50% cap, and how AI agents change the playbook.

Every operations team has work that nobody wants to do twice but somehow does fifty times a month. Restart the stuck job. Re-run the failed deploy. Add the new tenant to the IAM group. Answer the same support question. Individually, each task takes minutes. Collectively, it consumes the team and produces nothing durable.

That work has a name in SRE: toil. And the discipline has a specific answer for it — measure it, cap it, then engineer it away.

A note on framing. I use “SRE” inclusively here — the toil problem belongs to DevOps engineers, platform teams, cloud architects, and traditional ops just as much as it belongs to people with “SRE” in their title. The 50% rule is a load-bearing idea wherever production work happens.

Toil vs engineering work — the 50% cap visualized

Courtesy: Google SRE Book — Eliminating Toil, Chapter 5

What toil actually means

Toil is not “work I find boring.” Google’s SRE book defines it precisely: toil is operational work tied to running a production service that has the following properties.

Manual — a human has to do it.
Repetitive — it happens again and again, the same way.
Automatable — a machine could do it, given the design effort.
Tactical — reactive, interrupt-driven, not strategic.
Without enduring value — when you’re done, the service is in the same state as before, not improved.
Scales linearly with the service — twice the traffic, twice the work.

The last point is the killer one. Toil is what makes operating a service get worse as it grows. Engineering work, by contrast, scales sublinearly — a deduplication system you build once handles 10× the load with the same effort.

What is not toil, even though it might feel like it:

Overhead (meetings, email, HR) — annoying, but not service-tied.
Engineering work that’s tedious — refactoring a config system is still engineering; the output is durable.
One-time setup work — painful, but doesn’t repeat.

The line matters because the 50% cap only applies to the toil column.

The 50% rule

Google’s SRE guideline: no SRE should spend more than 50% of their time on toil. The other 50% must go to engineering work that reduces future toil — automation, tooling, architectural improvements, capacity planning, postmortem follow-ups.

The rule exists because toil has a gravitational pull. If you let the toil percentage drift past 50%, the team has no time to build the things that would bring it back down, and the trend is one-way. The 50% cap is a circuit breaker on that drift.

Teams routinely violate it for short bursts — an incident, a launch, an oncall storm. That’s fine. The discipline is to notice the violation, name it, and route the next quarter of capacity toward burning it back down.

How to measure toil honestly

You can’t manage what you don’t measure, and toil measurement is where most teams get vague. Three approaches that work:

Time-tracking sample weeks. Pick one week per quarter. Each SRE logs every 30-minute block as toil / engineering / overhead. Cheap, surprisingly accurate, hard to game.
Ticket categorization. Tag every support ticket and oncall page as toil or not. Roll up monthly. Works if your ticketing discipline is already good.
Pager-rooted estimate. Count pages and tickets; multiply by an average resolution time. Less accurate but easy to automate.

The number to watch is not the absolute value but the trend. A team at 35% toil that’s been climbing for three quarters is in worse shape than a team at 55% that’s been falling.

Toil trend chart over four quarters — direction matters more than absolute

Courtesy: Google SRE Workbook — Eliminating Toil

Common toil sources (and what to do with them)

Toil pattern	Typical fix
Manual deploys / rollbacks	CI/CD pipeline + progressive delivery
Capacity adds for known-growing services	Autoscaling + capacity planning
Repeated investigation of the same alert	Better SLO-based alerting; runbook; auto-remediation
Tenant / user provisioning	Self-service portal in the IDP
Certificate / secret rotation	cert-manager, external-secrets, automated rotation
Answering the same question in Slack	Internal docs + searchable knowledge base
Postmortem write-up after every incident	Templates + drafting assistance (this one used to require a senior SRE; in 2026 a well-prompted agent does the first draft)

The pattern: the toil itself is the spec for the tool that replaces it.

What’s changing in 2026

The interesting shift this year is which kinds of toil are now economically worth eliminating. Before 2024, automating “summarize this Slack thread into an incident report” or “explain this Prometheus alert in English” was rarely worth a half-quarter of engineering effort — the volume was too low, the variance too high. With production-grade LLMs and the MCP integration layer underneath them, the marginal cost of building those tools dropped by an order of magnitude.

That doesn’t change the 50% rule. It changes the toil sources you can attack with a week of work instead of a quarter — and it adds a new failure mode to watch for. An agent that runs once and produces a wrong runbook is a one-time mistake. An agent wired into your remediation loop is now contributing to the SLI burn rate, and it deserves the same scrutiny as any other component. (When NOT to use AI in production SRE is the longer version of that argument.)

The discipline is the same as it’s always been: name the toil, measure it, ship the thing that removes it, then measure again.

References

Eliminating Toil — Google SRE Book, Chapter 5 — the original definition
Identifying and Tracking Toil — SRE Workbook — practical measurement
Atlassian — Toil and how to measure it
PagerDuty — Reducing toil to improve reliability
DORA — How automation reduces operational toil — empirical data from the State of DevOps reports
Honeycomb — On toil and the cost of “doing the work” — Charity Majors’s recurring theme
USENIX SREcon — community talks on toil reduction