Jun 5, 2026 · 6 min read AI Platform Engineering Reliability

Agent sprawl is your next production incident

Teams shipping AI agents are recreating 2015's microservices sprawl with worse observability. The governance surface that contains it before it pages you.

There’s a graph every platform team will recognize by the end of this year. The number of AI agents running against your production systems, plotted over time. It starts at one — somebody’s incident-investigation copilot. Six months later it’s forty: a cost agent, three teams’ worth of “ask the runbook” bots, a deploy-gate agent, a half-dozen MCP servers wired into Slack, and a long tail of things nobody can quite name an owner for.

I’ve watched this curve before. It’s the microservices adoption curve from 2015 to 2020, and we already know how that story goes. The first ten services felt like liberation. Service one hundred felt like a distributed-systems tax you pay every single on-call shift. Datadog’s State of AI Engineering 2026 put numbers on the AI version of it, and the shape is identical — agent infrastructure complexity is now growing faster than most teams’ ability to measure or govern it.

That gap has a name. Agent sprawl: the condition where the count and reach of your AI agents outpaces the reliability surface you have to observe and contain them. And like microservices sprawl, it doesn’t announce itself. It shows up as your next production incident.

Why this is the same problem

Strip the AI vocabulary away and an agent is a thing that makes network calls, holds state, depends on other services, and acts on your infrastructure. That’s a service. Everything we learned about running a lot of services applies — which is exactly why the failure modes rhyme.

No service catalog. In 2017 the question was “how many services do we run and who owns each one.” In 2026 it’s “how many agents have credentials to our cluster and who approved them.” Most teams cannot answer the second one today, the same way they couldn’t answer the first one then.
Unbounded fan-out. A microservice calls three services that call five each. An agent calls a tool that triggers another agent that opens a PR that fires a webhook. The blast radius of one bad decision is no longer one process — it’s a chain you didn’t draw.
Emergent, untested interactions. Two agents that each behave correctly in isolation can deadlock, thrash, or amplify each other in production. This is the distributed-systems lesson re-learned: the interesting failures live between the components, not inside them.
Observability that lags the architecture. We instrumented individual services beautifully and could still not answer “why is this request slow” until distributed tracing caught up. We are at the same lag point with agents right now — rich per-agent logs, almost no cross-agent causal picture.

If you lived through the first sprawl, none of this is new. What’s new is the speed.

What’s actually different — and worse

Microservices sprawl took most organizations five years. Agent sprawl is compressing into one, and three properties make it bite harder.

1. Agents are non-deterministic. A microservice given the same input returns the same output. An agent given the same incident might investigate it two different ways on two different days. You cannot regression-test your way to confidence the way you could with a service contract. Reliability becomes a distribution you monitor, not a property you assert — and most teams are still asserting.

2. The dangerous ones take actions. A read-only copilot that drafts a root-cause summary is low-stakes sprawl. But the curve doesn’t stop at read-only. The moment an agent can restart a pod, merge a PR, or scale a node group, sprawl stops being a cost problem and becomes a safety problem. The industry consensus in 2026 is telling: most teams running AI for SRE still keep agents firmly out of production-mutating paths, precisely because the governance to do it safely isn’t standard yet.

3. The failure cause is invisible by default. When a microservice failed, the cause was in a log, a metric, or a trace. When an agent fails, the most common root cause is a context deficiency — it didn’t have the governed, correct information it needed to behave well — and that cause lives in the assembled prompt, which almost nobody is capturing as telemetry. You can have perfect infrastructure metrics and still be blind to why your agent did the wrong thing. I’ve argued the build side of this in context engineering; sprawl is what happens when you scale that out across forty agents with no shared discipline.

The containment surface

Here’s the load-bearing claim of this post: you do not fix agent sprawl by having fewer agents. You fix it the same way you fixed microservices sprawl — by building the platform surface that makes a large number of them governable. Three layers, in priority order.

1. A catalog — agents are inventory

You cannot govern what you cannot enumerate. Every agent needs a registered identity, a named owner, a declared set of tools it may call, and a declared blast radius. This is not a spreadsheet; it’s the same instinct as a service catalog, and it’s the precondition for everything else. If a new agent can reach production without showing up in an inventory, you’ve already lost.

2. A gateway — one governed path to action

The single highest-leverage move is to stop letting agents talk to your infrastructure directly and route every tool call through a governed broker. I’ve written the full version of this as the MCP gateway pattern: one place that authenticates the agent, authorizes the specific tool call against policy, resolves conflicts, and writes an audit record. The gateway is what turns “forty agents with cluster credentials” into “forty agents whose every action is brokered, logged, and revocable from one seat.” Sprawl without a gateway is forty independent attack surfaces; sprawl with one is a single, observable funnel.

3. Cross-agent observability — trace the chain, not the agent

Per-agent dashboards are necessary and insufficient — they’re the equivalent of per-service logs before tracing existed. What you need is the causal chain: which agent triggered which tool that woke which other agent, and what context each one was working from. The 2026 shift toward capturing the assembled context as a first-class span (OpenTelemetry-compatible, exportable to whatever backend you already run) is the agent-era equivalent of distributed tracing finally catching up to microservices. I’ll go deeper on this in a companion post on observability for AI systems — the short version: if you can’t reconstruct why an agent acted from your telemetry, you’re not observing it, you’re just watching it.

The honest version

I’m not anti-agent. I’ve spent the last year building agentic infrastructure for a safety-critical platform, and the leverage is real — the MTTR wins teams are reporting in 2026 are not marketing. But every one of those wins is also a new service you now have to run, and the industry is collectively under-counting that second half of the trade.

The teams that will be fine in twelve months are not the ones who deployed the most agents. They’re the ones who treated the third agent as a signal to build the catalog, stood up a gateway before they had ten, and instrumented the chain before it was load-bearing. Microservices sprawl taught us that the platform investment always feels premature right up until the incident that proves it wasn’t.

Build the containment surface now. The curve is already pointing up.