Jun 21, 2026 · 7 min read AI Reliability MCP

Chaos engineering for MCP: break your tool-call plane before production does

LLM calls fail 1–5% of the time and agent tasks fan out into 10–20 tool calls. How to fault-inject your MCP layer with mcp-chaos before production does.

Every model-comparison post and every agent demo this year answers the same question: can the model reason? It’s the wrong question for production. In production, your agent’s reliability has very little to do with the model and almost everything to do with the dozen tools it calls to get its job done.

Here’s the number that reframes it. LLM API calls fail somewhere between 1% and 5% of the time in production. A multi-step agent makes 10–20 tool calls per task. Run that math and a meaningful fraction of your agent’s tasks will hit at least one failure — and because each step treats the previous step’s output as reliable truth, that one failure doesn’t stay contained. It cascades.

The industry is starting to notice. There are now open-source tools like agent-chaos that “break your agent on purpose before production does,” research benchmarks like ReliabilityBench that apply chaos-engineering principles to LLM agents, and reporting that enterprises are generating agent failures they don’t even classify yet. Good. But almost all of it injects failures at the model call. The layer that actually breaks most often is one tier down — and nobody is testing it.

The MCP layer is a distributed-systems tier with no reliability model

When your agent calls a tool, it goes through the Model Context Protocol: an MCP server, usually a gateway in front of it, a registry that says which tools exist, and the retries, timeouts, and schemas gluing it together. That is not “a config detail.” That is a new distributed-systems tier — servers, routing, fan-out, partial failure — sitting on the hot path of every agent action.

The research that exists treats this layer almost entirely as a security surface: tool poisoning, prompt injection, registry hijacking. Important work, and I’ve written about the identity and policy side of it. But hardening a layer against attackers is not the same as making it reliable against ordinary operational failure. A tool doesn’t have to be poisoned to ruin your incident response. It just has to be slow, or return a slightly different JSON shape than last week, or have moved while the registry wasn’t looking.

I call this tier the tool-call plane, and the argument I make at length in the companion paper is simple: it should be engineered as a first-class reliability concern, with its own SLIs and its own chaos testing — exactly the way you’d treat a service mesh data plane.

A failure taxonomy worth injecting

Before you can inject failures meaningfully, you need to know what failures this plane actually produces. These are the operational — not adversarial — failure modes that take down agent workflows:

Cold or missing tool. The agent plans to call a tool that isn’t registered, isn’t warm, or isn’t reachable. What does it do — hallucinate the result, or fail cleanly?
Schema drift. The tool returns a shape the agent didn’t expect. A renamed field, a string where a number was, a newly-nullable column. The agent parses garbage and acts on it.
Registry staleness. The gateway routes to a tool endpoint that has moved or been deprecated. The call succeeds against the wrong thing.
Partial-result corruption. The tool returns some of the data — a truncated log, a paginated result the agent treats as complete — and the agent reasons over a half-truth.
Retry storms. A transient failure triggers retries from the agent, the gateway, and the client library simultaneously, amplifying load on an already-degraded server.
Fan-out amplification. One user intent expands into N tool calls across M servers. One slow server now gates the entire task. This is the multiplier that turns a 2% per-call failure rate into a much higher per-task failure rate.
Gateway head-of-line blocking. One slow tool call holds a connection and stalls the calls queued behind it.

Notice that none of these are exotic. They’re the same failures that have haunted distributed systems for thirty years — they’ve just moved onto a new plane that nobody has instrumented.

mcp-chaos: fault injection for the tool-call plane

So I built a tool for it. mcp-chaos is a small, runnable reference — not a framework to adopt, but something to read, copy, and outgrow. One fault model and one metrics model sit behind three drivers: a deterministic offline simulator, plus live proxies for both MCP transports (stdio and streamable-HTTP). The fault logic is written once and proven by the offline demo, then runs unchanged against real servers — so the reproducible result can’t drift from what the live proxy actually does.

Start with the headline. It needs no API key and no server, and it’s fully deterministic:

mcp-chaos demo

mcp-chaos — tool-call-plane blast radius (fan-out, injected error)

no resilience    TCAF=8.0  TCSR= 0.98  ETA= 0.98  blast_radius= 0.12
with resilience  TCAF=8.0  TCSR= 1.00  ETA= 1.00  blast_radius= 0.00

Read it bottom-up. A workload that fans out to TCAF ≈ 8 tool calls per task, with a 2% per-call injected error rate, fails 12% of its tasks — six times the per-call rate, because the failures compound across the fan-out. That’s the multiplier made concrete. Switch on three resilience patterns — a retry budget, graceful degradation, and an ETA-aware circuit breaker — and the blast radius collapses to zero on the exact same injected faults. The gap between those two rows is the whole argument for testing this layer, and it’s reproducible on your machine in one command.

To point it at a real server, write a declarative, seeded fault plan:

# fault-plan.yaml — inject tool-call-plane failures, reproducibly
seed: 42
faults:
  - type: latency                 # push slow tools toward the step deadline
    added_ms: 4000
    jitter_ms: 1500
    probability: 0.3
  - type: schema_drift            # return a shape the agent didn't expect
    tools: ["get_incident"]
    probability: 0.2
  - type: partial_result          # a truncated result the agent treats as complete
    tools: ["fetch_logs"]
    probability: 0.5
  - type: registry_inconsistency  # advertised tool that errors on call
    tools: ["restart_service"]
  - type: error_injection         # the demo's 2% per-call failure
    rate: 0.02

# wrap a real stdio MCP server and inject into its live JSON-RPC traffic
mcp-chaos proxy stdio --plan fault-plan.yaml -- python my_mcp_server.py

# or front a streamable-HTTP MCP server
mcp-chaos proxy http  --plan fault-plan.yaml --upstream http://localhost:8080

The proxy fails safe by default: if mcp-chaos itself errors, it passes the real session through rather than breaking it (pass --strict to fail closed instead). What you learn is the part that matters: of the tasks that hit an injected fault, how many failed safely (the agent noticed, degraded, or escalated) versus how many failed silently (it produced a confident, wrong answer and acted on it). The silent failures are the ones that become incidents.

A tool that is available is not the same as a tool that is correct. Most monitoring conflates the two — and the gap between them is exactly where agents go wrong.

Measure the plane, not just the model

Two SLIs are worth standing up before you grant an agent any write access (the kind of access I argue you should ration carefully in bounded autonomy):

Effective Tool Availability — decompose tool “uptime” into availability (did the call return) and correctness (was the result usable and well-formed). A tool can be 99.9% available and 92% correct, and your agent feels the 92%.
Tool-Call Amplification Factor (TCAF) — how many tool invocations one user intent fans out into, and how failures compound across them. A task with a TCAF of 15 and a 2% per-call failure rate fails roughly a quarter of the time. That’s the number that should scare you, and it’s invisible if you only watch the model.

These belong in the same place as the rest of your AI observability — the layer where, as I’ve argued before, context deficiency is so often the real root cause.

Degradation patterns that actually hold

Finding the failures is half of it. The other half is the catalog of patterns that contain them — and they’re the classic SRE patterns, ported to the tool-call plane:

Tool circuit breakers — stop calling a tool that’s failing its correctness SLI; don’t let the agent keep feeding on bad results.
Deadline-aware tool routing — if a tool can’t return inside the agent’s step budget, don’t make the call; route to a degraded path or escalate.
Idempotent tool contracts — so a retry storm replays safely instead of double-acting.
Graceful tool degradation — a tool that can return partial results should say so, and the agent should treat “partial” as a first-class state, not as “complete.”

This is also the antidote to agent sprawl: a fleet of agents on an untested tool-call plane is a fleet of correlated failures waiting for one slow MCP server.

What to do Monday

Put a number on your TCAF. For your most-used agent task, count the tool calls. Multiply out the per-call failure rate. If the per-task failure probability surprises you, you’ve been watching the wrong layer.
Inject schema drift first. It’s the cheapest fault to inject and the most common in the wild — and it’s the one most likely to produce a confident, wrong answer rather than a clean error.
Split availability from correctness in your tool monitoring this week. The gap is your real reliability number.
Run your task corpus through a fault proxy before you grant write access. An agent that hasn’t been chaos-tested at the tool-call plane hasn’t earned a rung on the autonomy ladder.

The model is the part everyone benchmarks. The tool-call plane is the part that pages you. Break it on purpose, on a Tuesday, with mcp-chaos — so it doesn’t break itself on a Saturday, during the incident, while it’s supposed to be helping.

Frequently asked questions

What is the tool-call plane in an MCP-based agent system?

The tool-call plane is the operational tier that sits between an LLM agent and the tools, data sources, and actuators it invokes through the Model Context Protocol — the MCP servers, the registry, the gateway, and the retries, timeouts, and schemas that connect them. In production it behaves like a distributed-systems data plane: an agent's reliability is increasingly governed not by model quality but by the latency, availability, and correctness of this layer. It deserves its own SLIs and its own chaos testing, the same way a service mesh data plane does.

Why do AI agents need chaos engineering?

Because their dependencies fail constantly and silently. LLM API calls fail roughly 1–5% of the time, and a multi-step agent makes 10–20 tool calls per task, so a meaningful fraction of tasks hit at least one failure. Worse, agents treat a previous step's output as reliable truth, so an early failure propagates downstream as a cascade. Chaos engineering injects those failures on purpose — rate limits, timeouts, schema drift, partial results — so you find the cascade in a test instead of during a 3am incident.

What failures should you inject into an MCP server?

Inject ordinary operational failures, not just security attacks: cold or missing tools, schema drift (the tool returns a shape the agent didn't expect), registry staleness (the gateway routes to a tool that moved), partial-result corruption, retry storms, fan-out amplification across multi-server topologies, and gateway head-of-line blocking. These are the failure modes that the MCP security literature ignores because they aren't adversarial — but they're the ones that actually take down agent workflows in production.

How is mcp-chaos different from generic agent chaos tools?

Generic agent-chaos tools inject failures at the LLM call (rate limits, timeouts, stream interruptions). mcp-chaos targets the MCP tool-call plane specifically — it sits as a fault-injecting proxy in front of your MCP servers and corrupts schemas, delays tool responses past the agent's step deadline, drops tools from the registry, and amplifies fan-out — then measures the blast radius on the agent task. It treats MCP as a reliability surface, not only a security surface.