Jun 26, 2026 · 6 min read AI Platform Engineering Reliability

Agentic AI Patterns: Where They Break in Production (Part 2 of 3)

Every agentic AI pattern looks clean in a demo. Where each one breaks in production, the signals you're hitting them, and mitigations that actually work.

Agentic AI Patterns — Part 2 of 3 · Part 1: The Decision Guide · Part 3: The Maturity Model

Every pattern from Part 1 has a demo that works. The demo uses a clean input, a fast tool, and a forgiving evaluation. Production has none of those things.

This post covers the failure mode for each pattern — the specific way it breaks under realistic conditions — plus the operational signal that tells you you’re hitting it, and the mitigation that actually helps.

Intelligent agent diagram showing the perception-reasoning-action loop with learning feedback

Utkarshraj Atmaram / Pduive23 / Wikimedia Commons — Public Domain

The meta-lesson first

Before the pattern-by-pattern breakdown: agentic systems fail at boundaries. Tool call boundaries, context handoff boundaries, retry loop boundaries, and sub-agent interface boundaries. The reasoning inside the model is often fine. The failure is in the seam between the model and the world.

Keep that frame in mind as you read what follows. The mitigations are almost always about hardening a boundary, not improving the model.

1. ReAct — the infinite reasoning spiral

Failure mode: The agent enters a loop where each observation produces a new thought that generates a new action whose observation triggers the same thought again. Token consumption grows, cost spikes, and the task never completes.

Production signal: Step count that’s growing instead of converging. If your median ReAct task takes 6 steps but individual runs are hitting 40+, you’re spiraling. The model will often narrate its confusion — “I am not sure yet” or “Let me try again” repeated across steps is diagnostic.

Mitigation:

Hard cap on step count (not configurable by the agent). Terminate and return a partial result when the cap is hit.
Observe step count as a metric, not just total latency. Alert when a single run exceeds 2× the median.
Provide a done tool the agent must explicitly call to terminate. Agents that lack an exit condition drift.

2. Plan-and-Execute — stale plan syndrome

Failure mode: Step 2 produces output that invalidates the assumption step 3 was built on. The execution agent continues with the stale plan, producing nonsensical or incorrect results by step 5. Because planning and execution are separate calls, the executor has no mechanism to surface “the plan is wrong” back to the planner.

Production signal: Final output that’s internally consistent (each step followed the plan) but incorrect relative to the original goal. This is the hardest failure mode to catch automatically — the system looks like it succeeded.

Mitigation:

Add a re-plan trigger: after each execution step, include a brief “does this output still match the plan’s assumptions?” check. If not, re-plan from the current state.
Human-in-the-loop gate after planning and before execution for high-stakes tasks. A 30-second review catches stale assumptions before they compound.
Log the original plan alongside the final output. Post-hoc review of failures is much faster when you can compare intent vs. execution.

3. Critic / Reflection Loop — convergence failure

Failure mode: The critic and generator disagree in circles. The generator produces version N; the critic finds a flaw; the generator fixes it but introduces a different flaw; the critic flags the new flaw; repeat indefinitely. Neither side converges.

Production signal: Hitting the max-iteration ceiling on a significant fraction of tasks (>5%). If you’re at the ceiling, you haven’t solved the quality problem — you’ve just stopped trying.

Mitigation:

The critic’s rubric must be enumerable and binary — each criterion is either pass or fail, not “could be improved.” Vague rubrics produce vague feedback that generates unbounded revision cycles.
Increment the severity threshold per iteration: the critic should be willing to accept “good enough” by iteration 3 even if iteration 1 was rejected. Hard quality gates with no graduation path guarantee ceiling hits.
Log per-criterion scores across iterations. If the same criterion keeps failing, the generator model doesn’t understand it — fix the criterion definition, not the generator prompt.

4. Parallel Fan-out — reducer bottleneck

Failure mode: The fan-out step completes fast; the reducer (aggregation) step becomes the bottleneck. Sub-agents return outputs in inconsistent formats, the reducer can’t merge them, and either the system throws or it silently drops partial results.

Production signal: High variance in end-to-end latency that correlates with N (the number of sub-agents), even though the parallel work finished at roughly the same time. The reducer is doing O(N) work sequentially.

Mitigation:

Design the reducer before you design the fan-out. If you can’t describe the reducer in one sentence, your subtask decomposition isn’t clean enough.
Enforce a typed output schema from each sub-agent. A sub-agent that returns freeform text cannot be reliably reduced at scale — use tool-call-style structured outputs.
Fan-out produces N partial results; handle the case where K < N return successfully. A reducer that requires all N to succeed will fail proportionally more as N grows.

5. Human-in-the-Loop Gate — approval fatigue

Failure mode: The gate fires too frequently. Humans start approving without reading. The safety property of the gate disappears while the latency cost remains.

Production signal: Approval time trending toward zero. If approvals that previously took 30–60 seconds are now averaging under 5 seconds, rubber-stamping has started.

Mitigation:

Gate only on irreversibility, not on risk level. A reversible action with high risk is a candidate for a Critic loop, not a human gate. Human gates are expensive; use them precisely.
Show the human what the agent is about to do in a diff-style view, not a prose explanation. Reviewers evaluate diffs faster and more accurately than prose.
Track approval rate and approval time as operational metrics. A drop in approval time without a drop in task complexity is a signal that the gate has become theater.

6. Supervisor / Orchestrator — routing failures and silent sub-agent errors

Failure mode: The supervisor misroutes a task to the wrong sub-agent. The sub-agent either errors (best case, detectable) or produces output that’s plausible but wrong for the original task (worst case, invisible until a human notices downstream).

Production signal: Sub-agent error rate spiking on tasks that were previously stable. The supervisor’s routing logic is the most sensitive surface — a prompt change or a new task type can break routing without breaking individual sub-agents.

Mitigation:

Sub-agents must return structured error types, not just error strings. The supervisor needs to distinguish “I don’t have permission” from “I don’t understand this task” from “the upstream tool is down” — the recovery path is different in each case.
Test routing in isolation: given 50 task descriptions, does the supervisor route each one to the right sub-agent? This is a unit test that most teams skip and everyone regrets.
Implement a fallback: if no sub-agent claims a task, route to a general-purpose agent and flag it for review rather than silently dropping.

The pre-production checklist

Before you deploy any agentic system, verify these:

Related posts:

Agentic AI Patterns: Where They Break in Production (Part 2 of 3)

The meta-lesson first

1. ReAct — the infinite reasoning spiral

2. Plan-and-Execute — stale plan syndrome

3. Critic / Reflection Loop — convergence failure

4. Parallel Fan-out — reducer bottleneck

5. Human-in-the-Loop Gate — approval fatigue

6. Supervisor / Orchestrator — routing failures and silent sub-agent errors

The pre-production checklist

Read next

Comments