Agentic AI Patterns — Part 3 of 3 · Part 1: The Decision Guide · Part 2: Where They Break in Production
Most teams that say they’re “doing AI agents” are at Level 2. A few know it. Most don’t.
This post maps agentic AI capability onto a five-level maturity model. The levels describe increasing autonomy, trust, and infrastructure complexity — not increasing model capability. You can be at Level 4 with an open-weights model on commodity hardware. Level is determined by how you’ve built and governed the system, not which API you’re calling.
Each level adds autonomy and demands more governance — most teams sit at Level 2; regulated industries should target Level 3.
The five levels
Level 1 — Manual
What it looks like: An engineer copies text into a chat interface, gets a response, and applies the result manually. The LLM is a smart clipboard.
Typical use: Drafting incident summaries, brainstorming runbook steps, explaining unfamiliar error messages.
Limitations: Zero repeatability. Zero auditability. Output quality depends entirely on the individual prompt. No leverage — one engineer, one task, one output.
Level 2 — Assisted
What it looks like: The LLM is embedded in a tool — an IDE plugin, a Slack bot, a CLI wrapper — and produces suggestions that a human reviews and applies. The human is still the agent; the LLM is the copilot.
Typical use: GitHub Copilot for code, an alert-explainer bot in your observability channel, an AI-assisted runbook generator that outputs a draft for human review.
Limitations: Human review bottleneck. Every output requires a human in the loop, which limits throughput. The system is only as fast as the human reviewing it.
Most production “AI” deployments as of mid-2026 are at this level. It’s a legitimate place to be, especially in regulated environments — but calling it “agentic” is a stretch.
Level 3 — Supervised Agent
What it looks like: The system runs an agentic loop autonomously — reads context, selects tools, executes actions — but with human-in-the-loop gates at defined checkpoints before irreversible actions. The agent does the work; a human approves the consequential steps.
Typical use: An incident-response agent that investigates, produces a root-cause analysis, drafts a remediation plan, and then pauses for human approval before executing any infrastructure change.
What the jump requires:
- Working tool integrations (not just LLM calls — real tool use with structured inputs/outputs)
- A gate mechanism: Slack webhook, approval queue, or UI checkpoint
- Observability: what did the agent do, in what order, with what results?
- Error handling: what happens when a tool call fails mid-run?
This is where most teams should be building right now. The autonomy is real; the safety net is present; the trust is being established.
Level 4 — Autonomous Agent
What it looks like: The agent operates end-to-end without a human gate on individual actions. It has defined boundaries (scope, tool access, blast radius) enforced by the platform, not by human review. Humans review outcomes, not individual steps.
Typical use: An auto-remediation agent that detects a known failure pattern, executes a bounded playbook (restart pod, scale up, clear cache), and posts a summary to Slack. No human approval per-action — the boundaries are enforced in the tool layer.
What the jump from Level 3 requires:
- Mature tool access controls: Cedar, OPA, or equivalent policy layer that enforces what the agent can and cannot do at the infrastructure level — not just in the prompt
- Blast radius limits: the agent can restart pods but cannot delete namespaces; the tool surface enforces this, not the model
- Outcome observability: if you can’t see what the agent did and measure whether it worked, you can’t trust it enough to remove the gate
- Incident playbook: when the autonomous agent causes a problem, how do you detect it, stop it, and roll back?
This level requires trust you earn, not trust you assume. Most teams skip Level 3 and go straight here. That’s how production incidents happen.
Level 5 — Multi-Agent Mesh
What it looks like: Multiple specialized agents coordinate to accomplish goals that no single agent could handle alone. A supervisor routes tasks to sub-agents, sub-agents may themselves spawn sub-agents, and results flow back up the chain. The system operates continuously, not just on demand.
Typical use: An autonomous SRE mesh where a monitoring agent detects anomalies, a diagnosis agent identifies root cause, a remediation agent executes a fix, and a reporting agent updates the incident record — all without a human step in the loop for known failure classes.
What the jump from Level 4 requires:
- Mature supervisor pattern (see Part 1)
- Sub-agent interface contracts: structured typed inputs and outputs, not freeform text
- Multi-agent observability: tracing that follows a task across agent boundaries, not just within a single agent
- Governance for autonomous-to-autonomous delegation: what can the supervisor authorize a sub-agent to do that the sub-agent couldn’t do on its own?
This is the frontier. A handful of teams in cloud-native environments are operating here. Most are not.
Self-assessment: where is your team?
Answer these honestly:
- Can your AI system complete a multi-step task without a human copy-pasting between steps? If no → Level 1 or 2.
- Does your AI system call external tools (APIs, databases, infrastructure) directly? If no → Level 2.
- Do you have observability on individual tool calls — not just final output, but every step? If no → you’re at Level 2 even if you think you’re at Level 3.
- Do you have policy-enforced tool access controls (not just prompt-based restrictions)? If no → you’re at Level 3 at most.
- Do multiple specialized agents coordinate on a single task, with structured interfaces between them? If yes → Level 5 candidate.
Where regulated and safety-critical industries should draw the line
This question comes up constantly in aviation, finance, and healthcare: how autonomous is too autonomous?
The honest answer: most regulated environments should target Level 3, with a well-defined path to Level 4 for bounded, reversible actions.
Here’s why:
- Auditability requirements in regulated industries demand a human decision point before consequential actions. Level 4 removes that point. You need explicit regulatory analysis before removing human gates on actions that affect safety or compliance.
- Blast radius asymmetry: in consumer software, a bad autonomous action is annoying. In aviation infrastructure or financial settlement, it can be catastrophic. The cost-benefit of Level 4 changes dramatically.
- Trust is regulatory, not just operational. Even if your Level 4 system works perfectly, you need to demonstrate to regulators that it operates within defined bounds. Policy-enforced tool access (Cedar, OPA) is the mechanism — but the regulatory framework for certifying autonomous AI decision-making in most industries is still immature.
The right posture for regulated industries: build Level 3 deeply, with excellent observability and tight blast-radius controls. Run Level 4 in read-only or low-consequence paths first (log analysis, summarization, reporting) to build the evidence base. Earn Level 4 trust one bounded use case at a time.
Using the model
The decision guide in Part 1 tells you which pattern to use for a specific task. This maturity model tells you which patterns are appropriate given your current level of governance, observability, and trust.
If you’re at Level 2, don’t implement a Supervisor pattern. You don’t have the observability or error handling to operate it safely.
If you’re at Level 3 with strong observability and a working gate mechanism, Plan-and-Execute and Critic loops are appropriate next steps. Parallel fan-out and Supervisor are Level 4+ patterns in most organizations.
The failure modes in Part 2 are largely failures of teams implementing Level 4+ patterns at Level 2 maturity. The maturity model is the checklist that prevents that.
Related posts: