Mental models for applying AI to infrastructure

Most writing about AI in infrastructure is tutorials. Tutorials answer how. Mental models answer whether. Here are seven I use as the front gate before any LLM goes near a production system — recoverability, reversibility, per-call economics, the autonomy ladder as a risk function, tools-not-chat, context as substrate, and identity that travels with the action.


There is no shortage of tutorials on applying AI to infrastructure. How to wire Claude to your alert manager. How to RAG your runbooks. How to write an MCP server. The tutorials are good. They answer how.

The harder question, and the one almost nobody writes about, is whether — whether a given AI integration belongs in your system at all, and at what blast radius, and behind what guardrails. That’s not a tutorial. That’s a mental model.

A mental model is a cheap, transferable lens that lets a senior practitioner make a decision in thirty seconds that would otherwise take a thirty-page architecture review. SREs already have a closet full of them — cattle not pets, shift left, error budgets are policy not measurement, toil is the enemy. We don’t yet have a shared closet for AI in infrastructure.

Here are the seven I’m now using as the front gate, in the order I apply them. None of these are clever. All of them earn their place by stopping bad ideas before code gets written.


1. The Recoverability Triangle

Before AI gets the wheel, three things must be true: the cost of being wrong is bounded, a deterministic fallback exists, and the decision is auditable post-hoc.

If any leg is missing, the answer is no. Not yet. Not here.

This is the test I run before approving any LLM integration on a path I care about. “Cool tool, demo looks great — what happens when it’s confidently wrong at 3am?” If the answer involves more confident-wrong outputs cascading downstream, you’re missing recoverability. If the answer is “we don’t know what it did,” you’re missing auditability. If the answer is “fall back to the on-call engineer,” good — but only if the engineer can fall back without the AI’s prior decisions blocking them.

A working example: pulling an LLM out of synchronous payment authorization (P99 = 14 seconds, regional outages, six-figure revenue per minute) and putting it on asynchronous fraud signal generation. Same model, same prompts, but now: cost-of-wrong is bounded (one false signal, not one declined card), the deterministic fallback is the existing rule engine, and every signal is logged with the input that produced it.

The triangle didn’t tell you to do that. It told you the original idea was disqualified. The team’s job was to find the version of the integration that satisfied all three legs — or to walk away.


2. The Reversibility Gate

Two-way doors get AI. One-way doors get humans.

Bezos’s two-way / one-way door framing maps onto agent design with no friction. A two-way door is a decision you can undo cheaply: restart a pod, requeue a job, re-summarize an incident. A one-way door is a decision you cannot: drop a database table, send an email to all customers, terminate an EC2 instance with the only copy of unsaved state.

Agents and LLMs may execute two-way doors. They may propose one-way doors. They may not execute one-way doors.

The implementation of this gate lives in the tool layer, not in the prompt. A prompt that says “please don’t run terraform destroy without confirmation” is a vibe. A tool registry where kubectl rollout restart is registered as a directly-callable tool and terraform destroy is registered as a proposal-only tool that returns a pending_approval token to a human is the gate. One survives a model that’s having a bad day; the other doesn’t.

This is the reason quarantine queues exist in production agentic systems. It is not the reason I cite when I describe them. Reversibility is the underlying principle; quarantine queues are the implementation. Get the principle right and the implementation falls out of it.


3. Per-Call Economics, Not Per-Demo Economics

Annual line item = per-call cost × call rate. Not the impressive demo.

The fastest way to get an LLM project killed in 18 months is to deploy it on every log line, every alert, every Slack message, and discover that what looked like a $4 demo is now a $300K annual cost-center with no measurable lift over a regex. I have watched this happen. So have you.

The mental model isn’t “AI is too expensive.” The mental model is do the volume math first, and design a tier above the LLM that catches everything routine.

For log-line classification, this looks concrete: a regex plus a small classifier handles 95% of cases at near-zero per-call cost; the LLM is reserved for the 5% that fails a confidence threshold. Same product surface from the user’s point of view. One-twentieth the cost. Better latency. Easier to debug.

Per-call economics is also the model that makes batching and caching feel obvious instead of clever. If you’re at scale, the LLM is the most expensive thing in your pipeline by an order of magnitude, and any architectural decision that doesn’t treat it that way is wrong.

The corollary I keep needing to repeat: a 50-line heuristic at 1% of the cost is a perfectly good baseline. “AI” is not the goal. The goal is to do the work; the AI is allowed to lose to a regex on the merits.


4. The Autonomy Ladder Is a Risk Function, Not a Roadmap

There’s a popular framing of agent autonomy as a four-stage maturity ladder: Read-only → Advised → Approved → Autonomous. It’s a useful vocabulary. It’s the wrong shape.

It is not a maturity curve a team climbs. It is a per-action risk decision a team re-makes every quarter.

Same SRE platform, same team, same agent, three different actions:

  • Summarize this incident. Read-only on day one, autonomous within a week. The downside of getting the summary wrong is “the summary is wrong.” Cost-bounded. Reversible. Promote.
  • Triage this alert and propose a remediation. Approved indefinitely. The proposal is cheap; executing on a wrong proposal is expensive. The human-in-the-loop is the work, not the friction.
  • Modify production IAM policy to grant temporary access. Read-only forever. There is no version of this where autonomy is the right answer. The compliance-attestation cost alone makes the question moot.

The mistake teams make is reasoning about their agent’s autonomy. The right unit is (agent, action, environment). Autonomy travels with the tuple, not with the agent.

This is the model that prevents the all-too-common architectural sentence “we’re at level 3 across the board.” No team is at level 3 across the board. Some actions are at level 4 and always will be; some are at level 1 and always should be. The ladder is a risk function, not a roadmap.


5. Tools, Not Chat

The interface is the mental model.

When the next step is known, a button beats a chatbot every time. Restart the pod is a button. Acknowledge the incident is a button. Open the runbook is a button. Wrapping a known-step workflow in a chat window is theater — it adds latency, ambiguity, and a probabilistic failure mode to a path that didn’t have one.

Where chat actually wins is the part of the work where the next step is not known. Diagnosis. Hypothesis generation. “What could explain this set of metrics?” That’s where the agentic loop earns its complexity budget.

The practical rule: in any agent design, list every action you’re tempted to expose. For each, ask whether the next step is known. If yes, it’s a button. If no, it’s a chat affordance. Most agent UIs in the wild get this backwards — they put deterministic actions behind chat (the “ask the bot to restart it” anti-pattern) and probabilistic ones behind buttons (the “auto-resolve” trap).


6. Context Is the Substrate, Not the Prompt

Most “prompt engineering” work in infrastructure is optimizing the wrong layer.

The model’s output quality on a real SRE task is dominated, at first order, by what the model can see — the topology of your system, the last 7 days of incident context, who owns the service, what changed in the last hour, what’s already paged. Prompt wording is a rounding error compared to context substrate.

The right architectural question is not “what should the prompt say?” It’s “what graph of context can the agent traverse, and how does it pay the latency and cost of doing so?”

Two SRE agents on the same alert, same prompt: one with a context graph (service topology, ownership, recent deploys, related incidents) plus a tool-use loop, and one with raw kubectl access. The first is correct. The second hallucinates plausible-sounding-but-wrong runbooks at a measurable rate.

This is also the answer to “how do we get more out of our LLM tooling?” — the answer is rarely a better prompt. It’s almost always better context. Build the substrate first. Tune the prompt last.


7. Identity Travels With the Action

This is the model nobody else is writing about, and the one your security and compliance teams will fail you on first.

Every tool call an agent makes must carry the human’s delegated identity, not a shared service-account credential.

The naive design — and the design every “AI agent” tutorial ships — is: the agent has its own credential, the agent calls the tools, the audit log says “agent did it.” That’s a backdoor with a friendly name. Six months later, when an auditor or a regulator asks “who, exactly, told the agent to delete that record,” the answer is a service principal that was used by every agent run for everyone in the company. You will not pass that audit.

The right shape borrows from RFC 8693 (OAuth 2.0 Token Exchange) and the delegation pattern (with an act claim) — distinct from impersonation. The user authenticates to the agent. The agent presents the user’s token at the gateway. The gateway validates the user’s token, then exchanges it for a short-lived, narrowly-scoped token bound to this user, this tool, this call. The downstream system sees the user’s identity (with an act claim showing the agent that delegated). The audit log says Ajin via agent X, not agent X.

The implementation is straightforward. The discipline is the hard part. Most teams get nine of the ten components of an agent platform right and then leak identity through a single shared-credential tool call, and that one leak is the load-bearing failure of every regulatory conversation that follows.

If you are designing or operating any agent that calls back into your own systems, this is the model to internalize first. The other six are about whether and how AI gets the wheel; this one is about who, in the audit log, is actually driving.


Putting it all together

These models stack. Used in order, they look like a checklist:

  1. Does this integration satisfy the recoverability triangle?
  2. Are the actions on the path two-way doors, or are we proposing for one-way doors only?
  3. Have we done the per-call × rate math, and is there a deterministic tier above the LLM?
  4. For each action, what’s the autonomy level — now, with this audience?
  5. Where the next step is known, is the affordance a button (not a chat box)?
  6. Has the model been given the context substrate it needs, or are we trying to fix this with a better prompt?
  7. Does the human’s identity travel through every tool call this agent makes?

That’s the front gate. Most “AI in infrastructure” projects I’ve reviewed in the last 18 months would have shipped a different design — or not shipped at all — if someone had walked them through these seven questions before the first commit. None of them are exotic. All of them prevent specific, recurring, recognizable failure modes I have actually watched happen.

The companion post on the MCP gateway pattern is roughly the implementation half of this writeup — what these models look like as an architecture. The prompt engineering post is the in-the-trenches version. This one is the front gate. If you only adopt one of the three, adopt this one.

Tutorials are easy to find. The mental models are not.


If you have a model I missed — especially one that prevents a failure I haven’t seen yet — I want to hear about it. Contact at cloudandsre.com.