There’s a diagram every vendor will hand you this year. It shows an “AI SRE agent” in the middle, arrows fanning out to your metrics, logs, traces, cloud APIs, and Slack, and a confident caption about autonomous incident resolution. It is not wrong, exactly. It is just missing every layer that decides whether the thing is safe to run.
I’ve spent 2026 building and breaking this stack. This is the reference I wish the vendor diagrams were — six layers, what each one is for, what actually fills it, and where the honest answer is not yet. If you’re evaluating AI for production SRE, read this before you read a single pricing page.
The stack, top to bottom
An AI-native SRE stack is not one product. It’s six layers, and the failures live at the seams between them — the same lesson distributed systems taught us, re-learned with models in the loop.
- Telemetry & context — the ground truth the agent reasons over.
- Reasoning — the model that forms hypotheses.
- Tool access — how the agent reaches your systems (in 2026, usually MCP).
- Identity & policy — who’s asking, and what they’re allowed to do.
- Remediation — the bounded set of actions the agent may take.
- Evaluation — whether any of the above is actually working.
Skip a layer and you don’t get a simpler stack. You get an unmeasured, over-permissioned one that pages you at 3am with more confidence than it has earned.
Layer 1 — Telemetry & context
This is the layer that decides everything above it, and it’s the one teams most want to skip. An agent reasoning over unstructured logs and an agent reasoning over structured events with topology are not the same product at different quality — they’re different products. The arXiv work on observability for AI systems keeps landing on the same root cause: context deficiency, not model weakness. The model isn’t wrong because it’s dumb; it’s wrong because you handed it half the picture. I went deep on this in observability for AI systems — the short version is that your telemetry quality is the ceiling on your agent quality.
What fills it: your existing Datadog / Grafana / New Relic / OpenTelemetry pipeline, plus a context layer (service topology, ownership, recent deploys, incident history) the agent can query. The context layer is the part most teams haven’t built. It’s also the part that does the most work.
Layer 2 — Reasoning
The model. In 2026 this is a frontier LLM — I default to Claude for infrastructure reasoning and keep a second model wired in for comparison, which is its own evaluation discipline. The thing to internalize: the model’s output is a hypothesis, not a fact, and never an authorization. Modern AI SRE agents form and test hypotheses against live telemetry rather than matching known signatures — that’s the genuine step-change over the AIOps correlation engines of 2019. But a hypothesis engine wired directly to your cluster’s write API is a liability, which is why layers 3–5 exist.
Layer 3 — Tool access (MCP)
This is how the agent reaches the world. In 2026 the answer is overwhelmingly the Model Context Protocol — not because it’s elegant, but because it standardizes the agent-to-tool boundary and lets you put one governed gateway in front of every tool instead of N bespoke integrations. I covered the architecture in the MCP gateway pattern. The catch is that this layer is a centralized attack surface — the NSA said as much in its 2026 cybersecurity information sheet on MCP. Convenient for you, convenient for an attacker. Treat it accordingly.
Layer 4 — Identity & policy
The layer the demo never shows. Every action the agent takes must carry a verifiable identity, be authorized for that specific action, and leave an audit record — what I’ve called no anonymous inference endpoints. The mechanism is token exchange (RFC 8693) plus policy-as-code (Cedar, OPA) at the gateway. If your agent acts through one shared god-mode service account, you don’t have an SRE stack, you have a confused deputy with admin rights and a friendly UI.
Layer 5 — Remediation
The actions. This is where vendors say “autonomous” and where I say bounded. The mature framing — visible in Azure SRE Agent and the AWS DevOps Agent, and the way Rootly frames its agent — is bounded remediation under governance with human oversight. The agent executes from a small, explicit, reversible action set, and everything outside that set is a proposal a human approves. I make the full case for this in bounded autonomy for AI SRE agents.
Layer 6 — Evaluation
The layer that turns the other five from a science-fair project into something you can put on-call. Were the agent’s hypotheses right? Did its actions reduce MTTR or just reduce visibility? Most failed AI-SRE rollouts I’ve seen had a great layer 2 and no layer 6 — they granted autonomy on faith and found out in production.
What the data actually says
Here’s the part the pricing pages leave out. New Relic’s 2026 AI Impact Report — aggregated across 6.6 million platform users — found AI users hit 2x higher correlation rates and 27% less alert noise than non-AI accounts. That’s real, and it’s significant.
And in the same breath: about half of respondents said AI reduced their toil. The other half reported no change or more work. Datadog’s State of AI Engineering 2026 shows the same split from the agent-sprawl angle — complexity growing faster than the ability to govern it.
The teams in the winning half are not the ones who bought the best agent. They’re the ones who had layers 1, 4, and 6 before they bought any agent at all. The loser half bolted a reasoning engine onto bad telemetry and no policy and called it transformation.
The AI doesn’t fix a broken SRE stack. It inherits it — and runs it faster.
How to use this map
- Auditing a vendor? Make them show you which of the six layers they own and which they assume you already have. “Autonomous” with no layer 4 means unauthenticated. “Intelligent” with no layer 6 means unmeasured.
- Building your own? Go bottom-up: telemetry, then identity, then a single read-only agent, then evaluation, and only then remediation. The next two posts in this drop go deep on the two layers that break most often — what the 2026 MCP release candidate means for layer 3, and bounded autonomy for layer 5.
The AI-native SRE stack isn’t coming. It’s here, and most of it is the boring, un-demoable plumbing under the agent — which is exactly where the reliability lives.