Apr 25, 2026 AISREReliability

When NOT to Use AI in Production SRE

Most AI-for-SRE writing tells you where AI helps. Here are seven places it actively hurts — and the operational rule of thumb I use to decide.

Most of the AI-for-SRE writing on the internet right now tells you where AI helps. This post is the inverse. After two years of putting AI into the on-call path of platforms operating in regulated industries, I have a list of places where adding AI is the wrong move — not because the technology cannot do the job, but because the operational economics, the failure modes, or the regulatory posture make it the wrong tool for the job.

If you are an SRE, a platform engineer, or a leader being pitched on “AI-powered everything,” this is the contrarian half of the conversation.

The TL;DR rule of thumb at the top, before the list: add AI to a surface only when the cost of being slightly wrong is recoverable, when there is a deterministic fallback, and when a human can audit the decision after the fact. If any of those three are missing, the AI does not belong there yet.

1. The deterministic remediation path

If a runbook step is “restart the pod, then check the health endpoint,” AI does not improve it. The step is already deterministic, already fast, already reliable. Putting an LLM in front of it adds latency, adds a failure mode (the LLM might paraphrase the step incorrectly), and adds a cost line on your monthly bill — all in exchange for nothing.

The temptation is to “wrap” the runbook in a chat surface. Resist it. Wrap it in a button. The chat surface is appropriate when the next step is not obvious; when it is obvious, the button is faster, cheaper, and more reliable.

Rule: If the next step is deterministic and consistent, do not put a probabilistic decision-maker in front of it.

2. The hot path of a financial transaction

I have watched a team propose putting an LLM in the synchronous path of a payment authorization. The pitch was that the LLM would “enrich” the request with fraud-relevant context. The pitch sounded reasonable. It was wrong.

Inference latency is bimodal. Median is fine; the long tail is not. A P99 of 14 seconds is normal for a non-trivial prompt. A payment authorization that times out at 10 seconds will fail enough of the time, on the calls that matter most, to materially erode trust in the system. And when the inference provider has a regional outage — and they will — the entire payment surface goes with it.

Take the LLM out of the hot path. Put it on the side — async fraud signal generation, async risk scoring, async post-hoc review. Synchronous AI on a transactional path is a P99 problem dressed up as a feature.

Rule: Anything user-blocking on a tight latency budget should not have a synchronous LLM in its path.

3. The compliance attestation

If a regulator asks “who approved this change?” and the answer is “the agent did,” you have a problem. Most regulatory frameworks — EU AI Act, the EASA AI Trustworthiness Framework that becomes enforceable in August 2026, the FAA AI Safety Assurance Roadmap — require effective human oversight for high-risk decisions. A click-through approval that is rubber-stamped 99.7% of the time is not effective oversight. It is a signature.

If you are in a regulated industry, the AI cannot be the final approver of an action that has compliance implications. It can recommend, it can pre-fill, it can summarize. The human-in-the-loop checkpoint is a regulatory control, not a UX detail. Designing it as theater means it will fail an audit.

Rule: AI proposes; humans dispose. In regulated contexts, this is not optional.

4. The single point of failure

If your on-call experience falls back to “sorry, the AI is having a bad afternoon, no runbooks today,” the AI is a single point of failure for your on-call experience. That is a worse posture than having no AI at all, because at least the no-AI posture forces you to keep the static runbook current.

For every load-bearing AI surface, the operational question is: what does this surface do when AI is unavailable? If the answer is “it stops working,” you have not designed an AI-enabled surface. You have designed a surface that depends on AI, which is the same kind of single-point-of-failure you would refuse for a database or a payment gateway.

Rule: Every load-bearing AI surface needs a deterministic fallback. If you do not have one, you do not have a production surface.

5. The high-frequency, low-value decision

AI is expensive per call. A call that costs three cents is not a problem. The same call running ten million times a day is $300,000 a year, and the question becomes whether the value generated by the AI on each of those calls — measured against the same call done with a deterministic rule, a regex, a small classifier, or nothing at all — actually justifies the line item.

The places I have seen this go wrong: log-line classification at full scale, alert deduplication on a high-cardinality stream, every-message Slack-thread enrichment. In each case, a 50-line classical heuristic got 95% of the way there at 1% of the cost. The remaining 5% of cases should be the AI’s job, not all 100%.

Rule: Per-call AI cost × call rate = annual line item. Do the math before you ship.

6. The privacy boundary you cannot cross

If your data cannot leave your tenancy, your region, or your trust boundary, hosted AI is not your tool. There are workloads — patient data, classified material, certain financial records, jurisdictions with strict residency requirements — where the answer is self-hosted inference, full stop. Self-hosted is not free; it brings GPU operations, model lifecycle, and a much higher operational burden into your platform team’s job description.

The mistake I see most often is teams that assume their data can cross the boundary because the contract did not explicitly say it could not. By the time legal reviews the prompt logs, the integration is in production, and removing it is a quarter of work.

Rule: Confirm the data boundary in writing before you build the integration, not after.

7. The behavior you cannot test

A non-deterministic interface is testable, but only with discipline — golden-set evaluations, structural assertions, production sampling. If you do not have any of those in place, and if the team is not committed to building them, the AI surface will drift, the drift will be invisible, and the day a model provider rotates weights you will discover the drift in production.

This one is less about whether AI is the right tool and more about whether the team is set up to ship AI safely. If the team has never run a model evaluation, has no concept of a regression test for a prompt, and considers “it worked when I demoed it” sufficient validation, the answer is not “no AI.” The answer is “no AI yet — first build the test discipline, then ship.”

Rule: No production AI surface ships without a test harness that can fail. The first failure is the goal.

What this list is not

This is not an argument against AI in production. It is the opposite — an argument for a clear-eyed view of where AI earns its place and where it does not. The places it does earn its place are most of the interesting work in our field right now: alert enrichment, runbook retrieval, postmortem drafting, on-call experience improvement, intelligent log analysis, code review, agentic investigation. I have shipped each of those. I will write about each of those in turn.

The point of this post is the discipline that has to come with all of it. AI is not a magic ingredient that improves any system you sprinkle it on. It is a dependency, with failure modes, with a cost line, with a regulatory profile, and with a testing burden. The teams that treat it that way will ship faster and with fewer incidents than the teams that do not.

If you take one thing from this post, take this: before adding AI to a surface, write down what the surface does when the AI is wrong, when the AI is down, and when the auditor asks who approved the action. If you cannot answer those three questions in writing, you are not ready to add AI to that surface yet. That is not a permanent verdict. It is a punch list.

If you found this useful, the rest of the work I’m doing on AI-enabled cloud infrastructure lives at cloudandsre.com, with the open-source tools at github.com/ajinb.