Why AI is the Next SRE Superpower

After 15 years in cloud infrastructure and SRE — including 8 years building safety-critical aviation systems at Jeppesen ForeFlight — here's why I believe AI is the most significant shift in how we operate systems since Kubernetes.


I’ve been an on-call engineer.

I know what 3am looks like — the Slack pings, the runbook that’s two years out of date, the Prometheus alert with a name that tells you nothing, the slow scroll through 10,000 log lines looking for the one that matters.

I’ve done it at companies small and large. For the past eight years I’ve done it at Jeppesen ForeFlight, building and operating the cloud infrastructure that pilots depend on — where “the system is down” has a different weight than it does in most software shops.

After 15 years in this field, I’ve seen a lot of things get called transformational. Most weren’t.

AI in infrastructure is different. Here’s why I believe it — and what I’m doing about it.


The problem AI actually solves in SRE

The chronic problem in on-call isn’t that engineers aren’t smart enough. It’s that there’s too much context scattered across too many places, and too little time to gather it when something is breaking.

You get paged. You look at the alert. The alert tells you a metric crossed a threshold. To understand why, you need to:

  • Know what the service does
  • Know its upstream and downstream dependencies
  • Know what changed recently
  • Know what this alert has meant the last three times it fired
  • Know the runbook — and whether it’s still accurate

None of that is in the alert. All of it is in your head, or in Confluence, or in a Slack thread from six months ago, or in the commit history, or in tribal knowledge that left when someone quit.

This is a retrieval and synthesis problem. That’s exactly what large language models are good at.


What AI can do right now (with working code)

I’m not talking about AGI. I’m not talking about AI that autonomously manages your infrastructure while you sleep. I’m talking about specific, narrow tasks where LLMs are genuinely useful today.

Explaining alerts in plain English. Give Claude a Prometheus alert payload and get back: what this alert means, what probably caused it, and a triage checklist ordered by probability. This alone can cut MTTR on unfamiliar alerts dramatically — especially for on-call rotations where not everyone knows every service.

Converting Slack threads to incident reports. Your post-mortem process probably sucks — not because nobody cares, but because writing a structured report at the end of a stressful incident is the last thing anyone wants to do. Feed the incident Slack thread to an LLM and get a draft report with timeline, root cause, impact, and action items. Then edit it. You’ll still spend 10 minutes instead of two hours.

Finding anomalies in log files. Grep finds what you tell it to look for. An LLM can look at 200 lines of logs and tell you what’s unusual — error clusters, cascading failures, timing anomalies you didn’t know to search for.

I’ve built these as standalone, working Python scripts and published them at github.com/ajinb/sre-ai-toolkit. Each one is runnable in under 5 minutes. No frameworks, no magic — just the Anthropic API and some carefully designed prompts.


Why most engineers are doing this wrong

The AI tooling space is full of demos. Things that work on a curated example, in a controlled environment, with a clean dataset.

Production SRE is not that.

Your logs aren’t clean. Your alerts fire at 3am with incomplete context. Your Slack threads are full of noise, side conversations, and people who type “lgtm” without reading anything.

The engineers who are going to get real value from AI in infrastructure are the ones who understand both sides: the messy reality of operating systems at scale, and how to design LLM interactions that are reliable under that messiness.

That means:

  • Prompts that handle partial or missing information gracefully
  • Fallback behavior when the model is uncertain
  • Output formats that are parseable by both humans and downstream systems
  • Not trusting the model blindly — especially for autonomous actions

This is the gap I’m trying to fill with everything I publish at cloudandsre.com. Not AI demos. Production-grade patterns, with working code.


What’s coming

Over the next year, I’m building a series of open-source tools that apply AI to specific, high-value SRE problems:

  • incident-scribe — full incident report generation from Slack threads, with PagerDuty and Confluence output formats
  • alert-explainer — a standalone service that enriches every Prometheus alert with AI-generated context before it hits your on-call engineer
  • k8s-ai-operator — a Kubernetes operator that detects pod failures and applies AI-reasoned remediation, with human-in-the-loop controls
  • llm-log-analyzer — AI log pattern recognition designed for the real world: noisy, high-volume, incomplete

Each one ships with a blog post, working code, and an explanation of the production-grade patterns it implements.

I’m also writing a book — Self-Healing Infrastructure: Building Autonomous Cloud Systems with AI — that pulls all of this together into a reference architecture for AI-native SRE.


Start here

If you want to see what AI-assisted SRE looks like in practice — not in theory — clone sre-ai-toolkit and run alert_explainer.py against one of your real Prometheus alerts.

The first time it correctly identifies that your high-error-rate alert is actually a database connection pool exhaustion problem — before you’ve opened a single log — you’ll understand why I think this matters.


Ajin Baby is a Computing Architect at Jeppesen ForeFlight (Boeing) and the founder of cloudandsre.com, where he publishes production-grade tooling at the intersection of AI and SRE. He is currently pursuing an MS in Artificial Intelligence and is a 15+ year IEEE member.