AI Platform & Cloud
Infrastructure Architect
Designing AI-enabled cloud platforms — from agentic infrastructure and MCP servers to self-healing systems — for safety-critical and regulated industries.
By Ajin Baby · 15+ years architecting cloud systems · 2x founder before architect. Writing and shipping open-source code at the intersection of AI, cloud, and reliability.
Recent Posts
-
Skills for AI agents that do SRE work
Most agent skills are chatbot prompts in disguise. The ones I just published are operator tools — opinionated, output-contracted, with mandatory discipline sections that say what the skill won't do. Three skills, portable across Claude Code, Claude Desktop, Codex CLI, and any markdown-prompt runtime.
-
Alert fatigue? Let AI triage.
How I built alert-explainer — an open-source service that sits between Alertmanager and your on-call routing and turns every Prometheus alert into a plain-English brief in 1–4 seconds for under a cent. Design, reliability patterns, and production tradeoffs.
-
When NOT to Use AI in Production SRE
Most AI-for-SRE writing tells you where AI helps. Here are seven places it actively hurts — and the operational rule of thumb I use to decide.
-
Building incident-scribe: Slack Thread to Incident Report with Claude
How I built an open-source tool that turns messy Slack incident threads into blameless, structured incident reports in under 30 seconds — design, reliability patterns, and production tradeoffs.
-
Why AI is the Next SRE Superpower
After 15 years in cloud infrastructure and SRE — including 8+ years building safety-critical systems at a global aviation-SaaS platform — here's why I believe AI is the most significant shift in how we operate systems since Kubernetes.
-
REST vs. GraphQL APIs – how to choose?
REST APIs has been around for a while now, while GraphQL is relatively new to the game. While they both are used for data transfer - sending HTTP requests and receive HTTP responses, they both have th