Blog
All posts — newest first.
-
Mental models for applying AI to infrastructure
Most writing about AI in infrastructure is tutorials. Tutorials answer how. Mental models answer whether. Here are seven I use as the front gate before any LLM goes near a production system — recoverability, reversibility, per-call economics, the autonomy ladder as a risk function, tools-not-chat, context as substrate, and identity that travels with the action.
-
Prompt engineering for SRE: patterns that actually work in production
Most prompt-engineering advice is written for chatbots. SRE workloads are different — the input is messy, the output has to be machine-readable, and there's no human to gracefully handle a wrong answer. Here are six patterns I've shipped to production for SRE LLM tools, and why each one earned its place.
-
The MCP gateway pattern: five jobs your agent runtime can't skip
Letting agents call MCP servers directly is the same mistake as letting microservices call each other without an API gateway. Here are the five jobs an MCP gateway has to do, and reproducible patterns for each — scope-token exchange, schema firewall, quarantine queue, provenance ledger, and a catalog/broker split.
-
Skills for AI agents that do SRE work
Most agent skills are chatbot prompts in disguise. The ones I just published are operator tools — opinionated, output-contracted, with mandatory discipline sections that say what the skill won't do. Three skills, portable across Claude Code, Claude Desktop, Codex CLI, and any markdown-prompt runtime.
-
Alert fatigue? Let AI triage.
How I built alert-explainer — an open-source service that sits between Alertmanager and your on-call routing and turns every Prometheus alert into a plain-English brief in 1–4 seconds for under a cent. Design, reliability patterns, and production tradeoffs.
-
When NOT to Use AI in Production SRE
Most AI-for-SRE writing tells you where AI helps. Here are seven places it actively hurts — and the operational rule of thumb I use to decide.
-
Building incident-scribe: Slack Thread to Incident Report with Claude
How I built an open-source tool that turns messy Slack incident threads into blameless, structured incident reports in under 30 seconds — design, reliability patterns, and production tradeoffs.
-
Why AI is the Next SRE Superpower
After 15 years in cloud infrastructure and SRE — including 8+ years building safety-critical systems at a global aviation-SaaS platform — here's why I believe AI is the most significant shift in how we operate systems since Kubernetes.
-
REST vs. GraphQL APIs – how to choose?
REST APIs has been around for a while now, while GraphQL is relatively new to the game. While they both are used for data transfer - sending HTTP requests and receive HTTP responses, they both have th
-
GraphQL for API Development
What is it? GraphQL is a query language for APIs and a runtime for fulfilling those queries with your existing data. It was originally created at Facebook in 2012 for describing the capabilities and r
-
What is Address Resolution Protocol (ARP)?
ARP is a communication protocol used for discovering link layer address associated with a given internet layer address. Eg: find the MAC (media access control) addresses associated with IPV4 addresses
-
The CAP theorem
Lets look at the different system design options (databases particularly) in detail: 1. Consistency and Availability over Partition tolerance Here we prefer having some data and same data for ev
-
The 7 layers of ISO OSI model
The International Organization for Standardization came up with the Open Systems Interconnection (OSI) conceptual model which provides a standard for diverse computer systems to be able to communicate
-
NAS vs SAN – A brief comparison.
Network Attached Storage [NAS] NAS is a specialized data storage device connected to a network providing data access to other machines in the network over ethernet. Its hardware, software, or sp
-
What is RAID?
RAID [Redundant Array of Independent Disks] is a way of providing redundancy to the stored data, providing protection from Disk failures. RAID makes it possible to use lower-priced disks are in large
-
Amazon Leadership Principles – my thoughts.
Leadership principles are important for Amazon - when they hire, when they do evaluations in the job etc. If you are preparing for an interview with Amazon, you should be expecting a lot of behavioral
-
Gettings started with Ansible.
Ansible is an open-source tool that enables the automation, configuration, and orchestration of infrastructure. It fully embraces the concept of Infrastructure as Code. We can build out our entire sys
-
Introduction to YAML!
YAML Aint Markup Language! It is designed with a focus on human-readable formatting. The creators of YAML wanted it to be easily readable by humans. It is portable, easily extendable, and suppor
-
Preparing for AWS Solution Architect (Professional) Certification.
Backstory I cleared the AWS Solution Architect - Associate level certification in 2019. It wasnt too hard, I could clear it with a couple of years of on-job experience and almost a month of read
-
Hello world!
Hola, So, this is my first post here. I have been thinking about setting up a blog for a while now. I believe I have done enough thinking (around 5-6 years) and I should do som