AI Platform & Cloud
Infrastructure Architect
Designing AI-enabled cloud platforms — from agentic infrastructure and MCP servers to self-healing systems — for safety-critical and regulated industries.
By Ajin Baby · 15+ years architecting cloud systems · 2x founder before architect. Writing and shipping open-source code at the intersection of AI, cloud, and reliability.
Recent Posts
-
Context engineering: the window is a budget, not a bucket
The context window is the working memory of every agent you ship, and most teams treat it like a junk drawer. Context engineering as a discipline: four operations (write, select, compress, isolate), a token budget you actually allocate, the failure modes that bite at scale, and a worked SRE example.
-
Harness engineering: the third phase of AI maturity
Agent = Model + Harness. In 2026 the model is rarely the bottleneck — the scaffolding around it is. Here's what a production-grade SRE harness actually contains, with a ~40-line reference implementation you can run offline: tool orchestration, verification, memory, guardrails, and observability.
-
Observability and incident response — the SRE basics
A primer on the two operational disciplines every SRE team needs to run: observability (logs, metrics, traces) and incident response (roles, severities, blameless postmortems). Includes the practical shape of an incident and how AI is starting to absorb the lower rungs of both.
-
Toil and the 50% rule — what it is, how to measure it, and how to kill it
A primer on toil — the manual, repetitive, automatable work that quietly eats SRE teams. Covers Google's six-part definition, the 50% cap, how to measure toil honestly, and how the 2026 generation of AI agents changes the toil-elimination playbook.
-
SLI, SLO, SLA, and error budgets — the reliability contract explained
A primer on the four numbers every SRE team needs to agree on: Service Level Indicators, Objectives, Agreements, and the error budget that falls out of them. Includes concrete examples, the math behind 'nines,' and what the contract looks like once AI agents start contributing to the burn rate.
-
What is Site Reliability Engineering (SRE)?
A primer on Site Reliability Engineering — what SRE is, where it came from at Google, how it differs from DevOps and Platform Engineering, and the core principles that make it work. Includes a short note on what changes in 2026 as AI moves into the on-call seat.