AI Platform & Cloud
Infrastructure Architect
Designing AI-enabled cloud platforms — from agentic infrastructure and MCP servers to self-healing systems — for safety-critical and regulated industries.
By Ajin Baby · 15+ years architecting cloud systems · 2x founder before architect. Writing and shipping open-source code at the intersection of AI, cloud, and reliability.
Recent Posts
-
Observability and incident response — the SRE basics
A primer on the two operational disciplines every SRE team needs to run: observability (logs, metrics, traces) and incident response (roles, severities, blameless postmortems). Includes the practical shape of an incident and how AI is starting to absorb the lower rungs of both.
-
Toil and the 50% rule — what it is, how to measure it, and how to kill it
A primer on toil — the manual, repetitive, automatable work that quietly eats SRE teams. Covers Google's six-part definition, the 50% cap, how to measure toil honestly, and how the 2026 generation of AI agents changes the toil-elimination playbook.
-
SLI, SLO, SLA, and error budgets — the reliability contract explained
A primer on the four numbers every SRE team needs to agree on: Service Level Indicators, Objectives, Agreements, and the error budget that falls out of them. Includes concrete examples, the math behind 'nines,' and what the contract looks like once AI agents start contributing to the burn rate.
-
What is Site Reliability Engineering (SRE)?
A primer on Site Reliability Engineering — what SRE is, where it came from at Google, how it differs from DevOps and Platform Engineering, and the core principles that make it work. Includes a short note on what changes in 2026 as AI moves into the on-call seat.
-
What are vector embeddings?
A short primer on vector embeddings — the numerical representation that lets a computer treat 'the meaning of this text' as something it can search, cluster, and compare. Covers what an embedding actually is, how similarity works, why model choice matters more than retrieval quality, and the production failure modes you only see in evaluation.
-
What is function calling (tool use)?
A short primer on function calling — the mechanism that lets an LLM decide to invoke an external function and let your code do the actual work. Covers the JSON-schema contract, the request/response loop, parallel and forced tool calls, and why every production AI agent in 2026 is built on this primitive.