AI Platform & Cloud
Infrastructure Architect
Designing AI-enabled cloud platforms — from agentic infrastructure and MCP servers to self-healing systems — for safety-critical and regulated industries.
By Ajin Baby · 15+ years architecting cloud systems · 2x founder before architect. Writing and shipping open-source code at the intersection of AI, cloud, and reliability.
Recent Posts
-
The CAP theorem in AI-native distributed systems
CAP didn't get repealed when LLMs showed up. But the costs of choosing C, A, or P shift when the datastore behind the system is a vector index, a context graph, or a model-served retrieval layer. A short revisit of the trade-offs, framed for teams building AI-enabled infrastructure.
-
NAS vs SAN for GPU workloads — what changed when AI showed up
The classical NAS-vs-SAN decision was about file vs block, ethernet vs fibre, and how much you wanted to pay. GPU training and inference rewrote the question. Here's how the calculus shifts when your storage has to keep an A100 or H100 cluster fed.
-
What is an AI agent? A primer for cloud engineers
A short primer on AI agents — the perceive-reason-act loop, what separates an agent from a one-shot LLM call, the classical agent types (reflex, model-based, goal-based, utility-based, learning) and how they map onto the agents running in modern SRE and platform tooling.
-
What is Model Context Protocol (MCP)?
A short primer on Model Context Protocol — the open standard that lets AI applications talk to tools and data sources through a uniform interface. Covers the host/client/server architecture, the data layer (JSON-RPC) and transport layer split, and why it matters for cloud and platform teams.
-
What is Retrieval-Augmented Generation (RAG)?
A short primer on Retrieval-Augmented Generation — the pattern that grounds an LLM's answer in documents you actually trust. Covers the indexing and serving paths, the role of the embedding model and vector index, and the failure modes that catch teams off guard in production.
-
Mental models for applying AI to infrastructure
Most writing about AI in infrastructure is tutorials. Tutorials answer how. Mental models answer whether. Here are seven I use as the front gate before any LLM goes near a production system — recoverability, reversibility, per-call economics, the autonomy ladder as a risk function, tools-not-chat, context as substrate, and identity that travels with the action.