Blog

All posts — newest first.

Jul 6, 2026 AIReliabilitySRE

Autonomy is a budget, not a toggle: error budgets for AI operators

SRE solved runaway release risk twenty years ago with the error budget. The same mechanism governs AI agents: authority per action class, burned in proportion to blast radius, demoted fast, promoted slow. In simulation it tracks a full-knowledge oracle within ~8%. Here's the model.
Jul 6, 2026 AIReliabilityArchitecture

The reliability gap: a framework for trusting autonomous SRE agents

In January 2026 an autonomous airline agent rebooked 1,247 passengers onto the wrong flights in one weather event. It worked in the demo. The gap between what agents can do and what we can trust them to do is a reliability problem — and reliability is not model accuracy. Here's how to measure it.
Jul 2, 2026 AIMCPPlatform Engineering

Agentic Resource Discovery: I Read the Spec, Then Published a Catalog

Google, Microsoft, and Hugging Face shipped Agentic Resource Discovery — a well-known ai-catalog.json plus a registry search API so agents can find, verify, and connect to tools without scraping the DOM. The real schema, a working catalog, serving config, and the gotchas that break it.
Jun 30, 2026 AIPlatform EngineeringArchitecture

The Five Types of Agentic Memory (and When to Use Each)

Agentic memory isn't one thing — it's five: working, episodic, semantic, procedural, and entity/profile. Each maps to a different storage substrate, eviction policy, and failure mode. A practitioner's decision guide for choosing the right memory for the job.
Jun 26, 2026 AIPlatform EngineeringArchitecture

Agentic AI Patterns: The Decision Guide (Part 1 of 3)

Six named agentic AI patterns — ReAct, Plan-and-Execute, Critic loop, Parallel fan-out, Human-in-the-loop gate, and Supervisor — with a decision flowchart and quick-reference table for picking the right architecture before you build.
Jun 26, 2026 AIPlatform EngineeringStrategy

Agentic AI Patterns: The Maturity Model (Part 3 of 3)

A five-level maturity model for agentic AI — from manual to multi-agent mesh — with a self-assessment to find where your team sits, what the jump to the next level actually requires, and where regulated industries should draw the line.
Jun 26, 2026 AIPlatform EngineeringReliability

Agentic AI Patterns: Where They Break in Production (Part 2 of 3)

Every agentic AI pattern looks clean in a demo. Here's where each one fails in production — the subtle failure modes, the operational signals that you're hitting them, and the mitigations that actually work.
Jun 26, 2026 AIDeveloper ToolsPlatform Engineering

Making Claude Code Work with Locally Deployed Models

Claude Code defaults to Anthropic's API, but you can point it at Ollama, vLLM, LM Studio, or any private endpoint. Here's when that's the right call, and exactly how to configure it — including what you give up.
Jun 26, 2026 AIData EngineeringPlatform Engineering

OKF: The Missing Context Layer for AI Agents

The Open Knowledge Format gives AI agents a structured vocabulary for what data they're touching, where it came from, and what it means — turning blind data consumption into informed, auditable reasoning.
Jun 21, 2026 AIReliabilityMCP

Chaos engineering for MCP: break your tool-call plane before production does

LLM API calls fail 1–5% of the time, and a single agent task fans out into 10–20 tool calls. The MCP layer is now a distributed-systems tier with no reliability model. Here's how to fault-inject it on purpose — with mcp-chaos — before it breaks an incident response for you.
Jun 12, 2026 AIReliabilityArchitecture

The trust gap: bounded autonomy for AI SRE agents

Vendors call 2026 the year of autonomous incident resolution. But SREs still face 50+ alerts a day at 60% false positives, and the trust frameworks lag the agents. Here's the autonomy-ladder model for what an AI SRE agent should — and should never — do on its own.
Jun 12, 2026 AIMCPPlatform Engineering

MCP goes stateless — what the 2026 release candidate means for your SRE tooling

The 2026-07-28 MCP release candidate is the biggest revision since launch: it deletes the session handshake for a stateless HTTP core and hardens OAuth against mix-up attacks. Here's what changes for the agents wired into your production systems — and the migration window you have to act in.
Jun 12, 2026 AISREReliability

The AI-native SRE stack — a 2026 reference guide

A practitioner's map of the AI-native SRE stack in 2026: the six layers from telemetry to bounded remediation, the tools that actually fill each one, and an honest read on where AI pays off — and where the New Relic and Datadog data says it doesn't yet.
Jun 6, 2026 AIPlatform EngineeringArchitecture

Context engineering: the window is a budget, not a bucket

The context window is the working memory of every agent you ship, and most teams treat it like a junk drawer. Context engineering as a discipline: four operations (write, select, compress, isolate), a token budget you actually allocate, the failure modes that bite at scale, and a worked SRE example.
Jun 5, 2026 AIPlatform EngineeringReliability

Agent sprawl is your next production incident

Every team shipping AI agents in 2026 is quietly recreating the microservices sprawl of 2015–2020 — faster, and with worse observability. Why agent sprawl is structurally the same failure, what's different this time, and the governance surface that contains it before it pages you at 3am.
Jun 5, 2026 AISecurityPlatform Engineering

No anonymous inference endpoints — the MCP security principle you're probably violating

In 2026 the NSA and NIST both put MCP and AI agents on notice: the protocol that lets your agents act is also a centralized funnel for prompt injection and privilege abuse. Why 'no anonymous inference endpoints' is the principle most teams break by default, and how token exchange (RFC 8693) plus policy-as-code closes it.
Jun 5, 2026 AIObservabilityReliability

Observability for AI systems — what changes when your service calls an LLM

Your golden signals don't cover the failure that will actually page you: the model returned a confident, well-formed, wrong answer. What observability for AI-enabled systems has to add — context as a span, quality as a signal, and the shift from passive monitoring to active investigation — grounded in where the field is in 2026.
May 29, 2026 AISREOpen Source

Harness engineering: the third phase of AI maturity

Agent = Model + Harness. In 2026 the model is rarely the bottleneck — the scaffolding around it is. Here's what a production-grade SRE harness actually contains, with a ~40-line reference implementation you can run offline: tool orchestration, verification, memory, guardrails, and observability.
May 24, 2026 SREObservabilityIncident Response

Observability and incident response — the SRE basics

A primer on the two operational disciplines every SRE team needs to run: observability (logs, metrics, traces) and incident response (roles, severities, blameless postmortems). Includes the practical shape of an incident and how AI is starting to absorb the lower rungs of both.
May 23, 2026 SREReliabilityFundamentals

Toil and the 50% rule — what it is, how to measure it, and how to kill it

A primer on toil — the manual, repetitive, automatable work that quietly eats SRE teams. Covers Google's six-part definition, the 50% cap, how to measure toil honestly, and how the 2026 generation of AI agents changes the toil-elimination playbook.
May 22, 2026 SREReliabilityFundamentals

SLI, SLO, SLA, and error budgets — the reliability contract explained

A primer on the four numbers every SRE team needs to agree on: Service Level Indicators, Objectives, Agreements, and the error budget that falls out of them. Includes concrete examples, the math behind 'nines,' and what the contract looks like once AI agents start contributing to the burn rate.
May 21, 2026 SREPlatform EngineeringFundamentals

What is Site Reliability Engineering (SRE)?

A primer on Site Reliability Engineering — what SRE is, where it came from at Google, how it differs from DevOps and Platform Engineering, and the core principles that make it work. Includes a short note on what changes in 2026 as AI moves into the on-call seat.
May 15, 2026 AIArchitecturePlatform Engineering

What are vector embeddings?

A short primer on vector embeddings — the numerical representation that lets a computer treat 'the meaning of this text' as something it can search, cluster, and compare. Covers what an embedding actually is, how similarity works, why model choice matters more than retrieval quality, and the production failure modes you only see in evaluation.
May 15, 2026 AIArchitecturePlatform Engineering

What is function calling (tool use)?

A short primer on function calling — the mechanism that lets an LLM decide to invoke an external function and let your code do the actual work. Covers the JSON-schema contract, the request/response loop, parallel and forced tool calls, and why every production AI agent in 2026 is built on this primitive.
May 15, 2026 AIArchitecturePlatform Engineering

What is prompt caching?

A short primer on prompt caching — the LLM-provider feature that drops the cost of a repeated long prompt by 50–90% and the latency by half. Covers how prefix matching works, the TTL economics across Anthropic / OpenAI / Google, where caching helps and where it quietly does not, and the operational gotchas that determine whether your hit rate is 90% or 9%.
May 12, 2026 AIArchitectureDistributed Systems

The CAP theorem in AI-native distributed systems

CAP didn't get repealed when LLMs showed up. But the costs of choosing C, A, or P shift when the datastore behind the system is a vector index, a context graph, or a model-served retrieval layer. A short revisit of the trade-offs, framed for teams building AI-enabled infrastructure.
May 12, 2026 AIStorageInfrastructure

NAS vs SAN for GPU workloads — what changed when AI showed up

The classical NAS-vs-SAN decision was about file vs block, ethernet vs fibre, and how much you wanted to pay. GPU training and inference rewrote the question. Here's how the calculus shifts when your storage has to keep an A100 or H100 cluster fed.
May 11, 2026 AIAgentsPlatform Engineering

What is an AI agent? A primer for cloud engineers

A short primer on AI agents — the perceive-reason-act loop, what separates an agent from a one-shot LLM call, the classical agent types (reflex, model-based, goal-based, utility-based, learning) and how they map onto the agents running in modern SRE and platform tooling.
May 11, 2026 AIMCPPlatform Engineering

What is Model Context Protocol (MCP)?

A short primer on Model Context Protocol — the open standard that lets AI applications talk to tools and data sources through a uniform interface. Covers the host/client/server architecture, the data layer (JSON-RPC) and transport layer split, and why it matters for cloud and platform teams.
May 11, 2026 AIArchitecturePlatform Engineering

What is Retrieval-Augmented Generation (RAG)?

A short primer on Retrieval-Augmented Generation — the pattern that grounds an LLM's answer in documents you actually trust. Covers the indexing and serving paths, the role of the embedding model and vector index, and the failure modes that catch teams off guard in production.
May 6, 2026 AISREArchitecture

Mental models for applying AI to infrastructure

Most writing about AI in infrastructure is tutorials. Tutorials answer how. Mental models answer whether. Here are seven I use as the front gate before any LLM goes near a production system — recoverability, reversibility, per-call economics, the autonomy ladder as a risk function, tools-not-chat, context as substrate, and identity that travels with the action.
May 5, 2026 AISREPrompt Engineering

Prompt engineering for SRE: patterns that actually work in production

Most prompt-engineering advice is written for chatbots. SRE workloads are different — the input is messy, the output has to be machine-readable, and there's no human to gracefully handle a wrong answer. Here are six patterns I've shipped to production for SRE LLM tools, and why each one earned its place.
May 2, 2026 AIMCPPlatform Engineering

The MCP gateway pattern: five jobs your agent runtime can't skip

Letting agents call MCP servers directly is the same mistake as letting microservices call each other without an API gateway. Here are the five jobs an MCP gateway has to do, and reproducible patterns for each — scope-token exchange, schema firewall, quarantine queue, provenance ledger, and a catalog/broker split.
Apr 30, 2026 AISREOpen Source

Skills for AI agents that do SRE work

Most agent skills are chatbot prompts in disguise. The ones I just published are operator tools — opinionated, output-contracted, with mandatory discipline sections that say what the skill won't do. Three skills, portable across Claude Code, Claude Desktop, Codex CLI, and any markdown-prompt runtime.
Apr 29, 2026 AISREOpen Source

Alert fatigue? Let AI triage.

How I built alert-explainer — an open-source service that sits between Alertmanager and your on-call routing and turns every Prometheus alert into a plain-English brief in 1–4 seconds for under a cent. Design, reliability patterns, and production tradeoffs.
Apr 25, 2026 AISREReliability

When NOT to Use AI in Production SRE

Most AI-for-SRE writing tells you where AI helps. Here are seven places it actively hurts — and the operational rule of thumb I use to decide.
Apr 21, 2026 AISREOpen Source

Building incident-scribe: Slack Thread to Incident Report with Claude

How I built an open-source tool that turns messy Slack incident threads into blameless, structured incident reports in under 30 seconds — design, reliability patterns, and production tradeoffs.
Apr 20, 2026 AISRECloud

Why AI is the Next SRE Superpower

After 15 years in cloud infrastructure and SRE — including 8+ years building safety-critical systems at a global aviation-SaaS platform — here's why I believe AI is the most significant shift in how we operate systems since Kubernetes.
Jan 18, 2026 FundamentalsArchitecture

Queues and Message Brokers: The Shock Absorber of Distributed Systems

A queue decouples producers from consumers, absorbs bursts, and keeps a slow component from taking down a fast one. A refresher on backpressure, at-least-once delivery, idempotency, and dead-letter queues — and why the same buffer now sits in front of every expensive LLM call.
Sep 21, 2025 FundamentalsDistributed Systems

ACID, BASE, and Isolation Levels: What 'Consistent' Actually Means

ACID promises correctness; BASE trades it for availability and scale. And even within ACID, the default isolation level is usually weaker than engineers assume. A refresher on transactions, isolation anomalies, and eventual consistency — and why it decides correctness in distributed and AI systems.
Jun 15, 2025 FundamentalsArchitecture

Graph Traversal: BFS, DFS, and Why GraphRAG Is Just a Walk

Breadth-first and depth-first search are the two ways to walk a graph, and the choice between a queue and a stack decides everything downstream. A refresher on BFS vs DFS, the all-important visited set, and why graph traversal underpins dependency resolution, knowledge graphs, and GraphRAG.
Mar 9, 2025 FundamentalsDev

Recursion and the Call Stack: Elegant Until It Overflows

Recursion lets code mirror the shape of the problem — but every call costs a stack frame, and runaway depth ends in a stack overflow. A refresher on frames, base cases, and iteration-vs-recursion, plus why the pattern (and its failure mode) shows up in agent loops and tree traversal.
Jan 12, 2025 FundamentalsReliability

Garbage Collection: The Convenience That Shows Up in Your p99

Garbage collection frees you from manual memory management — and occasionally freezes your program to do it. A refresher on reachability, generational GC, and stop-the-world pauses, plus why GC tuning is really a tail-latency problem for the services behind your AI systems.
Nov 19, 2024 FundamentalsDev

Compilers, Interpreters, and JIT: Why Python Is Slow and PyTorch Is Fast

The difference between compiling ahead of time, interpreting line by line, and JIT-compiling the hot paths explains your language's performance, its startup time, and its portability. A refresher — and why torch.compile and XLA are the same idea aimed at tensor programs.
Oct 1, 2024 FundamentalsArchitecture

Floating Point and Numerical Precision: Why 0.1 + 0.2 ≠ 0.3, and Why ML Cares

Floating-point numbers are approximations, and the errors aren't random. A refresher on sign/exponent/mantissa, why 0.1 + 0.2 isn't 0.3, and how the same fundamentals drive the FP32 → BF16 → FP8 march that makes modern LLM training and inference affordable.
Aug 13, 2024 FundamentalsLinux

Virtual Memory and Paging: The Illusion Every Program Lives In

Every process thinks it has a private, contiguous memory space. Virtual memory is the beautiful lie that makes it true — and page faults, thrashing, and overcommit are where the lie gets expensive. A refresher, and why paging ideas now manage the KV cache and explain container OOM kills.
Jun 25, 2024 FundamentalsArchitecture

The Memory Hierarchy: Why Data Locality Beats Clock Speed

From registers to object storage, each level of memory is roughly 10–100× slower than the one above it. A refresher on the hierarchy, cache lines, and locality — and why 'keep the data near the compute' is now the single biggest lever in GPU and LLM-inference performance.
May 6, 2024 FundamentalsReliability

Deadlocks, Locks, and Race Conditions: The Bugs That Only Happen Sometimes

Race conditions and deadlocks are the hardest bugs to reproduce because they depend on timing. A refresher on locks, atomicity, the four Coffman conditions, and lock ordering — and why these decades-old ideas explain hung agents, corrupted state, and 'it only fails under load'.
Mar 17, 2024 FundamentalsArchitecture

Concurrency vs Parallelism: The Distinction That Fixes Your Throughput

Concurrency is dealing with many things at once; parallelism is doing many things at once. Confusing them is behind a huge share of performance mistakes. A refresher on the difference, the GIL, async vs threads — and why it decides how you scale model calls and GPU work.
Jan 28, 2024 FundamentalsArchitecture

Caching and Eviction Policies: Why LRU, LFU, and FIFO Aren't the Same Bet

Caching is the oldest performance trick there is, and the eviction policy is the part that decides whether it works. A refresher on LRU vs LFU vs FIFO, hit rates, cache invalidation, and why the same ideas now govern prompt caches, KV caches, and GPU memory.
Dec 14, 2023 FundamentalsData Engineering

Bloom Filters: Saying 'Definitely Not' in a Few Kilobytes

A Bloom filter answers 'have I seen this before?' using a fraction of the memory a real set would need — at the cost of occasional false positives, never false negatives. A refresher on how it works, and why it quietly speeds up databases, caches, and AI training-data dedup.
Oct 29, 2023 FundamentalsDistributed Systems

Consistent Hashing: How Distributed Systems Add and Remove Nodes Without Chaos

Naive sharding with hash-mod-N reshuffles almost everything when a node joins or leaves. Consistent hashing moves only a fraction. A refresher on the ring, virtual nodes, and why this 1997 idea underpins your CDN, your distributed cache, and your sharded vector index.
Sep 10, 2023 FundamentalsSecurity

TLS and Public-Key Cryptography, Explained Without the Math

Every 'https' and every model API call rides on TLS. A refresher on how public-key crypto solves key distribution, what a certificate actually proves, and why the symmetric/asymmetric split matters — plus the cert-expiry and mTLS realities that decide uptime in cloud and AI systems.
Jul 23, 2023 FundamentalsNetworks

How DNS Resolution Works — the Internet's Phone Book, and Its Cache

DNS turns a name into an address, and its caching behavior decides how fast your failovers propagate and how badly a misconfiguration spreads. A refresher on the recursive walk from root to authoritative — and why TTLs quietly govern cloud deploys, service discovery, and AI gateways.
Jun 7, 2023 FundamentalsNetworks

How TCP/IP Actually Works — and Why the Handshake Still Bites You

Every API call, model request, and database query rides on TCP/IP. A refresher on the three-way handshake, sequence numbers, and flow control — and why round-trips, connection reuse, and TIME_WAIT quietly shape the latency of your cloud and AI systems.
Apr 18, 2023 FundamentalsStorage

B-tree vs LSM-tree: The Two Ways Databases Store Your Data

Almost every database you use is built on one of two storage engines: the read-optimized B-tree or the write-optimized LSM-tree. Knowing which one is under your database explains its performance, its failure modes, and why your vector store behaves the way it does.
Mar 5, 2023 FundamentalsArchitecture

Hash Tables: The Data Structure Behind Almost Everything

The hash table is the quiet workhorse under your cache, your database index, your deduplication, and your vector store's metadata layer. A refresher on how it turns a key into O(1) access — and the collision, load-factor, and hashing subtleties that decide whether it stays fast.
Jan 15, 2023 FundamentalsArchitecture

Big-O Notation in the Age of Billion-Vector Search

Big-O is the oldest tool in the box and the one that still decides whether your system survives contact with real data. A refresher on algorithmic complexity, and why it quietly governs vector search, context windows, and every 'it worked in the demo' outage.
Dec 8, 2022 APIDev

REST vs. GraphQL APIs – how to choose?

REST APIs has been around for a while now, while GraphQL is relatively new to the game. While they both are used for data transfer - sending HTTP requests and receive HTTP responses, they both have th
Nov 19, 2022 APIFundamentals

GraphQL for API Development

What is it? GraphQL is a query language for APIs and a runtime for fulfilling those queries with your existing data. It was originally created at Facebook in 2012 for describing the capabilities and r
Nov 18, 2022 GeneralNetworks

What is Address Resolution Protocol (ARP)?

ARP is a communication protocol used for discovering link layer address associated with a given internet layer address. Eg: find the MAC (media access control) addresses associated with IPV4 addresses
Jun 2, 2021 FundamentalsGeneralStorage

The CAP theorem

Lets look at the different system design options (databases particularly) in detail: 1. Consistency and Availability over Partition tolerance Here we prefer having some data and same data for ev
Jun 1, 2021 FundamentalsGeneralNetworks

The 7 layers of ISO OSI model

The International Organization for Standardization came up with the Open Systems Interconnection (OSI) conceptual model which provides a standard for diverse computer systems to be able to communicate
May 30, 2021 GeneralLinuxSRE

NAS vs SAN – A brief comparison.

Network Attached Storage [NAS] NAS is a specialized data storage device connected to a network providing data access to other machines in the network over ethernet. Its hardware, software, or sp
May 30, 2021 LinuxStorage

What is RAID?

RAID [Redundant Array of Independent Disks] is a way of providing redundancy to the stored data, providing protection from Disk failures. RAID makes it possible to use lower-priced disks are in large
May 29, 2021 AWSGeneralLeadership

Amazon Leadership Principles – my thoughts.

Leadership principles are important for Amazon - when they hire, when they do evaluations in the job etc. If you are preparing for an interview with Amazon, you should be expecting a lot of behavioral
May 26, 2021 AnsibleGeneral

Gettings started with Ansible.

Ansible is an open-source tool that enables the automation, configuration, and orchestration of infrastructure. It fully embraces the concept of Infrastructure as Code. We can build out our entire sys
May 10, 2021 General

Introduction to YAML!

YAML Aint Markup Language! It is designed with a focus on human-readable formatting. The creators of YAML wanted it to be easily readable by humans. It is portable, easily extendable, and suppor
Mar 30, 2021 AWSCertification

Preparing for AWS Solution Architect (Professional) Certification.

Backstory I cleared the AWS Solution Architect - Associate level certification in 2019. It wasnt too hard, I could clear it with a couple of years of on-job experience and almost a month of read

Blog

Autonomy is a budget, not a toggle: error budgets for AI operators

The reliability gap: a framework for trusting autonomous SRE agents

Agentic Resource Discovery: I Read the Spec, Then Published a Catalog

The Five Types of Agentic Memory (and When to Use Each)

Agentic AI Patterns: The Decision Guide (Part 1 of 3)

Agentic AI Patterns: The Maturity Model (Part 3 of 3)

Agentic AI Patterns: Where They Break in Production (Part 2 of 3)

Making Claude Code Work with Locally Deployed Models

OKF: The Missing Context Layer for AI Agents

Chaos engineering for MCP: break your tool-call plane before production does

The trust gap: bounded autonomy for AI SRE agents

MCP goes stateless — what the 2026 release candidate means for your SRE tooling

The AI-native SRE stack — a 2026 reference guide

Context engineering: the window is a budget, not a bucket

Agent sprawl is your next production incident

No anonymous inference endpoints — the MCP security principle you're probably violating

Observability for AI systems — what changes when your service calls an LLM

Harness engineering: the third phase of AI maturity

Observability and incident response — the SRE basics

Toil and the 50% rule — what it is, how to measure it, and how to kill it

SLI, SLO, SLA, and error budgets — the reliability contract explained

What is Site Reliability Engineering (SRE)?

What are vector embeddings?

What is function calling (tool use)?

What is prompt caching?

The CAP theorem in AI-native distributed systems

NAS vs SAN for GPU workloads — what changed when AI showed up

What is an AI agent? A primer for cloud engineers

What is Model Context Protocol (MCP)?

What is Retrieval-Augmented Generation (RAG)?

Mental models for applying AI to infrastructure

Prompt engineering for SRE: patterns that actually work in production

The MCP gateway pattern: five jobs your agent runtime can't skip

Skills for AI agents that do SRE work

Alert fatigue? Let AI triage.

When NOT to Use AI in Production SRE

Building incident-scribe: Slack Thread to Incident Report with Claude

Why AI is the Next SRE Superpower

Queues and Message Brokers: The Shock Absorber of Distributed Systems

ACID, BASE, and Isolation Levels: What 'Consistent' Actually Means

Graph Traversal: BFS, DFS, and Why GraphRAG Is Just a Walk

Recursion and the Call Stack: Elegant Until It Overflows

Garbage Collection: The Convenience That Shows Up in Your p99

Compilers, Interpreters, and JIT: Why Python Is Slow and PyTorch Is Fast

Floating Point and Numerical Precision: Why 0.1 + 0.2 ≠ 0.3, and Why ML Cares

Virtual Memory and Paging: The Illusion Every Program Lives In

The Memory Hierarchy: Why Data Locality Beats Clock Speed

Deadlocks, Locks, and Race Conditions: The Bugs That Only Happen Sometimes

Concurrency vs Parallelism: The Distinction That Fixes Your Throughput

Caching and Eviction Policies: Why LRU, LFU, and FIFO Aren't the Same Bet

Bloom Filters: Saying 'Definitely Not' in a Few Kilobytes

Consistent Hashing: How Distributed Systems Add and Remove Nodes Without Chaos

TLS and Public-Key Cryptography, Explained Without the Math

How DNS Resolution Works — the Internet's Phone Book, and Its Cache

How TCP/IP Actually Works — and Why the Handshake Still Bites You

B-tree vs LSM-tree: The Two Ways Databases Store Your Data

Hash Tables: The Data Structure Behind Almost Everything

Big-O Notation in the Age of Billion-Vector Search

REST vs. GraphQL APIs – how to choose?

GraphQL for API Development

What is Address Resolution Protocol (ARP)?

The CAP theorem

The 7 layers of ISO OSI model

NAS vs SAN – A brief comparison.

What is RAID?

Amazon Leadership Principles – my thoughts.

Gettings started with Ansible.

Introduction to YAML!

Preparing for AWS Solution Architect (Professional) Certification.