Jul 23, 2023 · 4 min read FundamentalsNetworks

How DNS Resolution Works — the Internet's Phone Book, and Its Cache

DNS turns a name into an address, and its caching behavior decides how fast your failovers propagate and how badly a misconfiguration spreads. A refresher on the recursive walk from root to authoritative — and why TTLs quietly govern cloud deploys, service discovery, and AI gateways.

DNS resolution flow: a stub resolver asks a recursive resolver, which walks root, TLD, and authoritative nameservers, then caches the answer

The first lookup walks the whole hierarchy. The TTL decides how long every lookup after it is just a cache hit.

“It’s always DNS” is a running joke in operations, and like most running jokes it’s earned. DNS is the system that turns a name a human can remember (cloudandsre.com) into an address a machine can route to (66.85.173.54). It works so seamlessly that people forget it’s a distributed, cached, hierarchical database — right up until a stale cache entry sends traffic to a dead server for an hour after you fixed the problem. This is a refresher on how a name actually gets resolved, and why the caching is the part that bites.

The resolution walk

Follow the diagram. Your application doesn’t know where cloudandsre.com lives, so it asks its stub resolver (built into your OS), which forwards the question to a recursive resolver — your ISP’s, or a public one like 8.8.8.8 or 1.1.1.1. The recursive resolver does the actual legwork:

It asks a root server: “where’s .com?” The root doesn’t know the final answer — it points to the .com TLD servers.
It asks a TLD server for .com: “where’s cloudandsre.com?” The TLD server points to that domain’s authoritative nameserver.
It asks the authoritative nameserver — the one that actually holds the records for cloudandsre.com — which returns the real answer: 66.85.173.54.
The recursive resolver returns that answer to you and caches it.

The hierarchy is the clever part: no single machine knows every name on the internet. Responsibility is delegated down a tree, and each level only knows enough to point you one step closer.

TTL: the most important number you never look at

Every DNS record carries a TTL (time to live) — how long a resolver is allowed to cache it. This one number is the source of most DNS pain and most DNS power.

The first lookup is slow (it does the full walk above). Every lookup after it, until the TTL expires, is an instant cache hit somewhere along the chain. Caching is what makes DNS survive planet-scale query volume.
But cached answers go stale. If you change where cloudandsre.com points and the old record had a 24-hour TTL, resolvers around the world may keep sending traffic to the old address for up to 24 hours. The change is correct; the propagation is slow. This is why “I updated DNS but half my users still hit the old server” is a feature, not a bug.

The operational lesson every ops engineer eventually learns: lower the TTL before a planned migration, not during it. Drop it to 60 seconds a day ahead, cut over, watch traffic move within a minute, then raise it back. If you forget, you wait out the old TTL while traffic dribbles over.

Why this is still front and center in cloud and AI

DNS didn’t get less important when we moved to the cloud — it became the control plane for how traffic finds services.

DNS is how failover and blue-green deploys work. Weighted and latency-based DNS records (Route 53, Cloud DNS) route users to the nearest or healthiest region. A failover is often “flip the DNS record to the standby.” That means your recovery time is bounded below by your TTL — a 300-second TTL sets a floor on how fast a DNS-based failover can complete. Your disaster-recovery plan and your TTL are the same conversation.

Service discovery is DNS. Inside Kubernetes, a service like payments.default.svc.cluster.local is resolved by cluster DNS (CoreDNS). When pods can’t reach each other, the cause is frequently DNS — misconfigured search domains, an overloaded CoreDNS, or aggressive negative caching. A startlingly large fraction of “the network is broken” incidents in Kubernetes are actually DNS incidents.

AI gateways and model endpoints hide behind names. When you call an MCP gateway or a model endpoint, you’re resolving a name first. If you route inference traffic across regions or providers for cost and availability, DNS is one of the levers — and DNS caching is why a provider cutover isn’t instantaneous. Agents that make many outbound calls also do many resolutions; a slow or failing resolver becomes a hidden tax on every tool call, showing up as mysterious latency that has nothing to do with the model.

The failure modes worth knowing

Stale cache after a change — the classic. Managed with TTL discipline.
Negative caching — resolvers also cache failures (NXDOMAIN). Fix a missing record and the “does not exist” answer can linger for its own TTL.
The resolver as a bottleneck — under high query volume (hello, chatty microservices and agents), an under-provisioned resolver adds latency to everything or drops queries. This is why cluster DNS is something you monitor, not something you assume.
Split-horizon confusion — internal and external views of the same name returning different answers, which is powerful and an excellent way to confuse yourself at 2am.

The takeaway

DNS is a beautifully simple idea — a hierarchical, delegated, cached lookup — carrying an enormous amount of operational weight. The resolution walk is easy; the caching is where the subtlety lives. If you internalize just one thing, make it this: the TTL is the propagation speed of every change you make, and treating it as an afterthought is why “it’s always DNS” keeps being true. Respect the cache, plan your TTLs, and a huge category of cloud and AI-infrastructure mysteries stops being mysterious.