
Photo: Alina Grubnyak on Unsplash. The intuition: every node is a point in embedding space; nearness in that space means similarity in meaning.
If you have read the RAG primer, you have already met the embedding. It is the small mystery in the middle of the architecture — the thing that turns a chunk of text into a list of numbers, and the thing whose quality, in production, ends up dominating every retrieval system you build.
This post is about what is actually happening inside that step, and why “pick a model and move on” is the wrong way to treat it.
An embedding is a coordinate
A vector embedding is a fixed-length list of floating-point numbers — typically 384, 768, 1024, 1536, or 3072 dimensions in current production models — produced by an embedding model from a piece of input. The input is usually text, but it can also be an image, an audio clip, a code snippet, or a chunk of structured data. The output is always the same shape for a given model: a point in a high-dimensional space.
The point itself is meaningless. Two embedding vectors are only useful relative to each other. The promise the embedding model makes is: if two pieces of input have similar meaning, their points are near each other; if they have different meaning, their points are far apart. Everything downstream — semantic search, clustering, recommendation, anomaly detection — is just an exploitation of that one promise.
The “near” and “far” are measured by a similarity metric. The two that show up in 95% of production deployments:
- Cosine similarity. The angle between the two vectors. Insensitive to vector magnitude. The default for almost every text embedding model shipped in the last five years.
- Dot product. Identical to cosine if the vectors are normalized to unit length, which most models do by default. Faster on hardware that has a fused multiply-add path.
A third — Euclidean (L2) distance — shows up in older systems and some image-embedding setups. For new text systems, default to cosine and only revisit if you have a reason.
Why the model choice does most of the work
The embedding model is a frozen artifact: a transformer that has been trained (or fine-tuned) to produce vectors with the property that semantically similar inputs land near each other. Which notion of “similar” the model has internalized depends on what it was trained on, and that choice shapes every downstream result.
A general-purpose model trained on web text will happily collapse domain-specific concepts onto each other. The word “rollback” in a database transaction sense and “rollback” in a deployment sense will land in roughly the same neighborhood, because nothing in the training data taught the model to separate them. A model fine-tuned on infrastructure documentation will keep them apart. A multilingual model may pull translations of the same concept toward each other; a monolingual model will scatter them.
Three properties are worth checking before you commit to a model:
- Dimensionality. Higher is not always better. A 3,072-dim vector is four times the storage and roughly four times the search cost of a 768-dim vector, and the quality gap is often surprisingly small for narrow domains. Most state-of-the-art models from 2024 onward — OpenAI’s
text-embedding-3-large, Nomic, Snowflake Arctic, BGE-M3 — ship with Matryoshka representation learning: you store the full vector but truncate to a smaller dimension at query time. OpenAI’s own benchmark saystext-embedding-3-largetruncated to 256 dimensions outperformsada-002at 1,536 dimensions on MTEB. That is a 6× storage saving with a quality gain, achieved by changing one parameter. - Maximum input length. Embedding models have a context limit, just like LLMs. A model that truncates at 512 tokens is a poor fit for embedding a long runbook section; embedding only the first 512 tokens of a 4,000-token chunk is the kind of bug that does not surface until your retriever starts returning the wrong document for ambiguous queries.
- Domain match. A medical-domain model will outperform a general one on medical queries. A code-aware model will outperform a general one on code search. Match the model to the corpus you will actually embed, not the corpus the model’s marketing page describes.
The single most common embedding mistake in production: picking the highest-on-the-leaderboard model on day one, then never re-evaluating against your own corpus. The leaderboard is benchmarks. Your corpus is reality.
What the index gives you
Once you have a few hundred vectors, you can find nearest neighbors by brute force — compute cosine similarity against every stored vector, return the top-k. At a few thousand vectors this still runs in milliseconds. At a million vectors it is too slow for an interactive query.
The escape is an approximate nearest-neighbor (ANN) index — a data structure that gets you near-perfect recall in exchange for searching only a fraction of the corpus per query. Three families dominate:
- HNSW (Hierarchical Navigable Small World). A graph where each node is a vector and edges connect near-neighbors at multiple “zoom levels.” Search is a greedy walk down the levels. Default for most production systems; high recall, fast queries, larger memory footprint.
- IVF (Inverted File Index). Cluster the vectors during build, search only the clusters nearest to the query. Cheaper memory, slightly lower recall, slower build. Common in FAISS-based systems.
- DiskANN / SPANN. Disk-resident variants for corpora that do not fit in RAM. Higher latency, much larger scale.
Most of the production vector databases — pgvector, Pinecone, Weaviate, Qdrant, Milvus — wrap one or more of these. The choice of database matters less than the choice of embedding model, by a wide margin.
Where it breaks
The failure modes are subtle, and they cluster in the part of the system that is hardest to test.
- Embedding-model drift. You change embedding models — usually because a better one shipped. The new model’s vectors are not comparable to the old model’s vectors. They live in a different space. You must re-embed every document in the corpus before you can serve a query against the new model. Mixing the two silently returns nonsense.
- Normalization mismatch. Some models return normalized vectors; some do not. If your similarity metric assumes unit-length vectors and you store un-normalized ones, your “similarity” scores are off by an arbitrary scaling factor and your top-k rankings are wrong. Normalize at write time and again at read time, or pick one place and document it.
- Out-of-distribution input. The model was trained on a distribution. Inputs far outside it — code in an obscure language, deeply structured ASCII tables, numeric strings — embed to vectors near the origin of the space, where everything looks similar to everything else. Symptoms: every retrieval returns the same handful of “centroid” documents.
- Cosine on the wrong axis. Cosine similarity treats vectors as directions, not magnitudes. For some use cases — anomaly detection, where magnitude is the signal — this is the wrong tool. Pick the metric that matches the question, not the metric the library defaults to.
What an embedding is not
- Not a hash. A hash is a fixed-size representation that destroys structure. An embedding preserves structure. Two near-identical inputs produce near-identical embeddings. Two near-identical inputs produce wildly different hashes.
- Not a feature vector in the classical ML sense. Classical features are hand-crafted: “word count,” “average sentence length,” “presence of an error code.” Embeddings are learned: each dimension has no human-readable meaning. The space is interpretable only in aggregate.
- Not a search engine. Embeddings are how a search engine finds candidates. A complete search system layers tokenization, hybrid keyword retrieval (BM25), reranking, and result-set assembly on top. A pure vector search is the substrate, not the product.
- Not free. Embedding 100M chunks at 1024 dimensions is roughly 400 GB of storage and several hours of GPU time even with a fast model. Plan the storage and the indexing throughput before you commit to a corpus.
Where to start
The smallest useful experiment: take 100 of your team’s runbook paragraphs, embed them with a single managed embedding API, store the vectors and source text in a numpy array or a SQLite file with the sqlite-vec extension, and write a 30-line query function that returns the top-3 paragraphs for a question.
Then — and this is the part most teams skip — write 20 evaluation queries with the right answers labeled. Run them against your retriever. Compute precision-at-3. That number is the first honest measure of how good your embedding is. Every production knob — chunking strategy, model choice, hybrid retrieval, reranking — is a way of moving that number up. Without it, you are tuning blind.
References
- Word embedding — Wikipedia
- Sentence embedding — Wikipedia
- Approximate nearest neighbor — Wikipedia
- Mikolov et al., 2013 — Efficient Estimation of Word Representations in Vector Space (the paper that put dense embeddings on the map)
- Reimers & Gurevych, 2019 — Sentence-BERT (the paper that made sentence-level embeddings practical)