Large language models are brilliant generalists with a fixed, frozen memory. The moment you need them to answer from your documents — a product manual, an internal wiki, last week's tickets — that memory falls short. Retrieval-augmented generation (RAG) closes the gap: instead of asking the model to recall, you retrieve the relevant facts at query time and hand them over as context. Done well, it is the single highest-leverage pattern in applied AI today.
Chunk before you embed
Retrieval quality is decided long before the model runs — at how you split your source material. Chunk too large and each result is noisy and expensive; too small and you sever the context that makes a passage meaningful. Start around a few hundred tokens with a small overlap, and chunk on natural boundaries — headings, paragraphs, sections — rather than a blind character count.
Garbage chunks in, confident nonsense out. Retrieval is the part most teams under-invest in.
Embeddings turn meaning into geometry
An embedding model maps each chunk to a vector so that similar meaning lands in nearby space. You store those vectors in a vector database, then embed the user's question with the same model and ask for its nearest neighbours. The crucial discipline: query and documents must be embedded by the same model, or the geometry stops lining up.
Ground the answer, and prove it
Retrieval gives you the right passages; the prompt decides whether the model actually uses them. Instruct it to answer only from the supplied context and to say when the context is insufficient — then surface citations so users can verify. That honesty is what separates a demo from something people trust.
const ctx = await vectorStore.search(embed(question), { k: 5 });
const answer = await llm.complete({
system: "Answer ONLY from the context. If it is missing, say so.",
prompt: `Context:\n${ctx.map(c => c.text).join("\n---\n")}\n\nQ: ${question}`,
});
// return answer + ctx.map(c => c.source) // always cite
Evaluate retrieval and generation separately
When a RAG app is wrong, you need to know which half failed. Measure retrieval on its own — did the right chunk make the top-k? — and generation on its own — given perfect context, was the answer faithful? Conflating them is how teams spend a week tuning a prompt when the real problem was a chunking bug.
- Retrieval metrics: recall@k and the rank of the correct passage.
- Faithfulness: does every claim trace back to the context?
- Latency budget: embedding + search + generation, measured end to end.
RAG is not a model you buy; it is a pipeline you engineer. Get chunking and retrieval right, ground the model in what you found, and prove every answer — and a general-purpose LLM becomes an expert on your data.