RAG: Retrieval-Augmented Generation
Tools Intermediate

RAG:
The Architecture That Grounds AI in Your Data

An LLM’s knowledge ends at its training cutoff and covers nothing that’s specific to your organisation, product, or domain. Fine-tuning is expensive and slow. Stuffing a massive document into the context window works until it doesn’t. Retrieval-augmented generation is the production answer: fetch only what’s relevant, inject it just in time, and let the model reason over your data rather than guess at it.

What RAG Is and Why It Exists

A language model is a frozen snapshot. Everything it knows comes from data it saw before training ended. Ask it about your internal policy document, the product you shipped last month, or the support ticket from this morning — it has no idea. It will either admit ignorance or, worse, hallucinate a plausible-sounding answer.

There are three common responses to this problem, each with different trade-offs:

  • Fine-tuning: retrain the model on your data so the knowledge becomes baked in. Expensive, requires ML expertise, needs re-training whenever the data changes, and still tends to hallucinate on the fine-tuned facts. Better for style and format adaptation than for knowledge injection.
  • Context stuffing: paste the relevant document directly into the context window alongside the user’s question. Works for a single document. Breaks at scale — you can’t fit your entire knowledge base into every request, and long contexts degrade retrieval quality as the model’s attention spreads thin.
  • Retrieval-Augmented Generation (RAG): index your knowledge base, find only the relevant passages at query time, and inject just those. The model answers from your data without needing to hold it all in context.

RAG is the standard production architecture for knowledge-grounded AI applications because it scales, stays current without retraining, and makes hallucinations detectable — you can check whether the model’s answer appears in the retrieved context.

RAG doesn’t eliminate hallucination — it makes it auditable

A well-implemented RAG system doesn’t guarantee accuracy. The model can still synthesise incorrectly from retrieved context. What RAG gives you is a ground truth to check against: every claim in the response should be traceable to a source chunk. That auditability is what makes RAG suitable for high-stakes applications.

What RAG is good for

  • Q&A over company wikis, documentation, legal documents, support tickets
  • Customer-facing bots that need to answer questions about your specific products
  • Research assistants that need to cite sources
  • Any application where the knowledge base changes frequently and fine-tuning would be too slow

What RAG is not good for

  • Tasks that require synthesising across an entire corpus simultaneously (RAG retrieves fragments, not the whole)
  • Knowledge that is better expressed as model behaviour than facts (use fine-tuning for this)
  • Real-time data that changes faster than your indexing pipeline can keep up

The Four-Phase Pipeline

RAG has two distinct workflows: an indexing pipeline that runs once (or on a schedule) to prepare your knowledge base, and a query pipeline that runs on every user request. Understanding which problems belong to which pipeline is important — most RAG performance issues are indexing problems misdiagnosed as retrieval problems.

01
Indexing pipeline
Ingest & Chunk

Load raw documents (PDFs, HTML, Markdown, code), clean them, and split into chunks. This is where chunk size, overlap, and metadata tagging decisions are made. Runs offline, on a schedule, or when documents change.

02
Indexing pipeline
Embed & Store

Pass each chunk through an embedding model to produce a dense vector. Store the vector alongside the chunk text and its metadata in a vector store. This is what makes semantic retrieval possible.

03
Query pipeline
Retrieve

Embed the user’s query using the same model. Run a nearest-neighbour search to find the top-k chunks whose vectors are closest to the query vector. Apply metadata filters if needed. Optionally re-rank the results.

04
Query pipeline
Generate

Build a prompt that injects the retrieved chunks as context, along with instructions on how to use them. Call the LLM. Return the answer, optionally with citations pointing back to the source chunks.

How Embeddings Work

An embedding is a numerical representation of meaning. An embedding model — itself a neural network — reads a piece of text and outputs a vector: a list of hundreds or thousands of floating-point numbers. The remarkable property is that texts with similar meanings produce vectors that are close together in this high-dimensional space, while texts with different meanings produce vectors that are far apart.

Concretely: “How do I reset my password?” and “I can’t log in to my account” will produce similar vectors even though they share no words. “Best restaurants in Paris” will produce a very different vector from either. This semantic closeness is what makes RAG work: when a user asks a question, you find the stored text that means the same thing as the question — not the text that shares the same keywords.

Choosing an embedding model

Model Dimensions Cost Best for
OpenAI text-embedding-3-small 1,536 $0.02 / M tokens General use; strong quality/cost ratio
OpenAI text-embedding-3-large 3,072 $0.13 / M tokens Highest accuracy when cost is secondary
Cohere embed-english-v3 1,024 $0.10 / M tokens Strong on long documents; good filter support
nomic-embed-text (open-source) 768 Free / self-hosted Local deployment, privacy-sensitive workloads
BGE-M3 (open-source) 1,024 Free / self-hosted Multilingual, long-context (8k tokens)
Never mix embedding models

The vectors from different embedding models live in incompatible spaces — cosine similarity across models is meaningless. Pick one model and use it for both indexing and query. If you switch models, you must re-embed your entire corpus. Track your embedding model version as infrastructure config, not a detail.

Chunking: The Make-or-Break Decision

If there is one thing that determines whether a RAG system works well or poorly, it is chunking. The chunk is the unit of retrieval: when a user asks a question, the system returns the chunks whose embeddings are closest to the question embedding. If your chunks are the wrong size, contain the wrong content, or straddle ideas that belong apart, your retrieval will return irrelevant or incomplete context no matter how good everything else is.

The chunk size trade-off

Chunk size is a dial between two failure modes:

  • Too small: chunks lack enough context for the embedding to be meaningful. A single sentence “The limit is 500.” embeds poorly because the embedding model doesn’t know what “the limit” refers to.
  • Too large: chunks contain multiple ideas with competing semantics. The embedding averages across them, becoming a blurred representation of several topics rather than a sharp representation of one. Retrieval returns these chunks for multiple different queries — including ones where only part of the chunk is relevant.

The sweet spot for most prose documents is 300–600 tokens per chunk, with 10–15% overlap between adjacent chunks. Overlap ensures that information sitting at a chunk boundary doesn’t fall through the gap.

Document text → chunks with overlap
...The model weights each token based on its position in the attention window. Earlier tokens receive slightly less influence as the window fills, which is why critical constraints are best placed at the end of the system prompt. This end-anchor pattern is particularly effective for jailbreak resistance, because the model processes the anchor immediately before the first user message arrives...

↑ Chunk A                         ↑ Overlap (both chunks share this)                ↑ Chunk B

Chunking strategies by document type

Strategy 01 — Fixed-size with overlap

Best for: articles, reports, long-form prose

Split every N tokens with an M-token overlap. Simple, predictable, works well for uniformly-structured text. Start with 512 tokens / 64 overlap. The main weakness: splits can cut across sentences mid-thought. Use a sentence-boundary-aware splitter to avoid this.

Strategy 02 — Structural / section-based

Best for: documentation, manuals, wikis

Split at natural section boundaries (headings, horizontal rules, numbered sections). Each chunk maps to one topic. Preserves semantic coherence much better than fixed-size for structured content. Chunks may vary widely in size — add a max-size guard to truncate unusually long sections.

Strategy 03 — Sentence or paragraph level

Best for: FAQs, support tickets, dense factual content

Each chunk is one paragraph or 2–4 sentences. High precision retrieval at the cost of less surrounding context. Works best when each paragraph is genuinely self-contained. Often paired with a parent-chunk retrieval pattern: retrieve the small chunk, then inject the surrounding larger context.

Strategy 04 — Code-aware chunking

Best for: codebases, API references

Split at function, class, or method boundaries — not at arbitrary token counts. A function split mid-body is useless context. Use a parser (tree-sitter for most languages) to identify logical code units. Include the function signature and docstring in every chunk even if the full body is truncated.

Always add metadata to every chunk

A chunk without metadata is just text. Metadata enables filtering (retrieve only from certain document types or date ranges), attribution (show users where the answer came from), and debugging (understand which documents are actually being used). At minimum, store: source filename, section heading or page number, creation/update date, and any categorical tags (document type, topic, product area).

Python — Chunk with Metadata
from dataclasses import dataclass
from datetime import datetime

@dataclass
class Chunk:
    text: str
    source_file: str
    section:     str
    page:        int | None
    doc_type:    str           # "policy" | "faq" | "manual" | "code"
    tags:        list[str]
    created_at:  datetime
    char_start:  int           # position in original document
    char_end:    int
    embedding:   list[float] | None = None

# When storing in pgvector:
# INSERT INTO chunks (text, source_file, section, doc_type, tags, embedding)
# VALUES (%s, %s, %s, %s, %s, %s::vector)

Retrieval Quality: Thresholds, Top-K, and Metadata Filtering

Getting retrieval right is about more than running a similarity search and taking the top results. Three parameters control retrieval quality, and all three need to be tuned to your data and use case.

Similarity threshold

A similarity score tells you how close a retrieved chunk is to the query — but raw scores are not directly meaningful without context. Cosine similarity of 0.82 is excellent for some embedding models and mediocre for others. You need to calibrate empirically: sample 20–50 queries you know the answers to, look at the similarity scores of the correct chunks, and set your threshold just below the cluster of correct results.

As a rough starting point for text-embedding-3-small: discard anything below 0.72–0.75. A chunk at 0.65 is usually noise; including it degrades the response quality. Better to return fewer, higher-confidence chunks than to pad the context with tangentially related text.

Top-K: how many chunks to retrieve

Retrieving more chunks gives the model more information to work with — but also more noise, more context window consumption, and more latency on the generation side. The right number depends on your chunk size and context window budget:

Chunk sizeRecommended top-KApprox. context tokens
200–300 tokens8–121,600–3,600
400–600 tokens4–61,600–3,600
800–1,000 tokens2–41,600–4,000

A rule of thumb: budget 2,000–4,000 tokens for retrieved context, stay within it, and prefer fewer high-quality chunks over more lower-quality ones. Test your specific queries at different K values and measure answer accuracy — the answer usually plateaus well below K=10.

Metadata filtering: pre-filter before vector search

Searching the entire corpus for every query is wasteful and often wrong. If your user is asking about a specific product, there’s no reason to search documentation for a different product. Metadata filters restrict the search space before vector similarity runs, dramatically improving precision:

Python — Filtered Retrieval (pgvector)
def retrieve(
    query: str,
    doc_type: str | None = None,
    tags: list[str] | None = None,
    top_k: int = 5,
    min_similarity: float = 0.72,
    db = None
) -> list[dict]:
    query_vec = embed(query)

    # Build WHERE clause from metadata filters
    filters, params = [], [query_vec, query_vec, top_k]
    if doc_type:
        filters.append("doc_type = %s"); params.insert(-1, doc_type)
    if tags:
        filters.append("tags && %s"); params.insert(-1, tags)

    where = ("WHERE " + " AND ".join(filters)) if filters else ""

    rows = db.query(f"""
        SELECT text, source_file, section, tags,
               1 - (embedding <=> %s::vector) AS score
        FROM chunks {where}
        ORDER BY embedding <=> %s::vector
        LIMIT %s
    """, params)

    return [
        {"text": r.text, "source": r.source_file,
         "section": r.section, "score": r.score}
        for r in rows if r.score >= min_similarity
    ]

Hybrid retrieval: vector + keyword

Pure vector search misses exact matches. If a user asks for “the GDPR Article 17 right to erasure clause”, keyword search will find that text precisely and quickly. Vector search might drift toward semantically similar but not lexically matching documents. The best production systems run both in parallel and merge results:

Hybrid Retrieval: Reciprocal Rank Fusion
def rrf_score(rank: int, k: int = 60) -> float:
    """Reciprocal Rank Fusion — merge ranked lists from multiple retrievers."""
    return 1.0 / (k + rank)

def hybrid_retrieve(query: str, db, top_k: int = 8) -> list[dict]:
    # 1. Vector search (semantic)
    vec_results = retrieve_vector(query, db, top_k=top_k * 2)

    # 2. Keyword search (BM25 / full-text)
    kw_results = db.query(
        "SELECT text, source_file, section FROM chunks"
        " WHERE content_tsv @@ plainto_tsquery(%s)"
        " ORDER BY ts_rank(content_tsv, plainto_tsquery(%s)) DESC LIMIT %s",
        [query, query, top_k * 2]
    )

    # 3. Merge via Reciprocal Rank Fusion
    scores: dict[str, float] = {}
    for i, r in enumerate(vec_results):
        scores[r["text"]] = scores.get(r["text"], 0) + rrf_score(i)
    for i, r in enumerate(kw_results):
        scores[r.text] = scores.get(r.text, 0) + rrf_score(i)

    merged = sorted(scores.items(), key=lambda x: x[1], reverse=True)
    return [m[0] for m in merged[:top_k]]

Re-Ranking: Why Retrieval Order Matters

Vector search returns results ordered by similarity score. That score is computed fast — it’s a dot product in high-dimensional space — but it is an approximation. The top result by cosine similarity is not always the most relevant result for the user’s actual question. Re-ranking is a second pass that uses a slower but more precise model to reorder the retrieved chunks before they go into the context.

How re-ranking works

A re-ranker (also called a cross-encoder) takes the full query and each candidate chunk together, and scores their relevance jointly. Unlike embedding similarity — which encodes query and document independently — the cross-encoder reads both texts together and scores their interaction directly. This is much more accurate but also much slower, which is why it runs only on the small set of candidates already retrieved by the faster vector search.

Python — Re-ranking with Cohere Rerank
import cohere

co = cohere.Client("YOUR_COHERE_KEY")

def rerank(query: str, candidates: list[dict], top_n: int = 5) -> list[dict]:
    # Step 1: initial retrieval returns 10-20 candidates from vector search
    # Step 2: re-rank to get the best 5
    response = co.rerank(
        query=query,
        documents=[c["text"] for c in candidates],
        model="rerank-english-v3.0",
        top_n=top_n
    )
    return [candidates[r.index] for r in response.results]

# Alternative: open-source cross-encoder via sentence-transformers
# from sentence_transformers import CrossEncoder
# model = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
# scores = model.predict([(query, c["text"]) for c in candidates])

Whether re-ranking is worth adding depends on your quality requirements and latency budget. It adds 100–400ms per query (Cohere’s hosted API) or more for local models. The quality improvement is significant for queries with ambiguous phrasing — typically 10–20% improvement in answer accuracy on standard benchmarks. For customer-facing applications where answer quality is business-critical, it is almost always worth it.

Retrieve wide, re-rank narrow

The practical pattern: retrieve 15–20 candidates from vector search (recall-optimised, fast), then re-rank to the top 4–6 (precision-optimised, slower). This gives you the recall of a generous initial retrieval and the precision of a careful second pass, at a fraction of the cost of running the re-ranker on the full corpus.

RAG vs a Big Context Window

Modern frontier models have very large context windows — 128k, 200k, even 1M tokens. A reasonable question: why bother with RAG at all if you can just stuff everything into the context?

The honest answer is that for small corpora, you often shouldn’t bother with RAG. If your knowledge base is 50 pages of documentation and fits comfortably in 40,000 tokens, putting it all in the context is simpler, faster to build, and likely works better than a RAG pipeline with suboptimal chunking. But the trade-offs shift quickly as scale increases:

Consideration Full context stuffing RAG
Knowledge base size ≤ ~500 pages Unlimited (millions of documents)
Cost per query High — full corpus on every call Low — only retrieved chunks billed
Latency Proportional to corpus size Mostly fixed (retrieval + small context)
Retrieval accuracy Degrades with long contexts (lost-in-the-middle) Precise when chunking is good
Implementation complexity None — just concatenate Requires indexing pipeline + vector store
Keeping data current Just update the document Re-chunk and re-embed changed documents
Multi-user applications Per-user corpus injection is impractical Filter by user metadata on every query

Use context stuffing when: corpus is small (<100k tokens), you’re prototyping, or the query always requires reasoning across the whole document (e.g. “summarise everything”).

Use RAG when: corpus is large or grows over time, costs need to stay predictable, you need precise attribution, or you serve multiple users with different data subsets.

The “lost in the middle” problem

Research consistently shows that LLMs are better at using information at the start and end of a long context than information buried in the middle. A 100,000-token context with your answer on page 12 will often produce worse results than a 4,000-token RAG context where the answer is in the top 3 retrieved chunks. Long context is a capability, not a guarantee.

The RAG Prompt Template

How you inject retrieved context into the prompt matters as much as what you retrieve. A well-structured RAG prompt tells the model how to use the context, what to do when it’s insufficient, and how to cite sources. Here is a production-ready template:

Core RAG Prompt Template
SYSTEM:
You are a knowledgeable assistant with access to a curated knowledge base.
Answer questions using ONLY the context provided below.

Rules:
1. Base your answer entirely on the provided context.
2. If the answer is not in the context, say:
   "I don't have that information in my knowledge base."
   Do not guess, infer beyond the context, or use outside knowledge.
3. Cite your sources as [Source 1], [Source 2], etc. after each claim.
4. If multiple sources say different things, note the discrepancy.
5. Be concise. Don't repeat the question or pad your answer.

CONTEXT:
[Source 1: {source_1_filename} — {source_1_section}]
{chunk_1_text}

[Source 2: {source_2_filename} — {source_2_section}]
{chunk_2_text}

[Source 3: {source_3_filename} — {source_3_section}]
{chunk_3_text}

USER QUESTION:
{user_question}

Prompt variations for different use cases

Conversational RAG (with chat history)
SYSTEM:
You are a support assistant. Use the provided knowledge base context to answer
questions. Maintain a friendly conversational tone. Cite sources inline.
If the context doesn't contain the answer, say so and suggest the user
contact support at [EMAIL].

CONTEXT:
{retrieved_chunks}

CONVERSATION HISTORY:
{last_5_turns}

USER:
{current_message}
Strict Citation RAG (for high-stakes applications)
SYSTEM:
You are a research assistant. Every statement in your answer MUST be directly
supported by a citation from the provided context. Format:

Answer text [Source N, page/section].

If you cannot find direct support for a claim in the context, omit the claim
entirely. Do not paraphrase in a way that changes the meaning of the source.

CONTEXT:
{retrieved_chunks_with_full_metadata}

QUESTION:
{user_question}

Free RAG Templates Pack

Prompt templates, chunking guide, retrieval code, re-ranking pattern, and the pgvector schema — all in one Markdown file ready to adapt.

Common Failure Modes

1. Chunks too large — retrieval returns blurred context

A 2,000-token chunk that covers three topics will score adequately for queries about any of those topics — but will pollute the context with the other two. If your answers feel unfocused or the model hedges when it shouldn’t, your chunks are probably too large. Halve the chunk size and re-embed.

2. No similarity threshold — noise leaks into context

Taking the top-K results without a minimum similarity threshold means every query gets K results, even when only 2 are relevant and the rest are tangential. A chunk at 0.60 similarity will mislead the model. Set a threshold and return “I don’t have information on this” when nothing clears it rather than injecting irrelevant context.

3. Stale index — model answers from outdated documents

If documents in your knowledge base are updated but the chunks in your vector store aren’t, the model answers from old information while the correct answer is in the updated document. Build an update pipeline: track document hashes, re-chunk and re-embed any document whose hash has changed, delete orphaned chunks from deleted documents.

4. Raw chunks with no source labels

Injecting chunks without source metadata means the model cannot cite its answers, you cannot audit which documents are being used, and users cannot verify claims. Label every chunk with its source before injection. This costs nothing and transforms RAG from a black box into an auditable system.

5. Treating vector similarity as a confidence score

High similarity means “this text is about the same topic as the query” — not “this text contains the answer to the query”. A chunk about password reset policies will score highly for “how do I reset my password?” whether or not it contains the actual steps. Use similarity for ranking and filtering, but validate answer presence with the re-ranker or with output validation.

6. Hallucination on empty retrieval

When no chunks clear the similarity threshold, the model has no grounded context. Without explicit handling, it falls back on its training knowledge and may hallucinate a confident-sounding answer. Always detect the empty-retrieval case and respond with a defined fallback rather than letting the model improvise.

Empty Retrieval Guard
async def rag_response(question: str, user_id: str) -> str:
    chunks = retrieve(question, user_id=user_id, min_similarity=0.72)

    if not chunks:
        return (
            "I don't have information about that in my knowledge base. "
            "You can contact our support team at support@example.com."
        )

    context = build_context_string(chunks)
    return await call_llm(RAG_SYSTEM_PROMPT, context, question)

Download the Free Templates

The downloadable file contains the complete RAG prompt templates, chunking decision guide, full pgvector schema, retrieval and re-ranking code, and the empty-retrieval guard pattern — all ready to adapt to your stack.

Download All RAG Templates

Prompts, chunking guide, pgvector schema, retrieval + re-ranking code, and the RAG vs context window decision guide.