Building RAG Systems: From Architecture to Production

Large language models are trained on a fixed snapshot of the world. They do not know about your company's internal documentation, your product changelog from last Tuesday, or the customer conversation that happened ten minutes ago. When they do not have the answer, they do not say so — they fabricate something plausible. This is the hallucination problem, and it is the reason most teams cannot deploy LLMs against private knowledge without an additional layer.

Retrieval-augmented generation is that layer. Instead of hoping the model memorized relevant information during training, you retrieve the actual documents that answer the question and inject them into the prompt. The model generates a response grounded in real data rather than its parametric memory. It is a straightforward idea — look it up, then answer — but the gap between the idea and a system that works reliably in production is where most teams struggle.

Roughly 70% of enterprise AI projects use some form of RAG today. It is the dominant pattern for connecting LLMs to private knowledge, and for good reason: it works without fine-tuning, it updates instantly when your data changes, and it gives you control over exactly what the model sees. This guide covers the full pipeline from document ingestion to production deployment, with the practical details that determine whether your RAG system is useful or just impressive in a demo.

RAG vs. Fine-Tuning

Before committing to RAG, understand how it compares to the alternative: fine-tuning a model on your data.

Dimension	RAG	Fine-Tuning
Cost to implement	Low to moderate — embedding and vector DB infrastructure	High — training compute, dataset curation, evaluation
Data freshness	Real-time — update documents and retrieval reflects it immediately	Stale — requires retraining to incorporate new data
Latency	Adds retrieval step (50-200ms for vector search)	No retrieval overhead, but larger models are slower
Accuracy on domain data	High when retrieval is good; degrades with poor chunking or retrieval	High after sufficient training; can overfit on small datasets
Hallucination control	Strong — model cites retrieved documents; you can audit the source	Weak — model internalizes knowledge; hard to trace where answers come from
Maintenance	Update documents as data changes; no retraining	Retrain periodically; manage model versions
Best for	Knowledge bases, documentation, support, any domain where data changes	Tone, style, specialized reasoning, tasks where retrieval latency is unacceptable

For most production use cases, RAG is the right starting point. Fine-tuning is warranted when you need the model to internalize a specific reasoning pattern or communication style — not when you need it to know facts. Facts change. A retrieval layer handles that. A fine-tuned model does not.

The RAG Pipeline

Every RAG system follows the same core pipeline. The details vary, but the structure does not.

Documents
    |
    v
[ Chunking ]  ---->  Split documents into retrievable units
    |
    v
[ Embedding ]  ---->  Convert chunks to vector representations
    |
    v
[ Vector DB ]  ---->  Index and store vectors for fast search
    |
    v
[ User Query ]
    |
    v
[ Query Embedding ]  ---->  Convert query to same vector space
    |
    v
[ Retrieval ]  ---->  Find most similar chunks
    |
    v
[ Context Assembly ]  ---->  Build prompt with retrieved chunks
    |
    v
[ LLM Generation ]  ---->  Generate grounded response
    |
    v
Response

Each stage has failure modes and design decisions that compound downstream. A bad chunking strategy produces bad embeddings. Bad embeddings produce bad retrieval. Bad retrieval produces hallucinated responses regardless of how good the model is. The pipeline is only as strong as its weakest stage.

Document Chunking Strategies

Chunking is where most RAG systems succeed or fail. The goal is to split documents into units that are small enough to be semantically focused — each chunk should be about one thing — but large enough to carry sufficient context for the model to use them.

Fixed-Size Chunking

The simplest approach: split text into chunks of a fixed token count. Fast, predictable, easy to implement. The problem is that it has no awareness of document structure. A fixed-size chunk might split a paragraph in the middle of a sentence, separating the claim from its evidence.

Recursive Splitting

Start with large structural boundaries (sections, headings), then recursively split within those boundaries if chunks exceed the target size. This preserves document structure while maintaining size constraints. LangChain's RecursiveCharacterTextSplitter is the most common implementation of this pattern.

Semantic Chunking

Use an embedding model to detect topic boundaries within the document. When the cosine similarity between consecutive sentences drops below a threshold, insert a chunk boundary. This produces chunks that are semantically coherent but variable in size. More expensive to compute but produces better retrieval quality for documents without clear structural markers.

Optimal Chunk Sizes

Research and practice converge on 512 to 1024 tokens as the effective range for most use cases. Smaller chunks (256 tokens) improve retrieval precision but lose context. Larger chunks (2048+ tokens) carry more context but dilute relevance and consume token budget. Start at 512 tokens and adjust based on your retrieval evaluation metrics.

Overlap

Chunks should overlap by 10-20% of their size. A 512-token chunk with 50-100 tokens of overlap ensures that information at chunk boundaries is not lost. Without overlap, a question whose answer spans two chunks may not match either chunk well enough to retrieve it.

Code Example: Recursive Chunking

interface Chunk {
  text: string;
  metadata: {
    source: string;
    chunkIndex: number;
    startChar: number;
    endChar: number;
  };
}

function chunkDocument(
  text: string,
  source: string,
  options: {
    maxTokens?: number;
    overlapTokens?: number;
    separators?: string[];
  } = {}
): Chunk[] {
  const maxTokens = options.maxTokens ?? 512;
  const overlapTokens = options.overlapTokens ?? 50;
  const separators = options.separators ?? ["\n\n", "\n", ". ", " "];

  const chunks: Chunk[] = [];

  function splitRecursively(
    segment: string,
    startChar: number,
    separatorIndex: number
  ): void {
    const estimatedTokens = Math.ceil(segment.length / 4);

    if (estimatedTokens <= maxTokens) {
      chunks.push({
        text: segment.trim(),
        metadata: {
          source,
          chunkIndex: chunks.length,
          startChar,
          endChar: startChar + segment.length,
        },
      });
      return;
    }

    if (separatorIndex >= separators.length) {
      // No more separators — force split at maxTokens boundary
      const splitPoint = maxTokens * 4; // rough char estimate
      chunks.push({
        text: segment.slice(0, splitPoint).trim(),
        metadata: {
          source,
          chunkIndex: chunks.length,
          startChar,
          endChar: startChar + splitPoint,
        },
      });
      const overlapChars = overlapTokens * 4;
      splitRecursively(
        segment.slice(splitPoint - overlapChars),
        startChar + splitPoint - overlapChars,
        separatorIndex
      );
      return;
    }

    const separator = separators[separatorIndex];
    const parts = segment.split(separator);

    if (parts.length === 1) {
      splitRecursively(segment, startChar, separatorIndex + 1);
      return;
    }

    let current = "";
    let currentStart = startChar;

    for (const part of parts) {
      const candidate = current ? current + separator + part : part;
      if (Math.ceil(candidate.length / 4) > maxTokens && current) {
        splitRecursively(current, currentStart, separatorIndex + 1);
        // Start next chunk with overlap from the end of current
        const overlapChars = overlapTokens * 4;
        const overlapText = current.slice(-overlapChars);
        current = overlapText + separator + part;
        currentStart = currentStart + current.length - overlapChars - part.length - separator.length;
      } else {
        current = candidate;
      }
    }

    if (current.trim()) {
      splitRecursively(current, currentStart, separatorIndex + 1);
    }
  }

  splitRecursively(text, 0, 0);
  return chunks;
}

Embedding Models

Embeddings convert text into dense vector representations where semantic similarity maps to vector proximity. The quality of your embeddings directly determines the quality of your retrieval.

Model	Provider	Dimensions	Relative Cost	Quality Notes
`text-embedding-3-small`	OpenAI	1536	Low	Good quality-to-cost ratio; sufficient for most use cases
`text-embedding-3-large`	OpenAI	3072	Medium	Higher quality on hard retrieval tasks; 2x the dimensions
`embed-v3`	Cohere	1024	Medium	Strong multilingual support; built-in search/classification modes
`bge-large-en-v1.5`	BAAI (open-source)	1024	Free (self-hosted)	Top-tier open-source; requires your own GPU infrastructure
`e5-large-v2`	Microsoft (open-source)	1024	Free (self-hosted)	Competitive quality; instruction-tuned variant available
`voyage-3`	Voyage AI	1024	Medium	Strong on code and technical content; good for developer docs

For most teams, OpenAI's text-embedding-3-small is the pragmatic default. It is cheap, fast, and good enough for the majority of retrieval tasks. Move to text-embedding-3-large or embed-v3 if your evaluation metrics show retrieval quality is the bottleneck. Use open-source models like BGE or E5 when you need to keep data on-premises or want to eliminate per-call embedding costs at high volume.

One critical rule: always use the same embedding model for indexing and querying. Vectors from different models live in incompatible spaces. If you switch models, you must re-embed your entire corpus.

import OpenAI from "openai";

const openai = new OpenAI();

async function embedTexts(texts: string[]): Promise<number[][]> {
  const response = await openai.embeddings.create({
    model: "text-embedding-3-small",
    input: texts,
  });

  return response.data.map((item) => item.embedding);
}

async function embedQuery(query: string): Promise<number[]> {
  const [embedding] = await embedTexts([query]);
  return embedding;
}

Vector Databases

Once chunks are embedded, you need a place to store the vectors and search them efficiently. The vector database market has matured rapidly, and the right choice depends on your scale, hosting preferences, and feature requirements.

Database	Hosting	Pricing Model	Hybrid Search	Key Strength
Pinecone	Fully managed	Pay per pod/serverless	Yes (sparse-dense)	Easiest to start; no infrastructure to manage
Weaviate	Self-hosted or cloud	Open-source; cloud pricing by usage	Yes (BM25 + vector)	Flexible schema; strong hybrid search out of the box
Qdrant	Self-hosted or cloud	Open-source; cloud pricing by usage	Yes (sparse vectors)	High performance; excellent filtering capabilities
Chroma	Self-hosted (embedded)	Open-source	No (vector only)	Simplest local setup; great for prototyping and small datasets
FAISS	In-memory library	Free (Meta open-source)	No	Fastest raw search speed; no server, just a library
pgvector	Self-hosted (PostgreSQL)	Free extension	Yes (with full SQL)	Use your existing Postgres; no new infrastructure

When to use each:

Pinecone when you want managed infrastructure and do not want to think about scaling. Good default for teams that want to focus on the application, not the database.
Weaviate when you need hybrid search (combining keyword and semantic) and want open-source flexibility. Strong choice for production systems with complex filtering needs.
Qdrant when performance matters and you need advanced filtering on metadata alongside vector search. Good for large-scale systems.
Chroma when you are prototyping or building a small application. Runs in-process, no server needed.
FAISS when you need raw speed and your dataset fits in memory. Common in research and batch processing pipelines.
pgvector when you already run PostgreSQL and want to avoid adding another database to your stack. Good enough for datasets under a few million vectors.

Retrieval Patterns

Getting the right chunks out of the database is where RAG systems differentiate themselves. Naive semantic search — embed the query, find the nearest vectors — works for simple cases but breaks down quickly on real-world queries.

Semantic Search

The baseline. Embed the user's query, compute cosine similarity against all stored chunk embeddings, return the top-k most similar chunks. Fast and effective when the user's query uses similar language to the source documents. Fails when the user asks a question in different terms than the document uses to describe the answer.

Hybrid Search (BM25 + Vector)

Combine traditional keyword search (BM25) with semantic vector search. BM25 catches exact term matches that semantic search might miss. Vector search catches semantic matches that keyword search cannot find. Score fusion — typically reciprocal rank fusion or weighted combination — merges the two result sets.

Hybrid search consistently outperforms either method alone in benchmarks. If your vector database supports it (Weaviate, Pinecone, and Qdrant all do), use it as your default retrieval strategy.

Re-Ranking

Retrieve a larger initial set (top 20-50) using fast vector search, then re-rank that set using a more expensive cross-encoder model. Cross-encoders score the query-document pair jointly rather than independently, which produces more accurate relevance judgments.

Cohere Rerank and models like bge-reranker-large are the standard options. Re-ranking adds 100-300ms of latency but significantly improves the quality of the final retrieved set. Use it when retrieval precision matters more than raw speed.

import cohere

co = cohere.Client("your-api-key")

def rerank_results(query: str, documents: list[str], top_n: int = 5) -> list[dict]:
    results = co.rerank(
        model="rerank-english-v3.0",
        query=query,
        documents=documents,
        top_n=top_n,
    )
    return [
        {"index": r.index, "score": r.relevance_score, "text": documents[r.index]}
        for r in results.results
    ]

Multi-Query Retrieval

A single user query may not fully express what they need. Multi-query retrieval generates 3-5 reformulations of the original query using an LLM, runs retrieval for each, and merges the results. This casts a wider net and reduces the chance of missing relevant documents due to query phrasing.

Maximal Marginal Relevance (MMR)

Standard top-k retrieval often returns chunks that are all about the same sub-topic — high relevance but low diversity. MMR balances relevance with diversity by penalizing chunks that are too similar to chunks already selected. This produces a set of retrieved chunks that covers more aspects of the query, which is especially important when assembling context for complex questions.

Context Assembly

Retrieval gives you chunks. Context assembly turns those chunks into a prompt the model can use effectively. This step is more important than most teams realize — how you present retrieved information to the model directly affects response quality.

Token Budget Management

RAG can reduce prompt sizes by 70% compared to stuffing entire documents into the context. But you still need to manage your budget. If you retrieve ten 512-token chunks, that is 5,120 tokens of context before the system prompt and user query. With prompt caching, repeated queries against the same retrieved context save 75-90% on subsequent calls.

Set a hard budget for retrieved context — typically 30-50% of the model's context window. If retrieval returns more content than fits, use relevance scores to cut the lowest-scoring chunks.

Ordering Chunks by Relevance

Place the most relevant chunks first. Models attend more strongly to information at the beginning and end of the context (the "lost in the middle" effect documented by Liu et al.). If you have five retrieved chunks, put the most relevant first, the second-most relevant last, and the rest in between.

Metadata Injection

Attach source metadata to each chunk in the prompt. This serves two purposes: it helps the model attribute its answers to specific sources, and it gives users a citation trail for verification.

function assembleContext(
  retrievedChunks: Array<{ text: string; score: number; metadata: ChunkMetadata }>,
  tokenBudget: number
): string {
  // Sort by relevance score descending
  const sorted = [...retrievedChunks].sort((a, b) => b.score - a.score);

  let assembled = "";
  let tokenCount = 0;

  for (const chunk of sorted) {
    const chunkText = `[Source: ${chunk.metadata.source}, Section: ${chunk.metadata.section}]\n${chunk.text}\n\n`;
    const chunkTokens = Math.ceil(chunkText.length / 4);

    if (tokenCount + chunkTokens > tokenBudget) break;

    assembled += chunkText;
    tokenCount += chunkTokens;
  }

  return assembled;
}

function buildRAGPrompt(
  query: string,
  context: string,
  systemInstruction: string
): { system: string; userMessage: string } {
  return {
    system: `${systemInstruction}

Use ONLY the provided context to answer questions. If the context does not contain enough information to answer, say so explicitly. Do not fabricate information.

When citing information, reference the source document.`,
    userMessage: `Context:
---
${context}
---

Question: ${query}`,
  };
}

Advanced RAG Patterns

Once the basic pipeline works, these patterns push quality and capability further.

Agentic RAG

Instead of a fixed retrieval pipeline, give an agent tools to search the vector database, query structured databases, and fetch live data. The agent decides what to retrieve, evaluates whether the results are sufficient, and retrieves more if needed. This is the pattern described in our architecture patterns guide applied to retrieval.

Agentic RAG handles multi-step questions that require synthesizing information from multiple sources — something a single retrieval pass cannot do.

Hierarchical Retrieval

Create two levels of index: a summary index that maps document-level summaries to their source documents, and a chunk index that contains the fine-grained chunks. First retrieve relevant documents using the summary index, then retrieve specific chunks only from those documents. This dramatically improves precision for large corpora where naive chunk-level search returns too much noise.

Query Decomposition

Break complex queries into sub-queries, retrieve for each independently, then synthesize. "Compare our pricing model to Competitor X and explain the tradeoffs for mid-market teams" becomes three sub-queries: retrieve our pricing docs, retrieve Competitor X pricing information, retrieve mid-market segment analysis. Each sub-query gets better retrieval than the compound original.

Self-RAG

The model itself decides whether it needs retrieval. For questions the model can answer confidently from its training data ("What is the capital of France?"), it skips retrieval entirely. For questions that require private knowledge or recent information, it triggers the retrieval pipeline. This reduces latency and cost for queries that do not need external context.

The implementation pattern is a routing step: have a fast, cheap model classify whether the query requires retrieval before entering the pipeline. As discussed in the token optimization guide, using model routing to avoid unnecessary work is one of the highest-leverage cost optimizations available.

RAG Evaluation

You cannot improve what you do not measure. RAG evaluation requires metrics at two levels: retrieval quality (did you find the right chunks?) and generation quality (did the model use them correctly?).

Key Metrics

Retrieval metrics:

Recall@k — Of all relevant chunks in your corpus, what fraction appears in the top-k retrieved results? Low recall means your retrieval is missing relevant information.
Precision@k — Of the top-k retrieved chunks, what fraction is actually relevant? Low precision means you are wasting context budget on irrelevant chunks.
MRR (Mean Reciprocal Rank) — How high does the first relevant result appear in the ranking? A low MRR means the model has to wade through noise before finding useful context.

Generation metrics:

Faithfulness — Does the response only contain information present in the retrieved context? A faithfulness failure is a hallucination — the model invented something not in the sources.
Answer relevance — Does the response actually address the user's question? High faithfulness but low relevance means the model grounded itself in retrieved context but answered the wrong question.
Answer correctness — Compared to a ground-truth answer, how accurate is the response? This requires labeled evaluation data.

The RAGAS Framework

RAGAS (Retrieval Augmented Generation Assessment) is an open-source framework that automates RAG evaluation. It computes faithfulness, answer relevance, context precision, and context recall using LLM-as-judge techniques — no labeled data required for most metrics.

from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision, context_recall
from datasets import Dataset

# Prepare evaluation data
eval_data = {
    "question": ["What is our refund policy?", "How do I reset my password?"],
    "answer": [generated_answer_1, generated_answer_2],
    "contexts": [retrieved_contexts_1, retrieved_contexts_2],
    "ground_truth": [expected_answer_1, expected_answer_2],
}

dataset = Dataset.from_dict(eval_data)

results = evaluate(
    dataset,
    metrics=[faithfulness, answer_relevancy, context_precision, context_recall],
)

print(results)
# {'faithfulness': 0.92, 'answer_relevancy': 0.88, 'context_precision': 0.85, ...}

Run evaluation on a representative sample of queries whenever you change your chunking strategy, embedding model, retrieval parameters, or prompt template. Small changes to the pipeline can produce large shifts in quality that are invisible without measurement.

Common RAG Failures and Fixes

These are the failure modes we see most often. Each one looks different on the surface but has a specific root cause and a specific fix.

Failure Mode	Symptom	Root Cause	Fix
Retrieved wrong chunks	Response is confident but factually wrong	Poor chunking (relevant info split across chunks) or poor embeddings	Increase chunk overlap; try semantic chunking; evaluate embedding model quality
Context window overflow	Model truncates or ignores retrieved context	Too many chunks retrieved; chunks too large	Set a strict token budget; reduce top-k; use re-ranking to keep only the best
Hallucinated despite context	Response includes facts not in any retrieved chunk	Model falls back to parametric knowledge when context is ambiguous	Add explicit instruction: "Only use provided context"; lower temperature; add faithfulness evaluation
Missed relevant documents	Response says "I don't have enough information" when the data exists	Query phrasing does not match document phrasing; low recall	Use hybrid search; add multi-query retrieval; improve chunking to keep related concepts together
Stale or contradictory context	Response uses outdated information	Document updates not re-indexed; old and new versions both in the index	Implement an update pipeline that deletes old chunks when documents change; add timestamps to metadata
Low diversity in results	All retrieved chunks cover the same narrow sub-topic	Top-k retrieval without diversity; chunks from the same section dominate	Apply MMR; use metadata filtering to ensure results come from multiple source documents

The single most impactful fix for most teams is improving chunking. If your chunks are semantically coherent and appropriately sized, retrieval quality improves across the board without changing anything else in the pipeline.

Putting It Together: A Production Checklist

Building a RAG system that works in a demo takes a day. Building one that works in production takes deliberate attention to each stage of the pipeline.

Ingestion:

Chunk documents using recursive splitting with 512-token chunks and 50-token overlap.
Preserve document metadata (source, section, date, author) as chunk-level metadata.
Build an update pipeline that re-indexes changed documents without duplicating content.

Retrieval:

Use hybrid search (BM25 + vector) as your default retrieval strategy.
Retrieve top 10-20 candidates, then re-rank to top 3-5 for context assembly.
Implement MMR or metadata filtering to ensure diversity in results.

Generation:

Assemble context with source attributions and a strict token budget.
Instruct the model explicitly to only use provided context.
Return citations alongside the response so users can verify.

Evaluation:

Build a test set of 50-100 representative queries with ground-truth answers.
Run RAGAS or equivalent evaluation after every pipeline change.
Monitor faithfulness and retrieval recall as your primary health metrics.

Operations:

Log every query, retrieved chunks, and generated response for debugging.
Alert on retrieval recall drops or faithfulness score degradation.
Re-embed your entire corpus when switching embedding models.

RAG is not a model feature — it is a system. The model is one component. The chunking, embedding, indexing, retrieval, and context assembly layers are where the engineering happens, and where the quality is determined. Build each layer deliberately, measure each layer independently, and the system will compound in quality over time. As covered in our memory and context guide, the same principle applies here as it does everywhere in agent systems: the systems that improve are the ones with explicit feedback loops built in from the start.