Skip to content

Retrieval-Augmented Generation (RAG)

★ New
adopt
AI / ML pattern N/A free

At a Glance

An LLM inference pattern that injects relevant documents retrieved from an external corpus into the model's context at query time, grounding responses in up-to-date or domain-specific information without retraining.

Type
pattern
Pricing
free
License
N/A
Adoption fit
small, medium, enterprise
Top alternatives

Retrieval-Augmented Generation (RAG)

What It Does

Retrieval-Augmented Generation (RAG) is an inference-time pattern for grounding LLM responses in external documents. When a user submits a query, the system first retrieves the most relevant document chunks from an indexed corpus (using vector similarity, keyword search, or hybrid approaches), injects those chunks into the LLM’s context window, and then generates a response informed by both the retrieved content and the model’s pretrained knowledge.

RAG solves two core problems: LLMs have a knowledge cutoff and cannot answer questions about events or documents outside their training data, and LLMs hallucinate when they lack relevant knowledge. By providing retrieved source material, RAG constrains the model to ground its response in actual documents, enabling domain-specific and up-to-date responses without the expense of fine-tuning or retraining.

Key Features

  • Vector indexing: Source documents are chunked and embedded into a vector store; retrieval finds semantically similar chunks at query time
  • Hybrid search: Production systems combine BM25 keyword matching with vector similarity for higher recall
  • Re-ranking: A second model (cross-encoder) reorders retrieved chunks by relevance before injection into context
  • GraphRAG: Microsoft’s variant pre-clusters documents into community summaries, enabling higher-level synthesis across large corpora
  • Agentic RAG: Retrieval is orchestrated by an agent that iteratively fetches additional documents based on intermediate reasoning steps
  • Metadata filtering: Retrieval can be constrained by document date, source, author, or other metadata
  • Context stuffing: Retrieved chunks are inserted into the LLM’s context window alongside the query prompt
  • Citation support: Retrieved chunks carry source references, enabling the LLM to cite its sources

Use Cases

  • Use case 1: Enterprise document Q&A — building a search assistant over internal policies, runbooks, or support tickets without retraining an LLM
  • Use case 2: Customer support automation — grounding chatbot responses in product documentation and FAQs that change frequently
  • Use case 3: Research assistance — answering questions over a corpus of papers, reports, or code repositories
  • Use case 4: Legal and compliance — querying regulatory documents, contracts, or case law with source citations required
  • Use case 5: Code search — retrieving relevant code snippets or API documentation to assist LLM-generated code

Adoption Level Analysis

Small teams (<20 engineers): Fits well. Managed RAG services (AWS Bedrock Knowledge Bases, Azure AI Search, OpenAI Assistants API) eliminate infrastructure overhead. Open-source stacks (LlamaIndex, LangChain) have low setup cost. Works at low traffic with minimal ops.

Medium orgs (20–200 engineers): Core infrastructure. Most teams building LLM applications include some form of RAG. Self-hosted vector databases (Weaviate, Qdrant, Chroma) are manageable at this scale. Re-ranking and hybrid search add complexity but meaningfully improve quality.

Enterprise (200+ engineers): Well-established pattern with vendor support. Enterprise vector databases (Pinecone, Weaviate Cloud, Azure AI Search) offer SLAs, RBAC, and audit logs. GraphRAG and hierarchical indexing address large-corpus limitations. Compliance teams have clear controls over what documents are indexed.

Alternatives

AlternativeKey DifferencePrefer when…
LLM Wiki PatternLLM pre-compiles a persistent wiki from sources rather than retrieving at query timeStable corpus, frequent queries, synthesis quality is more important than source freshness
Fine-tuningBakes domain knowledge into model weightsKnowledge is stable, large, and well-curated; latency constraints make context injection impractical
Long-context LLMs (1M+ tokens)Stuff the entire corpus into contextCorpus fits in context; cost is acceptable; retrieval precision is less important than completeness
GraphRAGPre-generates community summaries over document clustersQueries require synthesis across many documents; naive RAG fails on global questions
Full retrainingDomain-adapted base modelNarrow domain requiring different reasoning patterns, not just different knowledge

Evidence & Sources

Notes & Caveats

  • Retrieval is the dominant failure mode: In production, the LLM’s generation is often correct given its context; the system fails when retrieval returns the wrong chunks. Retrieval failures are silent — the model still produces fluent output grounded in the wrong documents.
  • Chunking strategy matters significantly: How documents are split (size, overlap, semantic boundaries) has a large impact on retrieval quality. There is no universal best practice; it is corpus-dependent.
  • Hallucination is not eliminated: RAG reduces hallucination but does not prevent it. The model can hallucinate around or between retrieved chunks.
  • Scalability degradation: RAG systems can degrade from sub-second to multi-second latency as corpora grow from thousands to millions of documents. Vector search indices require re-indexing infrastructure at scale.
  • Context window costs: Large retrieved chunks consume expensive input tokens, particularly with paid API providers. Cost management requires careful chunk size and retrieval count tuning.
  • GraphRAG cost: Microsoft’s GraphRAG pre-processing (community detection, summary generation) is expensive for large corpora — significant upfront LLM API cost before the system is queryable.
  • Vendor fragmentation: Dozens of vector databases, embedding models, and orchestration frameworks exist, each with different performance characteristics. The ecosystem is fragmented and changing rapidly.

Related