All posts
RAGLLMArchitectureEnterprise

Building Production RAG Pipelines: Lessons from 10 Enterprise Deployments

October 20, 20253 min readRavi Iyengar · Principal AI Engineer

The Problem with Naive RAG

Most RAG tutorials show you how to build a system in 50 lines of Python. Load some PDFs, split them into chunks, embed them, store in a vector DB, retrieve on query. Done.

The problem is that this works great on toy datasets and fails spectacularly in production.

After deploying RAG systems for 10+ enterprise clients — handling millions of documents, petabytes of text, and strict accuracy requirements — we've accumulated hard-won lessons that aren't in any tutorial.

The Three Layers of RAG Quality

We think about RAG quality in three distinct layers:

  1. Indexing quality — how well you store and organize information
  2. Retrieval quality — how well you find the right information
  3. Generation quality — how well you synthesize retrieved chunks into answers

Most teams focus only on (3) and wonder why their system underperforms. The real leverage is in (1) and (2).

Chunking Strategy Matters More Than You Think

The naive approach is fixed-size chunking: split every 512 tokens, overlap by 50 tokens, done. This works poorly because:

  • It splits sentences and paragraphs mid-thought
  • It treats all text as equally important
  • It ignores document structure entirely

What we do instead:

def semantic_chunk(text: str, max_chunk_size: int = 512) -> list[str]:
    """Split text by semantic boundaries, not token counts."""
    # Use sentence transformers to detect semantic shifts
    sentences = split_into_sentences(text)
    chunks = []
    current_chunk = []
    current_size = 0

    for sentence in sentences:
        if current_size + len(sentence) > max_chunk_size and current_chunk:
            chunks.append(" ".join(current_chunk))
            current_chunk = [sentence]
            current_size = len(sentence)
        else:
            current_chunk.append(sentence)
            current_size += len(sentence)

    if current_chunk:
        chunks.append(" ".join(current_chunk))

    return chunks

For structured documents (financial reports, legal contracts), we additionally:

  • Parse sections and headings explicitly
  • Keep tables intact as single chunks
  • Attach parent context (section title) to every child chunk

The Reranking Layer Is Non-Negotiable

Embedding-based retrieval is fast but imprecise. The cosine similarity between a query vector and a document vector doesn't perfectly capture relevance.

Our standard pipeline now always includes a reranker:

  1. Retrieve top-50 candidates from vector search
  2. Pass them through a cross-encoder reranker (Cohere, Jina, or a fine-tuned model)
  3. Return top-5 to the LLM

This alone improved answer accuracy by 23% in our insurance document RAG system.

Hybrid Search Beats Pure Vector Search

Pure vector search misses exact keyword matches. Pure BM25 misses semantic similarity. The winning approach is hybrid:

results = hybrid_search(
    query=user_query,
    vector_weight=0.7,    # semantic similarity
    keyword_weight=0.3,   # BM25 exact match
    top_k=50
)

For legal and financial documents where specific terms matter, we sometimes flip the weights to 0.4 / 0.6.

Conclusion

RAG is not a solved problem. Every deployment teaches us something new. The key insight: treat your retrieval system like a search engine, not an afterthought. Invest in chunking, indexing, and reranking — the generation layer will take care of itself.