Building RAG Pipelines That Actually Work in Production

AI/MLOpsCloudmess Team9 min readJune 2, 2025

Why Most RAG Demos Don't Survive Production

Building a RAG demo that answers questions from a PDF takes an afternoon. Building a RAG system that reliably answers questions from 50,000 documents in production takes weeks of engineering. The gap is in retrieval quality, chunking strategy, and evaluation. Most teams get excited by the demo, ship it, and then spend months debugging why the chatbot gives wrong or irrelevant answers. The model isn't the problem. It's what you're feeding it.

Chunking Is Where Most Teams Go Wrong

The default approach is to split documents into fixed-size chunks (500 or 1000 tokens) with some overlap. This works for homogeneous content like blog posts. It fails for structured documents like contracts, technical manuals, and financial reports where context boundaries don't align with token counts. A 500-token chunk might split a table in half, combine two unrelated sections, or separate a clause from its exceptions. We use semantic chunking based on document structure: headings, paragraphs, sections, and tables are kept intact. For complex documents, we preprocess with an LLM to extract structured data before embedding. This alone typically improves retrieval relevance by 30-50%.

Hybrid Search Beats Pure Vector Search

Pure vector similarity search works well for semantic queries ('how do I handle authentication?') but poorly for keyword-specific queries ('what is the value of parameter max_retries?'). Production RAG systems need hybrid search: vector similarity for semantic matching plus BM25 or keyword search for exact matches. On AWS, OpenSearch supports both vector and keyword search in a single index. We typically weight results 70% vector, 30% keyword, then re-rank the combined results using a cross-encoder model or Bedrock's re-ranking API. This hybrid approach catches the queries that pure vector search misses.

Evaluation Is Not Optional

You cannot improve what you don't measure. We build evaluation pipelines from day one with three components: a golden dataset of 50-100 question-answer pairs verified by domain experts, automated retrieval metrics (precision@k, recall@k, MRR) that run on every pipeline change, and LLM-as-judge scoring for answer quality (relevance, faithfulness to retrieved context, completeness). Every change to the chunking strategy, embedding model, or retrieval logic is evaluated against this baseline before deployment. Without this, you're guessing whether changes help or hurt.

The Production Architecture

Our production RAG stack on AWS: documents are ingested through a Step Functions pipeline that handles parsing (Textract for PDFs, custom parsers for structured data), chunking, and embedding. Embeddings are stored in OpenSearch Serverless with both vector and keyword indices. At query time, the user's question is embedded, hybrid search retrieves the top 10-20 chunks, a re-ranker scores and filters them down to 3-5, and these are passed as context to Bedrock (Claude) with a carefully engineered prompt. The entire pipeline is instrumented with Langfuse for tracing, so we can debug any bad answer by inspecting exactly what was retrieved and how it was formatted in the prompt.

Back to Blog

AI/MLOps

AI Observability: What to Monitor When LLMs Hit Production

Traditional APM tools don't cover LLM-specific failure modes. Here's what to monitor, why, and the stack we use to keep AI systems reliable.