Skip to main content

    Building RAG Pipelines That Actually Work in Production

    Agentic AICloudmess Team9 min readFebruary 12, 2026

    Why Most RAG Demos Do Not Survive Production

    Building a RAG demo that answers questions from a PDF takes an afternoon with LangChain, a vector store, and 50 lines of Python. Building a RAG system that reliably answers questions from 50,000 documents in production takes weeks of engineering. The gap is in retrieval quality, chunking strategy, and evaluation rigor. Most teams get excited by the demo, ship it, and then spend months debugging why the chatbot gives wrong or irrelevant answers to edge cases. The model is almost never the problem. What you feed it is. In our experience across 15+ production RAG deployments, retrieval quality accounts for 80% of answer quality issues. If the right context is in the prompt, modern foundation models like Claude 3.5 Sonnet will generate an accurate answer. If the wrong context is in the prompt, even the most capable model will produce a confident, wrong response.

    Chunking Is Where Most Teams Go Wrong

    The default approach in LangChain and LlamaIndex tutorials is to split documents into fixed-size chunks (typically 512 or 1024 tokens) with 10 to 20% overlap using RecursiveCharacterTextSplitter. This works for homogeneous content like blog posts and wiki articles. It fails catastrophically for structured documents like legal contracts, technical manuals, API documentation, and financial reports where context boundaries do not align with token counts. A 512-token chunk might split a table in half, combine two unrelated sections, separate a clause from its exceptions, or cut off a code example mid-function. We use a three-tier chunking strategy. Tier 1, structural chunking: parse document structure using headings, paragraphs, tables, and lists as natural boundaries. For PDFs, we use Amazon Textract with AnalyzeDocument to extract structured blocks, then group blocks by heading hierarchy. Tier 2, semantic chunking: for long sections that exceed our target chunk size (we aim for 256 to 512 tokens), we use a sliding-window approach that splits at sentence boundaries where the cosine similarity between consecutive sentence embeddings drops below 0.75, indicating a topic shift. Tier 3, metadata enrichment: each chunk is tagged with its document title, section heading, page number, and a brief summary generated by Claude 3.5 Haiku. This metadata is stored alongside the embedding and used for filtering and re-ranking. This approach typically improves retrieval relevance (measured by precision@5) by 30 to 50% compared to naive fixed-size chunking.

    Hybrid Search Beats Pure Vector Search

    Pure vector similarity search works well for semantic queries like 'how do I handle authentication in the API?' but poorly for keyword-specific queries like 'what is the value of parameter max_retries?' or 'what changed in version 2.3.1?' This is because embedding models compress text into dense vectors that capture semantic meaning but lose specific keywords and identifiers. Production RAG systems need hybrid search: vector similarity for semantic matching plus BM25 or keyword search for exact matches. On AWS, OpenSearch 2.11+ supports both k-NN vector search and BM25 text search in a single index using the neural_search plugin. We create a single index with both a vector field (knn_vector with dimension 1024 for BGE-large embeddings) and a text field analyzed with the standard tokenizer. At query time, we run both searches in parallel and combine results using Reciprocal Rank Fusion (RRF) with the formula score = sum(1 / (k + rank_i)) where k = 60. We then re-rank the top 20 combined results using a cross-encoder model (we use ms-marco-MiniLM-L-12-v2 hosted on a SageMaker serverless endpoint) or Bedrock's Cohere Rerank API to select the final 3 to 5 chunks for the LLM context. This hybrid approach catches the queries that pure vector search misses. In A/B tests, hybrid search with re-ranking improves answer accuracy by 15 to 25% compared to vector-only retrieval.

    Evaluation Is Not Optional

    You cannot improve what you do not measure. We build evaluation pipelines from day one with three components. Component 1, golden dataset: a set of 100 to 200 question-answer pairs verified by domain experts. Each pair includes the question, the expected answer, and the specific source document and section the answer should come from. We generate initial candidates using Claude to create questions from document chunks, then have domain experts verify and edit. This dataset is versioned in Git alongside the pipeline code. Component 2, automated retrieval metrics: we compute precision@5, recall@5, and Mean Reciprocal Rank (MRR) on every pipeline change. These run as a GitHub Actions workflow triggered on any PR that modifies chunking, embedding, or retrieval logic. If precision@5 drops by more than 5% relative to the baseline, the PR is automatically flagged for review. Component 3, LLM-as-judge scoring: we use Claude 3.5 Sonnet as an evaluator with a structured rubric that scores each answer on three dimensions (relevance: 1 to 5, faithfulness to retrieved context: 1 to 5, completeness: 1 to 5). We also run RAGAS (Retrieval Augmented Generation Assessment) metrics, specifically context_precision, context_recall, and answer_relevancy, using the ragas Python library. Every change to the chunking strategy, embedding model, or retrieval logic is evaluated against this baseline before deployment. Without this, you are guessing whether changes help or hurt.

    The Production Architecture

    Our production RAG stack on AWS follows this architecture. Ingestion pipeline: documents land in an S3 bucket (source-documents/), triggering an EventBridge rule that starts a Step Functions state machine. The state machine runs four stages: parsing (Textract for PDFs, custom Lambda parsers for HTML and Markdown), chunking (the three-tier strategy described above, implemented as an ECS Fargate task), embedding (BGE-large-en-v1.5 on a SageMaker serverless endpoint, processing chunks in batches of 64), and indexing (bulk insert into OpenSearch Serverless with both vector and keyword indices). The entire pipeline processes 1,000 documents in approximately 45 minutes at a cost of $2 to $5 depending on document complexity. Query pipeline: the user's question hits an API Gateway endpoint backed by a Lambda function. The Lambda embeds the query, runs hybrid search on OpenSearch, re-ranks results via a SageMaker cross-encoder endpoint, constructs the prompt with the top 5 chunks as context, and calls Bedrock (Claude 3.5 Sonnet) with a carefully engineered system prompt that instructs the model to cite specific source documents, say 'I do not know' when the context does not contain the answer, and format responses consistently. The end-to-end latency is typically 2 to 4 seconds, with retrieval taking 200 to 400ms and LLM generation taking 1.5 to 3 seconds. The entire pipeline is instrumented with Langfuse for tracing, so we can debug any bad answer by inspecting the exact query embedding, retrieved chunks with similarity scores, re-ranked order, final prompt, and model response.