
RAG Beyond Vector Search: Hybrid Retrieval Architecture for Enterprise AI
After deploying RAG into healthcare, legal and financial-services environments, one pattern is consistent: pure vector search makes a great demo and a fragile product. Enterprise users hit a long tail of acronyms, identifiers, dates and exact-string queries that dense embeddings handle poorly. The answer is hybrid retrieval — and a clear separation between retrieval, ranking and grounding.
After deploying RAG into healthcare, legal and financial-services environments, one pattern is consistent: pure vector search makes a great demo and a fragile product. Enterprise users hit a long tail of acronyms, identifiers, dates and exact-string queries that dense embeddings handle poorly. The answer is hybrid retrieval — and a clear separation between retrieval, ranking and grounding.
Why pure vector search breaks at enterprise scale
Dense embeddings excel at semantic similarity but they smear exact tokens. ‘ICD-10 R10.31’, ‘invoice INV-44219’, ‘Section 12(b)(iv)’ — the literal string is the meaning, and nearest-neighbor search will happily return a topical neighbor instead of the right record.
Vector indexes also struggle with negation, freshness and structured filters. A user asking ‘open tickets from Q3 not assigned to me’ is doing structured query work, not similarity matching.
The four layers of a production RAG stack
We architect every enterprise RAG system as four distinct, testable layers. Each layer has its own eval set and its own failure mode.
- Ingestion — chunking, metadata extraction, entity tagging and PII handling
- Retrieval — hybrid BM25 + dense, with structured filters from parsed intent
- Ranking — cross-encoder re-ranker over the top 50–100 candidates
- Grounding — prompt assembly with citations, schemas and guardrails
Hybrid retrieval: BM25 + dense, fused with RRF
Run BM25 (or SPLADE) and a dense vector retriever in parallel, take the top 50 from each, and fuse with Reciprocal Rank Fusion. RRF is parameter-light, robust, and consistently beats either retriever alone on enterprise corpora.
Layer a structured pre-filter on top — tenant, date range, document type — driven by an LLM intent parser. The retriever should never see documents the user is not authorized to see.
Cross-encoder re-ranking earns its compute
A small cross-encoder re-ranker (e.g. bge-reranker-v2-m3) over 50–100 candidates is the single highest-leverage component in most RAG stacks. It typically lifts top-3 precision by 15–30% over fused retrieval alone, at a cost of 50–200ms.
Cache aggressively at the (query, candidate) level — a surprising share of enterprise queries repeat verbatim within hours.
Grounding: cite, or do not answer
The model should generate answers strictly from the retrieved context, and every sentence should carry an inline citation back to a chunk ID. If the retriever returns nothing above a confidence threshold, the model says so. ‘I don’t have a source for that’ is a feature, not a bug — it is what makes the system auditable.
We also constrain output to JSON schemas wherever the downstream consumer is a system, not a person. Schema-constrained generation eliminates whole classes of formatting failures.
- 01Vector-only RAG fails on identifiers, exact strings and structured filters
- 02Architect ingestion, retrieval, ranking and grounding as separate, testable layers
- 03Hybrid BM25 + dense with RRF fusion is the strong default for enterprise corpora
- 04A small cross-encoder re-ranker is the highest-ROI add to most pipelines
Frequently asked questions
Which vector database should I use?
+
For most enterprise workloads, Postgres with pgvector or OpenSearch hybrid indexes are easier to operate than a dedicated vector DB. Use a specialized store only when you exceed 50M+ vectors with strict latency budgets.
How do you handle document freshness?
+
Source-of-truth lives in the system of record. The RAG index is a derived asset, updated via CDC or scheduled sync. Every chunk carries an updated_at; the prompt template instructs the model to prefer the most recent source when conflicts arise.
How big should chunks be?
+
300–600 tokens with 10–15% overlap is a good default for prose. For code or tables, chunk by structural boundaries (function, row group). Always store both the chunk and a reference to its parent document for context expansion.
Build an enterprise-grade RAG system
We design hybrid retrieval stacks with eval harnesses, citations and structured grounding — production-ready in 4–6 weeks.
Start your RAG project