RAG2024-07-306 min

RAG Beyond Vector Search: Hybrid Retrieval Architecture for Enterprise AI

After deploying RAG into healthcare, legal and financial-services environments, one pattern is consistent: pure vector search makes a great demo and a fragile product. Enterprise users hit a long tail of acronyms, identifiers, dates and exact-string queries that dense embeddings handle poorly. The answer is hybrid retrieval — and a clear separation between retrieval, ranking and grounding.

[ TL;DR ]

[ 01 ]

Why pure vector search breaks at enterprise scale

Dense embeddings excel at semantic similarity but they smear exact tokens. ‘ICD-10 R10.31’, ‘invoice INV-44219’, ‘Section 12(b)(iv)’ — the literal string is the meaning, and nearest-neighbor search will happily return a topical neighbor instead of the right record.

Vector indexes also struggle with negation, freshness and structured filters. A user asking ‘open tickets from Q3 not assigned to me’ is doing structured query work, not similarity matching.

[ 02 ]

The four layers of a production RAG stack

We architect every enterprise RAG system as four distinct, testable layers. Each layer has its own eval set and its own failure mode.

Ingestion — chunking, metadata extraction, entity tagging and PII handling
Retrieval — hybrid BM25 + dense, with structured filters from parsed intent
Ranking — cross-encoder re-ranker over the top 50–100 candidates
Grounding — prompt assembly with citations, schemas and guardrails

[ 03 ]

Hybrid retrieval: BM25 + dense, fused with RRF

Run BM25 (or SPLADE) and a dense vector retriever in parallel, take the top 50 from each, and fuse with Reciprocal Rank Fusion. RRF is parameter-light, robust, and consistently beats either retriever alone on enterprise corpora.

Layer a structured pre-filter on top — tenant, date range, document type — driven by an LLM intent parser. The retriever should never see documents the user is not authorized to see.

[ 04 ]

Cross-encoder re-ranking earns its compute

A small cross-encoder re-ranker (e.g. bge-reranker-v2-m3) over 50–100 candidates is the single highest-leverage component in most RAG stacks. It typically lifts top-3 precision by 15–30% over fused retrieval alone, at a cost of 50–200ms.

Cache aggressively at the (query, candidate) level — a surprising share of enterprise queries repeat verbatim within hours.

[ 05 ]

Grounding: cite, or do not answer

The model should generate answers strictly from the retrieved context, and every sentence should carry an inline citation back to a chunk ID. If the retriever returns nothing above a confidence threshold, the model says so. ‘I don’t have a source for that’ is a feature, not a bug — it is what makes the system auditable.

We also constrain output to JSON schemas wherever the downstream consumer is a system, not a person. Schema-constrained generation eliminates whole classes of formatting failures.

[ Key takeaways ]

01Vector-only RAG fails on identifiers, exact strings and structured filters
02Architect ingestion, retrieval, ranking and grounding as separate, testable layers
03Hybrid BM25 + dense with RRF fusion is the strong default for enterprise corpora
04A small cross-encoder re-ranker is the highest-ROI add to most pipelines

[ FAQ ]

Frequently asked questions

Which vector database should I use?

For most enterprise workloads, Postgres with pgvector or OpenSearch hybrid indexes are easier to operate than a dedicated vector DB. Use a specialized store only when you exceed 50M+ vectors with strict latency budgets.

How do you handle document freshness?

Source-of-truth lives in the system of record. The RAG index is a derived asset, updated via CDC or scheduled sync. Every chunk carries an updated_at; the prompt template instructs the model to prefer the most recent source when conflicts arise.

How big should chunks be?

300–600 tokens with 10–15% overlap is a good default for prose. For code or tables, chunk by structural boundaries (function, row group). Always store both the chunk and a reference to its parent document for context expansion.

[ Start your build ]

Build an enterprise-grade RAG system

We design hybrid retrieval stacks with eval harnesses, citations and structured grounding — production-ready in 4–6 weeks.

Start your RAG project