Beyond RAG: Architectures for Real Knowledge Systems

Ask a RAG system "what was our revenue last quarter?" and it might retrieve paragraphs mentioning revenue from three different quarters. The embeddings are semantically similar. The relevance is completely different. Two years into building production RAG systems, this class of failure is everywhere, and it cannot be solved within the architecture itself.

Where RAG Breaks Down

The most common failure is context loss. RAG retrieves chunks based on embedding similarity, but a chunk is a fragment stripped from its surrounding context. The model has no sense of where it sits in the original document or what qualifies it.

Then there is the chunking problem. Chunking works for linear text like blog posts or reports. It breaks down for spreadsheets, relational data, nested structures, and anything where the meaning depends on spatial layout rather than sequence. You can work around this with sliding windows, parent-child chunking, or layout-aware parsers, but the results are fragile and require significant tuning per document type.

Finally, dynamic knowledge is painful. If your content changes over time, you need to keep re-embedding. If a customer updates their documentation weekly, or a website changes daily, the maintenance burden grows fast. RAG works best when the corpus is fixed.

Hierarchical RAG

Hierarchical retrieval provides context at multiple levels of abstraction instead of handing the model a raw snippet. A sentence from the source document, a summary of the paragraph, a summary of the page, a summary of the chapter, a summary of the whole document.

This measurably improves quality on longer, structured documents. But the retrieval step is still disconnected from the reasoning step. RAG does not use the language model's intelligence for the search itself. The model receives fragments and must synthesize an answer from whatever it was given. It cannot ask a follow-up question. It cannot refine its search. It cannot reason about what is missing.

Agentic Search

Agentic search makes search a tool the model controls. Instead of a single embedding lookup feeding into the prompt, the agent decides what to search for, evaluates the results, and iterates. It might start with a broad query, find that the results are not quite right, and adjust. It can search across multiple sources, each with its own tool: a local file search, a web search, a database query.

The pattern is familiar: a query goes in, results come out. Perplexity built an entire product around this. Anthropic, OpenAI, and Google have all since shipped their own versions for web content. For internal files and knowledge bases, you need more custom setup, but the architecture is the same.

The agent maintains context across multiple search steps and can reason about what it has found and what is still missing. Context loss is reduced because the model drives the search process, though it still operates within its context window. With enough accumulated search results, synthesis quality can degrade for the same reasons RAG struggles: too much information, not enough structure.

Tradeoffs

Agentic search is slower and more expensive than RAG. A single vector lookup takes milliseconds, while an agent running three or four searches with reasoning in between might take several seconds at 5 to 20 times the cost per query. I have built production systems where RAG was the right choice specifically because the use case demanded sub-200-millisecond responses.

RAG works well when the knowledge domain is fixed, the documents are clean and linear, and the questions are factual and specific. Agentic search wins when knowledge changes over time, when questions are ambiguous or require reasoning across sources, and when the cost of a wrong answer is higher than the cost of a few extra seconds of latency.

Choosing the Right Architecture

A practical decision framework:

Stable corpus, speed is critical, questions are simple. Use RAG. Invest in good chunking and re-ranking.
Long structured documents, partial context problems. Use hierarchical RAG. Add summaries at multiple levels.
Dynamic knowledge, ambiguous questions, multiple sources. Use agentic search. Give the model tools and let it reason.

In practice, the most effective production architecture is often a hybrid: route simple factual queries to optimized RAG for speed, and route complex or ambiguous questions to an agent for quality. A lightweight routing layer that classifies incoming queries and picks the right backend gets you the best of both.

RAG is not dead. But it is no longer enough on its own.