You don’t need embeddings to do RAG. In fact, off-the-shelf, vector search can be actively harmful. It’s the wrong lens to think about the problem.

We got into this vector mess because RAG, at first blush, looks like a question answering system. The user asks the AI a question. It understands and responds in natural language. So, naturally, the search behind the AI should take questions and respond with answers.

In this line of thinking, we should be able to “ask” the vector database a free-form question, encoding a question like

What is the capital of France?

into an embedding…

…we find, in our vector index, the most similar vector, corresponding to:

Answer: Paris is the capital and largest city of France, with an estimated city population of 2,048,472 in an area of 105.4 km2 (40.7 sq mi), and a metropolitan population of 13,171,056 as of January 2025…

Thus, with vector search, we retrieve the most similar passage in response. And it seems like a natural fit to the problem.

Embeddings don’t make it easy to manipulate data

While this embeddings have become an invaluable ingredient, it’s the wrong abstraction to think about search.

RAG applications struggle because they don’t actually help users manipulate their specific corpus. Users want SELECT * FROM your_data WHERE <stuff_that_matters_in_your_data> . Users want the LLM to translate their question to the <stuff_that_matters_in_your_data>. They want to select domain-specific things that an open-domain, off-the-shelf, embedding model doesn’t understand.

But, vectors fundamentally don’t “select” for a number of reasons.

Embedding crowding

First, off-the-shelf embedding models tend to be trained on a general corpus. Our dataset tends to be specific. This creates a problem: every passage / question from our users / corpus is very similar to every other passage / question.

We might see quarterly earnings as an entirely different type of financial report from an S1 filing to IPO.

To the generic embedding model, trained on Web data, though, “S1” and “earnings report” would be very similar. Both are “financial reports”. Suddenly all your data becomes clumped together - every document is maybe 0.1 cosine similarity to every other document. This makes it hard to discriminate relevant from irrelevant without additional work to recenter or fine-tune your embeddings.

Filtering - match / no match

Second, embedding retrieval has no understanding of what answers / does not answer the users question.

Earlier we proposed a question in a hypothetical RAG system

What is the capital of France?

Similarity ranges from -1 to 1 with 1 being the most similar. Maybe the similarity between “What is the Capital of France” and “Paris is the capital…” is 0.9. But then, there’s this passage:

Answer: Rouen is a city on the River Seine, in northwestern France. It is in the prefecture of region of Normandy and the department of Seine-Maritime. Formerly one of the largest and most prosperous cities of medieval Europe,

Another French city.

Such a passage can be surprisingly close to the right answer. Maybe here its 0.8.

How do you cut it off? One passage is correct another is incorrect. Binary. Match / not match.

Without extra work, we don’t have a magical way of saying “this matches / doesn’t match” in vector search.

Some other question like

List all the cities of France

Might retrieve the Rouen and Paris snippets, but maybe at ~0.6 similarity to the query. After all, nothing magical says that “right answer has to be 0.8”. So whatever cutoff threshold from the previous problem doesn’t work here.

Without work (building a classifier, etc) pure vector retrieval doesn’t distinguish between correct / incorrect answers.

In-domain considerations dominate search performance

Information retrieval can be thought of two worlds.

Cross-domain research: techniques that provide modest gains across all domains, on leaderboards metrics.
In-domain industry: techniques that provide out-sized gains at your company on the complex, multidimensional constraints of your engineering problem

For embedding models, we frequently see expectations on ranking general data doesn’t translate to YourCorpus™️.

Trey and I saw this in our AI Powered Search course. Domain specific terminology and phrases didn’t always translate neatly into an embedding model’s similarity. In financial data, that might mean misunderstanding weird phrases like

Companies with high yield and strong fundamentals - high yield might mean junk bonds, not high returns
How do Chinese walls work within a Hong Kong bank? - doesn’t relate to the wall in China, but rather information barriers needed for compliance

Figure out how embedding similarity understands your domain’s phrases, not just the “What is Paris” style searches.

Users want affordances for selecting data

Users want to learn affordances for manipulating data. From Human-Computer Interaction expert Donald Norman:

“…the term affordance refers to the perceived and actual properties of the thing, primarily those fundamental properties that determine just how the thing could possibly be used. (Norman 1988, p.9)

Users want to know what they can do to your data, how to select it, with confidence the system will translate their terminology into the right selectors.

For this reason, good search turns into a query/content understanding exercise.

Users search for certain class of content they want to retrieve. We organize content to help LLMs find content. We translate queries to retrieval selectors based on our content’s organization.

It’s more information architecture + data modeling than semantic similarity.

For example, in a furniture search I search for a style, a room, a material, and so on. For company financial reports, the stock ticker, the market, a general type of report (ie quarterly earnings, S1, etc), maybe a mood (upbeat), maybe the user cares about recent reports vs historical research.

We never have a complete set of the properties (users always surprise us) but you can often have a fairly comprehensive list, a happy path where users might find the most success.

LLMs are a query understanding powerhouse

The mistake we made with RAG was thinking because users used natural language, we should use a natural language similarity technique (question answering).

What people have realized, through a variety of lenses (traditional RAG, tool usage, etc) is that LLMs are amazing query-understanding. That’s why I created a whole course about Cheating at Search with LLMs

They take free text :

Show me a suede, geometric couch

And a schema:

 class Query(BaseModel):
     # styles is a free text visual style
     styles: List[str] = Field(description="The visual style the user wants.")
     
     # materials is an enum of specific legal values
     materials: List[Literal["leather", "suede", ...""] =\
         Field("The material the user is looking for.")
     
     # Classification fits into our taxonomy organizing content hierarchically
     classification: str =\
         Field("How the item is classified, ie Living Room / Sofas")     

And produce a structured query conforming to the schema:

{
   "styles": ["geometric"],
   "materials": ["suede"],
   "classification": "Living Room / Seating / Sofas"
}

The same pattern applies whether you’re doing e-commerce, job search (skills, locations, etc), knowledge base (organized by topic, region, company name).

Each of these is a different similarity space

In the schema above, each attribute here might select data using completely different techniques:

Style similarity - solve with some kind of visual / CLIP embedding model mapping captions to furniture images?
Material similarity - solve with some explicit taxonomic similarity ensuring exact matches of suede rank higher, then related materials (leather, etc). Then everything else is not a match?
Classification - another taxonomic similarity, so that direct matches rank above sibling matches, rank above cousin matches and so on, deciding to cut off at a certain point?

Each can become its own little problem.

Each lets us explain to the LLM + user exactly why the result was a relevant match, allowing them to understand and refine.

We have this incredible ability now to decompose, at query time, the individual signals that apply to our final ranking function. Years ago, this would have been months-long projects by a team of experts. Now we can Cheat at Search with LLMs to translate user needs into a more structured specification of the ideal result.

What about when we can’t understand the query ?

A structured, query-understanding style approach becomes your first line of defense. Targeting those most important selectors users care about pays dividends. It shows we “get them” when we can reflect their intent back and describe why results came back.

But there remains a grey area. We can’t neatly organize every way users want to search. We still need to be able to tackle handle cases where we don’t understand the query completely. We may have lower confidence in what the user wants from their query, so we don’t want to filter-out potentially relevant items. We instead rank by our confidence the result will satisfy the user. Or by some semantic proximity to what the user wants (as mentioned above)

When we start with pure vector search (or BM25) we ONLY care about building this unbounded ranking, divorced from the understanding the properties of our domain. Even when we have done a good job at a subset of properties, we will never fully model every problem the user cares about. We will need a less precise fallback to attempt to help the user. Falling back to a good-enough retrieval can be a great fallback.

Indeed Google does this. Google is replete with info boxes of structured information. If Google knows you’re searching for a movie name, it gives you a summary of showtimes, reviews, and other information you’d want to know about movies. Within each info box there’s a small ranking problem to solve. If Google can’t solve your problem at all (and an AI summary has no idea) then they’ll just fall back to traditional search ranking.

Great search is an attempt at high-precision answers when we understand the query, but fallbacks to simpler forms of retrieval when we do not.

Ranking isn’t just BM25 and vectors

Finally, ranking passage similarity is only a tiny part of this ranking problem. Many other factors could come into play:

Popularity - Is this item, like Hansel, so hot right now? Or is it old and boring?
Recency - How recently was the item published? Does your user expect newer items or older items?
Authority - How well trusted is this information? Clasically for Google, this is a statistic like page rank.
Proximity - How physically close is the item to the user? Is the restaurant next door or an hour away?

Ranking is not just question-to-passage similarity, it’s an interplay of a number of factors users subconsciously use to decide whether the item will be relevant.

Then diversify…

Ranking itself isn’t enough. We need a diverse, representative set of relevant data across dimensions not specified by the user. If we don’t know some category (like couch material), intentionally show them / the LLM a diverse range. Give LLMs a broad overview of what’s there.

One of the worst search experiences I’ve seen is when a user searched for restaurant jobs and the same Chipotle cook job showed up 10 times (at different locations). Technically all relevant. But not a great experience. “Restaurant job” can mean a lot of different types of jobs, high-end, fast-food, cooking, serving, managing, and so on. Good search doesn’t just give you a few relevant results, it gives users a broad view within the set the user selected.

Then agentify?

This diversity point becomes particularly important for an agentic search loop. In agentic search, we have a feedback loop in the middle of the RAG system. An agent can reason: search, evaluate, try again, reformulate. The agent won’t be able to learn how to reformulate unless it can see the many ways a query might produce a diverse range of search results.

Has vector search been a distraction all along?

Vector databases were one of the best things to happen to the search industry in decades. They revitalized Information Retrieval, evangelized ideas around “relevance” at VC scale. They innovated new techniques, while working to make cutting-edge techniques like Splade or late interaction accessible.

But, we all know what when VC money gets lit under rocket ships, a few things may go wrong as we ride hype curves.

Unfortunately, I feel like the near term emphasis on RAG by vector DBs has both created RAG misconceptions while limiting the direction of vector database industry. A vector DB call should probably be in almost every request to a website - personalizing, recommending, helping you search. I’ve always had a gut feeling that over-indexing on RAG would create unfortunate knock-on effects for everyone involved.

Putting the vector search cart before the RAG horse has been one mistake. Not every team has millions to spend scaling out billion-scale vector search. That’s untenable. But they may have a decent enough keyword search that they manage. They might be able to gradually layer in the bits of useful vector retrieval here-and-there to solve targeted problems incrementally.

RAG isn’t about completely rethinking search around embedding retrieval. It’s about query understanding with an LLM to explore your corpus - and building retrieval LLMs and agents can understand.

Enjoy softwaredoug in training course form!

Starting June 22!

I hope you join me at Cheat at Search with LLMs to learn how to apply LLMs to search applications. Check out this post for a sneak preview.

RAG Isn’t a Vector Search Problem