RAG is a Fancy, Lying Search Engine

Canonical URL

Do not index

RAG has taken the GenAI world by storm. This is a mistake of excess, which will resolve to a sane equilibrium eventually. But there’s a lot to say about RAG in the mean time.

In this post I answer some key questions—

Why is RAG so popular?

What is RAG really?

What are its essential defects?

What are its alternatives?

But, first, what is RAG?

📌

RAG is a GenAI application design pattern that supplements a user’s LLM prompt with some other information retrieved dynamically from somewhere in order to make the LLM’s response to the user’s prompt better.

A perfectly obvious, typical software development pattern applied to LLM interaction patterns. My objection isn’t to RAG per se. My objection is to its misuse, and that objection is conditioned on a particular context of use and abuse, which is the market that Stardog Voicebox serves, namely, high-stakes use cases in regulated industries.

TLDR—RAG is unfit for high-stakes use cases in regulated industries because RAG lets the LLM speak last and that’s irresponsible and unsafe.

To get started, I offer five surmisals to explain RAG’s popularity. Of course “popularity” is orthogonal to “value”, but I talk about that later, too.

1. RAG Gives Great “Early Demo”

It’s pretty easy, all things considered, to hack a RAG system together. There’s a lot of open source RAG stuff around which makes it even easier. A 17 July Github search for “RAG AI” gets 2,300+ results. Approximately zero of which existed two years ago.

Maybe you don’t believe me yet? So what, you say, Github searches don’t prove anything. True enough! How about this?

50 lines of code: Your RAG powered AI app in 50 lines of code

10 lines of code: RAG from Scratch in 10 Lines of Python

5 lines of code: How to Build a RAG Solution in 5 Lines of Code

A few lines of code: How I built a Basic RAG for PDF QA in a few lines of python code

No lines of code: No-code RAG with ChatGPT and LlamaIndex

Which reminds me of—

A good business is like a strong castle with a deep moat around it. I want sharks in the moat to keep away those who would encroach on the castle.—Warren Buffett

2. Lots of RAG-based Startups get Funded

Next I googled for “RAG startups” and, completely without irony, here’s a piece about five RAG startups funded by the same VC from last fall. I can’t even muster the courage to search Crunchbase or Dealbook for the real data. Suffice to say, there are lots of VC-funded RAG startups, which ensures there will soon be…even more.

To a first approximation, there are only three kinds of GenAI software startups:

foundational model and other core infrastructure firms (OpenAI, Anthropic, Together.ai)

domain-specific GenAI apps for accelerating some core business task (JasperAI, Midjourney, etc)

RAG-with-LLM apps for doing something with documents—too many to even choose…

In short, the space is massively over-rotated to RAG-based startups.

3. A16Z is Super Influential and Pretty Smart and Not Shy

A16z published a GenAI reference architecture in the summer of 2023. It doesn’t mention RAG but RAG is the dominant AI app design pattern that it strongly implies. My issue here isn’t with A16Z. They do just about as good market and product analysis as any VC firm that I know. I’ve learned a lot about GenAI and other stuff from their work.

But since they were early, aggressive, and are very influential, their weight behind vector databases and a super RAG-friendly reference architecture—”Emerging LLM App Stack”—changed the trajectory of early GenAI investment toward RAG-based startups.

4. Something, something…Science!

To be honest, one of the signals that GenAI is the real deal is the response of the global research community to the arrival of LLM two years ago, that is, overwhelming, massive acceptance in the form of a massive increase of output of R&D based on LLM. Global science can be wrong, but it’s a strong signal that GenAI has legs, not least since the work is not only a signal as to inherent quality but it’s a feedback loop that increases quality.

More cynically, perhaps, consider this. Arxiv.org sees a steady stream of every-so-slightly different RAG variations and that gives the new AI experts in the VC world some “science” to hang their hats on. I mean, they aren’t wrong exactly!

As of 17 July, RAG is connected to at least 582 bits of science, nearly all of that in the past year.

Source: Arxiv.org which, perhaps ironically, has poor search. The first result of my “RAG” search isn’t about RAG!

5. Search Tech is Extraordinarily Stale. Thanks, Google?

This is the best reason of all, frankly. IR-based search engines aren’t very good but at least they’re old and dull. Yes, of course I’m not saying that Google is an information retrieval system, but there’s not been very much innovation in search since Google owned the Web, so in that sense I don’t mind all this RAG stuff at all.

This isn’t Google’s fault, of course, but their dominance in web search has been so absolute for so long that it more or less drove researchers and competitors out of the space. And now it seems—in a strategic blunder of epoch-making proportions—Google may be exiting web search?

Against that backdrop, RAG is interesting since it revivified search research around a promising new direction.

In the end, I don’t really know why RAG is so popular, but I think it’s a combination of the previous five factors.

What RAG, Essentially & Unavoidably, Is

I was a philosopher once which means I prefer essential definitions. What is the essence of RAG such that if that bit changes we’re not talking about RAG any more? I focused on that in my definition earlier: RAG supplements a user’s prompt with some dynamic information and then gives the LLM’s output back to the user. The many variations of RAG all do that in different ways but they all do that.

For example, consider this perfectly exemplary RAG diagram from AWS’s SageMaker platform.

The key bit is the direct interactions between the laptop (which stands for a person, oddly) and the goofy brain-graph LLM icon at lower right. The (1) oval is the user’s input. Some stuff happens in steps (2) and (3). Doesn’t really matter what happens. Just some stuff. The result of that stuff is added to the user’s original input and that’s all sent in (4) to the LLM brain bug.

Source: Starship Troopers’ Brain Bug is a decent metaphor for LLMs actually.

The key to RAG’s unfitness comes in (5); raw LLM outputs go from the LLM back to the user. This means the user is exposed directly to LLM hallucinations. LLMs are the only source of hallucination in a RAG app and trusting their output means exposing users to hallucinations. That’s not responsible, safe, or sane in high-stakes use cases in regulated industries. RAG is unfit for those use cases essentially and unavoidable because if you’re not doing that bit at (5), you’re not doing RAG! If you’re doing that bit, then yr doing RAG and you should stop if you’re serving high-stakes use cases.

This is why we talk about RAG trusting the LLM or letting the LLM speak last. The LLM brain bug is the source of all hallucinations in GenAI and we can’t expose our users to it directly.

What is RAG Good For? Vacation Policy Lookup

Vacation policies. Who can remember that shit? Three weeks based on seniority unless your service isn’t continuous, exceptions for maternity or paternity leave or a valid leave of absence, except in these EU countries where… blah blah oh my god blah blah.

You can see the problem here, right? Combine these factors for a stark usability nightmare:

Arcane, complex policy space around regulations, rules, etc.

Some basic competitive tensions, i.e., a zero sum game where the stakes matter.

Medium or low-stakes use case. That is, an important but not absolutely critical part of life; yes, vacation policy is super important, but also it’s sorta not, at least not compared to things like birth, death, illness, bankruptcy, and so on.

A lot of enterprise apps are directly in this confluence of factors.

🔥

If you remember one thing only from this post, remember that RAG is a truly terrible idea in the enterprise if you’re building anything more high-stakes than Vacation Policy Lookup.

RAG is a good choice for many use cases of this type since it’s going to get the common cases right, mostly. Occasionally it will just make shit up or go otherwise badly wrong, but in those cases, the user just asks again later and it will probably do better.

RAG is a good choice for Magic Eightball use cases where it’s perfectly acceptable to make the user try again later.

RAG is a Fancy Pants Search Engine that Makes Shit Up

In a stunning irony, RAG is offered as a solution to LLM’s hallucination problem. This boggles my mind since at best it lowers the frequency of hallucination from very high to merely unacceptable.

But let’s be fair: RAG may well reduce hallucinations from 30% or even 50% to maybe just, say, 7%. But for a regulatory filing, a strategic or tactical military or political matter, or a linchpin view in drug discovery or clinical trials, reducing hallucinations from 50% to even just 5% is laudable but also an absolute non-starter.

But At Least RAG Doesn’t Understand Databases!

I’ve talked a lot about RAG’s unsafety, since it lets LLMs speak last. But there’s another problem. There’s always another problem. The problem is implied by my choice to call RAG a “fancy search engine”. It is undoubtedly fancy and much better than information retrieval.

But it’s a search engine: it deals exclusively with documents, that is, unstructured data. But documents are only 1 of 3 types of enterprise data. RAG is a one-eyed jack that’s entirely blind to structured and semi-structured data.

A factual query—when you’re looking for a needle in the haystack or a data path—that is processed by RAG will necessarily be incomplete with respect to enterprise data because lots of needles live in database records.

Suffice to say, in regulated industries, a lot of crucial facts live exclusively in database records and other structured and semi-structured data sources. RAG can’t deal with golden records, and that’s a problem in risk and compliance in financial services, or wealth advisory, or drug discovery in pharma, or supply chain 360 in automotive or AgTech.

You Have to Use LLMs, but For God’s Sake Don’t Trust Them!

AI involves a twofold challenge—

algorithmically understanding data

algorithmically understanding human intent

The early tragedy of GenAI is that it’s really best for #2, but consensus so far is that it’s best for #1.

A Lot of GenAI Backlash is RAG Backlash

What do I mean? Is RAG popular or not? Well, it is, clearly. But there’s also a growing awareness that GenAI is super easy to prototype—it gives great “early demo”—but it’s hard to get into production—”it’s still making shit up”. The backlash against getting GenAI into production is a misdirected backlash against RAG. No one really says “RAG isn’t the right thing” because consensus is very powerful but increasingly people are realizing RAG isn’t the right thing.

We’ve heard this in every engagement we’ve started since launching Voicebox in Q1, from banks to pharma, to global manufacturers and defense. The problem isn’t tech or talent or resources. You can’t make a scorpion stop stinging the frog until it’s not a scorpion any more. And you can’t make GenAI stop lying until you stop letting the LLM speak last. ~~Large~~ More like Lying Language Machine, amirite!

Semantic Parsing is Shovel-ready GenAI

So is all hope lost if RAG is unfit for high-stakes use cases in the enterprise? Not at all. There is always an alternative. In this case, Semantic Parsing is the better choice. I talk about the tradeoffs between RAG and SP with respect to four key LLM design challenges in another post.