Canonical URL
Do not index
Do not index
Why can't the big data warehouse and other relational platforms replace a knowledge graph?
To confess my prejudice upfront, I like SQL a lot more than I like the relational data model. Or, rather, I like both; but the problem with modern enterprise data management, as I see it, is that the world has changed, profoundly.
In the new world, the relational model is a leaky abstraction and not as fit as it once was given all the changes to the enterprise data landscape since the mid-1970s. SQL is fine, but it is tied to the relational model and that’s a hard connection to break, and no one’s really even trying very hard to decouple them.
I’ll confess my other prejudice. In 2017 I predicted in a Medium post that “knowledge graph is the data model for the next 20+ years”—that prediction looks pretty good at this point.
RAG is King but Text2SQL is Everywhere Now
The main product in enterprise GenAI that isn't RAG—which only works for documents and unstructured data—is Text2SQL, that is, a conversational UX for a relational database.
There’s no shortage of options here as a simple search reveals and we should expect to see conversational UX for just about every database of note—Snowflake Copilot, Microsoft Copilot for Azure, Oracle’s Select AI (a knife fight internally to not name this Oracle Copilot!?), and, I assume, something for IBM in watsonx. And then looking at some other players, we can see text2sql.ai, Waii, and Datachat among the startups. And, of course, Databricks has a thing, AI/BI Genie. Interestingly, of the big platforms, though this isn’t at all surprising, Databricks gets it, that is, they know relational is showing its limitations in the GenAI era as the sole data model.
Almost no one wants to write SQL by hand and even fewer people know how to do it. Setting aside what people want or can do, almost no SQL queries are written by hand by anyone, a fact well-known to practitioners. Software engineers write programs and entire categories of systems to avoid writing SQL queries by hand. That’s just good engineering.
And yet, just as clearly, Text2SQL is important since enterprises store important data in relational systems. No one disputes that.
Peak Oil? More like Peak Relational Data!
But times have changed. Much like Peak Oil—the idea that fossil fuels will only ever contribute a smaller percentage to total energy output as time goes by—we could easily have declared Peak Relational Data years ago. While relational data is probably still the biggest in the enterprise, it’s percentage of the whole gets smaller every year.
All the data growth is in kinds of data that relational databases struggle to manage, namely, semistructured and unstructured data. That means that querying a relational system, every year, returns a smaller percentage of correct answers than ever before. This means, all else equal, a natural trend to diminishing F-score unless we intervene.
This matters for at least two reasons:
- We care about both precision and recall, after all. It doesn’t do much good to say “we returned no incorrect answers” if you don’t return any answers at all.
- Status quo bias toward SQL-and-relational limits the creation of new insights by limiting the places that we look for answers.
The future, which has arrived almost everywhere recently, favors knowledge graph approaches exactly because of this limitation of relational systems.
AI needs a Better Data Abstraction
But there are other good reasons why the relational data model isn’t sufficient to be the data model of the GenAI era.
- There's no reason to think multi-hop reasoning should respect data silos. Reality doesn't care about yr data silos! Truth cares even less. The true answer to a complex question that moves the needle is a mosaic of pieces, and neither the data lake nor data warehouse even contains all the pieces. The relevance of all enterprise data to GenAI for precision & recall cannot be overstated, which is ironic since I’ve said it at least three times in this post already.
- Important chunks of reality are not tabular in shape. Here are a few that Stardog pays a lot of attention to specifically:
- supply chains, both global and regional
- financial networks, including, critically, risk and compliance
- research programs, including, critically, life sciences, drug discovery, and space exploration
- strategic and tactical landscape of the modern warfighter
- All actual and forecasted data growth is in semistructured and unstructured types where the relational model struggles to be even so much as relevant.
- You can't ground LLM outputs to manage hallucinations using only the data warehouse because the data warehouse doesn't know enough about documents, logs, social, emails, texts, etc. In fact, this point generalizes and it’s one of the reasons that AI quality depends directly on data management—
A system cannot ground an AI output or part of an AI output in a fact that it doesn't contain or know about.
- Owing to a nice result from Juan Sequeda and Dean Allemang, there’s proof of what practitioners have long suspected: Text2SQL is inherently harder than other kinds of Semantic Parsing. It’s notably harder than what we do in Stardog Voicebox, which is to treat conversational inputs as knowledge graph queries, not as relational queries, i.e., SQL.
The big data warehouse vendors will continue to build Text2SQL UX for customers and that's exactly what I would do, too, in their situation.
The Right Abstraction Makes Intractable Problems Tractable
One of the funny things computer science teaches us is the qualitative difference between “it works” and “it works optimally”; or “you can kinda sorta do it” and “it’s the just right thing”.
The wrong abstraction makes hard things harder!
More Text2SQL is the right thing for big relational vendors to do, given their priors and their installed base, but for the reasons I've mentioned here it's often not the best thing for their customers to do.
At least not if your goal is, as ours is at Stardog, data democratization: We want to live in, and are busy every day creating, a world where anyone can ask any question about any data and get a fast, accurate, hallucination-free answer immediately.