A Puzzle about LLM Performance: Closed versus Open

Closed models outperform open ones. Or does the causal relationship run the other way?

A Puzzle about LLM Performance: Closed versus Open
Canonical URL
Do not index
Do not index
AI summary
Why do closed “frontier” models perform better than open ones?
Intuitively there’s no obvious or direct connection between performance (or quality) and license terms, as we’ve seen for the past 30 years with Linux, databases, programming languages, hardware, etc.
This has puzzled me for some time, not only as a matter of curiosity but also as a matter of refusing the rentiers who want to freeze innovation in the AI market in a devious three-step:
  1. LLMs are dangerous, on par with the global CBRN threat, in theory.
  1. Open-source models are even more dangerous since they give bad actors access to the CBRN-like mechanism.
  1. Thus, good, that is, “frontier” models have to be closed—we have to do it for the children, for god’s sake!—which also freezes the market by limiting the ability of small innovators from either using or innovating with frontier models.
None of this makes any sense now, given that #1 is false, except that there’s a persistent connection of some sort between model performance and license terms, with the best models being consistently closed.
While the number of models is growing rapidly and open access terms are the most common, it’s still the case that closed models dominate in terms of performance.
notion image
But there’s a striking asymmetry between performance of closed versus open models.
notion image
We might say, first, that this is an artifact of ChatGPT 4 , which is the most poweful model, generally, but also closed. However, the correlation goes beyond OpenAI to Google’s models, which also aren’t open. In fact, only the Llama 3 family—the premier version (300B+ parameters) of which is still unavailable as of June, 2024—is in the top-tier of performance and open.
Whatever you think of rent-seeking or of this instance of it, even if you don’t think the LLM threat is rent-seeking, it’s a puzzle why there should be any correlation between performance and license terms.

It’s about the Scaling Laws, Or: Chinchilla Doesn’t Generalize

Setting aside why LLMs work, to say nothing of how they do, we’ve known for a minute (since 2022) how to predict how fast they improve and how to predict their advance.
It’s the energy, stupid. That is, the more compute that goes into training a model, the better it will perform, all else equal. And we can think about this compute input in terms of data inputs since, after all, all that compute needs data to, well, compute upon.
Model builders use the scaling laws specifically to understand the data input size versus eventual model size tradeoff since, for a fixed compute budget of whatever size, you can spend on tokens and parameters and the ratio between the two is your main source of leverage. Any compute spent on tokens can’t be spent on parameters and vice versa. A larger model requires more tokens of input. A smaller model requires fewer tokens of input but will, all else equal, perform worse.
But how many of each is optimal? That’s what the scaling laws tell us, generally.
The data version of these laws were formulated in a paper (Training Compute-Optimal Large Language Models) from Google’s DeepMind and are generally called the Chinchilla laws. (There are lots of these and they keep getting revised, but that doesn’t really matter for our purposes here, so let’s just look at Chinchilla.)
notion image
Well no one really knows intuitively how much data that is, so let’s look at something that relativizes it to books—at these scales, it doesn’t much matter which book or how long, etc.
notion image
As simply as I can put it, to get a better model you need to train it on more data, thus expending more compute, and while there’s a debate about how many tokens (how data inputs are sized) are optimal per parameter (how models are sized), with estimates ranging from tens to many hundreds, the shape of the scaling laws is clear. More is better. No, even more than that! A little more…
A better model is a function of more parameters which is a function of more data which is a function of more compute.

How Scaling Laws Explain Closed vs Open License Terms

Today I read a paper—gzip Predicts Data-dependent Scaling Laws—that seemingly isn’t connected to the puzzle of performance and license terms at all, but appearance once again deceives. The main results of this paper are as follows—
  • Chinchilla is specific to low-quality web data (which had been speculated but is now demonstrated conclusively).
  • Datasets that are higher quality require more compute to compress them (in an LLM or with SOTA data compression methods) and different ratios of tokens-to-parameters, that is, less compressible training data shifts model preference for input size over model size.
The paper’s money shot for me is this—
We find that as the training data becomes less compressible (more complex), the scaling law’s compute-optimal frontier gradually increases its preference for dataset size over parameter count. We then measure the compressibility of real-world code & natural language datasets, showing that the former is considerably more compressible and thus subject to a predictably different scaling law.
But there’s a hint in this paper that explains why closed models perform better than open ones. Often the suggestion—given the CBRN hysteria around AI being a possible extinction-level threat to humanity—is that closed models are better because they’re closed, i.e., some special sauce yet undisclosed etc. It’s a vague suggestion at best. And it gets run together with the rentier’s claim, namely, for the sake of the kids, all good models (no matter how they got that way) must be closed for safety.
The gzip paper suggests that things may be the other way around: models aren’t good because they’re closed, they’re closed because they’re good but not because of safety but because of data quality.
We’ve known for some time that, given the scaling laws, there’s not enough data on the Web to keep training bigger models. But, as I said recently on LinkedIn, we aren’t running out of data to train bigger models, we’re running out of data on the Web to train models.
But as my regulated banking, pharma, and manufacturing customers remind me every day, there’s a lot of data that isn’t on the Web or in the cloud and that data tends to be higher quality than the average web page.

A Provisional Summary

This is all quite speculative, and I’m suspicious of explanations—even or especially when it’s my explanation!—that have three equi-contributing reasons rather than just one definitive reason, generally, but my view now is the following.
The answer appears to be that frontier models are trained on higher-quality data—which is less compressible, thus requires more compute, and shifts the scaling laws frontier into a region that prefers tokens to parameters—than open models and that explains both their performance advantage and offers an explanation as to their license terms.
So now my explanation has three parts—
  1. Frontier models remain closed for monetization. This one is the easiest to understand and I have no quibble here since the owners of IP—modulo the issue of copyright violations—set the terms of IP access. I personally don’t have the audacity to call my org “Open” anything when I’m in fact the very opposite of that in nearly every way.
  1. High-quality training data (that is, higher quality than Common Crawl and the like) likely either is obtained via restrictive sublicense terms which necessitate closed license terms of the eventual model or has been obtained by legally dubious means, which also suggests a closed license policy.
    1. If I have very high quality data that I want to license to OpenAI, but it’s sensitive or I want to continue monetizing it, I’m not going to give them sublicense or distribution rights since that weakens my control of my IP. But I will take a lot of their money if they keep the eventual model itself closed.
    2. If I have been, uh, “legally aggressive”, thinking I’d ask forgiveness than permission, in sourcing some high quality data, then I’m not going to risk disclosure of this decision by releasing the model as open, etc.
    3. In sum, I don’t see any roads to high quality data sourcing that also permit or suggest open license terms of the eventual model.
  1. The rent-seeking strategy requires a closed model since releasing a frontier model as open gives the game away. If a frontier model is CBRN-like, then the only “responsible AI” is to carefully restrict it, i.e., not open source it. Or if it’s released as open, then it’s not a frontier model! Either, the closed models get to claim a moral high ground that’s dubious at best.

An Observation about Llama’s Open Motivations

I’m not now nor have I ever been a Facebook user. I just don’t see the point. However, Mark has a made a lot of money and monetized the Web, along with Google, allowing it to achieve a status something like a global information utility. But he’s never been very popular or likable on a personal level. He lacks a certain rizz, as the kids say. And then he went and did the Metaverse thing, which wasn’t very successful and exposed him to both technical ridicule and to shareholder pressures.
But you have to credit him, whatever else is true, with a certain doggedness. He’s out of the doghouse again because of the Llama open move. His re-rise to geek chic reminds me of Microsoft, which was routinely and unironically called the “Evil Empire” when I was a Linux guy in the late 90s, but now is open source-friendly, Linux dominates Azure, etc. I’m typing this on an HP Windows 11 laptop that’s got Ubuntu running in a terminal!
And Llama has catalyzed almost everyone outside of a few frontier model shops—OpenAI, Google, Microsoft, the big Chinese outfits like Tencent, Huawei, etc—to throw-in together on models, tooling, infrastructure, and research. Llama has done this because it’s a good model and because it’s open. You need both to compete with closed frontier models, after all. And it’s not as if being an open model makes it inherently dangerous; Meta takes responsible AI seriously.
Mark’s achievements are his business, but his contributions are everyone’s and Llama may be the greatest of those.
Maybe one reason I don’t like Facebook is that I prefer to consume high quality data like frontier models apparently prefer and that’s not Facebook’s content by my lights. Give me a book on literally any topic imaginable written by a subject matter obsessif and carefully edited by a team of them and then fact-checked, etc. I remain convinced that LLMs will never generate those de novo and equally convinced that LLMs will help humans generate higher quality content faster and faster, cheaper and cheaper, as they improve.

Voicebox talks for you and your data talks back! Grow your business with faster time to insight.

Stardog Voicebox is a fast, accurate AI Data Assistant that's 100% hallucination-free guaranteed.