2026-06-11

Sutton Says Supervised Generative AI Can't Discover. Half of That Holds.

Sutton splits discovery into variation, evaluation, and selective retention, then argues pure generative AI lacks the evaluation step. The core is right, but his own counterexamples dismantle the part of the verdict aimed at the LLM route.

reinforcement-learning llm-limits ai-research

Sutton Says Supervised Generative AI Can't Discover. Half of That Holds. — Photo / Unsplash

Summary

Richard Sutton recorded a talk for a SAIR workshop on “Science for AI,” titled “AI Creativity and Discovery.” He puts forward a claim he himself calls “new and possibly controversial”: that generative AI trained by supervised learning, including every large language model and image and video model, is in a precise sense incapable of making genuine novel discoveries. The post drew 676K views and over a hundred comments on Hacker News.

The core of the argument is solid: discovery equals variation plus evaluation plus selective retention, and a pure generative model is missing the evaluation step. Yet in the same talk he names AlphaProof, AlphaEvolve, and even Claude-Code as systems that are both novel and good. That is exactly what shows the red line he draws between “supervised-learning AI” and “era-of-experience RL” has already been crossed in engineering practice. My judgment: as a critique of pure pre-trained models, he is right; as a verdict on the LLM route, he is half wrong, and it is his own examples that undo that half.

The debate

Sutton’s side is clear, and it does not rest on disparaging generative AI. He opens with an old joke: a review says “this work is both novel and good. Unfortunately, the parts that are good are not novel, and the parts that are novel are not good.” He says this lands precisely on a large part of today’s AI. Generative models ingest huge numbers of examples and produce a model that behaves like the examples: text like people, images like artists, video like the internet. The process is partly stochastic. Each step can go several ways, so a trajectory is either random (and thus novel) or anchored to the training data (and thus good, because the data is good), but never both at once.

Then he gives the constructive frame. Real discovery is the combination of three steps: variation, evaluation, and selective retention. Evolution works this way, the scientific method works this way, animal learning works this way; psychology calls it operant conditioning, and machine learning calls it reinforcement learning. What generative AI lacks is the middle step. The generator was fixed by pre-training, so at runtime there is no mechanism to evaluate what it just produced. No evaluation means no selective retention, and so no discovery. “The novelty flickers into existence but, if its value is unrecognized, it flickers away and is lost.”

The opposing side assembled on Hacker News in two tiers of very different strength.

The first tier is misreading, and can be set aside. Some commenters (simianwords, vasco) argue that humans can “evaluate” only because they have access to the real world, so this is not an inherent AI limitation; others (edot, dwd) argue that LLMs plainly do produce things that are novel and good. Most of these missed the back half of the talk, where Sutton never says systems with evaluation fail; he singles those systems out for praise. dwd does catch a real internal contradiction: Sutton lists Claude-Code, a generative AI, among systems that make discoveries, which collides head-on with his opening line. That hit lands, but it strikes the looseness of Sutton’s phrasing, not the core of his argument.

The second tier is the technically weighty objection, and it cuts the other way: it actually tightens Sutton’s case. porridgeraisin notes that reinforcement learning with verifiable rewards (RLVR) does not expand beyond the base distribution; it only mode-seeks within it. It can push already-present, lower-probability trajectories to the top and improve maj@k and pass@1, but it does little for pass@k at high k. It sharpens the top of the distribution rather than opening new territory. A trajectory with near-zero probability under the original model has to be sampled before it can be rewarded, and RLVR rarely samples it. So the ceiling is real, and the reason AlphaEvolve clears it is that it bolts on an external evolutionary searcher to generate candidates the base policy would never produce.

On the other flank, skybrian, doctoboggan, balazstorok, and musebox35 read the talk most accurately, and most damagingly. Sutton attacks pre-training alone. The moment you put an LLM in an agentic loop, wired to a compiler, a terminal, verifiable rewards, you supply the generate-test-selectively-refine loop, which is the very step Sutton says is missing. musebox35 maps it to the Fisher/Box feedback loop from statistics and points out that the most successful applications today, like coding, were never the product of pure generative modeling; they are the product of closing the loop.

Who’s right

You have to judge in two parts, because “about pure pre-trained models” and “about the LLM route” are two different claims that Sutton bundled into one sentence.

About pure pre-trained models, Sutton is simply correct, and porridgeraisin’s technical detail is the hardest evidence on his side. A model that only does next-token prediction with no external judge at inference genuinely has no evaluation step and genuinely can only sample within its training distribution. RLVR looks like it supplies evaluation, but it sharpens an existing distribution rather than exploring outward; the real variation, the “blind” component Sutton stresses, has to be carried in by external search. The claim that a model can on its own surface entirely new solutions has no support today. At this layer the “LLMs obviously innovate” rebuttals do not connect, because nearly all of their examples already smuggle in an external evaluator: a human in the loop, a compiler in the loop, or a searcher in the loop.

But when Sutton pushes the conclusion to “the supervised-learning generative AI route cannot discover,” he overreaches, and he does so on the strength of his own examples. His three-step frame is about systems, not about a single component. Once he grants that Claude-Code and AlphaProof count, he has granted that a system assembled from a generator as the variation source, an external judge as evaluation, and memory as retention can discover. Then the LLM is not a dead end that “cannot discover”; it is the extremely powerful variation component inside a discovery system. doctoboggan puts it exactly right: he is not saying AI systems cannot create, he is saying generative AI without a harness cannot. Those two propositions are worlds apart, and the talk’s opening uses the phrasing of the first while the argument supports the second.

So my judgment is this: the physics of the argument is right (no evaluation, no discovery), but it is packaged as a stronger conclusion than it can carry. The version that holds is that discovery comes from a closed-loop system and a generator alone is not enough, not that the generative AI route is doomed to be foreign to discovery. The first is a technical reality; the second is a verdict on a roadmap. Sutton delivers the first and sells the second.

Why it matters

The real problem this debate exposes is that “model” and “system” get conflated, and that is the most common error in reading AI progress right now. If you attribute capability to bare model weights, Sutton sounds like he is writing off the entire LLM direction. If you attribute capability to the “model plus external loop” system, you find that Sutton actually handed you a map of where to invest: in evaluators and searchers, not just in scaling the base model.

That map has direct consequences for roadmap choices. A large share of the last two years’ progress came not from pre-training scale but from wrapping models in steadily better outer loops: verifiable rewards, tool calls, agentic orchestration. Sutton’s frame explains why this works; you are supplying a powerful variation source with the evaluation and retention it lacks. It also explains where the outer loop’s ceiling sits: if your evaluation signal is fuzzy, or your search is still RLVR-style in-distribution mode-seeking, do not expect the system to leap past what the base model already knows. porridgeraisin’s line, that “our planner is still dumb and we need to work on it,” may be the most actionable sentence in the whole thread.

Further out, this bears on how the “AI scientist” goal actually lands. Sutton’s call to arms is to share goals with AI so it can create, evaluate, discover, and so fully participate in reaching them. The engineering translation is that the bottleneck for automating discovery is not generation but whether you can define, for a domain, a clear enough goal and a cheap enough evaluation. Math and code fell first precisely because their evaluation is nearly free (proofs check, code runs). Domains where evaluation is expensive or fuzzy, most of the natural sciences, are still stuck here.

What to ignore

Ignore both “he is credentialed so he must be right” and “he is old so he is just doom-mongering,” both of which showed up on HN (someone trotted out Dyson’s barb at Wolfram, others rushed to defend by seniority). Sutton is a founder of reinforcement learning, which makes his framework worth taking seriously, but it does not exempt any single conclusion from scrutiny; conversely, dismissing a concrete technical argument by age or stature is its own laziness. The nourishing parts of this thread are the specific disputes about pass@k and in-distribution versus out-of-distribution, not the meta-fight over whether to respect authority.

Ignore the over-inference that “he contradicts himself, so the whole thing is wrong.” Listing Claude-Code among discoveries does collide with his opening line, and it is a genuine flaw. But the flaw points straight at the part of his argument that is right: he instinctively concedes that closed-loop generative systems can discover. Seizing the contradiction to reject the whole talk throws out the solid core with it. The correct reading is to tighten the conclusion he left loose, not to call him wholly wrong because he left it loose.

Finally, ignore the reduction of this debate to a binary “are LLMs good enough or not.” The real question was never whether the LLM component is good enough; it is what kind of system you intend to put it in and what evaluation and search you pair it with. Sutton is arguing about system architecture, and the sharpest people on HN are arguing about system architecture. Staying at the “pro-LLM versus anti-LLM” level means you missed what both sides are saying.

FAQ

Why does AlphaGo's move 37 count as discovery while an LLM writing code does not?

In Sutton's framework the difference is not who is smarter but whether there is a hard evaluation independent of the model itself. AlphaGo has the win/loss outcome of Go as an objective judge; the quality of a move is not decided by imitating human game records. A bare LLM at inference has no such judge and can only sample from its training distribution. But the line is blurry: once an LLM's code is wired to a compiler and tests, it gains the same hard evaluation, and Sutton himself lists Claude-Code among the systems that make genuine discoveries.

Can RLVR let a model explore solutions outside its base distribution?

The evidence points to no. Reinforcement learning with verifiable rewards mostly pushes already-present, lower-probability trajectories toward the top of the base policy's distribution, improving maj@k and pass@1 but doing little for pass@k at high k. It sharpens an existing distribution rather than opening new territory. A trajectory with near-zero probability under the original model must be sampled before it can be rewarded, and RLVR rarely samples it. Crossing that ceiling takes external search such as MCTS or evolutionary search.

Is this the same point as Sutton's famous bitter lesson?

It is an extension of the same thread. The bitter lesson says general methods that scale with compute (search and learning) beat hand-coded human knowledge over time. Here he aims at generative AI: imitating human corpora is a way of baking human knowledge in, missing the step of searching from experience and goals. Both say the same thing, that real progress comes from search through interaction with the world, not from digesting an existing corpus.

Sources

No official primary source available; this analysis is based on reliable secondary reporting (named outlets, cross-confirmed).