BackMay 26, 2026

Noam Brown says AI benchmarks need a compute axis

I still actually think the significance of inference compute is underestimated.

Watch the recap video here

Context

Capability becomes an allocation problem when inference-time compute, orchestration, and evaluation determine how much intelligence a buyer can purchase at answer time.

Big Ideas

The benchmark leaderboard is becoming a pricing surface. Brown's argument turns every reasoning-heavy eval into a score-versus-cost curve, which means model capability depends on the inference budget buyers, labs, regulators, and attackers are willing to spend.
Safety policy inherits the same problem. If a lab evaluates a model under a low inference cap but users can scaffold much larger budgets around it, release thresholds may understate the effective capability available in the wild.
Compute access is now institutional leverage. Brown's university comments imply that AI talent, research agendas, and even academic status will flow toward organizations that can allocate GPUs per researcher, not just toward places with prestige or publication culture.

Supporting Context And Sources

The official source is the ARC Prize YouTube episode, "The field is underestimating inference compute | Noam Brown", published by the ARC Prize channel.
Podwise indexes the same episode as "A Single Number Doesn't Make Sense Anymore | Noam Brown", and its outline tracks the same arc: intelligence definitions, inference compute, academia's compute gap, and Brown's poker story.
OpenAI's official "Learning to reason with LLMs" corroborates the core technical premise behind Brown's argument: o1 performance improved with both more reinforcement learning during training and more time spent thinking at test time.
ARC Prize's "OpenAI o3 Breakthrough High Score on ARC-AGI-Pub" provides the clearest outside cost-axis example: o3's ARC-AGI score changed materially between high-efficiency and low-efficiency settings, and ARC says efficiency is now required when reporting performance.
ARC Prize's "Announcing ARC-AGI-2 and ARC Prize 2025" makes the same measurement point from the benchmark designer side: intelligence should include cost and efficiency, not just whether a system can eventually solve a task.
JOLT's NeurIPS 2024 writeup on inference-time compute reads Brown's earlier Math-AI remarks as part of a broader field shift from training-scale-only thinking toward test-time compute, while also warning that inference scaling has different economics because the cost is paid per query rather than amortized once through training.
OfficeChai's commentary, "Reasoning Models Are Making AI Benchmarks Irrelevant: OpenAI's Noam Brown", interprets Brown's benchmark critique as a cost-accounting problem: reasoning models can look much stronger when allowed to spend more, so benchmark results without cost can mislead.

Full Recap

00:03-04:30 - Intelligence starts as a moving target - Brown says he does not have a crisp daily working definition of intelligence, then revisits a 2017 Reddit AMA prediction that AI would not write a thought-provoking novel within ten years (00:20-01:06). - He still treats a genuinely thought-provoking novel as a useful sign of intelligence, but admits current models may be close if scaffolded well enough (01:17-04:28). - The host pushes on whether a benchmark should normalize for iterations, prior experience, and joules; Brown says those details matter more now than they did when such capabilities seemed remote (02:51-04:18).

04:34-08:32 - Benchmarks are partial views, not intelligence itself - The conversation covers validation perplexity per joule, GSM8K-style answer efficiency, benchmark hacking, and ARC as competing proxies for intelligence (04:34-05:31). - Brown argues that even human intelligence is hard to measure objectively, so model intelligence is best read through a diverse set of evals rather than one universal number (05:37-07:19). - He frames ARC as measuring adaptation to novel environments, which may matter more or less depending on the job a model is meant to do (07:40-07:56).

08:32-14:48 - The missing ingredient is inference compute plus verification - Asked about the "Ilyaism" that next-token prediction plus more tokens may be enough, Brown says the phrase misses a key ingredient: models becoming more productive by thinking longer at inference time (09:26-09:59). - The host tests that claim with the bubble-sort-to-merge-sort objection: if all traces teach bubble sort, why would more thinking discover a better algorithm? Brown answers that discovery depends on diversity of data, exploration, and a verifier that can recognize a better result once it appears (10:01-12:10). - Brown compares this to human progress: many attempts may be wasteful, but once a useful discovery appears, verification can turn it into progress (12:11-12:38). - Both speakers converge on the importance of priors: random search is too inefficient, but a sharpened prior helps search focus on promising directions (12:41-14:46).

14:48-20:15 - Human priors are still doing real work - Brown says biological plausibility is not the key question; the important fact is that current neural systems are clearly working, even if the brain analogy is incomplete (14:48-15:55). - The host ties Brown's poker work to test-time compute: a relatively simple CPU-running program could beat heavier on-policy RL approaches by using more thinking at decision time (15:57-16:19). - Brown contrasts AlphaGo and AlphaZero: removing human data worked well for Go, but similar no-prior approaches did not work as cleanly in StarCraft-like settings with harder action spaces (17:01-17:54). - He says learning from scratch is theoretically plausible, but for the foreseeable future the evidence favors large-scale internet pretraining and useful human-derived priors (18:02-18:59).

20:15-23:27 - A single benchmark score breaks under reasoning models - Brown's contrarian view is that the field still underestimates inference compute as models become more capable (20:28-20:49). - He says publishing one GPQA-style score for a model made sense for GPT-2 and GPT-3, became iffy with GPT-4, and stopped making sense once chain-of-thought prompting and reasoning models made performance depend on how much compute is spent at answer time (20:52-21:32). - He praises ARC for moving toward x-axes of inference, compute, or cost, and says that is the right way to measure reasoning-heavy benchmarks (21:39-21:54). - Brown connects this to release governance: preparedness frameworks and responsible scaling policies need to decide how much inference budget is used when judging whether a model crosses a capability or danger threshold (22:03-22:49). - The operational risk is that a lab might evaluate with a $10 inference cap, while a downstream user scaffolds many calls together and effectively obtains a more capable system by spending $1,000 or $1 million at inference time (22:44-23:27).

23:27-26:10 - Train-time recurrence is plausible but unresolved - The host argues that current transformer training lacks unbounded train-time recurrence, while test-time reasoning can recurse or refine through additional computation (23:59-24:52). - Brown says spending more compute in training, especially beyond current pretraining practice, is a worthwhile research direction because today's working stack is unlikely to be the final or only way to build capable systems (25:12-26:08). - He avoids speculating on which parts of the current stack will disappear, noting that he is not primarily a pretraining researcher and may not be able to discuss frontier details anyway (26:10-26:22).

26:10-36:25 - Compute access is reshaping academia - Brown says academia still has viable work to do, but the most impressive AI capabilities increasingly come from scale, and universities have far fewer GPUs than industry labs (27:17-27:49). - He proposes that a university could rapidly attract AI talent by spending heavily on a large compute cluster and offering unusually high GPU access per researcher (28:13-28:44). - The host notes that many universities have meaningfully less than one H100-equivalent per CS student, and Brown says faculty often underestimate the compute gap between academia and industry (28:47-29:44). - Brown says students should ask professors and current students how much compute they will actually get, and that recruits should treat compute access as a real factor in choosing where to work (30:14-30:53). - He suggests universities could pool resources for large-scale open-source pretraining, but academic credit systems would need to adapt from small-author papers to large coordinated projects (31:16-31:55). - Brown emphasizes that high-quality third-party evals are one lower-compute path to influence; he says OpenAI pays attention to good outside evals (32:07-34:16). - On conferences, Brown expects AI-written papers and AI-assisted reviews to grow, and thinks AI reviewers paired with humans could improve review quality by catching fatal flaws and doing literature checks (34:23-35:44).

36:25-40:08 - The poker story explains his taste for test-time thinking - Brown closes with the 2017 Libratus poker competition, describing a year of near-nonstop work on a poker bot when his career prospects felt tied to one high-variance match (37:46-38:37). - He says the hard part was not just beating older bots, but surviving adaptive human opponents who might discover weaknesses during play (38:59-39:19). - The lesson he drew was blunt: sometimes 90% of the effort can still produce 0% of the reward, but some goals require that level of execution anyway (39:48-39:59).

00:03-04:30 - Intelligence starts as a moving target

00:20-01:06 - Brown says he does not have a crisp daily working definition of intelligence, then revisits a 2017 Reddit AMA prediction that AI would not write a thought-provoking novel within ten years .
01:17-04:28 - He still treats a genuinely thought-provoking novel as a useful sign of intelligence, but admits current models may be close if scaffolded well enough .
02:51-04:18 - The host pushes on whether a benchmark should normalize for iterations, prior experience, and joules; Brown says those details matter more now than they did when such capabilities seemed remote .

04:34-08:32 - Benchmarks are partial views, not intelligence itself

04:34-05:31 - The conversation covers validation perplexity per joule, GSM8K-style answer efficiency, benchmark hacking, and ARC as competing proxies for intelligence .
05:37-07:19 - Brown argues that even human intelligence is hard to measure objectively, so model intelligence is best read through a diverse set of evals rather than one universal number .
07:40-07:56 - He frames ARC as measuring adaptation to novel environments, which may matter more or less depending on the job a model is meant to do .

08:32-14:48 - The missing ingredient is inference compute plus verification

09:26-09:59 - Asked about the "Ilyaism" that next-token prediction plus more tokens may be enough, Brown says the phrase misses a key ingredient: models becoming more productive by thinking longer at inference time .
10:01-12:10 - The host tests that claim with the bubble-sort-to-merge-sort objection: if all traces teach bubble sort, why would more thinking discover a better algorithm? Brown answers that discovery depends on diversity of data, exploration, and a verifier that can recognize a better result once it appears .
12:11-12:38 - Brown compares this to human progress: many attempts may be wasteful, but once a useful discovery appears, verification can turn it into progress .
12:41-14:46 - Both speakers converge on the importance of priors: random search is too inefficient, but a sharpened prior helps search focus on promising directions .

14:48-20:15 - Human priors are still doing real work

14:48-15:55 - Brown says biological plausibility is not the key question; the important fact is that current neural systems are clearly working, even if the brain analogy is incomplete .
15:57-16:19 - The host ties Brown's poker work to test-time compute: a relatively simple CPU-running program could beat heavier on-policy RL approaches by using more thinking at decision time .
17:01-17:54 - Brown contrasts AlphaGo and AlphaZero: removing human data worked well for Go, but similar no-prior approaches did not work as cleanly in StarCraft-like settings with harder action spaces .
18:02-18:59 - He says learning from scratch is theoretically plausible, but for the foreseeable future the evidence favors large-scale internet pretraining and useful human-derived priors .

20:15-23:27 - A single benchmark score breaks under reasoning models

20:28-20:49 - Brown's contrarian view is that the field still underestimates inference compute as models become more capable .
20:52-21:32 - He says publishing one GPQA-style score for a model made sense for GPT-2 and GPT-3, became iffy with GPT-4, and stopped making sense once chain-of-thought prompting and reasoning models made performance depend on how much compute is spent at answer time .
21:39-21:54 - He praises ARC for moving toward x-axes of inference, compute, or cost, and says that is the right way to measure reasoning-heavy benchmarks .
22:03-22:49 - Brown connects this to release governance: preparedness frameworks and responsible scaling policies need to decide how much inference budget is used when judging whether a model crosses a capability or danger threshold .
22:44-23:27 - The operational risk is that a lab might evaluate with a $10 inference cap, while a downstream user scaffolds many calls together and effectively obtains a more capable system by spending $1,000 or $1 million at inference time .

23:27-26:10 - Train-time recurrence is plausible but unresolved

23:59-24:52 - The host argues that current transformer training lacks unbounded train-time recurrence, while test-time reasoning can recurse or refine through additional computation .
25:12-26:08 - Brown says spending more compute in training, especially beyond current pretraining practice, is a worthwhile research direction because today's working stack is unlikely to be the final or only way to build capable systems .
26:10-26:22 - He avoids speculating on which parts of the current stack will disappear, noting that he is not primarily a pretraining researcher and may not be able to discuss frontier details anyway .

26:10-36:25 - Compute access is reshaping academia

27:17-27:49 - Brown says academia still has viable work to do, but the most impressive AI capabilities increasingly come from scale, and universities have far fewer GPUs than industry labs .
28:13-28:44 - He proposes that a university could rapidly attract AI talent by spending heavily on a large compute cluster and offering unusually high GPU access per researcher .
28:47-29:44 - The host notes that many universities have meaningfully less than one H100-equivalent per CS student, and Brown says faculty often underestimate the compute gap between academia and industry .
30:14-30:53 - Brown says students should ask professors and current students how much compute they will actually get, and that recruits should treat compute access as a real factor in choosing where to work .
31:16-31:55 - He suggests universities could pool resources for large-scale open-source pretraining, but academic credit systems would need to adapt from small-author papers to large coordinated projects .
32:07-34:16 - Brown emphasizes that high-quality third-party evals are one lower-compute path to influence; he says OpenAI pays attention to good outside evals .
34:23-35:44 - On conferences, Brown expects AI-written papers and AI-assisted reviews to grow, and thinks AI reviewers paired with humans could improve review quality by catching fatal flaws and doing literature checks .

36:25-40:08 - The poker story explains his taste for test-time thinking

37:46-38:37 - Brown closes with the 2017 Libratus poker competition, describing a year of near-nonstop work on a poker bot when his career prospects felt tied to one high-variance match .
38:59-39:19 - He says the hard part was not just beating older bots, but surviving adaptive human opponents who might discover weaknesses during play .
39:48-39:59 - The lesson he drew was blunt: sometimes 90% of the effort can still produce 0% of the reward, but some goals require that level of execution anyway .

Technical Need To Knows

Inference compute / test-time compute: Compute spent while a model is answering, searching, sampling, checking, or reasoning, rather than during initial training. Brown says this is underestimated because longer thinking can make a model more capable at reasoning tasks (09:35-09:59, 20:40-21:14).
Reasoning models: Models designed to spend more computation before responding. Brown treats them as the point where single-number benchmark reporting clearly breaks, because their performance depends heavily on the allowed inference budget (21:19-21:32).
Chain-of-thought prompting: A prompting style that asks a model to reason through intermediate steps. Brown says the single-score convention was already no longer true once chain-of-thought prompting improved benchmark performance (21:19-21:25).
Score-versus-cost evals: Benchmark reporting that plots performance against inference, compute, or dollar cost instead of publishing only one score. Brown praises ARC for moving in this direction and says it should apply to reasoning-heavy benchmarks (21:39-21:54).
GPQA: A hard benchmark often used to summarize model scientific reasoning capability. Brown uses it as an example of a benchmark that is usually reduced to one score, even though reasoning-model performance depends on inference budget (21:03-21:07).
ARC / ARC-AGI: A benchmark and prize program focused on adaptation to novel tasks. In this conversation, ARC matters both as an eval of novelty and as an example of reporting performance along a compute or cost axis (07:40-07:56, 21:39-21:54).
Generator-verifier gap: The gap between the difficulty of generating a solution and checking whether it is good. Brown leans on this idea when explaining how search plus verification can discover better algorithms or solutions when good answers are easier to recognize than invent (11:49-12:10).
Scaffolding: External orchestration around a model, such as chaining many calls, adding tools, or running multiple attempts. Brown's release-governance concern is that users can scaffold a released model into a more capable system by spending much more inference than the original eval allowed (22:44-23:27).
Preparedness frameworks / responsible scaling policies: Governance systems that decide when model capability crosses release or safety thresholds. Brown says these policies need to specify inference budget, because capability estimates change when more inference is allowed (22:03-22:49).
Pretraining and human priors: Pretraining on large-scale internet data gives models a broad prior from human-generated information. Brown says the evidence currently favors these useful priors over learning everything from scratch with RL (18:17-18:59).
Reinforcement learning from scratch: Training primarily by trial and feedback rather than human data. Brown says it is theoretically plausible, but not likely to be the right path in the foreseeable future given the effectiveness of pretraining (18:02-18:59).
AlphaGo / AlphaZero: DeepMind game systems used as examples of human-prior tradeoffs. AlphaGo used human data and search; AlphaZero removed the human-data prior in games like Go, but Brown says no-prior approaches have been much less practical in broader action spaces such as StarCraft (17:01-17:54).
Monte Carlo tree search: A search method that explores possible future paths before choosing an action. It appears in the discussion as a key form of test-time computation in game-playing systems and as an analogy for deeper reasoning through program space (17:10-17:14, 13:47-13:58).
H100 / GPU access: High-end AI accelerators are the practical currency of frontier research. Brown and the host use per-researcher GPU access as a way to explain why industry labs can pursue scale-heavy research that many universities cannot (27:17-30:53).
Third-party evals: Outside benchmarks and evaluations created by researchers beyond frontier labs. Brown says they remain a high-leverage academic contribution because labs such as OpenAI can learn from strong external evals without the external team needing frontier-scale training budgets (32:07-34:16).

Back to allocation feed