Some kind of quantitative estimate of the effects on the effective pool of cognitive labour would be good.
If an H100 equivalent (part of a cluster running the latest public models) has roughly the cognitive capability of a junior engineer, and the capacity of ten of them (runs 5x as fast, for twice as long per day), then 16M H100e is a workforce increment of 160M CS grads. Next year that could be 1600M. But that "10x a CS grad" figure could be 100x too large or 10x too small, I don't know. (And I know I don't know enough to estimate it.)
(Of course over 90% of capacity is currently used for training next-gen models rather than inference, but the principle holds.)
Riffing off Dario's "country of geniuses in a data center": barring disasters, the early visible effects of AI will come from a continent of college grads in a thousand data centers. How big is that continent? Australia (30M people) or Asia (5,000M people)?
Agreed that this is _very_ hard to estimate... and in any case the ratio of human to AI productivity can't be expressed in a single number. Reason #17 (of 99999) that the future is hard to predict.
My understanding is that the percentage of frontier lab compute devoted to R&D is closer to 50% than 90%? Something like (_very_ handwavy): 10% for the training run that produces the next shipping model, 40% for other R&D activities (experiments, failed training runs, etc.), and 50% for serving customers. I don't have a specific source offhand, but to the best of my recollection, the 50/50 split is consistent with occasional reports on finances at OpenAI / Anthropic, and I've seen comments from researchers that most of the R&D compute is used for things other than final training runs. Do you have a source for the R&D share being closer to 90%?
Edit: yes, AIUI over half the R&D is not directly training, but it is testing and otherwise making the next gen fit for purpose, so not available for customers to use.
Good essay, looking forward to future ones. Three objections to your argument here — curious for your thoughts.
1/ on RSI — you suggest this may be happening, but what leads you to believe this? Section 9.1.3 of GPT 5.5’s system card is remarkably candid that they’re seeing effectively none of this — which is a striking admission if part of their valuation is premised on the inevitability of takeoff. Ie “Trust me, we have to run in the red because we have to get to RSI first!”
2/ on China — you don’t cover the price and margin compression dynamics that open source which is “good enough” create for frontier labs. If I can get sonnet-level performance out of deepseek v4 for pennies on the dollar per 1M output tokens compared to frontier labs, why would I pay extra if all I really need is sonnet-level performance? Good enough is good enough, for what I expect will be the vast majority of consumer and enterprise workloads (basic text summarization and retrieval stuff). There are really diminishing marginal returns to frontier performance unless you believe they’re getting closer to takeoff. China’s play (IMO the smart one) is to just structurally undercut US frontier lab margins, ushering in the next winter and then claiming dominance. But they don’t believe in AGI, so that’s not their goal.
3/ On rates of improvement — a lot of this seems premised on either 1 or 2 not being a structurally-persistent issue. But even if continued linear improvement WERE possible, giving diminishing marginal returns, why would investors front the capital required to fund increasingly-less profitable training runs? Even if there’s more low-hanging fruit to pluck, why should we assume there will be private appetite to pluck it?
Finally — as someone who has led private companies I’m sure you’re familiar with the importance of messaging discipline in the road to IPO or acquisition. So wouldn’t you agree that it’s a little suspicious for both OpenAI and Anthropic to publicly claim they’ll void any insider share selling in forward contracts or SPVs? Like, what’re they so afraid of here? Why feel compelled to make this announcement at all? Unless insiders and their creditors systematically believe these companies are overvalued and see the unit economics writing on the wall. OpenAI is trading at a discount in secondary markets already and MSFT is almost certainly going to programmatically trim their ~30% stake. How does all this not spell trouble?
1) Agreed that we're not in full-blown RSI yet (and I don't expect that we will be, in any dramatic fashion, for a while yet). I said "AI is indeed starting to accelerate its own development", which is technically RSI but I didn't mean to imply any sort of rapid takeoff. I base this primarily on self-reports from lab staff who say that coding agents are giving them a huge productivity boost, cross-checked against my own more general understanding of what agents can (and can't) do currently. My guess is that many lab employees are experiencing substantial productivity gains (potentially 2x or more), many are not, and that overall this could be adding up to a modest gain in overall capabilities advances, on the rough order of 10% faster progress (bearing in mind that researcher productivity is just one contributor to progress). A 10% speedup would hardly be decisive but it would mean that we're moving out of the zone where RSI is irrelevant.
The rapid pace of feature releases on Claude Code and Codex likely provides a visible example of something that wouldn't be happening without an internal productivity boost. (And the production stability issues at Anthropic might, possibly, suggest a downside to that.)
2) Can you, in fact, get Sonnet-level performance for pennies on the dollar? I've seen mixed reports on this, and have not done a deep dive myself, but I see a lot of stories of low prices turning out to be backed by overly-quantized models that don't really work very well, DeepSeek offering cheap APIs that you can't actually use because they don't have much capacity, etc.
From first principles, I would expect that if there were a large market for a better price/performance tradeoff than currently available from the leading US labs, then those labs would find a way to fill it. I don't see any durable Chinese advantage here, they're just targeting a niche. If that niche were to grow then the US labs could target it as well, and they still have lots more compute for training, and much more capacity for inference. And so long as we're in a compute crunch – which might be, clear through to the singularity – it's not clear to me that we should expect pricing pressure to emerge.
3) Why will training runs become increasingly less profitable? The current Anthropic revenue ramp doesn't seem to point in that direction. If it did happen, then yes perhaps the pace of investment, and thus the pace of progress, might slow down, but not stop.
Insider share selling: I have no insight into this.
Weak RSI : Systems that can accelerate AI researchers
Strong RSI: Systems that can accelerate itself - recursive feedback loop style - even accounting for compute bottlenecks,time taken to run experiments/interaction with real world, intelligence bottleneck, jaggedness etc
I would categorize them as two different things parading under the RSI name. I think such a a claim is under supported by current evidence as it stands. Like on the same level as the existence of God,Aliens,Time travel etc. More Scifi than Scifi
I think we need the extraordinary evidence about to suppose it. But I think it is theoretically possible . We need to start collecting more information , more transparency ( the system cards have been bearishly transparent on that regard as it pertains to lack of significant leaps in autonomous research ).
It also could be somewhere in between but that is not interesting i guess
I feel like the Bitter Lesson ended, and I'm the only one that noticed. The Bitter Lesson says that all you need for AI to improve is scale and cleverness does not matter. If that is true, how did Anthropic ever catch up? Why isn't xAI killing it? I am not saying scale doesn't matter: China's struggles speak to that, especially the need for co-located compute to train larger models in meaningful timeframes. But it has gone from being the only thing that matters to a necessary but not sufficient condition for making the best models.
I'm not qualified to answer properly, but my understanding is that the Bitter Lesson doesn't deny any role for cleverness, it just says you shouldn't try to build a lot of detailed cleverness into the structure of your model. There's still value in finding higher-quality training data, figuring out what kinds of data are worth including, tuning hyperparameters, etc. etc. etc. You can't just smash a giant pile of compute and data on the floor and expect an AGI to arise (as, you note, xAI proved).
More succinctly: spend your cleverness on crafting an environment in which the model can learn, don't spend it trying to instill specific ideas directly into the model.
I feel like the existence of harnesses already makes the distinction about building cleverness into the model obsolete. They strike me as independent artifacts now, and harnesses demand cleverness.
Really interesting framing. The cost shift actually makes the organizational challenge harder, not easier. When access was cheap, everyone could experiment. As agents become more compute-intensive, the advantage shifts toward organizations that know which workflows are worth operationalizing and have the judgment and systems to support them well. Feels increasingly like a readiness and adoption challenge as much as a technology one.
Some kind of quantitative estimate of the effects on the effective pool of cognitive labour would be good.
If an H100 equivalent (part of a cluster running the latest public models) has roughly the cognitive capability of a junior engineer, and the capacity of ten of them (runs 5x as fast, for twice as long per day), then 16M H100e is a workforce increment of 160M CS grads. Next year that could be 1600M. But that "10x a CS grad" figure could be 100x too large or 10x too small, I don't know. (And I know I don't know enough to estimate it.)
(Of course over 90% of capacity is currently used for training next-gen models rather than inference, but the principle holds.)
Riffing off Dario's "country of geniuses in a data center": barring disasters, the early visible effects of AI will come from a continent of college grads in a thousand data centers. How big is that continent? Australia (30M people) or Asia (5,000M people)?
Agreed that this is _very_ hard to estimate... and in any case the ratio of human to AI productivity can't be expressed in a single number. Reason #17 (of 99999) that the future is hard to predict.
My understanding is that the percentage of frontier lab compute devoted to R&D is closer to 50% than 90%? Something like (_very_ handwavy): 10% for the training run that produces the next shipping model, 40% for other R&D activities (experiments, failed training runs, etc.), and 50% for serving customers. I don't have a specific source offhand, but to the best of my recollection, the 50/50 split is consistent with occasional reports on finances at OpenAI / Anthropic, and I've seen comments from researchers that most of the R&D compute is used for things other than final training runs. Do you have a source for the R&D share being closer to 90%?
No, only an off-hand comment by Zvi.
Edit: yes, AIUI over half the R&D is not directly training, but it is testing and otherwise making the next gen fit for purpose, so not available for customers to use.
Good essay, looking forward to future ones. Three objections to your argument here — curious for your thoughts.
1/ on RSI — you suggest this may be happening, but what leads you to believe this? Section 9.1.3 of GPT 5.5’s system card is remarkably candid that they’re seeing effectively none of this — which is a striking admission if part of their valuation is premised on the inevitability of takeoff. Ie “Trust me, we have to run in the red because we have to get to RSI first!”
2/ on China — you don’t cover the price and margin compression dynamics that open source which is “good enough” create for frontier labs. If I can get sonnet-level performance out of deepseek v4 for pennies on the dollar per 1M output tokens compared to frontier labs, why would I pay extra if all I really need is sonnet-level performance? Good enough is good enough, for what I expect will be the vast majority of consumer and enterprise workloads (basic text summarization and retrieval stuff). There are really diminishing marginal returns to frontier performance unless you believe they’re getting closer to takeoff. China’s play (IMO the smart one) is to just structurally undercut US frontier lab margins, ushering in the next winter and then claiming dominance. But they don’t believe in AGI, so that’s not their goal.
3/ On rates of improvement — a lot of this seems premised on either 1 or 2 not being a structurally-persistent issue. But even if continued linear improvement WERE possible, giving diminishing marginal returns, why would investors front the capital required to fund increasingly-less profitable training runs? Even if there’s more low-hanging fruit to pluck, why should we assume there will be private appetite to pluck it?
Finally — as someone who has led private companies I’m sure you’re familiar with the importance of messaging discipline in the road to IPO or acquisition. So wouldn’t you agree that it’s a little suspicious for both OpenAI and Anthropic to publicly claim they’ll void any insider share selling in forward contracts or SPVs? Like, what’re they so afraid of here? Why feel compelled to make this announcement at all? Unless insiders and their creditors systematically believe these companies are overvalued and see the unit economics writing on the wall. OpenAI is trading at a discount in secondary markets already and MSFT is almost certainly going to programmatically trim their ~30% stake. How does all this not spell trouble?
Great questions! Here are my thoughts:
1) Agreed that we're not in full-blown RSI yet (and I don't expect that we will be, in any dramatic fashion, for a while yet). I said "AI is indeed starting to accelerate its own development", which is technically RSI but I didn't mean to imply any sort of rapid takeoff. I base this primarily on self-reports from lab staff who say that coding agents are giving them a huge productivity boost, cross-checked against my own more general understanding of what agents can (and can't) do currently. My guess is that many lab employees are experiencing substantial productivity gains (potentially 2x or more), many are not, and that overall this could be adding up to a modest gain in overall capabilities advances, on the rough order of 10% faster progress (bearing in mind that researcher productivity is just one contributor to progress). A 10% speedup would hardly be decisive but it would mean that we're moving out of the zone where RSI is irrelevant.
The rapid pace of feature releases on Claude Code and Codex likely provides a visible example of something that wouldn't be happening without an internal productivity boost. (And the production stability issues at Anthropic might, possibly, suggest a downside to that.)
2) Can you, in fact, get Sonnet-level performance for pennies on the dollar? I've seen mixed reports on this, and have not done a deep dive myself, but I see a lot of stories of low prices turning out to be backed by overly-quantized models that don't really work very well, DeepSeek offering cheap APIs that you can't actually use because they don't have much capacity, etc.
From first principles, I would expect that if there were a large market for a better price/performance tradeoff than currently available from the leading US labs, then those labs would find a way to fill it. I don't see any durable Chinese advantage here, they're just targeting a niche. If that niche were to grow then the US labs could target it as well, and they still have lots more compute for training, and much more capacity for inference. And so long as we're in a compute crunch – which might be, clear through to the singularity – it's not clear to me that we should expect pricing pressure to emerge.
3) Why will training runs become increasingly less profitable? The current Anthropic revenue ramp doesn't seem to point in that direction. If it did happen, then yes perhaps the pace of investment, and thus the pace of progress, might slow down, but not stop.
Insider share selling: I have no insight into this.
Good answers, and thank you for taking the time to respond.
We’ll just have to wait and see — once we get S-1’s and lockups expire, I expect lots of today’s ambiguity to start falling away.
If we suppose :
Weak RSI : Systems that can accelerate AI researchers
Strong RSI: Systems that can accelerate itself - recursive feedback loop style - even accounting for compute bottlenecks,time taken to run experiments/interaction with real world, intelligence bottleneck, jaggedness etc
I would categorize them as two different things parading under the RSI name. I think such a a claim is under supported by current evidence as it stands. Like on the same level as the existence of God,Aliens,Time travel etc. More Scifi than Scifi
I think we need the extraordinary evidence about to suppose it. But I think it is theoretically possible . We need to start collecting more information , more transparency ( the system cards have been bearishly transparent on that regard as it pertains to lack of significant leaps in autonomous research ).
It also could be somewhere in between but that is not interesting i guess
this piece gives me so much clarity on the ai trajectory, thanks for writing it steve!
I feel like the Bitter Lesson ended, and I'm the only one that noticed. The Bitter Lesson says that all you need for AI to improve is scale and cleverness does not matter. If that is true, how did Anthropic ever catch up? Why isn't xAI killing it? I am not saying scale doesn't matter: China's struggles speak to that, especially the need for co-located compute to train larger models in meaningful timeframes. But it has gone from being the only thing that matters to a necessary but not sufficient condition for making the best models.
Interesting question!
I'm not qualified to answer properly, but my understanding is that the Bitter Lesson doesn't deny any role for cleverness, it just says you shouldn't try to build a lot of detailed cleverness into the structure of your model. There's still value in finding higher-quality training data, figuring out what kinds of data are worth including, tuning hyperparameters, etc. etc. etc. You can't just smash a giant pile of compute and data on the floor and expect an AGI to arise (as, you note, xAI proved).
More succinctly: spend your cleverness on crafting an environment in which the model can learn, don't spend it trying to instill specific ideas directly into the model.
I feel like the existence of harnesses already makes the distinction about building cleverness into the model obsolete. They strike me as independent artifacts now, and harnesses demand cleverness.
¯\_(ツ)_/¯
Might have to ask Sutton to weigh in...
Really interesting framing. The cost shift actually makes the organizational challenge harder, not easier. When access was cheap, everyone could experiment. As agents become more compute-intensive, the advantage shifts toward organizations that know which workflows are worth operationalizing and have the judgment and systems to support them well. Feels increasingly like a readiness and adoption challenge as much as a technology one.