20 Comments
User's avatar
Saty Chary's avatar

Hi Steve, nice post! I do want to point out the anthropomorphizing throughout - thinking, became alarmed, claimed, became quite irked, threatened, etc, etc. In reality, none of that could have happened because there was 'no one' called Claudius! Of COURSE you knew that - but, ascribing these human terms to math calculations only adds to the overall perception that they -are- intelligent, human-like, capable of dominating us, etc etc etc.

Expand full comment
Tom Dietterich's avatar

It is great to have these examples laid out in one place -- thank you! All of them reveal an inability to identify and track the relevant state of the world, such as net profit, cost of inventory, and so on. As you point out, it also reveals yet again the absence of meta-cognition: in this case its own time management (as well as its own identity and task). Finally, we see that these systems may know many things (e.g., that discounts are a bad idea), but can't reliably apply that knowledge when taking action.

Expand full comment
Abhay Ghatpande's avatar

Hi Steve, I've been reading posts by Gary Marcus, Peter Voss, Srini P., and others, and their opinion AFAIK is that agents would need "world models" to operate. (Not the world models introduced recently as part of gaming/video generation, but an actual "model" of the world.) I've not been able to find more info on exactly what they mean here, because the supposed leader in this space, aigo.ai, has zero info on the site. If this were true, it leads me to believe that agents would need to be highly specialized and narrowly focused on a task because building a broad, general purpose model (of concepts and their relationships) is close to impossible.

I would love to see a hybrid agent that combines both LLMs and Cognitive AI. If you are aware of any such efforts, please point them out. Thank you for your (second) thoughtful posts and efforts to educate us.

Expand full comment
Steve Newman's avatar

It's not clear to me why building a world model would require specialization – after all, people (seem to) build broad models of the world. For many domains, it's not even clear to me that a narrow model is possible, because everything in the world around us connects to everything else.

World models do seem important. I believe there is debate as to the extent to which LLMs might be developing internal world models.

Expand full comment
Abhay Ghatpande's avatar

That's such an interesting POV. Thank you.

I was thinking that it's not about our world, but the agent's world -- what it's built for, what's its remit. And as you say, since everything is connected in the physical world, it would be almost impossible to capture all the relationships. So by limiting the "world" (space) that an agent needs to operate in, constraining the entities that it would interact with, it could be realistic to define dependencies between them. That's why I thought that specialized agents would be required for autonomy.

Expand full comment
Steve Newman's avatar

It would depend on the task. For an agent that writes Python code, the "world" could perhaps be fairly narrow. For an agent that plans business strategy, the "world" includes customers, employees, competitors, etc. and it seems like you need a pretty broad view. Then the question becomes how many useful tasks fit into the narrow-world vs. broad-world scenarios.

Expand full comment
Seth's avatar

Gary Marcus is very committed to a literal, and explicit, and symbolic world model; but AFAIK there's no reason a world model can be "implicitly encoded" in the weights or dynamics of a neural network. It's just very, very hard to learn the "correct" world model from observational data.

Expand full comment
Allison's avatar

Fascinating run down. Also highlights the giant gap between ‘all the AI will replace you corporate hypsters’ and the reality of real world decision making. Do the CEO’s really think humans are this ineffective?

Expand full comment
Marginal Gains's avatar

I'm thinking more and more about the progress made in the last year, and I believe we are expecting too much from general-purpose models. Most real-world tasks don’t need broad intelligence; they need focused competence. In most cases, it is like bringing the firehouse to water a houseplant: too much pressure, not enough control. General-purpose LLMs often feel like an over-engineered solution to most practical problems. We should build small, specialized models for specific domains and let a general model handle orchestration only when cross-domain reasoning is required.

- Use specialists for perception, parsing, and domain-specific decision-making with clear, structured state and verifiable constraints.

- Wrap them with simple verifiers and uncertainty checks to ensure reliability, and escalate to humans when needed.

- Reserve general models for coordination, open-ended dialogue, and genuinely multi-domain problems.

This systems approach—specialists for depth, a generalist for glue—delivers better performance, lower cost, and higher trust than forcing a single general model to do everything.

Expand full comment
Steve Newman's avatar

Possibly! I wonder whether how many tasks are sufficiently "specific" for small, specialized models to work. It's certainly a worthwhile experiment that will get tried, so we'll find out.

Expand full comment
Marginal Gains's avatar

When I talk about tasks, I’m referring to them at the domain level. For example, we already have enough coding models. We need models tailored to specific fields like physics, chemistry, biology, or even combinations of closely related domains, when collaboration between fields (e.g., chemistry and biology, or physics and mathematics) makes specialization more efficient than relying on a general-purpose model.

A good metaphor for the inefficiency of generalization is: "Bringing the whole library when you only need one book."

Rather than creating models that do everything, we should focus on specialized models designed for specific domains—or modular systems that allow closely related fields to work together seamlessly. For instance, a biochemistry model that combines chemistry and biology expertise, or a physics-math model for solving advanced physical problems.

While I don’t have concrete proof, I suspect that general-purpose models are more prone to "hallucinations," or providing incorrect answers by filling gaps with unrelated or inaccurate knowledge from other domains. They may also overanalyze or overgeneralize, making them less effective for domain-specific tasks.

By contrast, specialized models are likely to ensure greater efficiency, accuracy, and relevance, avoiding the pitfalls of overgeneralization.

I’ve observed a similar tendency, even among some brilliant people I work with at my day job. Sometimes, they overanalyze straightforward problems by trying to fill any gaps with their extra intelligence or tacit knowledge. While they intend to be thorough, this approach can lead to unnecessary complexity when a simple solution would suffice.

We need to match the tool to the task, which often means choosing just the right book(s) from the library, rather than hauling the entire collection.

Expand full comment
Steve Newman's avatar

Perhaps. But much of human knowledge doesn't partition neatly. Politics relates to economics, which relates to transportation, which relates to innovation, which relates to other things. Perhaps a narrow biochemistry model would make sense, but many (most?) parts of many (most?) jobs don't fall into such tidy narrow buckets.

Even for a simple task like "summarize this email", a model with wide and deep world knowledge might do a better job.

Expand full comment
Marginal Gains's avatar

Most novel problems or tasks related to significant scientific discovery will likely require a general model, as will individuals working in several domains. However, most operational, day-to-day business activities don’t necessarily need such a general model. A general model better handles tasks like summarizing an email, and it makes sense for people who frequently write or respond to emails to have it. However, email or document summarization is a relatively small part of daily work for many jobs, especially with the widespread adoption of collaboration tools like Slack, Microsoft Teams, and others. The need for a general or specialized model will vary depending on the domain, organizational culture, and individual job roles.

There will always be specific tasks where a general model is ideal, but it may not be necessary for many activities. I understand the appeal of a general model—maintaining a single model instead of managing hundreds is far more efficient and manageable. However, it’s worth remembering that models like GPT-5 likely orchestrate multiple specialized models behind the scenes to address different needs. The same is probably true for other leading models, suggesting that we are already moving towards a level of specialization, even if it isn’t yet explicitly domain-specific (beyond coding models, for instance).

This trend aligns with the insights from the research paper “Survey of Specialized Large Language Models” (https://arxiv.org/pdf/2508.19667), which highlights the rise of specialized large language models (LLMs) tailored to specific domains such as healthcare, finance, law, and engineering. These models demonstrate significant performance improvements on domain-specific benchmarks compared to general-purpose LLMs. Examples include BioGPT for biomedical tasks, BloombergGPT for financial analysis, and Med-PaLM 2 for healthcare applications. Can they do every part of a professional job? The answer is most likely no. I hope these models will become smaller in the long run as model training and other techniques evolve to provide better cost-effectiveness and efficiency.

Expand full comment
Marginal Gains's avatar

Here is an article in The Economist: https://www.economist.com/business/2025/09/08/faith-in-god-like-large-language-models-is-waning

As David Cox, head of research on AI models at IBM, a tech company, puts it: “Your HR chatbot doesn’t need to know advanced physics.”

If you do not have access to it, here is the summary:

https://substack.com/@microexcellence/note/c-157136596

Expand full comment
Seth's avatar

Coming from a neuroscience background, it does seem like the most obvious major thing that brains have that AI agents do not is highly structured memory. AFAIK, LLM-based agents have basically "short-term" memory in their context window and "long-term" memory embedded in their trained weights, but the interactions between the two is pretty crude. Brains seem to have a small "context window"--from 1 to 6 "items", depending on who you ask--but several different types of long-term memory, and they spend a huge amount of effort deciding exactly what to move from long-term to short-term memory and vice versa.

I say this because this seems incredibly obvious, and there are much smarter neuroscientists than myself working in AI, yet it doesn't seem like there's been much progress on this front. I'm guessing "they" are working on this, but it's just very very hard. Maybe because it requires new architecture, and not just layering things on top of a transformer?

Expand full comment
Steve Newman's avatar

I do not have a neuroscience background, but I share your understanding / view that current LLM architectures have an impoverished memory hierarchy, that it's not obvious how to get past that with current architectures, and something will have to give at some point.

It's probably relevant that people don't have the clear training vs. inference distinction that is fundamental to current LLM architectures.

Expand full comment
Mike Newhall's avatar

"Over the next few years, I’m sure we’ll see continued, impressive progress."

Are you?

The implicit assumption is that AI is a thing, a thing we are working on right now. A path that leads to AGI a bit down the road -- or a long way down the road -- but down the road. The same road. Or track. As in, right track. That we're on, that is, right now.

Alternative theory: AI is a misnomer. We are on the track of developing LLM technology. Or transformer-based tech, whatever you want to label it. Who approved the request to relabel it AI? LLM tech has limits. Like roads do sometimes. The ones we call "dead ends".

LLM tech is a cool new toy. It's cool because it does things that earlier tech couldn't, and those new tricks appear to land in the previously unoccupied territory between the land of "what machines can do" and the land of "what only humans can do". Granted. So, given that observation, is it absolutely obvious that all the other things humans can do, but that LLMs can't, are going to be conquered by LLM tech? Inevitably? Why? From whence cometh this confidence? Points aren't lines, from what I remember of high school geometry. Except when they are degenerate.

So if, as the hypothesis goes, we are going down a dead end road, the confidence to make any predictions at all about the rate of near term progress, or how close we are to this or that milestone that lies off that road, has now zero basis. The path we need to be on to achieve AGI is an entirely different one that lies undiscovered, somewhere out there in the yet trackless wilderness. Beyond the limits of what LLM tech can do, the future of "AI" progress becomes entirely opaque. LLM tech is useless as a basis for predicting anything much further than what has already been achieved in 2025.

Expand full comment
Abhay Ghatpande's avatar

Came across aui.io which deems to use neuro-symbolic reasoning. Great to see a commercial model available for testing and spreading awareness of possibilities other than Transformers. Look forward to your review and opinions on AUI, Steve.

Expand full comment
Dang's avatar

Important to note the specific models being tested! Sounds like this experiment used plain GPT5, not the reasoning variants or GPT5 Pro. There's a massive difference in capabilities.

Expand full comment
Sam's avatar

It’s almost amusing to see mankind’s obsession with finding ways to not just automate some task (which is perfectly fine) but also give up our ability to have control over outcomes. As enterprise accelerates its way towards AI and everything agentic it’s losing sight of what the end game will look like. The bandwagon wants to move at warp speed but no one’s calling out that it is after all a bandwagon. 😅

Expand full comment