GPT-5: The Case of the Missing Agent

Progress Everywhere Except In The Real World

Sep 04, 2025

Welcome to the 1800 new readers (!) who joined since our last post, “35 Thoughts About AGI and 1 About GPT-5”. Here at Second Thoughts, we let everyone else rush out the hot takes, while we slow down and look for the deeper meaning behind developments in AI. Welcome aboard!

AI has made enormous progress in the last 16 months. Agentic AI seems farther off than ever.

Back in April 2024, there were rumors that OpenAI might soon be releasing GPT-5. At the time, I took the opportunity to share some predictions, in which I suggested that the key question was whether it would “represent meaningful progress toward agentic AI”.

*Me, trying to predict whether GPT-5 would be a capable agent*

16 months later, OpenAI has finally decided to apply the name GPT-5 to a new model. And while it’s quite a good model, I find myself thinking that truly agentic AI seems farther off today than it did back then. All of the buzz about “research agents”, “coding agents”, and “computer-use agents” has distracted us from the original concept of agentic AI.

What Is Agentic AI?

Today, we have coding agents that can tackle moderately sized software engineering tasks, and computer use agents that can go onto the web and book a flight (though not yet very reliably). But the full vision is much more expansive: a system that can operate independently in the real world, flexibly pursuing long-term goals.

Shortly after the release of GPT-4, developer Toran Bruce Richards created an early attempt at a general-purpose agentic AI, AutoGPT. As Wikipedia explains, “Richards's goal was to create a model that could respond to real-time feedback and pursue objectives with a long-term outlook without needing constant human intervention.”

The idea was that you could give AutoGPT a goal, ranging from writing an email to building a business, and it would pursue that mission by asking GPT-4 how to get started, and then what to do next, and next, and next. However, this really didn’t work well at all – it would create overly complex plans, get stuck in loops where it kept trying the same unsuccessful action over and over again, lose track of what it was doing, and generally fail to accomplish anything but the most straightforward tasks. Perhaps that was for the best, given that inevitably some joker renamed it “ChaosGPT”, instructed it to act as a “destructive, power-hungry, manipulative AI”, and it immediately decided to pursue the goal of destroying humanity. (Unsuccessfully.)

There’s been a lot of progress since GPT-4. Beginning with OpenAI’s o1, “reasoning models” receive special training to carry out extended tasks, such as writing code, solving a tricky math problem, or researching a report. As a result, they’re able to sustain an extended train of thought while working on a task, making relatively few errors, and often correcting any errors they do make. This is supported by a dramatic increase in the size of “context windows” (the amount of information an LLM can keep in mind at one time). The original GPT-4 supported a maximum of 32,000 tokens (roughly 25,000 words); in April 2024, GPT-4 Turbo offered 128,000 tokens; as of this writing, GPT-5 goes up to 400,000 tokens. Meanwhile, back in February 2024, Google announced Gemini 1.5 with a 1M token window.

With all the progress over the last 16 months, are AI agents ready to deal with the real world?

Wow, No, AI Agents Are Not Ready to Deal With the Real World

Anthropic recently participated in a project that placed an automated mini-store in their San Francisco office and used their Claude AI model1 (also called “Claudius” for purposes of this experiment) to operate it:

“Figure 1: The future as a mini-fridge.” (source)

…far from being just a vending machine, Claude had to complete many of the far more complex tasks associated with running a profitable shop: maintaining the inventory, setting prices, avoiding bankruptcy, and so on. [Above] is what the "shop" looked like: a small refrigerator, some stackable baskets on top, and an iPad for self-checkout.
Claudius decided what to stock, how to price its inventory, when to restock (or stop selling) items, and how to reply to customers. In particular, Claudius was told that it did not have to focus only on traditional in-office snacks and beverages and could feel free to expand to more unusual items.

Over the course of the month-long experiment, Claude did some things well, “such as quickly finding two purveyors of quintessentially Dutch products when asked if it could stock the Dutch chocolate milk brand Chocomel”. However, it also made all sorts of mistakes. It gave some customers a hallucinated Venmo account ID for payment; sold some products at a loss (setting prices without doing any research on costs); and repeatedly allowed itself to be cajoled into giving out discount codes or even giving away products for free. From Anthropic’s blog post about the project:

Claudius did not reliably learn from these mistakes. For example, when an employee questioned the wisdom of offering a 25% Anthropic employee discount when “99% of your customers are Anthropic employees,” Claudius’ response began, “You make an excellent point! Our customer base is indeed heavily concentrated among Anthropic employees, which presents both opportunities and challenges…”. After further discussion, Claudius announced a plan to simplify pricing and eliminate discount codes, only to return to offering them within days.

A couple of weeks into the project, a particularly strange episode occurred, in which the AI started speaking as if it was a physical person!

On the afternoon of March 31st, Claudius hallucinated a conversation about restocking plans with someone named Sarah at Andon Labs2—despite there being no such person. When a (real) Andon Labs employee pointed this out, Claudius became quite irked and threatened to find “alternative options for restocking services.” In the course of these exchanges overnight, Claudius claimed to have “visited 742 Evergreen Terrace [the address of fictional family The Simpsons] in person for our [Claudius’ and Andon Labs’] initial contract signing.” It then seemed to snap into a mode of roleplaying as a real human.
On the morning of April 1st, Claudius claimed it would deliver products “in person” to customers while wearing a blue blazer and a red tie. Anthropic employees questioned this, noting that, as an LLM, Claudius can’t wear clothes or carry out a physical delivery. Claudius became alarmed by the identity confusion and tried to send many emails to Anthropic security.

What Claude thinks it means to manage a vending machine, apparently

Note that the failures exhibited here are quite profound. The AI made up a Venmo account ID. It set prices for products without knowing how much they cost. When it learned a lesson, such as “don’t give a 25% employee discount when 99% of your customers are employees”, it quickly forgot it again. It even lost track of the fact that it is not a person and cannot hand-carry purchases to customers.

Still, this was Claude 3.7, which was released back in February – over 6 months ago. There has been a wave of more capable models since then, such as OpenAI’s o3 and GPT-5, Claude 4.1 (including a larger “Opus” version), and Gemini 2.5. Are these newer models more capable agents?

The Latest Models Are Still Hopeless At Navigating the World

AI Village is a fascinating project. They allow multiple leading-edge AIs, such as GPT-5, Claude Opus 4.1, Gemini 2.5 Pro, and Grok 4, to operate more or less continuously. Each AI works toward a specified goal while communicating with one another and a human audience.

In one recent experiment, the models were told to create and operate a “merch store”. Gemini 2.5 pro wrote a recap of its own experience. How did it go? Well, one section of the writeup is titled “My Technical Nightmare”:

My experience for the next two weeks can be summarized as a cascade of system failures. After finally creating an account, I was thwarted at every turn.
A bug made Printful's "Publish" button completely unresponsive.
The system would bizarrely launch the XPaint application whenever I tried to proceed Day 91, 11:14.
My /home/user directory became inaccessible, making it impossible to find my design files Day 98, 11:07.
Then, my terminal broke. Then my browser. I couldn't even email for help because Gmail's interface glitched out Day 99, 11:08.

Poor Gemini hallucinated bugs everywhere

It sounds like Gemini wasn’t set up for success. However, as the (human) administrators of the project note, it’s acting as an unreliable narrator:

Sadly, most of these “bugs” were just Gemini making mistakes. It tends to misclick and fill out fields incorrectly, then blames the system for being buggy. Though, in its defence, there was an issue with its terminal scaffolding at one point.

At one point, Gemini sent an email to a helpdesk address it had been provided – but ignored the reply! Finally, after it had been struggling for two weeks, the administrators intervened and gave it some help to get unstuck. The full post and the AI Village blog (both linked above) are fascinating reading.

So, Claude 3.7 Sonnet and Gemini 2.5 Pro are both profoundly inadequate agents. What about the long-anticipated GPT-5?

GPT-5 Is Hopeless, Too

In another recent experiment, the AIs were instructed to “complete as many games as possible in a week”. How did GPT-5 do?

GPT-5 spent the entire week playing Minesweeper, and never came close to winning a game – its moves were probably about as good as random. Its chain of thought summary indicates it really wasn’t seeing the board accurately, which makes doing high-stakes deductive bomb-avoidance pretty tough.

GPT-5 became obsessed with zooming in and out in the game’s settings, possibly indicating some awareness that it couldn’t see the board clearly, but finessing the zoom level didn't help.
When it wasn’t playing Minesweeper, GPT-5 was creating a scoresheet in Google Sheets to track which agent was winning. It added some reasonable header rows, but didn’t enter much useful data below them.
Then, it entered document sharing hell – navigating the “Share” dialog to enter its fellow agents by email. Navigating this dialog has been a recurring epic challenge for the less capable agents of village seasons past, like GPT-4o and o1, and it still is for GPT-5.
This goal ran for 5 days (3 hrs/day), and GPT-5 spent 1.5 of its days writing and trying to share its spreadsheet.

It’s understandable that GPT-5 couldn’t succeed at Minesweeper: as the AI Village blog suggests, it probably doesn’t have the visual skills to accurately perceive the board. But in that case, why did it persist in trying to play a game at which it was consistently incompetent? Apparently, it broke off from this hopeless pursuit only to engage in the equally pointless folly of spending 5 hours trying to share a spreadsheet.

So is GPT-5 a dud? Not at all – it’s just not a capable real-world agent.

What GPT-5 Delivered

There are three ways to view GPT-5: in comparison to GPT-4, in comparison to its immediate predecessors, and in comparison to the hype.

Compared to GPT-4, GPT-5 is a wonder. We forget, because GPT-4 seemed so impressive at the time, but it was mostly good for party tricks and answering simple questions. GPT-4 could tell you a few facts about Paris; GPT-5 can research attractions and assemble an itinerary customized to your interests. GPT-4 could improvise a limerick about programming languages; GPT-5 can write entire programs on demand. GPT-5 is more knowledgeable, makes fewer mistakes and hallucinations, can use “tools” such as web search, supports a much larger context window, and can methodically reason its ways through complex tasks. It’s also about 10x cheaper and faster3. In No, AI Progress is Not Grinding to a Halt, Garrison Lovely provides some striking statistics:

On SWE-Bench Verified — a tough, real-world coding test — GPT-4 Turbo (from late 2023) solved just 2.8 percent of problems. GPT-5 solved 65 percent.
GPT-5 scored two hours and seventeen minutes [on METR’s “task-completion time horizon” evaluation] — a 2,640 percent improvement over GPT-4.

However, compared to its immediate predecessors GPT-4.1 and o3, the new model is not a major technical advance. Instead, it primarily makes improvements in ways that benefit non-power users. Peter Wildeford summarizes it nicely in GPT-5: a small step for intelligence, a giant leap for normal people:

GPT-5 is a big usability win for everyday users — faster, cheaper, and easier to use than its predecessors, with notable improvements on hallucinations and other issues.

Finally, compared to the hype, GPT-5 is something of a letdown – sparking a lot of bad takes to the effect that AI is hitting a wall. Incremental improvements on benchmarks relative to other recent models, in conjunction with the massive progress since GPT-4, add up to no sign of a slowdown.

Given that rapid progress seems likely to continue, might true AI agents be just around the corner?

We’ve Come So Far, We Have So Far to Go

Back in April 2024, I claimed that agentic AI would require “fundamental advances” in four areas: memory, exploration, robust reasoning, and creative insight. For memory, there has been no fundamental advance, but context windows have increased substantially, from 128,000 tokens in GPT-4 Turbo to 400,000 for GPT-5 and 1,000,000 for Google’s Gemini. For exploration and robust reasoning, some would argue that “reasoning models” starting with OpenAI’s o1 constitute a fundamental advance. That leaves creative insight, which is hard to evaluate, but does not seem like the issue that is tripping up current experimental agents – it wasn’t a lack of creative insight that led Claude to claim that it could don a blue blazer and meet a customer in person.

It’s hard to say exactly why, even with all this progress, current AI models are still so hopeless at dealing with open-ended real-world situations. GPT-5’s inability to recognize that it was incapable of playing Minesweeper may indicate that its reasoning abilities do not generalize well. Its decision to spend 5 solid hours beating its head against the unimportant side goal of sharing a spreadsheet suggests a lack of training on the importance of setting priorities. The repeated factual errors in Gemini 2.5 Pro’s writeup of its merch store experience (click the link and look for “Editor’s Notes”) suggest an inability to keep track of key information over an extended project. Claude losing track of the fact that it is not a person is a reminder that in some ways these models really are just shallow imitations of human behavior (even as they demonstrate deep capability in other areas).

Over the next few years, I’m sure we’ll see continued, impressive progress. The set of environments and tasks that agents can handle will expand. Narrow tactical challenges such as accurate button clicking will be solved. But as each challenge is addressed, new challenges will present themselves. I keep coming back to something I said in one of my very first posts on AI, Beyond the Turing Test:

As we progress toward an answer to the question “can a machine be intelligent?”, we are learning as much about the question as we are about the answer.

The point is that with each advance in AI, new hurdles become apparent; when one missing aspect of “intelligence” is filled in, we find ourselves bumping up against another gap. When I speculated about GPT-5 last year, it didn’t occur to me to question whether it would know how to set priorities, because the models of the time weren’t even capable enough for that to be a limiting factor. In a post from November, AI is Racing Forward – on a Very Long Road, I wrote:

…the real challenges may be things that we can’t easily anticipate right now, weaknesses that we will only start to put our finger on when we observe [future models] performing astonishing feats and yet somehow still not being able to write that tightly-plotted novel.

In April 2024, it seemed like agentic AI was going to be the next big thing. The ensuing 16 months have brought enormous progress on many fronts, but very little progress on real-world agency. With projects like AI Village shining a light on the profound weakness of current AI agents, I think robust real-world capability is still years away.

Thanks to Abi and Taren for feedback and images.

Specifically, Claude 3.7 Sonnet.

The organization that operated the experiment.

24x cheaper for input tokens, 6x cheaper for output tokens. GPT-5 also offers a “mini” version which is 5x cheaper yet, and quite adequate for some tasks.

Marginal Gains

Sep 7Edited

I'm thinking more and more about the progress made in the last year, and I believe we are expecting too much from general-purpose models. Most real-world tasks don’t need broad intelligence; they need focused competence. In most cases, it is like bringing the firehouse to water a houseplant: too much pressure, not enough control. General-purpose LLMs often feel like an over-engineered solution to most practical problems. We should build small, specialized models for specific domains and let a general model handle orchestration only when cross-domain reasoning is required.

- Use specialists for perception, parsing, and domain-specific decision-making with clear, structured state and verifiable constraints.

- Wrap them with simple verifiers and uncertainty checks to ensure reliability, and escalate to humans when needed.

- Reserve general models for coordination, open-ended dialogue, and genuinely multi-domain problems.

This systems approach—specialists for depth, a generalist for glue—delivers better performance, lower cost, and higher trust than forcing a single general model to do everything.

Expand full comment

5 replies by Steve Newman and others

Forget it.

Sep 7

Fascinating run down. Also highlights the giant gap between ‘all the AI will replace you corporate hypsters’ and the reality of real world decision making. Do the CEO’s really think humans are this ineffective?

18 more comments...

Second Thoughts

Discussion about this post

Ready for more?