It takes about fifteen years for a human to learn to drive a car. We do not let week-old infants get behind the wheel, as we would if a few hundred hours was accurate.
15 years is only 131,000 hours, and many many of those are spent sleeping, sitting still in a classroom, or otherwise engaged in activities which don't have much to do with learning to drive. Which comes to a small fraction of the hours Waymo cars have logged, but quite a bit more than 100, yes.
On the other hand, I don't think it's currently the case that we could pretrain a model on a few tens of thousands of hours of general world interaction, and then quickly teach it to drive. Those first 15 years that a human spends "getting ready to learn to drive" involve fairly generic, easily-obtained data, and set them up to then efficiently learn a skill with relatively little data specific to that skill.
I think this is because humans encode concepts and AI's don't. A Waymo learns to drive by inputting information about the environment and using controls of the car to create an optimal outcome, such as 'stay in the lane'. A teenager understand this concept intrinsically without the need to collect any information about the environment and without the need to test any of the controls.
If there were a country where you had to drive on the line, without seeing the road, the cars, or being in the country, humans drivers would understand how to do that. A Waymo would need to learn all over again.
Until we can describe to a Waymo the new rules and it use those rules to drive successfully in a completely new environment, we haven't come close to a human level of driving intelligence, we've just created a robot car that can successfully navigate a complex, but very specific, task.
Quite so ... humans can be taught rules ... how to play a game, do math, run or repair machinery, etc. ... we do not pattern match against vast amounts of data. LLMs are completely incapable of learning from rules--thus, e.g., the abysmal performance on Towers of Hanoi, or being easily led into making illegal chess moves. The whole LLM approach is barking up the wrong tree.
Re: #3 - needing to manually pick the o3 instead of the default.
I'm shocked that only a few percent of people with access were using the o3 model before the GPT-5 switcheroo.
Aside from the better answers I received from o3 vs. 4o, I got an odd satisfaction from waiting while the model reasoned. Having 4o hose me down with 1000 tokens the instant I hit enter made me feel like I was getting a half baked answer.
Apparently there was a huge 4o tribe who loved 4o's style. Anyone else on team o3?
I presume that the main factor here was "most people never even noticed the model picker (or didn't know what to do with it)", not "most people, after considered deliberation, preferred 4o for everything".
I agree, that's what happened. The default option rules all.
I feel kind of embarrassed for humanity imagining o3 staring dumbfounded as ~95% of humans march past to 4o, because o3 is hidden under a single submenu :)
The observation that the length of tasks, which AIs can complete is doubling about every seven months seems woefully incomplete. Reliability is not improving very fast at all. I think most people reading that statistic who are not in the field would assume that AI task completion means a high rate of successful task completion. But in fact, it’s stuck around 50%.
For tasks of a given size, the success rate is going up, no?
You're correct that people often forget that these reported task sizes are for 50% success rate, but METR has found that a similar curve holds for 80% success rates, just following behind by (a year or so? I forget). They don't have enough data to report figures for higher success rates; the general assumption is that they will also follow a similar curve, though personally I am not entirely confident of that.
Yes, your intuition is correct. Model releases seem to swing from spiky, crazy-good, slightly feral (think of o3's lies or Claude 3.7s reward hacking behavior) followed by polished, incremental and reliable follow ups (GPT 5, Claude 4). Its interesting that the spiky stuff moves the 50% threshold a lot while the safer ones (such as GPT 5 with reduced hallucinations) move the 80% frontier a lot. If test time scaling hit a wall, there seems to be more juice to squeeze on the reliability front.
"Now imagine if every single morning, you discover that you have that much catching up to do, because the AI team you’re attempting to supervise has done the equivalent of two weeks of work while you slept."
I think there's a big demand-side problem with this kind of scenario. Like what are all these AI agents trying to accomplish? Presumably they are trying to do useful work for some human being somewhere, right? And so the team of AI agents needs to at least make their work comprehensible to that human customer. There are two ways this could play out:
* Customer feedback becomes a major bottleneck. If you have a team of AI tax prep assistants, they're going to grind away at a customer's tax return for an hour or two, generate a tax return, and then spend days, if not weeks, waiting for the customer to review their work and possibly request changes.
* We develop abstractions that allow the customer to review the high-level results of AI agents' work without understanding all the details. In this scenarios, I expect company employees will use many of those same abstractions to monitor the AI agents' progress and make necessary changes.
Either way, the idea that we'll have huge teams of AI agents doing stuff that no human being understands does not seem very plausible to me.
This was my attempt to describe a scenario where AIs have become sufficiently capable, reliable, general, robust, etc. that they are doing the great majority of the work, and are at least somewhat self-directed. (At least, in some fields of work and some teams / organizations.)
The key question is whether AIs will in fact become "sufficiently capable, reliable, general, robust, etc." that they're capable of working independently for tens of hours with a reasonable expectation that the work will turn out to have been useful. I should have called this out as an assumption.
If so, then here's an example of how I imagine that playing out. Imagine a software firm where AIs are doing most of the day-to-day work: fielding customer calls, handling sales for smaller clients, tracking market news, analyzing all of this to provide input to the product management team, designing and implementing features, and troubleshooting production issues. There is a small human staff keeping an eye on things, setting high-level strategy, overseeing product design and technical architecture, and handling relations with the largest customers. Basically, only the most senior staff across all functions are human.
In this scenario, the people will mostly be working a 40-hour week, while the AIs can go 24/7. I think it's unlikely that software development and other major functions would shut down overnight and on the weekends just because the human staff are off duty. And so the human staff will constantly be playing catch-up.
Thank you for outlining your thoughts and following up with further elaboration !
Let me try to build on your example: in this scenario, when and how can senior staff trust AI that it has faithfully implemented the architectural design and fully understood and respected all the constraints imposed on architecture by inputs from product management and customer input, even if not explicitly formulated in the design ? At what point will it be able to suggest changes to the architectural design, and also implement them, so as to accommodate possible changes in requirements better ?
If AI can do that then I can just feed it with a set of requirements (even incomplete ones which is the norm anyway), sit back and wait for a fully functional software package to be delivered to me - provided that functions such as AI-driven test automation to prove to me that the resulting software is actually capable of doing its job are also included. And also, if a problem occurs later, not only identity cause(s) but also provide necessary fixes.
All of that my team, especially our chief architect but collaborating with other key people, could do (I am now retired). If AI could do that and not only in a narrow field of applications, then I would consider that domain-specific AGI for software development. And it would also solve the problem you describe of human staff needing to catch up on too much output by AI - but also eliminate 99% of humans in SW development along the way.
> We score an AI’s output on a benchmark problem as “correct” or “incorrect”. In real life, each task is part of a complex web of ongoing processes. Subtle details of how the task is performed can affect how those processes play out over time. Consider software development, and imagine that an AI writes a piece of code that produces the desired outputs. This code might be “correct”, but is it needlessly verbose? Does it replicate functionality that exists elsewhere in the codebase? Does it introduce unnecessary complications?
Kinda I guess, but the idea has been in the air. I think I saw someone suggesting a while back that in many cases where an AI "correctly" solves a coding task, the result would not in fact meet quality standards to be mergeable.
Great list of questions. I agree that the distance to AGI is large. But a major point of people who believe in short timelines is that they expect (or take into account the possibility) of recursive self-improvement. That would be a good topic for additional questions :)
I think a lot of the commentary I am seeing ignores how most of us advanced users are taking advantage of the major tools. We aren’t really interested in AGI just in the general increases in productivity we can use the tools for. We don’t spend all day trying to trick the tools into making mistakes, we spend our time making the tools do great stuff for us. The typical strategic use of the tools are top down automation of answering complex problems by having the tool analyze the problem, suggest solutions, break solutions down into tasks and sub tasks, code execution of the tasks, check code for errors and security, test the code multiple times, document the code, check progress and process with expert human in the loop, and then produce an executable product. Many of us have automated this entire process except the necessary human in the loop which is the only bottleneck. Believe me this works and is actually very significant to a lot of jobs and industry. I’m not even sure I would care much about AGI in this context.
I've long believed we'll need more technological leaps to achieve AGI, though I'm aware that could definitely be wrong so I envision that belief as extending the right end of my CI for when AGI is achieved out a lot while only shifting the left end or a bit. One metaphor I like when describing this, amusingly, is golf!
For someone who is generally coordinated, or takes little work to go from never having played to scoring double par (achieving a score of ~140). Depending on the person it's only a bit more investment than that first but to get down into the 100-120 range. From there the investment required to continue to improve is at least exponential, maybe hyper! The same amount of work that got you from 100 to 90 might get you from 90 to 88. If you've gotten all the way down to 80 that same amount of work might get you to 79.95! Point being, combined with the fact that benchmarks inflate model capabilities, I wouldn't consider that length of task doubling to be particularly meaningful in terms of predicting when AI doesn't need steering supervision.
This is such a great essay, so clear and digestible. I’ve been obsessed with the question you raise: why isn’t this AGI yet? Terms like “memory,” “judgment,” or “insight” are fuzzy, and AI has partial versions of them … retrieval-augmented memory, fine-tuning (a weak form of continuous learning), and emergent pattern recognition that sometimes looks like “insight.”
But the gap feels less functional than qualitative. We can’t quite name it, but we sense something missing. Could it be because these capacities are deeply biological … kind of emerging from bodies, hormones, and nervous systems interacting with messy environment in ways we don’t yet know how to abstract into code ?
Maybe that’s why we’re still circling AGI … wondering what it is because we are not speaking of the same biological capacity…
Definitely these things have a lot to do with human development!
My guess is that we will eventually find ways to address shortcomings of AI cognition that don't rely on giving them physical bodies, we're just not there yet. It may indeed be necessary to let them interact with a "messy environment" as part of their development / training process, but I would guess that virtual environments will be enough to develop cognitive skills. (Virtual environments might not be enough to develop some specific physical-world skills, so I wouldn't necessarily expect to see the emergence of robots that can, for example, make a soufflé without some real eggs getting broken somewhere in the development process.)
Great piece! I do think you misread the chart at the end. It’s the “telecom” boom, not the dot-com boom. I assume it refers to the intense rollout of 5G in 2020
Oops, you're correct. The relevant paragraph in the source (https://paulkedrosky.com/honey-ai-capex-ate-the-economy/) references both "peak telecom spending ... around the 5G/fiber frenzy" and "the decades ago peak in telecom spending during the dot-com bubble", and I got mixed up and thought the graph was referring to the latter. Will fix. Thanks!
Ad. the chart on railroads bubble - in 19th century there was much more informal economy - railroads were a big part of the formal economy but it did not mean as much as it would mean today.
“Current AIs aren’t AGI. But I don’t know why.” Couple reasons: (1) there isn’t broad consensus among researchers on how to define AGI, (2) there isn’t even consensus on what constitutes intelligence, period
Fascinating stuff, especially around the impact of using AI over a longer period of time. Will AI push us to think in specific ways/thought patterns through prolonged use? Probably very hard to capture, as I don't even think that current studies of how we are using AI in the short-term accurately capture how it's changing our work.
I think that lots of those 15 years (including the classroom and the sleeping) is about us building: (1) a model of the world; and (2) reinforcement about 'good' and 'bad' outcomes to give us loose framework about desired outcomes.
So, our world model includes vehicles as a class, and cars as an object (of class vehicle) and we understand how cars move. We then also understand that cars need to follow the rules of the road (a concept?) and that breaking the rules of the road is bad; that cars colliding with other cars, trees or people is even worse; and arriving at our destination on time and without incident is good!
Right from when we are born, we then supplement the model we have built (as it is at any given stage of development) and the framework of outcomes with data we get from the environment we are in - from our eyes and ears as we drive down a street.
That's why we learn to drive more quickly. Waymo doesn't have a world model or a complete framework of outcomes.
I'm not sure how Frontier Maths benchmarks show "superhuman" performance by AI.
For reference, checking the latest Frontier Maths results for for Tier 1 - 3 problems (so undergrad to masters level), no AI is even at 25% completion - clearly worse than humans.
Even Tier 1 FrontierMath problems are quite difficult; "undergraduate" is a description of the minimum depth of background knowledge required to solve the problems in principle, it is not a description of the difficulty of the problems. Elliot Glazer (lead author of the FrontierMath paper) has described Tier 1 as "near top-tier undergrad/IMO" (https://x.com/ElliotGlazer/status/1870613328474853548).
See https://epochai.substack.com/p/is-ai-already-superhuman-on-frontiermath: Epoch (the organization that created FrontierMath) "organized a competition at MIT, with around forty exceptional math undergrads and subject matter experts taking part. The participants were split into eight teams of four or five people, and given 4.5 hours to solve 23 questions with internet access". The average human team scored 19%. https://epoch.ai/frontiermath shows that GPT-5 scores around 24.8% (you need to click to "Tier 1-3 results" to see a score that corresponds to the MIT competition – as I gather you're aware since you mentioned AIs are below 25%).
There are error bars on all of these numbers, and also if you pool the work of all 8 teams at the MIT competition then collectively they would have solved enough problems to score 35%, which is why I only said "arguably" superhuman. Some models are clearly scoring higher than almost any one individual person, possibly higher than any individual (unknowable I think, unless someone like Terence Tao sits down to take the test), but not yet as high as a team of people.
It takes about fifteen years for a human to learn to drive a car. We do not let week-old infants get behind the wheel, as we would if a few hundred hours was accurate.
Fair point!
15 years is only 131,000 hours, and many many of those are spent sleeping, sitting still in a classroom, or otherwise engaged in activities which don't have much to do with learning to drive. Which comes to a small fraction of the hours Waymo cars have logged, but quite a bit more than 100, yes.
On the other hand, I don't think it's currently the case that we could pretrain a model on a few tens of thousands of hours of general world interaction, and then quickly teach it to drive. Those first 15 years that a human spends "getting ready to learn to drive" involve fairly generic, easily-obtained data, and set them up to then efficiently learn a skill with relatively little data specific to that skill.
I think this is because humans encode concepts and AI's don't. A Waymo learns to drive by inputting information about the environment and using controls of the car to create an optimal outcome, such as 'stay in the lane'. A teenager understand this concept intrinsically without the need to collect any information about the environment and without the need to test any of the controls.
If there were a country where you had to drive on the line, without seeing the road, the cars, or being in the country, humans drivers would understand how to do that. A Waymo would need to learn all over again.
Until we can describe to a Waymo the new rules and it use those rules to drive successfully in a completely new environment, we haven't come close to a human level of driving intelligence, we've just created a robot car that can successfully navigate a complex, but very specific, task.
Quite so ... humans can be taught rules ... how to play a game, do math, run or repair machinery, etc. ... we do not pattern match against vast amounts of data. LLMs are completely incapable of learning from rules--thus, e.g., the abysmal performance on Towers of Hanoi, or being easily led into making illegal chess moves. The whole LLM approach is barking up the wrong tree.
Sleeping is crucial to how humans learn, even if we don’t yet understand quite how it works.
Re: #3 - needing to manually pick the o3 instead of the default.
I'm shocked that only a few percent of people with access were using the o3 model before the GPT-5 switcheroo.
Aside from the better answers I received from o3 vs. 4o, I got an odd satisfaction from waiting while the model reasoned. Having 4o hose me down with 1000 tokens the instant I hit enter made me feel like I was getting a half baked answer.
Apparently there was a huge 4o tribe who loved 4o's style. Anyone else on team o3?
I presume that the main factor here was "most people never even noticed the model picker (or didn't know what to do with it)", not "most people, after considered deliberation, preferred 4o for everything".
I agree, that's what happened. The default option rules all.
I feel kind of embarrassed for humanity imagining o3 staring dumbfounded as ~95% of humans march past to 4o, because o3 is hidden under a single submenu :)
The observation that the length of tasks, which AIs can complete is doubling about every seven months seems woefully incomplete. Reliability is not improving very fast at all. I think most people reading that statistic who are not in the field would assume that AI task completion means a high rate of successful task completion. But in fact, it’s stuck around 50%.
For tasks of a given size, the success rate is going up, no?
You're correct that people often forget that these reported task sizes are for 50% success rate, but METR has found that a similar curve holds for 80% success rates, just following behind by (a year or so? I forget). They don't have enough data to report figures for higher success rates; the general assumption is that they will also follow a similar curve, though personally I am not entirely confident of that.
Yes, your intuition is correct. Model releases seem to swing from spiky, crazy-good, slightly feral (think of o3's lies or Claude 3.7s reward hacking behavior) followed by polished, incremental and reliable follow ups (GPT 5, Claude 4). Its interesting that the spiky stuff moves the 50% threshold a lot while the safer ones (such as GPT 5 with reduced hallucinations) move the 80% frontier a lot. If test time scaling hit a wall, there seems to be more juice to squeeze on the reliability front.
"Now imagine if every single morning, you discover that you have that much catching up to do, because the AI team you’re attempting to supervise has done the equivalent of two weeks of work while you slept."
I think there's a big demand-side problem with this kind of scenario. Like what are all these AI agents trying to accomplish? Presumably they are trying to do useful work for some human being somewhere, right? And so the team of AI agents needs to at least make their work comprehensible to that human customer. There are two ways this could play out:
* Customer feedback becomes a major bottleneck. If you have a team of AI tax prep assistants, they're going to grind away at a customer's tax return for an hour or two, generate a tax return, and then spend days, if not weeks, waiting for the customer to review their work and possibly request changes.
* We develop abstractions that allow the customer to review the high-level results of AI agents' work without understanding all the details. In this scenarios, I expect company employees will use many of those same abstractions to monitor the AI agents' progress and make necessary changes.
Either way, the idea that we'll have huge teams of AI agents doing stuff that no human being understands does not seem very plausible to me.
This was my attempt to describe a scenario where AIs have become sufficiently capable, reliable, general, robust, etc. that they are doing the great majority of the work, and are at least somewhat self-directed. (At least, in some fields of work and some teams / organizations.)
The key question is whether AIs will in fact become "sufficiently capable, reliable, general, robust, etc." that they're capable of working independently for tens of hours with a reasonable expectation that the work will turn out to have been useful. I should have called this out as an assumption.
If so, then here's an example of how I imagine that playing out. Imagine a software firm where AIs are doing most of the day-to-day work: fielding customer calls, handling sales for smaller clients, tracking market news, analyzing all of this to provide input to the product management team, designing and implementing features, and troubleshooting production issues. There is a small human staff keeping an eye on things, setting high-level strategy, overseeing product design and technical architecture, and handling relations with the largest customers. Basically, only the most senior staff across all functions are human.
In this scenario, the people will mostly be working a 40-hour week, while the AIs can go 24/7. I think it's unlikely that software development and other major functions would shut down overnight and on the weekends just because the human staff are off duty. And so the human staff will constantly be playing catch-up.
Thank you for outlining your thoughts and following up with further elaboration !
Let me try to build on your example: in this scenario, when and how can senior staff trust AI that it has faithfully implemented the architectural design and fully understood and respected all the constraints imposed on architecture by inputs from product management and customer input, even if not explicitly formulated in the design ? At what point will it be able to suggest changes to the architectural design, and also implement them, so as to accommodate possible changes in requirements better ?
If AI can do that then I can just feed it with a set of requirements (even incomplete ones which is the norm anyway), sit back and wait for a fully functional software package to be delivered to me - provided that functions such as AI-driven test automation to prove to me that the resulting software is actually capable of doing its job are also included. And also, if a problem occurs later, not only identity cause(s) but also provide necessary fixes.
All of that my team, especially our chief architect but collaborating with other key people, could do (I am now retired). If AI could do that and not only in a narrow field of applications, then I would consider that domain-specific AGI for software development. And it would also solve the problem you describe of human staff needing to catch up on too much output by AI - but also eliminate 99% of humans in SW development along the way.
> We score an AI’s output on a benchmark problem as “correct” or “incorrect”. In real life, each task is part of a complex web of ongoing processes. Subtle details of how the task is performed can affect how those processes play out over time. Consider software development, and imagine that an AI writes a piece of code that produces the desired outputs. This code might be “correct”, but is it needlessly verbose? Does it replicate functionality that exists elsewhere in the codebase? Does it introduce unnecessary complications?
Did you just predict these follow-up results by METR? https://metr.org/blog/2025-08-12-research-update-towards-reconciling-slowdown-with-time-horizons/
Kinda I guess, but the idea has been in the air. I think I saw someone suggesting a while back that in many cases where an AI "correctly" solves a coding task, the result would not in fact meet quality standards to be mergeable.
Great list of questions. I agree that the distance to AGI is large. But a major point of people who believe in short timelines is that they expect (or take into account the possibility) of recursive self-improvement. That would be a good topic for additional questions :)
No, when TRAINING a human to drive a car then you can do it maybe in 5 years but certainly in 7. It only becomes legal in 16 or 18 years
I think a lot of the commentary I am seeing ignores how most of us advanced users are taking advantage of the major tools. We aren’t really interested in AGI just in the general increases in productivity we can use the tools for. We don’t spend all day trying to trick the tools into making mistakes, we spend our time making the tools do great stuff for us. The typical strategic use of the tools are top down automation of answering complex problems by having the tool analyze the problem, suggest solutions, break solutions down into tasks and sub tasks, code execution of the tasks, check code for errors and security, test the code multiple times, document the code, check progress and process with expert human in the loop, and then produce an executable product. Many of us have automated this entire process except the necessary human in the loop which is the only bottleneck. Believe me this works and is actually very significant to a lot of jobs and industry. I’m not even sure I would care much about AGI in this context.
I've long believed we'll need more technological leaps to achieve AGI, though I'm aware that could definitely be wrong so I envision that belief as extending the right end of my CI for when AGI is achieved out a lot while only shifting the left end or a bit. One metaphor I like when describing this, amusingly, is golf!
For someone who is generally coordinated, or takes little work to go from never having played to scoring double par (achieving a score of ~140). Depending on the person it's only a bit more investment than that first but to get down into the 100-120 range. From there the investment required to continue to improve is at least exponential, maybe hyper! The same amount of work that got you from 100 to 90 might get you from 90 to 88. If you've gotten all the way down to 80 that same amount of work might get you to 79.95! Point being, combined with the fact that benchmarks inflate model capabilities, I wouldn't consider that length of task doubling to be particularly meaningful in terms of predicting when AI doesn't need steering supervision.
This is such a great essay, so clear and digestible. I’ve been obsessed with the question you raise: why isn’t this AGI yet? Terms like “memory,” “judgment,” or “insight” are fuzzy, and AI has partial versions of them … retrieval-augmented memory, fine-tuning (a weak form of continuous learning), and emergent pattern recognition that sometimes looks like “insight.”
But the gap feels less functional than qualitative. We can’t quite name it, but we sense something missing. Could it be because these capacities are deeply biological … kind of emerging from bodies, hormones, and nervous systems interacting with messy environment in ways we don’t yet know how to abstract into code ?
Maybe that’s why we’re still circling AGI … wondering what it is because we are not speaking of the same biological capacity…
Definitely these things have a lot to do with human development!
My guess is that we will eventually find ways to address shortcomings of AI cognition that don't rely on giving them physical bodies, we're just not there yet. It may indeed be necessary to let them interact with a "messy environment" as part of their development / training process, but I would guess that virtual environments will be enough to develop cognitive skills. (Virtual environments might not be enough to develop some specific physical-world skills, so I wouldn't necessarily expect to see the emergence of robots that can, for example, make a soufflé without some real eggs getting broken somewhere in the development process.)
Great piece! I do think you misread the chart at the end. It’s the “telecom” boom, not the dot-com boom. I assume it refers to the intense rollout of 5G in 2020
Oops, you're correct. The relevant paragraph in the source (https://paulkedrosky.com/honey-ai-capex-ate-the-economy/) references both "peak telecom spending ... around the 5G/fiber frenzy" and "the decades ago peak in telecom spending during the dot-com bubble", and I got mixed up and thought the graph was referring to the latter. Will fix. Thanks!
Cool—I hadn’t ever heard that period referred to as a boom either so understandable
Ad. the chart on railroads bubble - in 19th century there was much more informal economy - railroads were a big part of the formal economy but it did not mean as much as it would mean today.
“Current AIs aren’t AGI. But I don’t know why.” Couple reasons: (1) there isn’t broad consensus among researchers on how to define AGI, (2) there isn’t even consensus on what constitutes intelligence, period
Fascinating stuff, especially around the impact of using AI over a longer period of time. Will AI push us to think in specific ways/thought patterns through prolonged use? Probably very hard to capture, as I don't even think that current studies of how we are using AI in the short-term accurately capture how it's changing our work.
This is a really interesting point.
I think that lots of those 15 years (including the classroom and the sleeping) is about us building: (1) a model of the world; and (2) reinforcement about 'good' and 'bad' outcomes to give us loose framework about desired outcomes.
So, our world model includes vehicles as a class, and cars as an object (of class vehicle) and we understand how cars move. We then also understand that cars need to follow the rules of the road (a concept?) and that breaking the rules of the road is bad; that cars colliding with other cars, trees or people is even worse; and arriving at our destination on time and without incident is good!
Right from when we are born, we then supplement the model we have built (as it is at any given stage of development) and the framework of outcomes with data we get from the environment we are in - from our eyes and ears as we drive down a street.
That's why we learn to drive more quickly. Waymo doesn't have a world model or a complete framework of outcomes.
I'm not sure how Frontier Maths benchmarks show "superhuman" performance by AI.
For reference, checking the latest Frontier Maths results for for Tier 1 - 3 problems (so undergrad to masters level), no AI is even at 25% completion - clearly worse than humans.
Even Tier 1 FrontierMath problems are quite difficult; "undergraduate" is a description of the minimum depth of background knowledge required to solve the problems in principle, it is not a description of the difficulty of the problems. Elliot Glazer (lead author of the FrontierMath paper) has described Tier 1 as "near top-tier undergrad/IMO" (https://x.com/ElliotGlazer/status/1870613328474853548).
See https://epochai.substack.com/p/is-ai-already-superhuman-on-frontiermath: Epoch (the organization that created FrontierMath) "organized a competition at MIT, with around forty exceptional math undergrads and subject matter experts taking part. The participants were split into eight teams of four or five people, and given 4.5 hours to solve 23 questions with internet access". The average human team scored 19%. https://epoch.ai/frontiermath shows that GPT-5 scores around 24.8% (you need to click to "Tier 1-3 results" to see a score that corresponds to the MIT competition – as I gather you're aware since you mentioned AIs are below 25%).
There are error bars on all of these numbers, and also if you pool the work of all 8 teams at the MIT competition then collectively they would have solved enough problems to score 35%, which is why I only said "arguably" superhuman. Some models are clearly scoring higher than almost any one individual person, possibly higher than any individual (unknowable I think, unless someone like Terence Tao sits down to take the test), but not yet as high as a team of people.