It takes about fifteen years for a human to learn to drive a car. We do not let week-old infants get behind the wheel, as we would if a few hundred hours was accurate.
15 years is only 131,000 hours, and many many of those are spent sleeping, sitting still in a classroom, or otherwise engaged in activities which don't have much to do with learning to drive. Which comes to a small fraction of the hours Waymo cars have logged, but quite a bit more than 100, yes.
On the other hand, I don't think it's currently the case that we could pretrain a model on a few tens of thousands of hours of general world interaction, and then quickly teach it to drive. Those first 15 years that a human spends "getting ready to learn to drive" involve fairly generic, easily-obtained data, and set them up to then efficiently learn a skill with relatively little data specific to that skill.
Re: #3 - needing to manually pick the o3 instead of the default.
I'm shocked that only a few percent of people with access were using the o3 model before the GPT-5 switcheroo.
Aside from the better answers I received from o3 vs. 4o, I got an odd satisfaction from waiting while the model reasoned. Having 4o hose me down with 1000 tokens the instant I hit enter made me feel like I was getting a half baked answer.
Apparently there was a huge 4o tribe who loved 4o's style. Anyone else on team o3?
I presume that the main factor here was "most people never even noticed the model picker (or didn't know what to do with it)", not "most people, after considered deliberation, preferred 4o for everything".
I agree, that's what happened. The default option rules all.
I feel kind of embarrassed for humanity imaging o3 staring dumbfounded as ~95% of humans march past to 4o, because o3 is hidden under a single submenu :)
The observation that the length of tasks, which AIs can complete is doubling about every seven months seems woefully incomplete. Reliability is not improving very fast at all. I think most people reading that statistic who are not in the field would assume that AI task completion means a high rate of successful task completion. But in fact, it’s stuck around 50%.
For tasks of a given size, the success rate is going up, no?
You're correct that people often forget that these reported task sizes are for 50% success rate, but METR has found that a similar curve holds for 80% success rates, just following behind by (a year or so? I forget). They don't have enough data to report figures for higher success rates; the general assumption is that they will also follow a similar curve, though personally I am not entirely confident of that.
It takes about fifteen years for a human to learn to drive a car. We do not let week-old infants get behind the wheel, as we would if a few hundred hours was accurate.
Fair point!
15 years is only 131,000 hours, and many many of those are spent sleeping, sitting still in a classroom, or otherwise engaged in activities which don't have much to do with learning to drive. Which comes to a small fraction of the hours Waymo cars have logged, but quite a bit more than 100, yes.
On the other hand, I don't think it's currently the case that we could pretrain a model on a few tens of thousands of hours of general world interaction, and then quickly teach it to drive. Those first 15 years that a human spends "getting ready to learn to drive" involve fairly generic, easily-obtained data, and set them up to then efficiently learn a skill with relatively little data specific to that skill.
Re: #3 - needing to manually pick the o3 instead of the default.
I'm shocked that only a few percent of people with access were using the o3 model before the GPT-5 switcheroo.
Aside from the better answers I received from o3 vs. 4o, I got an odd satisfaction from waiting while the model reasoned. Having 4o hose me down with 1000 tokens the instant I hit enter made me feel like I was getting a half baked answer.
Apparently there was a huge 4o tribe who loved 4o's style. Anyone else on team o3?
I presume that the main factor here was "most people never even noticed the model picker (or didn't know what to do with it)", not "most people, after considered deliberation, preferred 4o for everything".
I agree, that's what happened. The default option rules all.
I feel kind of embarrassed for humanity imaging o3 staring dumbfounded as ~95% of humans march past to 4o, because o3 is hidden under a single submenu :)
The observation that the length of tasks, which AIs can complete is doubling about every seven months seems woefully incomplete. Reliability is not improving very fast at all. I think most people reading that statistic who are not in the field would assume that AI task completion means a high rate of successful task completion. But in fact, it’s stuck around 50%.
For tasks of a given size, the success rate is going up, no?
You're correct that people often forget that these reported task sizes are for 50% success rate, but METR has found that a similar curve holds for 80% success rates, just following behind by (a year or so? I forget). They don't have enough data to report figures for higher success rates; the general assumption is that they will also follow a similar curve, though personally I am not entirely confident of that.