Discussion about this post

User's avatar
tasdourian's avatar

Great post, I was looking for a balanced view of what o3 achieved on those benchmarks. I find trying to assess how impressive each new model is quite confusing, even with some effort and research; glad to see that even you find it hard, as well.

Expand full comment
Venkateshan K's avatar

Very informative post providing the much needed context for a better perspective on the performance of the model on these benchmarks.

One thing I am very curious to find out is what made o3 so much better at the Frontier Math problems (even assuming the ones it did correctly are no more difficult than average IMO level problems). We know for example that AlphaProof which solved 4 of the 6 IMO problems of this (last?) year used the RL framework that had been developed in the context of AlphaGo and used the formal language Lean to check several (thousands I guess) of candidate solutions. OpenAI has mentioned using reasoning chains but beyond that, is there anything we know?

Expand full comment
2 more comments...

No posts