Discussion about this post

User's avatar
Greg Burnham's avatar

This is a terrific post!! One question for you. This is the third problem on the first day of the 2024 IMO (P3). AlphaProof didn't solve this one, but did solve the third problem on the second day (P6). These are both very hard problems for humans: each was only solved by only about 2% of contestants. So, it's just one data point, but what do you think might account for the difference in AlphaProof's performance?

My hunch, FWIW, is that the re-representation of the problem in the grid format makes it trickier for AlphaProof. I've looked up a few solutions to P3, and they all make use of a similar technique at some point. P6 requires a similarly long and intricate build-up of observations and intermediate results, but I would say it doesn't really require seeing the problem in a new way. In other words, both problems are deep, but P3 requires more creativity. Then again, it's also possible that the reason is more technical: maybe combinatorics is just clunkier to represent in Lean, making the solution necessarily longer, and thus yielding a bigger search space; I don't know anything about that.

Curious for your take, though!

Expand full comment
tasdourian's avatar

Great Post! I was actually hoping you'd write a post like this, detailing what strategies high-level problem solvers use and how they use them. (I know you said all of us do, and that's true to a degree, but c'mon now...)

I also am very eager to see your analysis of the strengths and limitations of o1 mini and o1 preview-- I've found o1 mini to be much more capable and sophisticated at math/physics problems, and o1 preview has been able to solve the last 3 NYTimes Connections puzzles I've given it, whereas the previous 4o model couldn't solve these at all. Especially because Connections was the kind of puzzle where it seemed to require too much creativity/originality to just solve it using previous training data.

Anyway, I'd be interested in a hype-free analysis of the current new models, if that seems useful to you.

Expand full comment
4 more comments...

No posts