As a mathematician, I am annoyed by the common assumption that proving the Riemann hypothesis *doesn't* require managing complexity, metacognition, judgement, learning+memory, and creativity/insight/novel heuristics. Certainly, if a human were to establish a major open conjecture, in the process of doing so they would demonstrate all of these qualities. I think people underestimate the extent to which a research project (in math, or in science) differs from an exam question that is written by humans with a solution in mind.
Perhaps AI will be able to answer major open questions through a different, more brute-force method, as in chess. But chess is qualitatively very different from math: to play chess well requires much greater calculational ability than many areas of math. (At the end of the day, chess has no deep structure).
Also, prediction timelines for the Riemann Hypothesis or any specific conjecture are absurd. For all we know, we could be the same situation as Fermat in the 1600's, where to prove the equation a^n + b^n = c^n has no solutions you might need to invent modular forms, etale cohomology, deformation theory of Galois representations, and a hundred other abstract concepts that Fermat had no clue about. (Of course, there is likely some alternate proof out there, but is it really much simpler?). It is possible that we could achieve ASI and complete a Dyson sphere before all the Millenium problems are solved-- math can be arbitrarily hard.
I won't be surprised if AI is able to partially substitute brute search for judgement, insight, novel heuristics, etc. And I won't be surprised if it's *more* able to do this in mathematics than many other fields. But I would be surprised if it is able to solve difficult open questions in mathematics without substantial progress in judgement and so forth.
I couldn't resist leading off with that quote about the Reimann hypothesis, it gets at what I believe is an important point in an extremely pithy way, but I do agree with you that it likely overstates the case.
It would be interesting to have you take a crack at possible training data for these missing areas, even if just a few examples that others could expand on later. I wonder what training data AIs would generate for these and how hard the data would be to verify and correct.
Intelligence is substrate agnostic. Machines can demonstrate intelligence and perform any task a human can perform. (To those who disagree tell me where exactly is the intelligence in your brain)?
So what is the difference between a human and AGI? AGI isn't human, humans aren't AGI. AGI can't interact with the universe in a human like way because it doesn't have the same experience(s).
It's like asking you to become a fish for a week. You can replicate a lot of things a fish can do but you will never a fish.
I presume you mean that *in principle* machines can perform any task a human can perform? I agree with that. But the actual machines we have today aren't there yet. I'm trying to say something about the gaps between today's machines and today's humans.
Yes, in principle and in practice machines will be able to take on any task a human can do. However, the machine cannot be a human at the end of the experiment. In my opinion the actual machines are close to performing any task a human can do except the machines still cannot live the life of a human. This isn't to say machines are good or bad and humans are good or bad. They are just different.
I am quite convinced that a key ingredient for AI to evolve and become good and useful for real world tasks, beyond good memory, could be embodiment. I find this talk by Jim Fan (https://www.youtube.com/watch?v=Qhxr0uVT2zs) illuminating: he recalls the Held and Hein 196 "kitten carousel" experiment, where two kitten were harnessed to a carousel, but only one could move actively, while the other was moved passively. Only the former developed a normal visual perception.
I think there is definitely something to this, but that "embodiment" puts the focus in the wrong place. The success of reinforcement learning in domains ranging from chess to (most recently) chain-of-thought reasoning, shows that there is a lot of juice in training a system by letting it (try to) solve real tasks in the domain of interest. This seems to apply to both natural and artificial intelligences. My guess would be that the best way to construct AIs that do better on many of the challenges listed here will be to somehow incorporate trials of real-world tasks (or very clever approximations of real-world tasks) into their training.
To put it more simply, I think that if you want to train an AI agent capable of performing tasks on the Internet, you won't need to give it a physical body but you will need to "embody" it in the Internet (by allowing it to interact extensively with the Internet, or some simulacrum thereof, as part of its training). The important thing isn't the body, it's the mass of time spent attempting to do real things.
Avoiding being cheated and critical judgement of the input information is likely to be a challenge. LLMs train by recreating the input data not by creating a concise view of the world. Thus the necessity of fine-tuning the model to give appropriate answers about disputed issues. It would be a fascinating project to find out what worldview a superior AI would develop on its own if that wasn't necessary (just like AlphaZero developed its own superior way of playing chess or go).
Currently, however, if you ask "What kind of personality do geminis have" you will get anwer like from an esoteric magazine, whereas for questions "Is hypothesis that personality depends on the zodiac sign justified empirically" a rational explanation is given. LLMs don't even know they give contradicting answers.
This is such a refreshing take on the AGI hype cycle. The gap between benchmark performance and real world utility feels like one of the least discussed but most important challenges. Curious to hear your thoughts on how we might design better benchmarks for these fuzzier skills. Would measuring long term task execution or collaborative decision making be a good place to start?
The short answer is "I have no idea". I think a primary reason we haven't seen good benchmarks for these skills is that these are precisely the skills that are hard to benchmark. That doesn't mean it's impossible, but it is difficult and it's not something I've given much thought to.
Long term task execution is certainly one direction to go; one challenge of course is that, by definition, such tasks take a long time to carry out, which is an impediment both to designing / calibrating the benchmark, and to using it to evaluate a model.
I've been thinking about this question for a while!
When we consider what humans "actually" do, we often look at tasks and their outputs. A different way to consider the question is to understand the subjective valuations put on tasks and their outputs — my belief is that this alternative is superior because it provides clearer discrimination between [essentially human actions] and [actions which humans currently do which could be done better by machines].
I call this act of deciding and assigning subjective value "meaningmaking."
A writer choosing this word (and not that word) to achieve the correct tone for a blogpost is engaging in an act of meaningmaking — the choice of word is the result of deciding that one word is subjectively better than another in conveying the chosen tone for the intended audience.
These meaningmaking acts are everywhere in daily life, corporate life, and public life.
Deciding that this logo (and not that logo) is a better vehicle for corporate identity — meaningmaking. Choosing to hire this person (but not that person) because they are a better culture fit — meaningmaking. Ruling that this way of laying off lots of government employees (and not that way of doing it) is unlawful — meaningmaking.
Humans do 4 types of meaningmaking all the time:
Type 1: Deciding that something is subjectively good or bad. “Diamonds are beautiful,” or “blood diamonds are morally reprehensible.”
Type 2: Deciding that something is subjectively worth doing (or not). “Going to college is worth the tuition,” or “I want to hang out with Bob, but it’s too much trouble to go all the way to East London to meet him.”
Type 3: Deciding what the subjective value-orderings and degrees of commensuration of a set of things should be. “Howard Hodgkin is a better painter than Damien Hirst, but Hodgkin is not as good as Vermeer,” or “I’d rather have a bottle of Richard Leroy’s ‘Les Rouliers’ in a mediocre vintage than six bottles of Vieux Telegraphe in a great vintage.”
Type 4: Deciding to reject existing decisions about subjective quality/worth/value-ordering/value-commensuration. “I used to think the pizza at this restaurant was excellent, but after eating at Pizza Dada, I now think it is pretty mid,” or “Lots of eminent biologists believe that kin selection theory explains eusociality, but I think they are wrong and that group selection makes more sense.”
At the moment, I cannot see a way for an AI system do meaningmaking work.
It's part of a longer series of essays about how the meaningmaking lens helps us understand what AI can and should be used for (and what it can't do and should not be used for): https://vaughntan.org/meaningmakingai
Very much a work in progress so would love comments and suggestions from this community.
Calling out subjective value judgements is an important distinction. How would you draw the line between truly subjective decisions, vs. fuzzy "judgement calls" that do ultimately serve an objective purpose?
For instance, consider word choice while writing a blog post. If a writer has an intended audience in mind, and a specific goal for the post (e.g. swaying readers toward a particular idea, or even simply providing entertainment), then arguably word choice is an objective judgement, not subjective? Empirically, some word choices will do a better job of swaying the audience than other choices. It may be very difficult to determine in advance which word choice will better serve the goal, but in principle it's a question with an objective answer.
Logo choice and hiring decisions also have a strong objective-in-principle (but inscruitable-in-practice) flavor. Even more so, I'd think, for legal rulings. I don't see any fundamental principle that would prevent an AI from making good choices in blog wording, logo design, or hiring. It might be very challenging to get there in practice, starting from current architectures... or it might not!
it's a very good question — which i'm still thinking through.
a way to begin to consider the question is to think about a case in which some kind of work appears to be [objective-in-practice but inscrutable-in-practice] and then try to see where that appearance breaks down — that breakpoint is the transition between [work machines could do] and [work humans must do].
if you take a legal decision made by a judge, there will be some cases which are relatively clear applications of law and precedent for which no serious meaningmaking needs to be done. these cases generally stop with a decision at the court of first instance, or even settled before being heard at first instance. at the other extreme, there are the cases which push precedent to the limits, and the legal decision about whether a precedent applies either breaks a precedent or reinforces it. such legal decisions are clearly ones in which broader context is required for a good choice to be made. for instance, brown v board of education, which required a meaningmaking decision that ran counter to precedent at the time. so in the legal domain, perhaps a good rule might be that cases which are unlikely to be appealed are more likely to be decidable by an AI system (because the meaningmaking is sufficiently baked into precedent, and the precedent is stable), while cases which do get appealed are those where subjective meaningmaking decisions need to be made by humans.
similar distinctions along a continuum can probably be made for blog word choice, logo design, hiring, and a whole slew of other business/org situations in which meaningmaking may need to be done. perhaps a good general rubric is: "if the work to be done is fully specifiable and the mechanism by which the work is done is well-understood and represented in the AI system's training data, then it is more likely that the AI system can do the work well" and also "if the work to be done is intended to be novel (or if the outcome or the mechanism by which the outcome is achieved is novel) then it is more likely that a human must do the meaningmaking work involved in deciding whether a method used or outcome achieved is appropriate/desirable."
so: using the content marketing copywriting example. if the blogcopy to be written is for a standard type of product and the mechanism for conversion is well-understood, i suppose an AI system trained on a large corpus of successful and unsuccessful copy for similar-ish products could produce acceptable copy. but if the goal is to create copy that staggers the audience with its freshness and novelty, or if the copy is for a product in a category that has never before existed, maybe a human should have significantly more input into filtering the copy of the AI system. (just as an example, of course.)
i wrote about the breakpoint and why it is difficult to conceptualise in a pair of essays:
this also ultimately resolves to the difference between what i call uncertainty work (which requires meaningmaking) vs certainty work (which requires understanding of stable causal mechanisms instead of meaningmaking): https://uncertaintymindset.substack.com/p/49-the-work-of-uncertainty
As a mathematician, I am annoyed by the common assumption that proving the Riemann hypothesis *doesn't* require managing complexity, metacognition, judgement, learning+memory, and creativity/insight/novel heuristics. Certainly, if a human were to establish a major open conjecture, in the process of doing so they would demonstrate all of these qualities. I think people underestimate the extent to which a research project (in math, or in science) differs from an exam question that is written by humans with a solution in mind.
Perhaps AI will be able to answer major open questions through a different, more brute-force method, as in chess. But chess is qualitatively very different from math: to play chess well requires much greater calculational ability than many areas of math. (At the end of the day, chess has no deep structure).
Also, prediction timelines for the Riemann Hypothesis or any specific conjecture are absurd. For all we know, we could be the same situation as Fermat in the 1600's, where to prove the equation a^n + b^n = c^n has no solutions you might need to invent modular forms, etale cohomology, deformation theory of Galois representations, and a hundred other abstract concepts that Fermat had no clue about. (Of course, there is likely some alternate proof out there, but is it really much simpler?). It is possible that we could achieve ASI and complete a Dyson sphere before all the Millenium problems are solved-- math can be arbitrarily hard.
Agreed!
I won't be surprised if AI is able to partially substitute brute search for judgement, insight, novel heuristics, etc. And I won't be surprised if it's *more* able to do this in mathematics than many other fields. But I would be surprised if it is able to solve difficult open questions in mathematics without substantial progress in judgement and so forth.
I couldn't resist leading off with that quote about the Reimann hypothesis, it gets at what I believe is an important point in an extremely pithy way, but I do agree with you that it likely overstates the case.
It would be interesting to have you take a crack at possible training data for these missing areas, even if just a few examples that others could expand on later. I wonder what training data AIs would generate for these and how hard the data would be to verify and correct.
Intelligence is substrate agnostic. Machines can demonstrate intelligence and perform any task a human can perform. (To those who disagree tell me where exactly is the intelligence in your brain)?
So what is the difference between a human and AGI? AGI isn't human, humans aren't AGI. AGI can't interact with the universe in a human like way because it doesn't have the same experience(s).
It's like asking you to become a fish for a week. You can replicate a lot of things a fish can do but you will never a fish.
I presume you mean that *in principle* machines can perform any task a human can perform? I agree with that. But the actual machines we have today aren't there yet. I'm trying to say something about the gaps between today's machines and today's humans.
Yes, in principle and in practice machines will be able to take on any task a human can do. However, the machine cannot be a human at the end of the experiment. In my opinion the actual machines are close to performing any task a human can do except the machines still cannot live the life of a human. This isn't to say machines are good or bad and humans are good or bad. They are just different.
Great post, thank you!
I am quite convinced that a key ingredient for AI to evolve and become good and useful for real world tasks, beyond good memory, could be embodiment. I find this talk by Jim Fan (https://www.youtube.com/watch?v=Qhxr0uVT2zs) illuminating: he recalls the Held and Hein 196 "kitten carousel" experiment, where two kitten were harnessed to a carousel, but only one could move actively, while the other was moved passively. Only the former developed a normal visual perception.
I think there is definitely something to this, but that "embodiment" puts the focus in the wrong place. The success of reinforcement learning in domains ranging from chess to (most recently) chain-of-thought reasoning, shows that there is a lot of juice in training a system by letting it (try to) solve real tasks in the domain of interest. This seems to apply to both natural and artificial intelligences. My guess would be that the best way to construct AIs that do better on many of the challenges listed here will be to somehow incorporate trials of real-world tasks (or very clever approximations of real-world tasks) into their training.
To put it more simply, I think that if you want to train an AI agent capable of performing tasks on the Internet, you won't need to give it a physical body but you will need to "embody" it in the Internet (by allowing it to interact extensively with the Internet, or some simulacrum thereof, as part of its training). The important thing isn't the body, it's the mass of time spent attempting to do real things.
Avoiding being cheated and critical judgement of the input information is likely to be a challenge. LLMs train by recreating the input data not by creating a concise view of the world. Thus the necessity of fine-tuning the model to give appropriate answers about disputed issues. It would be a fascinating project to find out what worldview a superior AI would develop on its own if that wasn't necessary (just like AlphaZero developed its own superior way of playing chess or go).
Currently, however, if you ask "What kind of personality do geminis have" you will get anwer like from an esoteric magazine, whereas for questions "Is hypothesis that personality depends on the zodiac sign justified empirically" a rational explanation is given. LLMs don't even know they give contradicting answers.
This is such a refreshing take on the AGI hype cycle. The gap between benchmark performance and real world utility feels like one of the least discussed but most important challenges. Curious to hear your thoughts on how we might design better benchmarks for these fuzzier skills. Would measuring long term task execution or collaborative decision making be a good place to start?
The short answer is "I have no idea". I think a primary reason we haven't seen good benchmarks for these skills is that these are precisely the skills that are hard to benchmark. That doesn't mean it's impossible, but it is difficult and it's not something I've given much thought to.
Long term task execution is certainly one direction to go; one challenge of course is that, by definition, such tasks take a long time to carry out, which is an impediment both to designing / calibrating the benchmark, and to using it to evaluate a model.
I've been thinking about this question for a while!
When we consider what humans "actually" do, we often look at tasks and their outputs. A different way to consider the question is to understand the subjective valuations put on tasks and their outputs — my belief is that this alternative is superior because it provides clearer discrimination between [essentially human actions] and [actions which humans currently do which could be done better by machines].
I call this act of deciding and assigning subjective value "meaningmaking."
A writer choosing this word (and not that word) to achieve the correct tone for a blogpost is engaging in an act of meaningmaking — the choice of word is the result of deciding that one word is subjectively better than another in conveying the chosen tone for the intended audience.
These meaningmaking acts are everywhere in daily life, corporate life, and public life.
Deciding that this logo (and not that logo) is a better vehicle for corporate identity — meaningmaking. Choosing to hire this person (but not that person) because they are a better culture fit — meaningmaking. Ruling that this way of laying off lots of government employees (and not that way of doing it) is unlawful — meaningmaking.
Humans do 4 types of meaningmaking all the time:
Type 1: Deciding that something is subjectively good or bad. “Diamonds are beautiful,” or “blood diamonds are morally reprehensible.”
Type 2: Deciding that something is subjectively worth doing (or not). “Going to college is worth the tuition,” or “I want to hang out with Bob, but it’s too much trouble to go all the way to East London to meet him.”
Type 3: Deciding what the subjective value-orderings and degrees of commensuration of a set of things should be. “Howard Hodgkin is a better painter than Damien Hirst, but Hodgkin is not as good as Vermeer,” or “I’d rather have a bottle of Richard Leroy’s ‘Les Rouliers’ in a mediocre vintage than six bottles of Vieux Telegraphe in a great vintage.”
Type 4: Deciding to reject existing decisions about subjective quality/worth/value-ordering/value-commensuration. “I used to think the pizza at this restaurant was excellent, but after eating at Pizza Dada, I now think it is pretty mid,” or “Lots of eminent biologists believe that kin selection theory explains eusociality, but I think they are wrong and that group selection makes more sense.”
At the moment, I cannot see a way for an AI system do meaningmaking work.
I've quoted a lot from an article i wrote on the problem AI systems (and machines more generally) have with meaningmaking: https://uncertaintymindset.substack.com/p/ai-meaningmaking.
It's part of a longer series of essays about how the meaningmaking lens helps us understand what AI can and should be used for (and what it can't do and should not be used for): https://vaughntan.org/meaningmakingai
Very much a work in progress so would love comments and suggestions from this community.
Calling out subjective value judgements is an important distinction. How would you draw the line between truly subjective decisions, vs. fuzzy "judgement calls" that do ultimately serve an objective purpose?
For instance, consider word choice while writing a blog post. If a writer has an intended audience in mind, and a specific goal for the post (e.g. swaying readers toward a particular idea, or even simply providing entertainment), then arguably word choice is an objective judgement, not subjective? Empirically, some word choices will do a better job of swaying the audience than other choices. It may be very difficult to determine in advance which word choice will better serve the goal, but in principle it's a question with an objective answer.
Logo choice and hiring decisions also have a strong objective-in-principle (but inscruitable-in-practice) flavor. Even more so, I'd think, for legal rulings. I don't see any fundamental principle that would prevent an AI from making good choices in blog wording, logo design, or hiring. It might be very challenging to get there in practice, starting from current architectures... or it might not!
it's a very good question — which i'm still thinking through.
a way to begin to consider the question is to think about a case in which some kind of work appears to be [objective-in-practice but inscrutable-in-practice] and then try to see where that appearance breaks down — that breakpoint is the transition between [work machines could do] and [work humans must do].
if you take a legal decision made by a judge, there will be some cases which are relatively clear applications of law and precedent for which no serious meaningmaking needs to be done. these cases generally stop with a decision at the court of first instance, or even settled before being heard at first instance. at the other extreme, there are the cases which push precedent to the limits, and the legal decision about whether a precedent applies either breaks a precedent or reinforces it. such legal decisions are clearly ones in which broader context is required for a good choice to be made. for instance, brown v board of education, which required a meaningmaking decision that ran counter to precedent at the time. so in the legal domain, perhaps a good rule might be that cases which are unlikely to be appealed are more likely to be decidable by an AI system (because the meaningmaking is sufficiently baked into precedent, and the precedent is stable), while cases which do get appealed are those where subjective meaningmaking decisions need to be made by humans.
similar distinctions along a continuum can probably be made for blog word choice, logo design, hiring, and a whole slew of other business/org situations in which meaningmaking may need to be done. perhaps a good general rubric is: "if the work to be done is fully specifiable and the mechanism by which the work is done is well-understood and represented in the AI system's training data, then it is more likely that the AI system can do the work well" and also "if the work to be done is intended to be novel (or if the outcome or the mechanism by which the outcome is achieved is novel) then it is more likely that a human must do the meaningmaking work involved in deciding whether a method used or outcome achieved is appropriate/desirable."
so: using the content marketing copywriting example. if the blogcopy to be written is for a standard type of product and the mechanism for conversion is well-understood, i suppose an AI system trained on a large corpus of successful and unsuccessful copy for similar-ish products could produce acceptable copy. but if the goal is to create copy that staggers the audience with its freshness and novelty, or if the copy is for a product in a category that has never before existed, maybe a human should have significantly more input into filtering the copy of the AI system. (just as an example, of course.)
i wrote about the breakpoint and why it is difficult to conceptualise in a pair of essays:
1. On what AI systems do better than humans: https://uncertaintymindset.substack.com/p/where-ai-wins
2. On why AI systems appear seductively like autonomous meaning-making entities but are actually tools which can’t make meaning on their own (for now): https://uncertaintymindset.substack.com/p/ai-mirage
this also ultimately resolves to the difference between what i call uncertainty work (which requires meaningmaking) vs certainty work (which requires understanding of stable causal mechanisms instead of meaningmaking): https://uncertaintymindset.substack.com/p/49-the-work-of-uncertainty
Job skills that AI researchers ignore:
Accountability.
Given our goals, develop metrics we can use to tell if we're achieving them.
A humber discussion of progress so far, what's going wrong, what the AI needs to do to succeed, can do to be helpful, can do to go above and beyond