5 Comments
User's avatar
gregvp's avatar

Thanks for this, Steve. I hope the labs read it and think hard about it.

Paraphrasing brutally: to use current AI to produce sound work, you have to have a skeptical and persistent nature, high literacy, a logical mind, and excellent organising and critical thinking skills.

We are doomed.

Carsten Bergenholtz's avatar

This is a fantastic post, that could be teaching material in many educational settings. Benchmarks and model capability don't tell the full, real story. In real life, messy and open-ended tasks, GenAI can help you think but only if one is careful and always in cognitive control. A number of studies support the example and line of thinking presented here:

- this study https://dl.acm.org/doi/pdf/10.1145/3772318.3791796 shows that GenAI only contributes positively to solving an open-ended critical thinking challenge, if there is no/limited time pressure. When there is time-pressure, using LLMs led to worse results. Likely because of the same reasons outlined by Steve in this post: LLMs are too eager to present conflicting arguments.

- the following two articles both show that the impact of using GenAI can depend on one's expertise: GenAI helped lower performers, while high performers did not benefit. Mechanism: If you don't know much, then getting some info (e.g. on data centers in space) is better than nothing. Yet, if you in fact already do know something (on data centers) then getting pages and pages of info that on the surface looks smooth and plausible, but is in fact somewhat incoherent / quite right - can lead to a performance not improving. Qualitative interviews showed that the challenge of monitoring, filtering and evaluating the plausible information disrupted the higher performers thinking. See (co-authored by me) on a business school case here: https://journals.aom.org/doi/10.5465/amle.2025.0029 and an example involving legal reasoning here: https://papers.ssrn.com/sol3/papers.cfm?abstract_id=6525800.

If one has time, and one puts in extensive effort - then they can really be helpful.

MD's avatar

Interesting article, thanks for providing something like "raw data" and not just yet another polemic piece!

Two small errors: Footnote 6 apparently got overwritten with footnote 7, and footnote 12 refers to "previous footnote" but presumably should refer to footnote 9.

One disagreement: you talk about how "finding places to host a massive number of rocket launches would pose its own challenges", but I don't see how that is a new problem? SpaceX already has several launch sites that it can use to deliver Starlinks at a huge pace, why could it not use them for this purpose as well?

And one takeaway: I think your footnote 11 is the most interesting thing here. In my experience, when testing an AI for some task (e.g. in my experience transcribing handwriting), either I succeed in five minutes or nothing ever works, until perhaps a later model one-shots the task. These rounds of "refinement" or "self-improvement" or such never seem to work. I think it isn't an inability to *find* errors (if anything, an AI prompted to look for errors tends to give you a huge list that contains the relevant stuff and a lot of non-issues), but an inability to react to them effectively. It seems that AI lacks something like self-awareness or flexibility, and when it sees a problem, it doesn't have any way to change its approach.

It would be interesting to dig deeper into why this is the case, but I don't know how to even pose this question more rigorously. One test would be to try to simplify your approach and try to just produce 300 pages of *anything* with no corrections, then summarize it into a table of contents. My guess is it would be similarly (in)effective as your results, and the only difference between the two approaches you show is size of the output.

Steve Newman's avatar

Thanks for flagging the footnote errors! (Fixed.) You receive my inaugural Thorough Reader Award. (Footnotes 5 and 6 were both wrong; the text in #5 belonged in #6, and #5 should have been something else entirely; fixed. I draft these in Google Docs and the footnotes don't come across in select-all-copy-paste and need to be carried across manually. I've hesitated to automate this as it requires some precise clicking, but here's a reminder that the human baseline is not 100%.)

Launch sites: there is talk of scaling data center deployments to 100 GW/year or more. Handwave: if a single Starship launch can loft a 1MW satellite, that would be 100,000 launches per year. If each tower can handle one launch per day – and I believe SpaceX has talked about higher cadences, but it's certainly not demonstrated so let's just stick with the round number – you'd need 274 launch towers, over some number of sites. If they can turn launch sites around more rapidly, then fewer towers, but still many, and finding locations that are willing to tolerate a continual cadence of massive launches – with the attendant noise, closure of nearby airspace, etc. – is not obviously easier than siting terrestrial data centers. SpaceX today is indeed lofting Starlinks at a huge pace by any historical standard, but tiny in comparison to projected AI data center buildout. ChatGPT Pro estimates the aggregate raw solar panel capacity of all currently deployed Starlink satellites at 216MW, or 1/463 of the hypothetical 100GW AI deployment (https://chatgpt.com/c/69e4351f-b374-83e8-aa3c-006879de5e44). (And of course it would take several years at the current pace to deploy that many Starlinks.)

AIs failing to effectively address errors in their own work: I'm working on a new iteration of the research agent that is intended to make it harder for the agent to ignore, or wiggle out of addressing, errors. If this yields anything useful, I'll write about it.

MD's avatar

Please also write something (at least a Note linked here) if it does not yield anything useful! It's important to avoid publication bias.