Not So Fast: AI Coding Tools Can Actually Reduce Productivity
Study Shows That Even Experienced Developers Dramatically Overestimate Gains
UPDATE: on Wednesday July 16, I’ll be holding a fireside chat in SF with the primary authors of the paper – register here to attend!
The buzz about AI coding tools is unrelenting. To listen to the reports, startups are launching with tiny engineering teams, non-programmers are “vibe-coding” entire apps, and the job market for entry-level programmers is crashing. But according to a METR experiment conducted in the spring of 2025, there’s at least one cohort that AI tools still aren’t serving.
METR performed a rigorous study (blog post, full paper) to measure the productivity gain provided by AI tools for experienced developers working on mature projects. The results are surprising everyone: a 19 percent decrease in productivity. Even the study participants themselves were surprised: they estimated that AI had increased their productivity by 20 percent. If you take away just one thing from this study, it should probably be this: when people report that AI has accelerated their work, they might be wrong!
This result seems “too bad to be true” – so astonishing that it almost has to be spurious. However, the study was carefully designed, and I believe the findings are real. At the same time, I believe that at least some of the anecdotal reports of huge productivity boosts are real. This study doesn’t expose AI coding tools as a fraud, but it does remind us that they have important limitations (for now, at least) – confirming some things my colleague Taren wrote about in a previous post, First, They Came for the Software Engineers….
To begin with, I’ll explain how the study was done, and why I believe its results.
Finally, A Proper Scientific Trial of AI Coding Productivity
The study was carried out in pretty much the most rigorous fashion possible: an honest-to-goodness randomized controlled trial under real-world conditions. The subjects were experienced developers carrying out their everyday work.
The methodology was as follows:
METR recruited 16 developers from major open-source projects.
Each developer selected a list of coding tasks from their todo list, breaking up large projects into tasks that they could complete in an hour or two. In all, 246 tasks were included in the study.
The developers estimated how long it would take them to complete each task (a) under normal conditions, and (b) without using any AI tools. The percentage difference between these figures yields the predicted speedup – the degree to which the developer expected that AI tools would boost their productivity.
Each task was randomly assigned to one of two categories: “AI Allowed” (the developer can use any tools they like) or “AI Disallowed” (the developer cannot use AI coding tools or features).
The developers went about their work, while recording their screens for later analysis. After each task, they reported the time spent1. For AI Allowed tasks, they also estimated how much time AI tools had saved them – the retrodicted speedup.
To compute the actual speedup – or, rather, slowdown! – provided by AI tools, the researchers compared the developers’ predictions of how long each task would take to the measured completion time. They found that the difference between predicted and actual times was 19% larger for AI Allowed tasks than for AI Disallowed tasks2. Remember that when the developers estimate the task time, they don’t yet know whether they’ll be using AI for that task, so their estimates are unbiased.
The only significant way in which the study design falls short of the scientific gold standard is that it was not blinded: once work began, both the participants and the researchers knew whether AI tools were being used. This is of course unavoidable; there is no reasonable way of providing a “placebo” coding assistant. However, the researchers have specifically looked for, and found evidence against, a long list of alternative explanations – including the possibility of bias due to the non-blinded nature of the study. It really does seem that for the type of work carried out in this study, allowing developers to use AI tools slowed them down.
Addressing Every Objection You Thought Of, And Some You Didn’t
As I read through the study, I thought of half a dozen ways the results could have been confounded or otherwise invalid. It turns out that I’m not very imaginative; the authors address many more possible explanations.
The John Henry Effect: perhaps the developers were motivated to “beat the machine”, working extra-hard on AI Disallowed tasks. If this were the case, you might expect to see the effect taper off over the course of the study, as the excitement and novelty wear off – recall that subjects performed an average of 15 tasks of 1-2 hours each. No such tapering was observed.
Underuse of AI. Perhaps developers weren’t using AI tools even when allowed? However, this could only explain a lack of productivity gain; it can’t explain a loss. And exit interviews and analysis of screen recordings both showed substantial use of AI (84% of screen recordings for AI Allowed tasks showed at least some use of AI).
Cheating. Perhaps developers used AI tools for AI Disallowed tasks? But again, this could at most explain a neutral result, not a productivity loss. And exit interviews and screen recordings found only rare instances of cheating.
Overly optimistic time estimates. Developers are notorious for underestimating the amount of time it will take them to complete a task. However, because developers estimate time for a task before knowing whether they will have access to AI tools, any misestimation effects should cancel out.
Skewed task definitions. If developers were defining AI Allowed tasks differently than AI Disallowed tasks, that could throw off the results. However, this is ruled out, because developers specified their tasks and broke them down into chunks requiring less than two hours before the tasks were randomly assigned to the AI Allowed or AI Disallowed categories.
Dropout effects. Suppose that some AI Disallowed tasks were annoying to complete without AI, and so developers didn’t finish these tasks during the study and their results were not recorded. This could throw off the results. However, only 13 tasks were left uncompleted, against a total of 246 tasks completed, and the incomplete tasks were evenly divided between AI Allowed and AI Disallowed.
Out-of-date tools. AI capabilities are progressing rapidly; perhaps the results suffered from use of out-of date tools? But developers primarily used Cursor Pro with Claude 3.5/3.7 Sonnet, which were considered state of the art during the time period of the study (February – June 2025).
Inaccurate time reporting. Statistics are based on developer’s reports of how long it took them to complete each task, both before and after PR review. However, an analysis using only before-review times, or using the lengths of screen recordings, yields similar results.
None of the issues mentioned so far seem to have thrown off the results. Here are a few factors that may have contributed to the observed decrease in productivity.
Overuse of AI. The paper states that “developers were instructed to use AI to whatever degree they thought would make them most productive”. However, some subjects seem to have gotten carried away, and this may have contributed to the observed slowdown. In exit interviews, subjects were roughly evenly divided in describing their AI usage as “using AI as normal” (70 tasks), “experimenting with AI” (119 tasks) or “using as much AI as possible” (57 tasks). Those who self-reported as being in the first category were less slowed down by AI (though they still were not sped up).
Lack of experience with AI tools. If developers weren’t experienced with AI tools, they might make poor use of them. As it turned out, 93% of participants reported past experience with tools like ChatGPT, and 44% had experience with Cursor (currently the most popular dedicated AI coding tool). This makes it sound like there were a range of experience levels, with a significant number of participants having relatively little experience with AI tools, so it might be that developers would see more benefit from these tools once they have more experience. However, all developers were gaining experience with AI tools over the course of the study, and this did not result in observable improvements from the beginning of the study to the end. (Also, all participants received “live basic training” for Cursor at the outset of the study.)
A related potential issue stems from the fact that some study participants switched from their normal development environment to use Cursor for AI Allowed tasks during the study. However, “broadly, developers reported that they were not significantly inconvenienced or affected by these differences compared to their normal workflows”.
It doesn’t seem that inexperience was a major problem, but it may be that developers with more expertise in making the best use of AI tools would see better results.
Difference in thoroughness. Perhaps developers using AI tools expanded the scope of the task: for instance, writing code to handle more edge cases, adding additional features, or testing or documenting code more thoroughly. As potential evidence in this direction, developers added 47% more lines of code (per forecasted task size3) for AI Allowed tasks than AI Disallowed tasks. However, the study authors believe that this is at best weak evidence for scope creep. In private communication, they cited a number of reasons for this belief:
Line counts vary wildly from task to task (sometimes due to large auto-generated files), so there is a lot of “noise” in this measurement. Dividing two noisy numbers (lines of code in AI Allowed tasks vs. AI Disallowed tasks) yields a very noisy result. The observed difference is “not statistically significant”4.
The study authors examined many different metrics; it’s to be expected that at least one will show a spurious result (this XKCD comic hits the nail on the head).
In manual review, the authors saw little evidence of material difference in the nature of the AI Allowed work; they noticed a slight tendency toward more tests and comments, which could again have been spurious.
Perhaps the strongest evidence that scope creep did not contribute to slowdown is that increased time for AI Allowed tasks was greater on tasks where developers did not report scope creep:

Even if the difference in line counts is a real effect, there are potential bad explanations (bloated code, more duplication, unnecessary bells and whistles) as well as good ones.
More time might not mean more effort. Even if developers are spending more time when using AI tools, they might be expending less energy. Reviewing / correcting code is often (though not always!) easier than writing from scratch, and time spent waiting for AI can be used to relax or do other things.
Overall, it seems possible that the impact of AI tools was not quite as bad as it seems; some of the measured 19% productivity decrease could be paying off in more thorough work and reduced energy drain on developers, and some could be explained by overuse of AI by subjects who were overly focused on their participation in the study. But these tools don’t seem to be helping much, and they might really be making productivity worse. How could that be?
Some Kind of Help is the Kind of Help We All Can Do Without
Based on exit interviews and analysis of screen recordings, the study authors identified several key sources of reduced productivity. The biggest issue is that the code generated by AI tools was generally not up to the high standards of these open-source projects. Developers spent substantial amounts of time reviewing the AI’s output, which often led to multiple rounds of prompting the AI, waiting for it to generate code, reviewing the code, discarding it as fatally flawed, and prompting the AI again. (The paper notes that only 39% of code generations from Cursor5 were accepted; bear in mind that developers might have to rework even code that they “accept”.) In many cases, the developers would eventually throw up their hands and write the code themselves.
Based on the screen recordings, here is where developers spent their time:
Note that this graph reflects percentages of overall task completion time, and AI Allowed tasks took longer on average to complete, so a green bar of a given height represents more time than a purple bar of that same height6. You can see that for AI Allowed tasks, developers spent less time researching and writing code (though, due to the scale issues, the difference was less than visually apparent). Adjusting for scale, they spent roughly the same amount of time on “testing & debugging” and “git & environment”, and considerably more time idle – perhaps because waiting for AI tools causes people to lose flow. In any case, the moderate savings on researching and writing code was more than overcome by the time spent prompting the AI, waiting for it to generate code, and then reviewing its output.
How can we reconcile these results with the constant reports of AI coding tools working miracles?
This Is The Unevenly Distributed Future I Was Telling You About
Back in December, I wrote about how current AI tools are very good at some things, very bad at others, and the dividing line is extremely jagged. That jagged dividing line meanders right through the constellation of work that we call “software development”.
In First, They Came for the Software Engineers…, Taren wrote:
Typically, large productivity boosts occur for small, well-defined, greenfield projects, or when an engineer is first learning a new language or API. For other work, gains from using current AI tools are often far more modest – and potentially entirely offset by increased time needed for review, debugging, integration, and managing AI quirks.
Several aspects of the study play to the weaknesses of current tools. First, it was conducted on mature projects with extensive codebases. The average project in the study is over 10 years old and contains over 1 million lines of code – the opposite of “greenfield”. Carrying out a task may require understanding large portions of the codebase, something that current AI tools struggle with. (This may be less a fundamental weakness of AI models, and more a design choice in some coding tools to limit the amount of “context” sent to the model, in order to control costs and get quicker responses.) It also involved editing large files, which may be “out of distribution” for most AI models (i.e. they may not get much training on large files). The paper includes some anecdotal reports from developers which support this idea:
In software development, developers often rely on their own undocumented knowledge of the codebase to assist design and implementation decisions. In our study, developers often note that AIs lack this tacit codebase knowledge, resulting in less useful AI outputs. One developer notes that AI often acts like a new contributor to the repository, and that “AI doesn’t pick the right location to make the edits.” Another developer notes that while “we [..] know the data that will interact with the code, but the model doesn’t know the data. It doesn’t know we need to take care of this weird case of backwards compatibility and [thus] keep this specific line. And this is very hard to give as [context to the model].”.
We hypothesize that the size and maturity of the included repositories increases the amount of tacit knowledge that experienced developers rely on when completing their work—because AI systems may have less access to this knowledge, it may be more difficult for them to assist experienced developers on these issues.
Second, most of these open-source projects have strict style guidelines. The experienced developers in the study were accustomed to coding according to their project’s guidelines, but the AI tools are not – thus requiring developers to review and fix the AI’s output.
Third, the developers in the study had years of experience working on their projects, meaning that they were able to work very efficiently – posing a high standard for AI to compete with.
There have been other studies on the productivity impact of AI tools in real-world settings. One 2024 study found a 26% “increase in the number of completed tasks”, even though the subjects were using older tools that AI tools have improved dramatically in the last year. The methodology was less rigorous7, but perhaps more important is that this study involved less-experienced developers working on a wider range of projects. The study notes that “less experienced developers showed higher adoption rates and greater productivity gains”, consistent with the idea that current AI tools are less useful for the experienced developers in the new METR study.
Judge LLMs in Competitive Programming?
The observed 19% productivity decrease stands in particularly sharp relief with AI scores on coding benchmarks, where they are often found to rank at a human-elite level on coding competitions (though a recent study has called this into question). Coding competition problems are exceptionally small, well-defined, isolated, and greenfield (starting from scratch rather than working in an existing codebase), thus playing directly to AI’s strengths.
Enthusiastic anecdotes about time savings from AI tools often originate from within the big AI labs themselves. This may partly reflect misplaced enthusiasm – remember that the participants in this study believed that AI tools were speeding them up, even as they slowed them down. But it’s also the case that some of the work that goes on in those labs is better suited to AI tools, from coding up a small training experiment to adding a UI element to a chatbot. (It’s worth noting that not all lab insiders report significant value from AI coding tools; this might reflect the range of work.) And it’s possible that engineers at the AI labs are more experienced at using their own tools and benefit from sharing tips with their coworkers.
Here are a few references to recent papers with additional data points; I have not read the papers:
Zvi Mowshowitz (source):
New paper introduces LiveCodeBench Pro, which suggests that AIs are not as good at competitive programming as we have been led to believe. Some top models look like they weren’t tested, but these scores for the same model are lower across the board and all were 0% on hard problems, so the extrapolation is clear. [This benchmark is noteworthy because some of the problems are not published to the Internet, thus avoiding “memorization” issues which plague many coding benchmarks.]
Derek Thompson (source, including a link to the paper):
New paper: When a company goes from 0--> 30% of its code written by AI, a key measure of productivity only increases by 2.4%
Ethan Mollick (source):
Individuals keep self-reporting huge gains in productivity from AI & controlled experiments in many industries keep finding these boosts are real, yet most firms are not seeing big effects. Why? Because gaining from AI requires organizational innovation.
What To Take Away
This study was conducted in early to mid 2025. AI models are only going to get better from here. The coding applications built on those models, like Cursor, are going to keep improving to make better use of the models8. And developers are going to get better at using those applications efficiently and effectively – posing the right kinds of tasks, and providing enough context for the tool to do what they want. Things might change rapidly; the modern wave of LLM-based coding tools is only a couple of years old!
AI tools are also going to expand to address more of the software developer’s job, including reviewing code written by other developers, testing, and even reviewing and testing code written by other AIs.
The study’s finding of a 19% performance decrease may seem discouraging at first glance, but it applies to a difficult scenario for AI tools (experienced developers working in complex codebases with high quality standards), and may be partially explained by developers choosing a more relaxed pace to conserve energy, or leveraging AI to do a more thorough job. And of course results will improve over time. The paper should not be read as “debunking” the idea of an AI 2027-style software explosion, but it may indicate that significant feedback loops in AI progress may be farther away than anticipated – even if some aspects of AI research involve small throwaway projects that may be a better fit for AI coding tools. Meanwhile, it remains to be seen whether AI is generating bloated or otherwise problematic code that will cause compounding problems as more and more code is written by AI.
But perhaps the most important takeaway is that even as developers were completing tasks 19% more slowly when using AI, they thought they were going 20% faster9. Many assessments of AI impact are based on surveys or anecdotal reports, and here we have hard data showing that such results can be remarkably misleading.
Thanks to Nate Rush and Joel Becker for providing us with early access to the study, answering my incessant questions, and providing feedback on this post, and to Daniel Kokotajlo, Joyce Er, and Taren Stinebrickner-Kauffman for additional perspective and feedback.
Time measurements were self-reported by the developers. However, later analysis showed that alternate time measurements, such as the length of the video recording or the wall clock time until a pull request is created, yield similar results.
That is, the ratio of (actual time taken) to (up-front estimated time if no AI tools were allowed) was 19% longer for AI Allowed tasks than for No-AI tasks.
That is, the number of lines added divided by the number of hours the developer predicted a task would take to complete, was 47% more for AI Allowed tasks.
The p-value used for statistical significance is not stated in the paper, but an earlier draft made reference to p=0.05.
Note that this includes agent mode as well as autocomplete.
The authors note that this graph represents only some of the tasks completed in the study, because it was too much work to review and label all of the screen recordings. Many of the tasks where the largest slowdown occurred are not reflected in the graph, “to some extent because it's cheaper to pay labelers to [review] shorter videos”.
For one thing, in the 2024 study, developers defined tasks to complete after learning whether they would be using AI tools or not.
“Agentic” coding tools like Claude Code and OpenAI Codex have been getting strong reviews, but are so new that none of the developers in this study had adopted them.
This despite the fact that, as the paper notes, these developers may be better calibrated than average, since the structure of the study encouraged them to pay attention to how they were spending their time.
Very interesting study and great post!
I was surprised by the headline result, but the explanation does make sense and tracks with my own experience. I find ChatGPT pretty useful for coding, but: 1) I’m not a professional software engineer, my coding is for scientific research; and 2) it’s most useful when I’m trying to learn something new. I’ve definitely wasted time trying to get ChatGPT to do something that I could’ve done myself. (I’d say it’s analogous in those situations to just kind of brute forcing various changes in the code and seeing what works.)
First, thank you, Steve, as Abhay Ghatpande suggested, for this high-quality post, which provides a balanced and objective review of the study.
One of the striking elements is that developers consistently overestimate AI's impact on their productivity by nearly 40 percentage points (from a -19% actual to a +20% perceived increase), highlighting that subjective productivity assessments in the AI era may be fundamentally unreliable without objective measurements. With all the possible biases at play, this is not surprising and reminded me of some of the insights from https://en.wikipedia.org/wiki/Thinking,_Fast_and_Slow by https://en.wikipedia.org/wiki/Daniel_Kahneman
It also helps reinforce the importance of measuring ROI for both objective and subjective metrics to understand the benefit and impact of AI that organizations leverage.