The Gas-Powered Wagon

There's a genre of AI blog post that I skip as soon as I recognize it: the company retrospective where they handed everyone Copilot or Claude Code, waited a couple of quarters, and came away disappointed that nothing shipped any faster.

What bothers me about it is that it reaches for a causal story when all it has is a correlation, and it never goes looking for the mechanism underneath. We adopted the thing, the number didn't move, therefore the thing doesn't work. It's the reasoning of someone who straps a gasoline engine onto the back of a wagon, watches the extra weight slow the horses down, and concludes that engines make you slower.

The thing is, you can't actually do that. A wagon is built around the pull of an animal, the wheels only have to roll and bear weight, the axle is a dead beam that holds the wheels apart and carries the load straight down, and the whole thing is tuned for the speed of a walking horse. An engine delivers power in a completely different way, as torque that has to be transmitted to the wheels, which means you need a gearbox to trade the engine's fast spin for usable force, a clutch to engage it, a driveshaft to carry the power back, and a differential so the wheels can turn at different rates through a corner. The dead wagon axle can't take any of that, it would twist and shear the first time the engine bit, and even if it somehow held, the wooden wheels and the plain greased bearings would tear themselves apart at any speed the engine could reach, the brake meant for walking pace wouldn't stop the thing, and the frame would shake itself to splinters. By the time you've added a real drivetrain, a live axle, proper wheels and tires, brakes that work at speed, and a frame stiff enough to survive the vibration, you haven't built a faster wagon. You've built a car, and it only resembles a wagon from a distance.

That is the move every disappointed AI rollout actually made, dropping a new source of power onto a structure that was only ever designed to be pulled along slowly. You can't bolt more horsepower onto every engineer and expect the organization to move faster if the parts that carry the load were built for walking pace, and the research that's accumulated over the last couple of years keeps pointing at the same conclusion, which is that the thing slowing it down was never the typing. The coordination problems were already there, the slow reviews and the relitigated decisions and the handoffs that stall for a day, we'd just learned to live with them because the cost of building was high enough to hide them. AI didn't create any of that. It made the rest of the work fast enough that the old friction finally shows up against the new baseline, and the friction is the same friction it always was. It isn't AI's fault that our organizations never kept pace with the work they were already doing, AI only shone a light on it.

The number that goes the wrong way

In the 2024 DORA report, three quarters of the people surveyed said they were leaning on AI for part of their job, three quarters reported feeling more productive, and then the system-level numbers went the other way, with the report estimating that a 25% increase in AI adoption corresponded to a 1.5% drop in delivery throughput and a 7.2% drop in delivery stability, and the 2025 follow-up held the pattern rather than reversing it. The easy reading is that the engineers are kidding themselves, that the felt productivity is a sugar high and the dashboard is the truth, and I don't buy it, because the feeling and the dashboard aren't measuring the same thing. What an engineer feels when she uses a good model is the speed of implementation, the rate at which intent turns into working code under her hands, and that genuinely has gotten faster. What DORA measures is delivered value, the throughput and stability at the far end of a pipeline that every change still has to crawl through, the review and the testing and the approval and the deploy. Jez Humble has spent fifteen years pointing out that none of the value is real until the change is safely in production, that "done" means released rather than written, and that a fast delivery capability pays off only when the organization wrapped around the pipeline is built to use it. Mary Poppendieck has the sharper version, that if you map the value stream from idea to production the development work isn't where the time goes, it's the queues and the approvals, which Mary puts at somewhere between half and ninety percent of the elapsed time. Speed up the small slice that is implementation, leave the rest of the pipeline untouched, and the delivered number was never going to move much.

Every engineer who has used AI to build something for herself over a weekend, or thrown together a tool for a friend, already knows the implementation speedup is real, because at home there is no pipeline in the way, just her and the keyboard and a thing that exists by Sunday night that wouldn't have existed otherwise. The reason she can't reproduce that feeling at work isn't that she imagined it the first time. It's that at work the value has to survive the trip through the pipeline before it counts for anything, and the pipeline is exactly where it stalls, which makes the whole thing a delivery failure that we keep mislabeling as a productivity one.

There's a study people reach for to argue the keyboard didn't even get faster, and it deserves to be handled carefully rather than waved around. METR ran a randomized trial with sixteen experienced developers working real issues on their own mature repositories, codebases averaging over a million lines that they'd maintained for years, and found that with AI allowed they were 19% slower, even though they expected to be faster and believed afterward that they had been. It would be easy to stop there and announce that AI doesn't work, which is the lazy move I opened this post complaining about. METR didn't stop there, they ran a factor analysis, and the parts worth dwelling on are about the quality bar, because the study counted a task as done only when the author was satisfied the code would pass review, style and testing and documentation included, and the mature repositories carried a lot of implicit standards that the model didn't know and the developer had to supply by hand. The time didn't disappear into typing, it went into making the output trustworthy enough to survive a demanding codebase, which is the cost this whole post is about showing up at the scale of a single change. They're careful to note that this says nothing about less experienced developers or unfamiliar codebases, the cases where the quality bar is lower and the model has less hidden context to trip over. Whichever way the keyboard number breaks, what cost them time was trusting the code rather than generating it.

Rachel Stephens at RedMonk had the foresight to see this same thing back in 2024 and her writeup is well worth a read. Her move is to dust off the theory of constraints, the old idea that a system only moves as fast as its tightest bottleneck, and Gene Kim's blunt version of it: that any improvement you make somewhere other than the bottleneck is an illusion. If you pour all your new capacity into writing code faster, and writing code was never the constraint, you've optimized the wrong station on the line. You've fitted a stronger engine and left the axle exactly as it was.

Coordination was the heavy part

So what is the constraint, if not the code? My answer is coordination. I don't mean it in the vague "we should communicate better" sense. I mean coordination in the specific, measurable sense the productivity researchers have been pointing at for years.

When Forsgren and Storey and the rest of the SPACE authors laid out their five dimensions of developer productivity back in 2021, one of the five was Communication and Collaboration, sitting right next to the activity counts everyone actually tracks. The DevEx framework that followed put feedback loops at the front, the speed of getting an answer back from a person or a system, the code-review turnaround, the wait for an approval, the handoff between teams. Those are coordination costs wearing different names, and they were already the expensive part long before anyone could generate a thousand lines of plausible code in a minute.

The part I think gets underplayed is the economics, and Nate Jones has put numbers to it in his newsletter better than I'm about to. Coordination has always carried a cost, with the communication paths between people growing roughly with the square of the team size, the n(n-1)/2 that Brooks wrote about half a century ago. What AI changes isn't that arithmetic. What it changes is the price of every hour you spend on the left side of it. If an engineer with good tooling can now implement in an afternoon what used to take a week, then every hour that engineer spends in a status meeting, or waiting on a review, or relitigating a decision that was already made, is an hour priced against the much larger amount they could have built instead. The opportunity cost of coordinating went up because the opportunity itself got bigger. Some of the estimates floating around put coordination overhead at well over half of all knowledge-worker hours (which I'd treat as directional rather than gospel), but even the conservative version of that claim means the bottleneck has been sitting in plain sight on everyone's calendar the whole time.

What I'd actually do about it

What I'd take from all of this is less a fix than a reorientation. The instinct, once the engine is bolted on, is to keep tuning the engine, to get better at prompting, to adopt the next coding agent, to measure acceptance rates. The more useful instinct is to figure out which parts of the wagon are going to break when you try to go 0 to 60 in 1.66 seconds.

In practice that has started with the practice everyone points at first, review, because the agent produces changes faster than anyone can read them and the pull requests stack up. The first tech talk I gave at Bit Complete argued that review was the bottleneck, and I've since come around to thinking that was only half right, because review is mostly a proxy for trust. Nobody really minds that there are more changes to review, what they mind is being asked to vouch for code they have no other way to trust, and a human reading the diff is the slowest and least reliable way to earn that trust. Review is drowning because we let the cheaper sources of trust atrophy, the automated tests and the feature flags and the staged rollouts and the observability and alerting that let a team ship something it isn't yet sure of without betting production on it. Investing heavily in these is what stops review from being the place everything queues, because a change you can roll out to one percent of traffic, watch, and switch off in seconds doesn't need a human to have read every line before it goes out. The DORA authors point at a version of this from another angle, with the throughput and stability drops tracking the arrival of larger, messier change sets, which is to say the old advice about small batch sizes matters more now rather than less, precisely because the tools make it so easy to produce a large batch.

It has also meant taking the team-shape questions seriously, the ones Team Topologies and Conway's law were already asking. If coordination cost grows with the number of people who have to agree, then the lever is to need fewer of them in the room, with smaller teams holding clearer ownership, decisions written down once and referred back to instead of relitigated, and dependencies designed out of the architecture rather than negotiated across it every sprint. None of that is new advice. What's new is how much it's now worth, and how far the rebuild has to go, because you don't get to bolt the engine on and keep the wagon. The structure that survives the new power is the one you've reinforced and re-geared and rebuilt until it isn't really a wagon anymore.

The thing that actually scares me

The thing that actually scares me in all of this isn't the failed rollouts, it's cognitive offloading, because it's a cost the org chart doesn't capture and it lands on the individual engineer rather than the pipeline. Cognitive offloading, the transfer of mental work from a person to an external system, is exactly what makes agents worth reaching for, and a lot of what gets offloaded is no loss at all, because the boilerplate and the routine lookups and the mechanical refactors were never building anything in the first place. But some of the work we're handing over was the work that kept us sharp, the deep reading of unfamiliar code, the debugging intuition that only comes from hours spent tracking down problems, the design judgment that accumulates from making decisions and watching them play out, and the persistence to stay with a hard problem until the answer is right rather than merely plausible. Those are skills maintained through the act of doing them, and they quietly atrophy when the act is always delegated. The persistence one is the best evidenced: Liu et al. (2025) ran three randomized experiments and found that people who solved problems with AI did measurably worse once it was taken away, with the effect concentrated in the majority who reached for it to get the answer rather than a hint, the option of a direct solution quietly removing the part of the work that builds the habit of staying.

This matters for the rebuild specifically. The structure I've been describing puts the human in a more supervisory seat, auditing and specifying and orchestrating rather than typing, and auditing requires the very skill the agent is replacing, because you can't evaluate a design proposal you no longer have the judgment to evaluate. So the offloading turns out to be self-limiting: let those skills go and you can only supervise the easy work, which is the opposite of what the rebuild was for. The defense isn't to refuse the tools, it's a bit of deliberate practice, reading some code without the agent, debugging some problems by hand, taking a position on the trade-offs before asking for its analysis, writing specs at the precision a human reviewer would need even when the reader is a machine, and keeping ownership of a few small things end to end. The cost is a little short-term throughput. The return is staying able to do the supervising the whole rebuild depends on.

The companies writing the disappointed retrospectives aren't wrong that the number didn't move. They bolted the most powerful engine anyone has ever handed them onto an axle built for a walking horse, and then blamed the engine when nothing went faster. The engine was never the problem, the wagon always was.