For the last few months I've been running an earlier version of Leroy — my AI engineer — on my own projects. It was genuinely good at the thing everyone promises: I'd describe a fix, and minutes later there was a working PR.
And then the PRs piled up.
That's the part nobody warns you about. When generating a fix costs minutes, you generate a lot of them. But every one of those PRs still had to be reviewed, tested, and trusted before it could ship — by me. The backlog didn't shrink. It grew. I had a queue of "probably fine" changes I hadn't verified, and the faster the agent worked, the longer that queue got.
For a while I blamed the pile-up itself. That was wrong. The pile-up was a symptom. The actual problem was underneath it:
Verifying each fix never got cheaper.
Writing code got 10x faster. Confirming a fix was safe to ship got… exactly as slow as it always was. Sure — a frontend-only change with a configured Vercel preview pipeline was great. I'd get a live preview for free and could eyeball it in a minute. But not every ticket falls into that path. The moment a fix needed the backend, or fresh data to actually exercise it, or touched both sides at once, I fell off the paved road: spin up an environment, get the frontend and backend running together, seed the data, click through it myself. For a one-line backend tweak, fine. For anything cross-stack, that setup could cost more than the fix did. So I didn't do it. The PRs waited.
Multiply "expensive to verify" by "lots of PRs" and you get the wall I hit.
The popular answer to this is "remove the human" — full autonomy, let it merge itself. I don't buy it, and not because I'm precious about control. I want to make the call on what ships. What I don't want is to pay a huge tax every time I make it. The bottleneck was never my judgment. It was the cost of exercising it.
So that's the problem Leroy v3 is built around: make verification cheap, and the pile-up drains on its own. Here's what that looks like so far:
- A review agent reads the code changes and builds a red/green test plan — the things that should pass, the things that should fail — and then runs it. You don't get "trust me," you get checks.
- One click puts you in a real environment with the change live. It provisions the frontend and backend together and seeds the data needed to test — the exact setup that used to eat my afternoon, and exactly the tickets that never fit the easy preview path. No config. You click, and you're standing in front of the running change, ready to eyeball it.
None of this takes me out of the loop. It does the opposite — it hands me everything I need to sign off in seconds instead of assembling it myself. The judgment stays mine. The grunt work goes away.
I'm still building it, and I'll write about the parts that break (there will be parts that break). But the core bet feels right: the next unlock in AI engineering isn't writing code faster. We're already drowning in code we can write. It's being able to trust it fast enough to actually ship.