We had four AI code reviewers running on every pull request. Cursor bot, Sentry bot, GitHub Copilot's bugbot, and Claude's code reviewer. Each one would unleash a wall of comments — sometimes 150+ on a single 1,000-line PR.

So we built a skill to triage them. It would read every unresolved comment, determine if the finding was valid, and respond accordingly. Roughly 25% were real. The other 75% were false positives. Noise.

But the real insight came from the postmortem. We'd built a skill that runs post-merge — goes back through everything that was caught across all those review passes and asks: what patterns are we actually seeing? What should we learn from this?

Turns out, 67% of the valid issues were nuanced logic errors. Not style violations. Not best-practice gaps. Actual bugs — a function consuming data it shouldn't trust, a state update that doesn't account for a race condition, a mutation that silently breaks a downstream consumer. The kind of thing you only catch by tracing inputs upstream and outputs downstream, like a skeptical staff engineer tugging on every seam.

None of our reviewers were built for that. They were built to know everything — and they were rigorous about nothing.

That was the moment I realized we had the wrong model entirely. Not the language model — the mental model.


The speed is real. So is the cost.

Everyone's chasing 10x. We got there. Our team — a small engineering group at Duro — was shipping features that previously took weeks in a matter of hours. Agent teams running in parallel. TDD-first pipelines. Automated spec compliance. The velocity was genuinely transformative.

But here's what nobody puts in the LinkedIn post (weird, right?): Google's 2025 DORA Report confirmed that AI adoption has a negative relationship with software delivery stability. And Faros AI's telemetry across 10,000+ developers put hard numbers on the problem.

+9%
increase in
bug rates
+91%
increase in
review time
+154%
increase in
PR size

More code, faster, with more bugs, that takes longer to review. That's not 10x. That's 10x output with 0.5x quality — and the math on that is ugly.

You can ship 10x faster and 10x worse. That's not a win. That's a liability wearing a velocity metric as a disguise.


The spec is the product

When one developer gets a vague requirement, they make one interpretation. That interpretation might be wrong — but at least it's consistently wrong. You can fix it in one place.

When five parallel agents get a vague requirement, they make five different interpretations. Each one is internally coherent. Together, they're chaos.

"A precise spec multiplies into precise implementations everywhere. A vague spec multiplies errors across parallel runs."

— Addy Osmani, The Code Agent Orchestra

The spec isn't overhead. The spec is the product. The code is a byproduct — a deterministic output of a well-defined input. Get the input right, and you can run it through as many agents as you want. Get the input wrong, and you've just multiplied your mistakes at the speed of inference.

This is why "just assign it to AI" doesn't work. Not because the AI is incapable — but because the input to the AI is sloppy. Garbage in, garbage out, at five times the throughput. This is why vibe coding only works for prototypes (which I love too, when appropriate).


Specialists beat generalists (we learned this the hard way)

Back to our postmortem findings. We knew that 67% of real issues were logic errors — things only caught by tracing data flow, not checking style guides. So we stopped trying to build one reviewer that catches everything. Instead, we built reviewers that each catch one thing exceptionally well.

Five specialists — security, dead-code, error-handling, async-safety, performance — each running in isolation, each with a narrow scope and deep focus. And a logic reviewer built to do what our postmortem told us mattered most: trace the inputs upstream, trace the outputs downstream, and ask "who calls this? who consumes this? what breaks if this is wrong?" And we may even take this further in the event a specialist gets too large (or not, if the models get smarter. who knows).

The improvement was immediate. Not incremental — categorical.

Turns out Stripe arrived at the same conclusion independently. Their Minions system exposes roughly 500 tools via MCP through an internal server called Toolshed — but each agent gets a curated subset of about 15. Their engineering team put it simply: "Agents perform best with a carefully curated subset relevant to their task."

Focused context dramatically outperforms comprehensive context. Every time.


3–5 agents, not 50

The "AI swarm" narrative sounds incredible in a conference talk. Fifty agents descending on your codebase like a cloud of productivity. Ship everything. Ship it now.

In practice — it's a mess.

The research is converging on a different number: 3–5 specialized agents per team. That's it. Frontend, backend, testing. Or API, database, integration. Small, focused, domain-specific teams where each agent owns a clear slice of the problem.

Stripe's 1,300 PRs per week don't come from one massive swarm. They come from many small, focused agent runs — each scoped to a single task in an isolated cloud environment. The impressive scale comes from concurrency, not swarm size.

The "thousands of PRs" headline is real. The "massive swarm" mental model behind it is not.


Humans aren't going anywhere. They're getting promoted.

Here's the part where I'm supposed to tell you that AI is replacing developers. I won't — because it's not true, and I'm tired of reading that take from people who don't ship software for a living.

What is happening is a role shift. The value moved upstream. Way upstream.

At Stripe, human review didn't disappear when Minions started shipping 1,300 PRs a week. It shifted — from writing code to evaluating code. From generation to verification. The humans are still there. They're just doing different work.

Think about what that means for your day. You're not the bricklayer anymore. You're the architect, the inspector, and the one who decides what gets built in the first place. That's not a demotion. That's a promotion — and it requires a different muscle.

Writing great specs. Making architectural judgments under ambiguity. Recognizing when agent output is almost right — which is the most dangerous kind of wrong, because it passes a casual glance. These are senior engineering skills, and they've never been more valuable than they are right now.


A bag of skills is not a workflow

I've watched teams assemble impressive skill collections. A skill for Jira integration. A skill for code review. A skill for generating tests. A skill for writing PR descriptions. They share them in Slack like recipe cards. They demo beautifully.

And they fall apart at the seams.

Because a bag of skills is not a workflow. Without pipeline discipline — without enforced handoffs, quality gates between stages, and a framework that has opinions about how work flows from idea to production — you're just vibe coding with extra steps.

(Yeah. I've been avoiding that comparison for a while now. But it fits.)

The difference between a framework and a skill collection is the difference between a manufacturing line and a workbench full of nice tools. Both can produce things. Only one produces them consistently, at quality, at scale.

A framework encodes your team's engineering standards into something enforceable. It decides when specs get reviewed, when code gets tested, when PRs get specialist scrutiny — and it does this every single time, not just when someone remembers. The framework is the nervous system. The skills are organs. Without the nervous system, the organs don't coordinate. They just twitch independently.


Where this goes

The next leap isn't faster models. It's three things:

Cloud execution. Agents shouldn't need your laptop. Isolated environments that spin up in seconds, work autonomously, and deliver finished PRs. Stripe's already there with pre-warmed devboxes. Claude Code now supports remote sessions. The infrastructure is arriving — and when it does, your local machine stops being the bottleneck.

Continuous autonomy. For pre-approved work types — migrations, dependency bumps, well-scoped bug fixes — the pipeline should run without a human trigger. A planner agent watches your backlog and pre-decomposes issues. When you approve, workers execute. The human becomes the approver, not the initiator.

Trust tiers. Not all work needs the same level of oversight. A dependency bump auto-merges after CI passes. A dead code cleanup needs a quick scan. A new API endpoint needs human architecture review. The system should know the difference — and calibrate its quality gates accordingly.

But here's the thing about all three — none of them work without quality infrastructure underneath. You don't scale speed. You scale trust. You build the quality gates first. Then you open the floodgates.


We built Jig because we needed it. Our team was moving fast — genuinely fast — and nothing in the ecosystem was keeping up with the standard we held ourselves to. Not because the AI wasn't capable. Because nothing was holding it accountable.

Every stage has a skill. Every handoff has a quality gate. Every shortcut has a guardrail.

10x was never the goal. Shipping with confidence — at any speed — was always the goal.

p.s. Stripe's secret wasn't their AI model. It was the devbox infrastructure they built for human engineers years before LLMs existed. Same pattern everywhere — the companies winning the AI transition aren't the ones bolting on the cleverest features. They're the ones who invested in boring fundamentals: great APIs, structured workflows, comprehensive test suites. The AI just plugs in. Back to basics is the dark horse.