Evals · Production

The merge test is the only eval that matters.

Here's a question I've started asking every time I evaluate an AI coding tool: would a senior engineer on my team actually merge this output?

Not "does it compile?" Not "does the test suite pass?" Not "does it solve the stated problem?" Those bars are real, but they're not the whole job. The whole job is: would a thoughtful engineer, seeing this code in a pull request, hit approve?

Most benchmarks don't ask that question. And I think that's costing us.

The benchmark arms race and what it's actually measuring

SWE-bench became the de facto leaderboard for AI coding assistants. The premise was solid: give the model a real GitHub issue, let it write a fix, run the existing test suite, see if the tests pass. Concrete, reproducible, competitive. Labs raced up the leaderboard.

Then Cognition looked more closely at the benchmark itself. In June 2025, they published an audit of SWE-bench Verified — the human-validated subset that was supposed to be the gold standard — and found that roughly 35% of its samples had at least one meaningful flaw: tasks with marginal real-world value, problem specifications that were ambiguous or wrong, or test suites that didn't actually validate the solution correctly.

You're measuring model performance against a ruler that's off by a third. The scores aren't wrong exactly — they're just measuring something slightly different from what the leaderboard implies.

Cognition released SWE-bench Prime to address this: a filtered version that removes the flawed samples and adds harder, multi-file tasks sourced from real engineering work. It's a meaningful improvement. But it still sits in the same paradigm: pass the tests, score the point.

Passing tests is table stakes. It's not the finish line.

The question behind the question

When a senior engineer reviews a PR, they're not running a checklist. They're running a mental simulation of what happens after this code ships. They're asking: is this readable six months from now? Does this pattern fit how the rest of the codebase is organized? Does this add a dependency that will haunt us? Is this the obvious solution, or the clever one — and is clever what we want here?

Those questions don't show up in any test suite. They can't. They're the accumulated judgment that takes years to build, and they're exactly what separates code that works from code that belongs.

Cognition has been thinking about this gap. Their December 2024 Cognition Evals framework introduced a category called Enterprise Code Quality, with an eval called CodeReview. The setup is deliberately different from SWE-bench: an evaluator LLM is asked to role-play as a senior engineer doing a code review, then make a binary decision — accept or reject, as if deciding whether to merge the pull request. No test harness. Just judgment.

It's a small framing shift. It's also a more honest representation of what production deployment actually demands.

Why this framing shift matters for tool selection

Right now, most teams pick their AI coding assistant based on some combination of benchmark position, word of mouth, and a vibe from the demo. That's not entirely irrational — benchmarks track something, and demos reveal rough capability — but it's incomplete in a way that's going to show up at the worst possible time.

The gap appears in code review. It appears in the refactor that technically works but leaves a mess the next engineer has to untangle. It appears when the AI assistant writes code that passes CI and then introduces a subtle ordering dependency that breaks production six weeks later.

Evals that ask "does this code behave like code a senior engineer would approve?" are getting at a different signal than pass/fail on tests. They're asking about style consistency, architectural fit, over-engineering, under-documentation, security smell, the whole texture of professional judgment.

That signal is harder to automate and noisier to measure. Which is exactly why it matters more.

The uncomfortable implication

If you're deploying AI coding assistants and measuring ROI by lines of code generated or tasks completed, you may be building up a debt that won't show up on any dashboard until it does.

Technically-correct code that your team can't maintain isn't an asset. It's a mortgage.

The teams I respect most are the ones adding a layer of scrutiny that looks less like automated scoring and more like: send the AI-generated output through a senior engineer's eyes before it lands in main. Not for every commit — that's not scalable — but enough to build intuition about where the model's judgment actually fails under the hood.

Evals like CodeReview are a step toward automating that scrutiny. They're not perfect. An LLM playing the role of a code reviewer is still an approximation. But the approximation is asking the right question.

What to do with this

The practical takeaway isn't "wait for better benchmarks." It's: be deliberate about what you're actually measuring when you choose a coding tool, and be honest about what the current benchmarks leave out.

  • When evaluating tools internally, add a code quality review pass to your assessment — even just a few hours of senior engineer time looking at AI outputs with fresh eyes. What you find will tell you more than any leaderboard position.
  • When reading benchmark numbers, ask what the eval is actually rewarding. Task completion and code quality are correlated, but they're not the same thing.
  • When setting team expectations, frame AI coding assistants as first drafts, not final answers. The engineer who reviews the output isn't slowing down the process. They're the part of the process that makes the output usable.

The merge test isn't exotic. It's what good engineers have always done. Making it the center of how we evaluate AI coding tools is overdue.

The question was always "would you ship this?" We just forgot to ask it.

Comments 0

No login needed. Be kind, stay on topic, no profanity.