Agents · Evals

The agent reliability gap.

It's hard to scroll any AI feed right now without hitting an agent demo. A model books a flight. Another files an expense report end-to-end. A third spins up a working SaaS dashboard from a single sentence. The clips are stunning, and the takeaway is almost always the same: the agents are here.

They're not. Not yet. What's here is a generation of agent demos — single-take performances that hide the long tail of failure modes that show up the second you point one at real work. The gap between an agent that wows in a conference talk and an agent that survives a Monday morning is wider than the demos suggest, and closing it is going to be the actual story of AI in 2026.

Why agents fail differently than software

When a traditional system fails, it usually fails in a way you can name. A null reference. A 500 from an upstream API. A bad migration. You write a test, you fix the bug, you move on. The failure mode is local, the fix is local.

Agents don't work like that. A few things break the old playbook:

  • They're non-deterministic. The same prompt against the same model produces different outputs from one minute to the next. "Worked when I tried it" is not evidence anymore.
  • Errors compound. A 5% error rate at one step becomes a 23% error rate across five steps. Multi-step plans amplify everything wrong with any single step.
  • They act on the world. A traditional bug returns the wrong number. An agent bug sends the wrong email, books the wrong meeting, fires off the wrong job. Failures aren't recoverable in the same way.
  • The long tail is the job. Demos optimize the happy path. Real work is mostly edge cases — the customer with the weird address format, the invoice with three different totals, the request that contradicts itself halfway through.

Building an agent that works in a demo is an engineering problem. Building one that works in production is a different engineering problem, and most teams are still solving the first one.

The three moats that close the gap

After enough agent pilots, the teams that ship and the teams that stall start to look different in pretty consistent ways. Three habits separate them.

1. Aggressive scoping

The first thing that goes wrong is almost always over-scoping. A capable model and a generous tool list will technically attempt anything. That's not a feature — it's the whole problem.

Production-grade agents look much narrower than the marketing implies. They have a sharply defined job, a tightly bounded toolset, and explicit refusal paths for anything outside their lane. "Handles support tickets" is a demo. "Handles password resets, account merges, and tier-one billing questions, and escalates everything else with a structured handoff" is a system.

If you can't write down your agent's job in one paragraph, your agent doesn't have a job. It has a vibe.

2. Evals as the source of truth

You don't have an agent. You have a hypothesis about an agent. Until you've run that hypothesis against a meaningful evaluation suite — and watched it fail in ways you didn't predict — you don't actually know what you've built.

The teams that ship treat evals the way good engineering teams treat tests: a non-negotiable, version-controlled, CI-gated discipline. Concretely:

  • A growing corpus of real (or realistic) inputs, including the weird ones.
  • Both rubric-based grades and outcome-based checks where the task allows.
  • Regression suites that gate deploys. "The new prompt is better on average" is not the bar. "It is at least as good on every existing case" is.
  • Human review on a sample of production traffic — every week, forever.

Evals aren't glamorous. They're the moat.

3. Observability and replay

When — not if — something goes wrong in production, you need to be able to ask: what exactly did the agent do, with what context, and why? That means structured traces of every prompt, every tool call, every retrieval, every plan revision, searchable and replayable.

The teams that do this well can take a real failure, replay it against a tweaked prompt or a different model, and either fix the root cause or add it to the eval suite. The teams that don't are stuck trusting the agent and hoping. Hope is not a deploy strategy.

The boring conclusion

Most of the agent capabilities people are chasing right now are going to commoditize. Tool use is becoming a protocol, not a moat. Reasoning models are everywhere. Frameworks have largely converged.

What isn't commoditizing is the discipline of making one of these things actually work for a specific job at a specific company on a specific Monday morning. Scoping, evals, observability — the unglamorous middle of the stack — are where reliable agents will be won, and where most pilots are quietly failing.

If you're building an agent for real, the question to keep asking isn't what else can it do? It's how do I know it still works tomorrow?