How to evaluate an agent: metrics that actually predict success

Agent evaluation often collapses into a search for one number. Teams want a benchmark score, a leaderboard placement, or a simple pass rate they can quote in a planning meeting. The problem is that real agent systems do not fail in one way. They fail by misunderstanding intent, selecting the wrong tool, producing malformed arguments, timing out across dependencies, or creating a poor user experience even when the backend technically succeeds. A single aggregate metric tends to hide these distinctions rather than clarify them.

A more useful evaluation approach starts by asking what success looks like for the workflow. Did the user complete the task they cared about? How many retries or interventions were needed? How often did tools fail? How long did the work take? Did the product leave users with confidence in the result? Good metrics predict those outcomes. Weak metrics mostly predict demo performance. The gap between the two is where many teams lose months.

Evaluation becomes meaningful when the metric maps to a user-visible outcome or an operational cost the business actually feels.

Start with a strict definition of task success

“Task success” sounds obvious until you try to define it precisely. Did the agent open a support ticket, or open it in the right queue with the right priority and a usable summary? Did it draft an email, or draft one the user accepted without heavy editing? Strict definitions are useful because they align evaluation with real product value. Loose definitions inflate confidence and make improvements look better than they are.

A helpful pattern is to define success at the workflow level and then record why unsuccessful runs failed. Maybe the model misunderstood the request. Maybe it asked a needless clarification question. Maybe the right tool was chosen but a provider error blocked completion. That taxonomy turns task success into a dashboard you can act on, rather than a vanity metric you can only admire or dispute.

Questions for a strict success definition

Would a user consider the workflow complete without hidden cleanup?
Did the final state match the intent, not just the requested verb?
Were approval and policy steps followed correctly?
Did the system avoid unnecessary retries, loops, or clarification churn?

Track tool failure as a first-class metric

Many agent systems depend on external APIs, internal services, and workflow engines. That means tool performance directly influences user-perceived quality. If the model is excellent but the tool layer is flaky, users still experience an unreliable product. Teams should therefore monitor tool-call failure rate, timeout rate, and retry frequency as core evaluation metrics rather than as background infrastructure stats.

This is valuable for roadmap prioritization too. If one workflow underperforms because a downstream provider rate-limits aggressively, prompt tuning may not be the right intervention. If malformed arguments are the dominant issue, schema design or validation may be the real lever. Tool metrics help separate model quality from systems quality so teams stop solving the wrong problem.

Measure time-to-completion, not just end states

A workflow that eventually succeeds after several retries, multiple confirmation loops, and long waits may look fine in a binary success metric while still feeling poor to users. Time-to-completion helps expose that gap. Distribution matters more than average. A median may look healthy while the tail is full of frustrating sessions. Evaluating the full distribution reveals where latency, ambiguity, or approvals create drag that users actually notice.

Time metrics also create useful pressure to simplify workflows. If a task regularly succeeds only after two follow-up questions and a slow external provider call, the team can ask whether the product should collect one of those fields earlier, improve previews, or defer part of the action into a draft stage. These are product improvements informed by evaluation, not just diagnostics about the model.

A “successful” run that takes too long or requires too much rescue is often a workflow failure wearing a green badge.

Use error budgets for operational discipline

Error budgets are a useful way to keep agent quality grounded in operational reality. If you decide that a workflow can tolerate only a certain rate of tool failures, malformed writes, or policy bypass incidents before changes slow down, you create a healthier relationship between experimentation and reliability. This is especially important for agent products, where teams can otherwise add new capabilities faster than the operational surface can absorb them safely.

An error budget also supports decision-making across functions. Product can understand why a new high-risk workflow might wait. Engineering can justify work on validation or retries. Leadership can see why apparent feature velocity without reliability guardrails is not real progress. The specific threshold matters less than the discipline of having one and responding to it seriously.

The best evaluation stack is small and actionable

Good evaluation frameworks are usually narrower than teams expect. A strict task success rate, a tool failure rate, and a time-to-completion distribution may already tell you most of what you need if you also tag failures with useful categories. Add qualitative review where workflows are high stakes or outputs are subjective, but resist metric sprawl. Too many numbers produce the same confusion as too few when nobody can tell which ones matter.

What predicts success is not the flashiest benchmark. It is the set of measures that tell you whether the workflow completed correctly, efficiently, and safely. If your metrics help you choose the next fix with confidence, they are doing their job. If they mostly help you win arguments, they are not.