AI agent evals are the tests that tell you whether an agent is doing its job well, and they exist because the testing methods software teams already trust do not work on a system that never produces the same output twice. A unit test asserts one correct answer. An agent has no single correct answer to assert against, so the question becomes how you judge quality without a fixed target. That question has a discipline of its own, and it is the one most teams skip on the way to production.

The gap shows up the first time someone tries to write a test for an agent the way they would for a function. They capture a known-good output, assert that the agent reproduces it, and watch the test fail on the next run, not because the agent got worse, but because it phrased the same correct answer differently. The instinct to pin behaviour to a fixed string is the right instinct for deterministic code and exactly the wrong one here.

The Test That No Longer Applies

Deterministic software has a clean definition of correct. Given an input, there is a specific output the code should produce, and a test asserts that it does. The assertion is binary and stable: it passes today, it passes tomorrow, and when it fails it tells you something broke.

A language-model agent has none of that stability. The same input produces different outputs across runs, shaped by sampling, by context, and by whatever version of the model is behind the API that day. There is rarely one correct response; there is a space of acceptable ones and a space of unacceptable ones, and the boundary between them is a matter of judgement rather than string equality. You cannot assert your way to that boundary. The unit test does not get harder. It stops applying.

You cannot test an agent by asserting one correct answer, because there is no single correct answer. You can only judge its output against a standard, which is a different kind of test entirely.

What an Eval Actually Is

An eval is a judgement of quality against a defined standard, rather than a check against a fixed result. The shift is from "did it return exactly this?" to "was the output acceptable on the dimensions we care about?", which forces a question most teams have not answered explicitly: what does a good output actually look like here? Faithful to the source? Within policy? Correctly formatted? Arrived at safely? An eval cannot be written until those dimensions are named, which is why writing evals so often exposes that the team never agreed what good meant.

The output of an eval is not a single pass or fail. It is a measurement you can track over time, a baseline you hold a probabilistic system to as the model, the prompt, and the surrounding code all change underneath you. Without that baseline, "we improved the agent" is an assertion. With it, the claim is something you can show.

The Three Ways to Judge

In practice, evaluating agent output draws on three complementary methods, layered by what they can see and what they cost.

Deterministic checks cover the structured surface of an output: did it parse as valid JSON, did it call the permitted tool rather than a forbidden one, did it stay within the schema and the policy. These are cheap, fast, and completely reliable, and they are also limited to the parts of behaviour that can be expressed as a rule. They tell you the output was well-formed, not whether it was right.

Model-graded evals use a language model to score the parts that cannot be reduced to a rule: whether a summary was faithful to its source, whether a response answered the question, whether the tone was appropriate. This is powerful because it scales judgement to volumes no human could review, and it carries its own failure modes, because the judge is itself a probabilistic system that can be inconsistent or gamed. Treating a model-graded score as ground truth is one of the most common mistakes in the field, and the reasons it goes wrong are worth understanding before you rely on it, which is the subject of the LLM-as-judge problem.

Human annotation is the anchor. A small set of examples graded carefully by a person is what calibrates everything else: it tells you whether your deterministic checks cover the cases that matter and whether your model judge agrees with human judgement often enough to be trusted. It does not scale, and it is not supposed to. Its job is to be the ground truth the cheaper methods are measured against.

For Agents, Judge the Path, Not Just the Destination

This is where evaluating an agent diverges from evaluating a single model call. A one-shot LLM eval looks at an input and an output and judges the output. An agent does not produce one output; it takes a sequence of steps, calling tools, reading results, and deciding what to do next, and the final answer is only the last of those steps. Judging the answer alone misses everything that happened to produce it.

An agent can arrive at the right answer through a broken path: by looping through the same lookup a dozen times, by calling a tool it should never have had access to, by violating a policy in the middle of a run that happened not to affect the final text. Score only the destination and all of those pass. The failure is in the trajectory, and evaluating the trajectory requires being able to reconstruct it, which is why evaluation depends on goal-level observability. You cannot grade a path you could not see. The trace gives you the object to judge; the eval supplies the judgement.

Where Eval Programmes Go Wrong

The failures are consistent enough to name. The most common is evaluating only the final output, which makes unsafe and inefficient runs invisible as long as the answer lands. Close behind is the single metric, a one-number "quality score" that collapses faithfulness, safety, and format into a figure that cannot tell you which of them slipped. Then there is the unanchored judge, a model-graded eval trusted as truth with no human-graded set behind it to confirm it tracks reality. And underneath all of them is sequence: teams that build their evals after deployment, in response to an incident, rather than defining what they would measure before the system went live.

None of these is a tooling problem, and none is fixed by buying an eval platform. They are failures to decide what good looks like and to measure against it deliberately. The platform can run the evals; it cannot tell you what to evaluate.

A Practical Starting Point

The useful first move is not to instrument anything. It is to write down, for one real task your agent performs, what a complete judgement of a good run would require: the dimensions that matter, the failures you most need to catch, and a handful of real cases, including the ones that have already gone wrong. That description is the specification for your eval set, and it is worth more than any amount of generic metric collection, because it is grounded in how your system actually fails rather than in what is easy to measure.

From there, the build is real engineering: anchoring a human-graded golden set, layering deterministic checks under model-graded scoring, and wiring the whole thing into a baseline you can run on every change. That construction, the evaluation rig that holds a non-deterministic system to a standard, is the work we take apart in our AI Agent Observability and Reliability course. This piece is the case for doing it; the course is how it is built. The teams that keep agents reliable in production are the ones that decided what good meant and measured against it from the start, not the ones that waited for a wrong answer to tell them they had never defined it.