AI agent observability is the practice of monitoring whether an autonomous agent is actually doing its job, not just whether the service running it is up. The distinction matters because an agent can return a clean response, pass every infrastructure check, and still be wrong. Standard monitoring is structurally unable to see that failure, and most teams discover the gap only after a bad output has already reached a user.
A traditional monitoring stack is built on a contract that language models break. In deterministic software, uptime and correctness are bound together: the code defines the rules, and when something violates them the application fails loudly with an exception or an error code. If the system is up, it is doing what it was told. An agent built on a probabilistic model has no such guarantee. The output is a prediction weighted by likelihood, not the result of rules you can read line by line, so the system can run flawlessly and fail you in the same breath.
The 200 OK That Hides a Failure
Picture an automated tool that reconciles vendor invoices. A request arrives, a series of model calls execute, a fully structured response comes back, and the gateway records a 200. CPU was efficient. Memory stayed flat. To every layer of the infrastructure, the transaction succeeded.
Inside that response, the model met an invoice in an unfamiliar format, misread a row, and returned an instruction to pay ten times the amount due. No alarm fired because no rule was broken. The model processed the context and returned its best prediction. Mechanically, everything worked. The system drifted completely off course, and nothing in the stack was positioned to notice.
This is not a failure of the monitoring tools. HTTP status records whether a response came back within the protocol; it does not inspect the content. Latency measures the time to return a response; a wrong answer takes the same time as a correct one. Memory tracks allocation; the model uses the same memory whether its output is right or wrong. Each metric is working exactly as designed. The problem is that "working" means observing the transport layer, which is a genuinely separate thing from a correct output.
A language model can process a request flawlessly and fail you at the same time. Uptime tells you the system ran. It tells you nothing about whether the system was right.
Why Correctness Stops Rolling Up
The deeper problem appears once an agent takes more than one step. In ordinary software, correctness rolls up: if every step of a request succeeds, the request succeeded. An agent run breaks that arithmetic. Every tool call can return a clean 200, every step can be locally valid, and the run as a whole can still loop, stall, or drift off the goal. The fault is in how the steps combine, not in any single one of them.
A clean step is not a clean run. You cannot certify an agent's behaviour by inspecting its steps one at a time, because the failure lives in the trajectory, the path the agent took through its available actions. This is the difference between two kinds of monitoring. Session-level telemetry reads a signal at a single step: this call returned, this query ran, this latency was acceptable. Goal-level tracing reads the whole task at once: did this sequence of steps actually advance the objective that prompted them, or did it wander?
Flat, step-level logs only give you the first. A redundant reasoning loop does not show up as an error on any individual line; it only exists when you look at the entire trajectory and see the same lookup repeated with no progress between iterations. The signal is real, but it lives at the level of the run, and step-level logging cannot represent it.
Tracing Reasoning, Not Just Requests
In a traditional web application, tracing a transaction is a solved problem. A request arrives, the gateway assigns an identifier, and that identifier travels through every downstream call as a header. When the response returns to the user, the lifecycle closes, and every event from the transaction shares one ID. Query that ID and you get the complete picture.
Autonomous agent tasks do not close like that. A finance assistant asked to resolve a disputed contract might call a file-search tool, read a clause, check it against an external database, find a mismatch, recalculate, and produce a summary, across several minutes and several services. Read the raw logs from a run like that and you get a collection of unrelated entries. The file-search tool logged a retrieval. The database recorded a query. The model provider logged several inference calls. If the final summary is wrong, nothing stitches those events back into the journey the agent actually took. The telemetry captured the actions and lost the intent.
Goal-level tracing is the fix in principle: bind every model call and tool execution back to the single objective that prompted them, so a pile of disconnected logs becomes a graph you can follow when something drifts. The same discipline that makes an agent's reasoning traceable is what makes its failures diagnosable, and it is closely related to the work of constraining what an agent is allowed to do in the first place.
The Vocabulary Problem Underneath
There is a quieter obstacle beneath all of this. An agentic system is rarely one service. It is a reconciliation API, a sync worker, a tooling layer, each touching the same model, each written by a different team at a different time. Left alone, each one names what it logs however its own framework happens to name things. One service records a prompt's token count under one field name, another under a different name, a third never captures it at all.
None of those choices is wrong in isolation. The damage is structural and only appears in aggregate: when you try to investigate a cost spike or trace a failure across the whole platform, there is no single query that touches all three services, because the field you need is recorded under three different names. The repository you built to answer questions cannot answer them.
This is the problem that shared telemetry standards exist to solve. OpenTelemetry's generative-AI semantic conventions are an agreed, maintained set of attribute names for model telemetry, so that every service emits the same keys and the log repository behaves like one queryable dataset rather than a heap of incompatible records. Adopting a shared vocabulary is the unglamorous precondition for everything else: you cannot trace a goal across services that cannot agree on what to call a token.
From Watching to Judging: Evaluating an Agent
Observability tells you what an agent did. It does not, on its own, tell you whether what it did was good. That second question is the job of evaluation, and it is where most teams meet the same wall they hit with monitoring: the methods they trust were built for deterministic software. A unit test asserts that a function returns an exact expected value. An agent that never produces the same output twice cannot be checked that way, because there is no single correct string to assert against.
Evaluating an AI agent, the practice usually shortened to "evals", means judging the quality of a non-deterministic system against a standard rather than checking it against a fixed answer. In practice that draws on a few complementary approaches: deterministic checks for the parts of an output that are structured, such as whether it parsed, called the right tool, and stayed within policy; model-graded scoring for the parts that are not, such as whether a summary was faithful to its source; and human review on the cases that anchor the rest. The point of an eval is not a single pass or fail. It is a baseline you can hold a probabilistic system to as it changes underneath you.
Evaluation depends on observability, which is why the two belong in one conversation. You cannot grade a trajectory you were unable to reconstruct. The trace gives you the object to judge; the eval supplies the judgement. A team with neither is flying blind, and a team with tracing but no evals can see exactly what happened without knowing whether it was acceptable.
What Agent Observability Actually Requires
Pulling the threads together: observing an autonomous agent is not a harder version of observing a web service. It is a different problem. It requires reading correctness at the level of the whole task rather than the individual step, binding every step to the goal that prompted it so the trajectory can be reconstructed, and standardising the vocabulary of what gets recorded so the trace holds together across services. None of that is delivered by the infrastructure metrics most teams currently rely on, which is why a system can look healthy on every dashboard while quietly producing wrong answers.
Naming the gap is the first step. Building the instrumentation that closes it, the goal-level traces, the evaluation rigs that establish a quality baseline for non-deterministic output, and the architectural constraints that suppress failure before it cascades, is engineering work in its own right. That build is the subject of our AI Agent Observability and Reliability course. This piece is the map of the territory; the course is how you cross it. The organisations that handle autonomous AI in production most reliably are the ones that treated observability as a first-class design concern, not a dashboard they bolted on after the first wrong answer reached a customer.
Frequently Asked Questions
What is AI agent observability?
AI agent observability is the practice of monitoring whether an autonomous agent is achieving its objective, not just whether the service running it is available. It differs from standard application monitoring because a language-model agent can return a well-formed response, pass every infrastructure check and still produce a wrong or off-goal result. Observability for agents therefore has to read correctness at the level of the whole task, not the individual request.
What is the difference between session-level and goal-level tracing?
Session-level telemetry reads a signal at a single step: a call returned, a query ran, a latency was acceptable. Goal-level tracing reads the entire task at once and asks whether the sequence of steps advanced the objective that prompted them. The distinction matters because an agent run can have every step succeed individually and still loop, stall or drift off the goal, and that failure is only visible across the whole trajectory, which step-level logs cannot represent.
Why don't standard monitoring tools catch AI agent failures?
Standard tools measure the transport layer, not the content of a response. HTTP status records that a response returned within the protocol, latency measures how long it took, and memory tracks allocation; none of them inspects whether the output was correct. A model can misread its input and return a confidently wrong answer while every one of those metrics stays green, because no rule was broken. The tools are working as designed; correctness simply is not something they were built to observe.
Why use OpenTelemetry's generative-AI semantic conventions for agent tracing?
An agentic system is usually several services, each potentially logging the same model telemetry under different field names, which makes platform-wide queries impossible. OpenTelemetry's generative-AI semantic conventions are an agreed, maintained set of attribute names for model telemetry, so every service emits the same keys and the logs behave like one queryable dataset. Agreeing the vocabulary is the precondition for tracing a goal across services, because you cannot correlate steps that cannot agree on what to call a token.
What are AI agent evals?
AI agent evals are structured tests that judge the quality of an agent's behaviour, which is harder than testing conventional software because an agent rarely produces the same output twice. Rather than asserting one correct answer, evals combine deterministic checks for the structured parts of an output (did it parse, did it call the right tool, did it stay within policy), model-graded scoring for the parts that are not (was a summary faithful to its source), and human review to anchor the rest. The goal is a quality baseline you can hold a probabilistic system to as it changes, not a single pass-or-fail result.
How do you evaluate an AI agent in production?
Start from the trace, not the output. Because an agent's failures live in the trajectory rather than any single step, evaluating one means reconstructing the whole run and judging it against a standard: whether it advanced the goal, stayed within its allowed actions, and produced output appropriate to the input. That requires observability first, since you cannot grade a run you cannot reconstruct, and then a defined baseline of what a good run looks like. Evaluation and observability are two halves of the same reliability problem: the trace gives you something to judge, and the eval supplies the judgement.
Related Training
Foundation ·
How AI Makes Decisions
The underlying technical foundations that determine how every AI tool behaves
Foundation · Agentic
An Introduction to AI Agents
Orientation: Acquiring the judgment to lead in an agentic landscape.
Foundation · Agentic
How AI Agents Work
Moving from the conceptual to architecture: a practitioner’s guide to agentic systems.
Professional · Agentic
AI Agent Observability & Reliability
Engineering for the unpredictable: Deep tracing and failure suppression in autonomous systems.
Foundation · AI in Practice
An Intro to LLM Chatbots
What ChatGPT, Claude, and Gemini are, and what deploying them for an organisation involves
Foundation · AI in Practice
An Intro to AI Copilots
AI embedded in the software you already run, and what it takes to get value from it
Professional · AI in Practice
AI Agent Adoption
The business path to deploying agents: making the case, managing the risk, and knowing when not to
Professional · AI in Practice
An Intro to Coding Agents
What Cursor, GitHub Copilot, and Claude Code do, and what that means for the people who manage engineers