Progress Observability Platform
Explainer series · No. 1 · Agent Observability

The Agent Observability Gap

Why agents that look healthy in a playground go sideways the moment they meet real users, and what a useful observability stack for agent-shaped systems has to cover.

Progress Observability Platform~4 min read5 interactive infographicsUpdated Apr 22, 2026

A single prompt-and-response in a notebook hides almost everything that will actually decide whether an agent ships. In production, the same agent fans out into retrieval, tools, retries, spend, and latency, and most of that happens under the waterline of a traditional APM. This piece walks through five ideas about why the gap exists and what it takes to close it.

Context · What a trace-native view looks like

What a full agent trace looks like

Before the five sections below, here's a short walkthrough of a trace-native view, every step, tool call, retrieval hop, and token spent on a single agent run. It grounds the rest of the article in something concrete.

telerik.com/ai-observability-platform
Trace-native interface · product walkthroughyoutube/7DyZbg5hzw4
01Playground vs Production

The same agent lives in two very different universes.

In a playground, an agent looks like a single prompt and a single response. In production, that same agent fans out into retrieval, tool calls, retries, external APIs, and whatever guardrails sit around them. Each hop has its own latency, its own cost, and its own way of silently going wrong. , and most of them inherited observability built for request-and-response web apps, a shape their agent stopped having a long time ago.

So what does that actually look like when you flip the same agent from a playground demo into a live production run?
What you build for · 1 call
Playground

One prompt. One response.
All green.

User prompt
LLM call
Response
1.2s$0.003OK

The happy path. One call, one outcome, exactly what you built for.

Production trace
trace_id · 9a1f…c4e0

Seven spans. Four silent failures. Almost none raise an exception.

Inspector · Retrieval
520ms · span #2

APM sees a successful vector query. It can't judge whether the retrieved context was actually right.

timeline · spans2.3s total
p99 8.4sCost 140× playgroundDrift detectedRetrieval relevance 0.31
TakeawayIf you only test an agent through the playground, you're measuring the happy path. The graph underneath is what decides whether it survives real traffic.
02Failure Modes

Agentic failures don’t look like bugs.

Agent failures rarely raise exceptions. Forrester catalogues , and almost none of them surface in unit tests: a retrieval returns a plausible-but-wrong chunk, a tool call succeeds with an HTTP 200 on the wrong action, a reasoning loop burns tens of thousands of tokens before quitting. The five below are the ones most teams meet first, and the five that traditional logging is least equipped to spot.

Which five, specifically, and what would actually catch each one before it reaches a user?

Retrieval Failure

What it looks like

Your RAG pipeline returns stale, wrong, or irrelevant chunks, the model answers confidently anyway.

Why traditional tools miss it

APM sees a successful vector query. It can't judge whether the retrieved context was actually right.

In the wild
severity

Tool-Call Failure

What it looks like

The agent picks the wrong tool, passes bad parameters, or the tool fails silently mid-chain.

Why traditional tools miss it

HTTP 200 hides semantic errors. Logs don't know which tool the agent should have called.

In the wild
severity

Latency Spikes

What it looks like

p50 looks fine. p99 is 12 seconds on the exact flows your power users hit most.

Why traditional tools miss it

Request-level metrics average away the tail. Per-span latency across an agent chain is invisible.

In the wild
severity

Cost Blowouts

What it looks like

A reasoning loop burns 40k tokens on a single request. One user racks up $200 before lunch.

Why traditional tools miss it

Infra dashboards track CPU, not tokens. Cost per trace, per user, per model is not a native concept.

In the wild
severity

Output Drift

What it looks like

Same input, different output. Quality silently degrades after a model upgrade or prompt tweak.

Why traditional tools miss it

There's no 'error' to log. You need evals and historical replay to catch behavioral regressions.

In the wild
severity
TakeawayA green status code is not a signal of correctness. Catching these five requires per-step visibility, not just per-request.
03Logging

APM was designed for a shape agents no longer have.

Traditional APM and log aggregation were built around a deterministic request-response contract: a URL comes in, code runs once, a response goes out, a span closes. Agent execution breaks every one of those assumptions. It's non-deterministic (same input, different path), stateful (tool calls feed the next reasoning step), and multi-step (one inbound request can fan out into dozens of sub-calls). , not because APM disappeared, but because it was never the right abstraction for this shape.

How much of a single agent run does each tier of tooling actually see?
One agent run · span depth
Each lane is a span. Shaded = visible to Traditional APM.
POST /v1/agent
retrieval.vector_search
tool.search_docs
llm.reasoning (plan)
tool.update_record (200, silent err)
llm.reasoning (retry loop)
eval.llm_as_judge
0ms575115017252300ms
Captures
  • HTTP request & response
  • Top-level status codes
  • Process-level CPU & memory
Blind to
  • Per-step reasoning
  • Tool inputs / outputs / decisions
  • Silent semantic failures on HTTP 200

200 OK means the server responded. It does not mean the agent was right.

TakeawayStack traces answer “what crashed?” Agents need an answer to “what did it decide, with what context, and why?”
04The Six Pillars

A working observability stack for agents covers six layers, not one or two.

Most “LLM observability” stories collapse into a single feature: pretty traces, or eval dashboards, or token counters. A useful stack for agent-shaped systems has to do all six at once (traces, evals, cost, latency, drift, and replay), because the failure modes interact. A retrieval regression shows up as a drop in eval score, as a jump in retries, and as a cost spike in the same trace. Miss any one of the six and you're debugging blindfolded on the others.

What does each of those six layers actually do, and why do they only work together?
LLM / Agent
Observability
Pillar 1 / 6

Traces

Full reasoning path and tool chain for every agent run.

Why it matters
What this looks like in practice

OTLP-native spans across every step. Inspect prompts, completions, tool I/O, and decisions in one view.

TakeawayTraces without evals are vibes. Evals without cost and latency are research. All six together is what lets a team run agents in production with confidence.
05Self-Assessment

Is your LLM app production-ready?

Use this as a gut check for your team, not a lead magnet. There's no submit and no signup. Five yes/no questions, each tied to one of the pillars above. The score at the bottom is a rough rubric for where your observability coverage actually sits today, and which layer is likely to be the next thing to bite.

So where does your team sit today, and which layer is the next one to bite?
Progress0 / 5
Yes coverageAnswered
Can you trace every agent step?

Without a full trace, debugging is guesswork and fixes are gambles.

Can you see exactly where it fails?

Retrieval, tool calls, and prompts all fail differently. You need to see which one broke.

Can you inspect latency and cost per step?

One slow tool or one runaway loop can tank UX and burn budget. Per-span visibility prevents both.

Can you compare runs over time?

Prompts and models change. Evals and replay tell you whether quality held or slipped.

Can you catch drift before your users do?

Silent degradation is the most expensive bug. Drift detection turns it into an alert, not a postmortem.

Your score
0/ 5

Answer the five checks above.

Maturity rubric
  1. 5 / 5
    Mature coverage

    Traces, evals, cost, latency, drift, and replay are all in place. You can debug on evidence.

  2. 3, 4
    Clear gaps

    You can debug most failures, but one or two classes still rely on guesswork.

  3. 0, 2
    Building from scratch

    Expect regressions, silent cost leaks, and long debugging loops until the basics land.

TakeawayIf you answered “no” to three or more, the gap isn't a tooling preference anymore, it's a reliability risk you're already paying for.
06Where to Start

You don’t have to land six pillars at once.

You've taken the gut check. Wherever you landed, the useful next step is the same: instrument spans first, attribute per-step cost and latency second, then layer evals and drift. Ordered by dependency, not preference. The first step closes more blind spots than any other, and every later layer needs the one before it to mean anything.

What does that actually look like across a week, a month, a quarter?
  1. 01Week 1

    Make the graph visible

    Close the biggest Section 02 blind spots in a week.

    • Emit OTLP spans around every LLM call and every tool call.
    • Capture prompt, response, tool name, arguments, and status on each span.
    • Ship one dashboard that lists the slowest ten traces this hour.
  2. 02Month 1

    Attribute latency and cost

    Turn spans into per-step numbers you can budget.

    • Tag spans with model, tenant, and app for dimensional breakdowns.
    • Track token cost on every LLM span; alert on spikes, not totals.
    • Break latency into p50, p95, and p99 per span, not per request.
    builds on Week 1 spans
  3. 03Month 3

    Catch quality regressions

    Make drift the alert, not the postmortem.

    • Build a small golden dataset and run LLM-as-judge on every deploy.
    • Track score distributions over time; flag statistically meaningful drops.
    • Wire replay: pick any failed trace and re-run a step with edits.
    builds on Month 1 attribution

The order matters more than the timeline. Week 1 can take an afternoon on a small agent or two sprints on a multi-agent stack. What stays fixed is the dependency chain: you can't budget cost per step without per-step spans, and you can't baseline drift without a quality signal to baseline.

TakeawayPick one span you don’t have today and add it on Monday. That single change moves a team more than any amount of whiteboard debate about which tool to evaluate next.