{
  "id": "HRN-006",
  "slug": "observability-for-agentic-systems",
  "title": "Observability for Agentic Systems",
  "category": "Observability",
  "status": "Draft",
  "summary": "How to make a non-deterministic, multi-step agent inspectable — traces and spans, token and cost accounting, evaluation hooks, and deterministic replay — so the system can be debugged, measured, and trusted.",
  "updated": "Sun Jun 21 2026 00:00:00 GMT+0000 (Coordinated Universal Time)",
  "url": "https://santismm.com/en/handbook/observability-for-agentic-systems",
  "evidence": {
    "evidenceLevel": "theoretical",
    "confidenceLevel": "medium",
    "sourceType": [
      "industry_observation",
      "personal_experience"
    ]
  },
  "related": [
    "HRN-003",
    "HRN-007"
  ],
  "tags": [
    "observability",
    "tracing",
    "spans",
    "replay",
    "cost-accounting"
  ],
  "headings": [
    "Executive Summary",
    "Key Concepts",
    "Definition",
    "Architecture Diagram",
    "Detailed Explanation",
    "Production Evidence",
    "Observed Failure Modes",
    "KPIs",
    "Cost Metrics",
    "Scaling Characteristics",
    "Related Content",
    "References",
    "FAQs"
  ],
  "markdown": "# Observability for Agentic Systems\n\n## Executive Summary\nObservability is the harness component that turns an opaque, non-deterministic agent run into an inspectable, replayable artifact. You cannot debug, evaluate, govern, or trust a multi-step stochastic system you cannot see — which is why observability is a precondition for nearly every other harness capability, not a phase-two add-on. This chapter covers traces and spans adapted for agents, token and cost accounting as first-class telemetry, evaluation hooks, and deterministic replay.\n\n## Key Concepts\n- **Trace:** The complete record of a single agent run — every step from goal to outcome.\n- **Span:** A single unit of work within a trace (a model call, a tool invocation, a retrieval, a decision) with inputs, outputs, timing, and metadata.\n- **Token/cost accounting:** Per-span and per-trace tracking of tokens in/out and resulting cost.\n- **Evaluation hook:** An instrumentation point where evaluation logic can score a span or trace, online or offline.\n- **Replay:** Re-executing a recorded trace deterministically to reproduce and debug behavior.\n- **Cardinality:** The dimensionality of telemetry tags; high cardinality aids analysis but raises storage cost.\n\n## Definition\n**Observability for agentic systems** is the harness subsystem that captures, structures, and stores a complete, queryable record of every agent run — its spans, inputs, outputs, model calls, tool calls, costs, and decisions — such that any run can be understood after the fact, compared across versions, scored by evaluation, and replayed deterministically. It answers the question \"what, exactly, happened, and why?\"\n\n## Architecture Diagram\n```mermaid\nflowchart TB\n  RUN[Agent Run] --> TRACE[Trace]\n  subgraph TRACE[Trace: one run]\n    direction TB\n    S1[Span: Plan]\n    S2[Span: Model Call]\n    S3[Span: Tool Call]\n    S4[Span: Retrieval]\n    S5[Span: Decision]\n  end\n  S2 --> TOK[Token / Cost Accounting]\n  TRACE --> STORE[(Trace Store)]\n  STORE --> QUERY[Query &amp; Dashboards]\n  STORE --> REPLAY[Deterministic Replay]\n  STORE --> EVALH[Evaluation Hooks]\n  EVALH --> EVAL[Evaluation HRN-007]\n  QUERY --> ALERT[Alerting / Monitors]\n```\n\n## Detailed Explanation\n\n### Why classic observability is not enough\nTraditional APM assumes deterministic services: a request, a few synchronous calls, a response. Agentic systems break those assumptions. A single run may take a *different path each time*, fan out across many model and tool calls, loop an unknown number of times, and produce *natural-language* inputs and outputs that ordinary metrics cannot summarize. Observability for agents must therefore capture not just latency and errors but the *semantic content* of each step — the prompt sent, the completion returned, the tool arguments chosen, the reasoning. Without that content, a trace tells you *that* the agent failed but never *why*.\n\n### Traces and spans, adapted for agents\nThe trace/span model from distributed tracing is the right backbone, with agent-specific span types:\n- **Model-call spans** record the assembled prompt (or a reference to it), the completion, the model and parameters, token counts, and latency.\n- **Tool-call spans** record the tool, the (validated) arguments, the result or error, and retries.\n- **Retrieval spans** record the query, the items returned, and their scores — essential for diagnosing memory misses.\n- **Decision/plan spans** record the agent's choice of next action and, where available, its rationale.\n\nSpans nest to form the full causal tree of a run. The richer the captured content, the more debuggable the system — at the cost of storage and privacy exposure, which must be managed (redaction, sampling, retention).\n\n### Token and cost accounting as first-class telemetry\nIn agentic systems, *cost is a behavior*, not just a bill. A regression that causes an extra reasoning loop or a bloated context shows up first as a token spike. Observability must therefore treat token counts and derived cost as first-class metrics, attributed per span, per trace, per user, and per agent version. This makes cost regressions detectable, runaway loops alertable, and per-task economics measurable — closing the loop with the cost-metrics discipline that recurs across the handbook.\n\n### Evaluation hooks\nObservability and evaluation (HRN-007) are co-dependent. Evaluation needs the traces; observability is most valuable when its data feeds scoring. The harness should expose *evaluation hooks* — instrumentation points where a scorer (a rule, a classifier, or an LLM-as-judge) can attach to a span or trace, either *online* (scoring live traffic for monitoring) or *offline* (replaying stored traces against a new model or prompt). Designing these hooks into the trace format from day one is what makes continuous evaluation cheap later.\n\n### Deterministic replay\nThe most powerful agent-specific capability is replay: re-running a recorded trace to reproduce its behavior. Because the model is non-deterministic, true replay requires capturing enough to *pin* the run — recorded model outputs (to replay without re-calling the model), tool results, retrieved context, and random seeds where applicable. Replay enables three things that are otherwise nearly impossible: reproducing a production failure locally, regression-testing a prompt or model change against real historical traffic, and A/B comparing two harness versions on identical inputs. A harness without replay debugs by guesswork.\n\n### Privacy, redaction, and retention\nCapturing full prompts and completions means capturing potentially sensitive data. Observability must integrate redaction (PII scrubbing), access controls on the trace store, and retention policies — these are governance (HRN-008) and security (HRN-011) concerns that the observability layer enforces in practice.\n\n## Production Evidence\n> **Evidence level:** theoretical · **Confidence:** medium · **Source:** industry_observation\n>\n> _Illustrative, representative scenario — not a verified single deployment._\n\n- **Context:** Teams operating multi-step agents in production who initially shipped with only basic logging.\n- **Scenario:** An intermittent failure (the agent occasionally takes a wrong action) is undiagnosable from logs; after adding full trace/span capture with replay, the failing run is reproduced locally and traced to a retrieval miss that fed the model a misleading document.\n- **Technology:** Tracing backend with agent-aware span types, trace store, replay tooling, token/cost telemetry.\n- **Load:** Production traffic with long-tail, hard-to-reproduce failures.\n- **Results:** Representative experience is that mean-time-to-diagnosis drops sharply once runs are fully traced and replayable, and that cost regressions become visible the moment they occur.\n\n## Observed Failure Modes\n- **Logs without structure:** Free-text logs that record *that* something happened but not the span tree, inputs, and outputs needed to understand it.\n- **No content capture:** Capturing latency and errors but not prompts/completions, leaving failures undiagnosable.\n- **Unbounded cardinality/storage:** Capturing everything at full fidelity for every run, exploding storage cost; needs sampling and retention policy.\n- **No replay:** Inability to reproduce non-deterministic failures, forcing debug-by-guesswork.\n- **Privacy leakage:** Capturing sensitive prompt content without redaction or access control.\n\n## KPIs\n| Metric | Target | Notes |\n|--------|--------|-------|\n| Trace coverage | ~100% of runs traced | Every production run produces a trace |\n| Mean time to diagnosis | Minimized | Time from failure report to root cause via traces/replay |\n| Cost attribution coverage | Per span/trace/version | Enables cost-regression detection |\n| Replay fidelity | High | Share of recorded traces that replay deterministically |\n\n## Cost Metrics\nObservability adds storage cost (proportional to traces × spans × captured content) and a small runtime overhead per span. Sampling, redaction, tiered retention, and storing references to large payloads control this. The cost is repaid by faster incident resolution and by *making token/cost itself observable*, which typically surfaces inference savings that dwarf the observability spend.\n\n## Scaling Characteristics\nTrace volume scales with traffic × steps-per-run, so deep agentic workflows generate disproportionately more telemetry than shallow services. Storage and query cost are the scaling bottlenecks; head-based and tail-based sampling, aggregation, and retention tiers keep it bounded. Replay storage scales with the fidelity captured, trading storage for reproducibility.\n\n## Related Content\n- HRN-003 — The Harness Taxonomy\n- HRN-007 — Evaluation of Agentic Systems\n\n## References\n- Distributed tracing concepts (spans, traces) adapted to agentic workloads.\n- Practitioner literature on LLM observability and tracing tooling.\n- Santa María, S. — Working notes on agent observability and replay.\n\n## FAQs\n**Q:** Isn't logging enough?\n**A:** No. Unstructured logs cannot reconstruct the causal span tree of a multi-step, branching run, and they rarely capture the semantic content (prompts, completions, retrieved context) needed to explain a failure. Structured traces with replay are required.\n\n**Q:** Why track cost in the observability layer?\n**A:** Because in agentic systems cost is a *behavior*: extra loops and bloated context show up as token spikes before they show up anywhere else. Cost telemetry is how you catch those regressions.\n\n**Q:** What is the single most valuable capability?\n**A:** Deterministic replay. It turns \"we can't reproduce it\" into a routine local debug session and enables regression-testing changes against real historical traffic."
}