Why traditional observability fails for agents
APM was built for request/response services. It tracks latency percentiles, error codes, and throughput. Those metrics tell you whether a service is healthy. They do not tell you whether an agent made the right decision.
Agents select tools, evaluate context, invoke external systems, and sometimes wait for human approval before proceeding. A 200 OK on the HTTP layer means the request completed. It says nothing about whether the agent chose the correct tool, whether the right policy was applied, or whether the action should have been allowed at all.
When an agent deploys a bad configuration to production, traditional monitoring tells you the deployment API responded in 340ms. What you actually needed to know: the agent selected kubectl.apply on a production namespace, the policy engine evaluated rule approval-prod-write, and the approval was granted by an on-call engineer who had 90 seconds of context. That chain of decisions is invisible to standard APM.
The core blind spot
A service can report perfect uptime while agents consistently make governed but suboptimal decisions. Fast, healthy infrastructure does not mean correct agent behavior. You need a different layer of visibility.
Agent incidents are decision incidents. The question after something goes wrong is never "was the API available?" It is "why did the agent choose this path, what policy applied, and did the behavior match what we intended?" Traditional observability cannot answer those questions.
Three pillars of agent observability
If traditional observability rests on metrics, logs, and traces, agent observability needs its own foundation. The three pillars that matter for autonomous systems are decision tracing, behavioral drift detection, and governance audit trails.
Record every tool selection, policy evaluation, and execution outcome with a shared trace_id. Connect intent to action to result in a single queryable path.
Compute rolling baselines for denial rates, approval patterns, and tool usage. Alert when current behavior diverges from the established norm. Agents adapt, and adaptation can be silent regression.
Log every governed decision with policy version, actor identity, timestamp, and evidence pointers. Make the record immutable and queryable for compliance review and incident reconstruction.
These three pillars are complementary. Tracing gives you the "what happened" for a single run. Drift detection gives you the "is something changing" across many runs. Audit trails give you the "prove it was governed" for compliance and incident review.
What top posts miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Salesforce: Agent Observability | Clear distinction between monitoring and observability, with strong focus on reasoning spans and metadata. | No concrete drift detection patterns or governance audit trail schemas for compliance review. |
| IBM: Why observability is essential for AI agents | Strong MELT plus AI-specific telemetry framing, including token usage and tool-call diagnostics. | Limited treatment of behavioral drift alerting and no policy-version tracking for audit lineage. |
| Rubrik: Agent Observability | Useful enterprise checklist around structured logs, staging tests, and centralized telemetry. | No decision-level tracing examples or runnable queries for incident reconstruction. |
The common gap: strong conceptual framing of agent telemetry, but limited operational guidance on drift detection, decision-level tracing with policy lineage, and audit-ready evidence schemas.
Decision tracing
A decision trace records the full path from agent intent to execution outcome. Unlike a standard distributed trace that tracks service calls, a decision trace captures the reasoning layer: which tool was selected, what policy rule matched, what the decision was, and how long each step took.
The trace_id is the connective tissue. It links the initial job submission to the policy evaluation, the approval (if one was required), the tool execution, and the final status. When an incident occurs, you pull one trace_id and reconstruct the entire decision chain without searching across disconnected log streams.
Example: decision trace record
This is a single span from a production deployment trace. The agent selected kubectl.apply, the policy engine matched rule approval-prod-write, and an on-call SRE approved the action. Every field is queryable.
{
"trace_id": "trc_8f2a4b6c",
"span_id": "spn_policy_eval_01",
"agent_id": "deploy-agent-prod-3",
"timestamp": "2026-04-09T09:31:47Z",
"tool": "kubectl.apply",
"decision": "REQUIRE_APPROVAL",
"rule_id": "approval-prod-write",
"policy_version": "pol_2026_04_07",
"latency_ms": 12,
"context": {
"namespace": "production",
"resource": "deployment/api-gateway",
"side_effect": true
},
"outcome": {
"approved_by": "oncall_sre",
"approved_at": "2026-04-09T09:33:12Z",
"execution_status": "SUCCESS"
}
}Key design principle: every field that appears in post-incident questions should be a first-class attribute in the trace, not buried in unstructured log text. If you find yourself grepping logs to answer "which policy version was active?", the trace schema is incomplete.
Decision tracing also surfaces latency at the governance layer. In the example above, policy evaluation took 12ms, but approval took nearly 90 seconds. That distinction matters: governance overhead is often human-approval wait time, not compute cost.
Behavioral drift detection
Agents adapt. New model versions change tool selection preferences. Prompt updates shift decision boundaries. Policy changes alter what gets denied or approved. Over days and weeks, the aggregate behavior of an agent population can shift meaningfully without any single run looking obviously wrong.
Drift detection works by computing rolling baselines for key behavioral signals and alerting when current values diverge beyond a configured threshold. The signals that matter most:
- - Deny rate shift: A sudden spike in policy denials can indicate attack probes, prompt injection attempts, or a misconfigured policy update.
- - Approval pattern changes: If approval rates drop or approval latency changes significantly, the governance flow may have a bottleneck or the risk profile of incoming requests has shifted.
- - Tool selection entropy: Agents using tools in new combinations or frequencies that diverge from the training baseline may be exhibiting capability drift.
- - Output safety rate: A gradual increase in quarantined or flagged outputs suggests prompt quality degradation or data distribution shift.
Example: deny-rate drift query
This query computes a z-score comparing the current 10-minute deny rate against a 7-day rolling baseline. A z-score above 2 triggers investigation; above 3 pages on-call.
# Detect deny-rate drift over 7-day rolling window
WITH baseline AS (
SELECT
date_trunc('hour', ts) AS hour,
count(*) FILTER (WHERE decision = 'DENY')::float
/ nullif(count(*), 0) AS deny_rate
FROM policy_decisions
WHERE ts > now() - interval '14 days'
AND ts <= now() - interval '7 days'
GROUP BY 1
),
current_window AS (
SELECT
count(*) FILTER (WHERE decision = 'DENY')::float
/ nullif(count(*), 0) AS deny_rate
FROM policy_decisions
WHERE ts > now() - interval '10 minutes'
)
SELECT
c.deny_rate AS current_rate,
avg(b.deny_rate) AS baseline_avg,
stddev(b.deny_rate) AS baseline_stddev,
(c.deny_rate - avg(b.deny_rate)) / nullif(stddev(b.deny_rate), 0) AS z_score
FROM current_window c, baseline b
GROUP BY c.deny_rate;The key insight with drift detection is that individual runs can all look normal while the population-level pattern shifts. A deny rate moving from 2% to 8% over two weeks does not trigger any single-run alert. Only a baseline comparison catches it.
After policy changes, reset baselines
A deliberate policy update will naturally shift deny rates and approval patterns. If you do not reset the drift baseline after an intentional change, the system will alert on expected behavior. Tag policy deployments and auto-reset rolling windows.
Governance audit trails
An audit trail is the compliance-ready record of every governed decision. It answers: who acted, what policy applied (and which version), who approved, what happened, and when. It must be immutable, timestamped, and queryable.
The difference between an audit trail and a log is structure and integrity. Logs are append-only text. Audit trails are structured records with hash chains, policy version pointers, and approval evidence that can be independently verified.
Example: audit record
{
"event_id": "evt_0195f2c8",
"run_id": "run_8bce4a",
"trace_id": "trc_8f2a4b6c",
"tenant": "prod-a",
"actor": {
"type": "agent",
"id": "deploy-agent-prod-3"
},
"policy": {
"decision": "REQUIRE_APPROVAL",
"matched_rule": "approval-prod-write",
"policy_version": "pol_2026_04_07",
"policy_hash": "sha256:3f91b6a9..."
},
"approval": {
"required": true,
"approver": "oncall_sre",
"approved_at": "2026-04-09T09:33:12Z",
"method": "slack_button"
},
"execution": {
"tool": "kubectl.apply",
"status": "SUCCESS",
"duration_ms": 2340
},
"integrity": {
"prev_hash": "a0f965...2b1e",
"hash": "0d8d6e...ee0a",
"sig_alg": "ed25519"
},
"ts": "2026-04-09T09:33:15Z"
}Design principles for production audit trails:
- - Policy version pinning: Every decision record includes the exact policy version and hash that was active at evaluation time. If a rule changes tomorrow, yesterday's decisions still reference the rule that applied.
- - Approval evidence: When a decision requires human approval, the record captures the approver identity, method (Slack button, CLI, dashboard), and timestamp. Not just "approved: true".
- - Integrity hashing: Each record includes a hash of its contents and a pointer to the previous record's hash. This creates a tamper-evident chain. If a record is modified after the fact, the hash chain breaks.
- - Queryable storage: Audit records should support queries like "show me all DENY decisions for agent X in the last 24 hours" or "show me all approvals by user Y during the incident window." If you need a developer to write a script, the trail is not operationally useful.
For a deeper treatment of compliance-specific audit trail design, see the AI agent audit trails compliance guide.
Key metrics to track
These are the metrics that surface agent-specific problems traditional monitoring misses. Each one answers a question that matters during an incident or a compliance review.
| Metric | What it tells you | Alert threshold |
|---|---|---|
| Deny rate | Policy is blocking more actions than baseline. Could indicate misconfiguration, attack probes, or prompt drift. | > 3x 7-day baseline for 10 minutes |
| Approval latency (P50) | Governance is becoming a bottleneck. Agents are waiting too long for human sign-off. | > 15 minutes for high-risk class |
| Fail-open count | Safety kernel was unavailable and the system continued without policy enforcement. | > 0 in any 5-minute window |
| Drift score | Agent behavior distribution has shifted from the last stable baseline period. | > 2 standard deviations from 14-day rolling mean |
| Tool selection entropy | Agent is using tools in unexpected combinations or frequencies compared to training baseline. | > 1.5x weekly entropy baseline |
| Trace coverage ratio | Percentage of policy decisions that have complete end-to-end traces. Gaps mean blind spots. | < 99% sustained for 15 minutes |
Start with deny rate and fail-open count. Those two metrics alone catch the highest-severity agent incidents: policy violations and unprotected execution. Add drift score and trace coverage as the system matures.
Build vs buy
You can build agent observability from scratch. Structured logging, trace correlation, policy version tracking, drift alerting, and queryable audit storage are all implementable with existing tools. The question is whether that is where your engineering time should go.
| Capability | Build yourself | What Cordum provides |
|---|---|---|
| Structured decision logging | Custom middleware on every tool call and policy check. 2-4 weeks for a basic implementation. | Built-in decision records with trace_id propagation on every policy evaluation. |
| Trace correlation | Integrate OpenTelemetry spans across agent framework, policy engine, and execution layer. | Automatic trace context from job submission through policy check, approval, and dispatch. |
| Policy version tracking | Version-control policies, snapshot on evaluation, store hash with each decision record. | Policy snapshots attached to every decision. Query any historical evaluation by version. |
| Drift alerting | Compute rolling baselines, z-scores, and anomaly thresholds. Maintain a statistics pipeline. | Configurable drift monitors on deny rate, approval patterns, and tool selection entropy. |
| Audit export and query | Build queryable storage with retention policies, legal hold support, and export tooling. | Queryable audit log with configurable retention, compliance export, and incident replay. |
The build path makes sense if you have a small number of agents with simple governance needs and existing observability infrastructure. The buy path makes sense when you need production-grade tracing, drift detection, and audit trails without spending a quarter building plumbing.
Getting started
You do not need all three pillars on day one. Start with the one that has the highest immediate return: structured decision logging.
- 1. Log every policy decision with a trace_id. This is the single highest-leverage change. When something goes wrong, you can pull one ID and see the full decision chain. Without it, incident investigation is log archaeology.
- 2. Add deny-rate monitoring with a 7-day rolling baseline. A simple z-score threshold catches policy drift, attack probes, and misconfiguration faster than any dashboard review.
- 3. Store audit records with policy version and integrity hashes. Even before you need compliance exports, immutable decision records shorten every incident review.
- 4. Instrument approval latency. The governance bottleneck in most agent systems is not policy evaluation time. It is the time a human takes to approve a high-risk action. Measure it, alert on it, and use it to tune risk-tier routing.
- 5. Add trace coverage monitoring. If only 80% of policy decisions have complete traces, the other 20% are blind spots where incident reconstruction will fail.
The goal is not perfect observability on launch day. The goal is that every incident teaches you which signal was missing, and you add it before the next one.
Frequently Asked Questions
How is AI agent observability different from traditional APM?
What is decision tracing for AI agents?
How do you detect behavioral drift in autonomous agents?
What should an AI agent audit trail include?
Can I use OpenTelemetry for agent observability?
What is a fail-open event and why does it matter?
How often should drift baselines be recalibrated?
What is the minimum observability setup for a new agent deployment?
Next step
Pick one agent workflow running in production today. Add a trace_id to every policy decision it generates. Set up a deny-rate baseline and a single alert. Then trace one real incident end to end using the decision records. If you can reconstruct the full decision chain in under five minutes, your observability foundation is solid. If not, you know exactly where the gaps are.
Continue with AI Agent Security Best Practices and AI Agent Audit Trails Compliance Guide.