The production problem
Most teams monitor AI systems with latency and error dashboards. Those metrics can look healthy while the agent keeps making risky decisions.
During audits, the first hard question is simple: who approved this action and under which rule version? If your answer is a screenshot and a Slack thread, you do not have an audit trail.
What top ranking sources cover vs miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| IBM compliance auditability article | Clear accountability framing, framework differences (SOC 2, GDPR, ISO 27001), and practical auditability layers. | No concrete API-level contract for linking policy decisions to workflow execution state in a control plane. |
| AWS Bedrock model invocation logging guide | Strong baseline for collecting model request and response payloads, metadata, and destination configuration. | Invocation logs alone do not represent approval lineage or tool execution governance outcomes. |
| Braintrust LLM observability guide | Useful production framing for tracing, evals, and quality monitoring beyond uptime and latency metrics. | Limited focus on audit-grade evidence artifacts tied to policy snapshots and approval controls. |
Audit record schema
Good evidence models are boring on purpose. They are small, explicit, and easy to join. You want fields that survive model upgrades, workflow edits, and staff turnover.
| Field | Why required | Source endpoint |
|---|---|---|
| run_id + workflow_step_id | Connects every decision to a workflow context | GET /api/v1/workflow-runs/{id}, /timeline |
| job_id + trace_id | Anchors events to a unique execution path | GET /api/v1/jobs/{id} |
| decision + policy_rule_id + policy_snapshot | Shows why a decision happened and under which policy version | GET /api/v1/jobs/{id}/decisions, GET /api/v1/approvals |
| approval_required + resolution + resolved_by | Proves human gate behavior and accountability | GET /api/v1/approvals |
| job_hash | Binds approval to the exact request payload | GET /api/v1/approvals |
| context_ptr + result_ptr | Provides pointer-level reproducibility for evidence review | GET /api/v1/jobs/{id} |
| output_safety.decision | Captures post-execution release, redact, or quarantine outcome | GET /api/v1/jobs/{id} |
{
"run_id": "run-01",
"workflow_step_id": "approve-change",
"job_id": "job-123",
"trace_id": "trace-456",
"decision": "REQUIRE_APPROVAL",
"policy_rule_id": "prod-write-needs-approval",
"policy_reason": "Production writes need manager approval",
"policy_snapshot": "cfg:system:policy#sha256:7f3d...9c2b",
"job_hash": "b3b5...8f1a",
"approval_required": true,
"resolution": "approved",
"resolved_by": "manager-2",
"resolved_comment": "ticket INC-123",
"context_ptr": "ctx:job-123",
"result_ptr": "res:job-123",
"output_safety": {
"decision": "ALLOW",
"rule_id": "out-safe-1"
},
"timestamp": "2026-04-01T14:10:32Z"
}Query patterns that survive audits
The fastest path in an incident is one command list that joins execution, policy, and approval evidence. Build this once and run it for every critical workflow.
# 1) Job state + pointers + output safety metadata curl -sS http://localhost:8081/api/v1/jobs/job-123 # 2) Policy decisions for the job curl -sS http://localhost:8081/api/v1/jobs/job-123/decisions # 3) Approval queue with decision summary and job hash linkage curl -sS "http://localhost:8081/api/v1/approvals?include_resolved=false" # 4) Append-only workflow run timeline curl -sS "http://localhost:8081/api/v1/workflow-runs/run-01/timeline?limit=200" # 5) Policy publish/rollback audit stream curl -sS http://localhost:8081/api/v1/policy/audit
Operational defaults
Defaults decide whether logs become evidence or noise. Make these explicit in your runbooks.
| Control | Default | Why it exists |
|---|---|---|
| Approvals query filter | include_resolved=false for pending queue | Separates active risk from historical records |
| Workflow timeline endpoint | GET /api/v1/workflow-runs/{id}/timeline | Gives event chronology without replaying raw bus traffic |
| Decision endpoint | GET /api/v1/jobs/{id}/decisions | Makes policy rationale reviewable per execution |
| Policy audit endpoint | GET /api/v1/policy/audit | Tracks publish and rollback events for change governance |
| Output safety metadata | output_safety.* in job payload | Prevents blind spots between execution and release |
| Model invocation logs | disabled by default in Bedrock | Teams must explicitly enable retention of prompt/response evidence |
Limitations and tradeoffs
Rich evidence retention increases storage and indexing load over time.
Audit payloads can include sensitive context and need strong access boundaries.
Good APIs still fail if teams skip approval notes or inconsistent run identifiers.