The production problem
Agent teams usually start with a confirm button and call it governance. That approach breaks once actions involve production writes, payment flows, or access changes.
Approval quality drops fast when reviewers see low-context requests. In production, a rushed yes can be as dangerous as a bad model output.
What top ranking sources cover vs miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Microsoft function tool approvals | Solid runtime approval loop: propose call, request human input, then continue session with explicit approve or reject. | No control-plane model for tenant policy snapshots, decision summaries, and queue-wide operational governance. |
| StackAI HITL approval workflow guide | Practical risk-based approval design, idempotency emphasis, and reviewer evidence-pack recommendations. | Limited API-level detail on approval binding fields and execution-state correlation endpoints. |
| Before the Tool Call (arXiv 2603.20953) | Strong argument and quantitative data for deterministic pre-action authorization before tool execution. | Focuses on authorization layer; less detail on enterprise queue operations and workflow gate UX. |
Approval architecture
Safe approval design separates proposal from execution. The agent proposes. Policy decides. Humans approve only where risk requires it.
| Layer | Role | Failure mode if missing |
|---|---|---|
| Policy decision point | Return REQUIRE_APPROVAL before side effects occur | Agent executes action immediately with no gate |
| Approval queue | Route pending actions by risk and approver role | Reviewer overload and rubber-stamp behavior |
| Execution bind | Resume only if approval record matches job hash and policy snapshot | Approval replay on mismatched request payloads |
| Decision summary | Expose what, why, and next effect in reviewer context | Blind approvals with low quality decisions |
| Audit trail | Persist resolution, actor, and policy linkage | No defensible evidence during audits or incidents |
Risk-tiered routing
Route approvals by blast radius. Do not send every action to humans. That only creates fatigue and slower operations.
| Action class | Policy decision | Approver | SLA |
|---|---|---|---|
| Read-only search/classification | ALLOW | None | Immediate |
| Internal ticket update | ALLOW_WITH_CONSTRAINTS | Sampled or exception-only | Near real-time |
| Production write or config change | REQUIRE_APPROVAL | Service owner or on-call engineer | 15-60 minutes |
| Payments, access grants, destructive deletes | REQUIRE_APPROVAL | Dual approval | Policy-defined cutoffs |
Implementation examples
version: v1
rules:
- id: allow-read
match:
topics: ["job.*.read", "job.*.list", "job.*.get"]
risk_tags: []
decision: allow
- id: prod-write-gate
match:
risk_tags: ["prod", "write"]
topics: ["job.deploy.*", "job.db.*", "job.config.*"]
decision: require_approval
reason: "Production writes require human approval"
- id: payment-gate
match:
risk_tags: ["payment"]
decision: require_approval
reason: "Financial operations require dual approval"
- id: egress-bounded
match:
risk_tags: ["egress"]
decision: allow_with_constraints
constraints:
max_runtime_sec: 60
network_allowlist: ["api.github.com", "api.slack.com"]GET /api/v1/approvals?include_resolved=false
200 OK
{
"items": [
{
"job": { "id": "job-1", "state": "APPROVAL" },
"decision": "REQUIRE_APPROVAL",
"policy_snapshot": "cfg:system:policy#sha256:7f3d...9c2b",
"policy_rule_id": "prod-write-gate",
"policy_reason": "Production writes require human approval",
"job_hash": "b3b5...8f1a",
"approval_required": true,
"decision_summary": {
"title": "Review production config change",
"why": "Production writes require human approval",
"next_effect": "Approve to continue deployment workflow"
}
}
]
}# Pending approval queue
curl -sS "http://localhost:8081/api/v1/approvals?include_resolved=false"
# Approve action
curl -sS -X POST http://localhost:8081/api/v1/approvals/job-1/approve -H 'Content-Type: application/json' -d '{"reason":"approved by on-call","note":"ticket INC-123"}'
# Reject action
curl -sS -X POST http://localhost:8081/api/v1/approvals/job-1/reject -H 'Content-Type: application/json' -d '{"reason":"policy violation","note":"missing rollback plan"}'
# Verify decisions attached to job
curl -sS http://localhost:8081/api/v1/jobs/job-1/decisionsOperational defaults
| Control | Default | Why it exists |
|---|---|---|
| Queue endpoint | GET /api/v1/approvals | Single queue for workflow and policy approvals |
| Pending filter | include_resolved=false | Prevents operational dashboards from mixing active and historical items |
| Decision linkage | policy_snapshot + job_hash | Binds approval to exact request and policy version |
| Decision summary | title + why + next_effect | Gives reviewer enough context to make a fast, informed decision |
| Resolution metadata | resolved_by + resolved_comment + resolution | Preserves accountability for audit and forensics |
| Route coverage | No GET /api/v1/approvals/{id} | Teams should design queue operations around list and resolution routes |
Limitations and tradeoffs
Approval queues can bottleneck if routing and action tiers are poorly tuned.
Too much evidence in the UI slows decisions and encourages superficial reviews.
Approval routing degrades if policy rules are not reviewed as workflows evolve.