The problem
Many teams stop at reviewer UX. They build a clean approval screen, route notifications, and feel done. Then an outage hits. The approval endpoint retries, policy snapshots change mid-flight, or a worker goes offline after approval.
Approval workflows fail in production when they do not verify integrity before publish and when approve endpoints are not idempotent under concurrency.
What top sources miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Microsoft agents-humanoversight | Decorator pattern (`@approval_gate`), timeout handling, and orchestrator-agnostic integration. | No snapshot/hash integrity check before publish, so policy drift between request and approval is not fully addressed. |
| n8n human review tools docs | Practical reviewer channels (Slack, Teams, email, chat) and review payload context via `$tool` variables. | Limited guidance for distributed idempotency and requeue semantics after approval under infrastructure failures. |
| StackAI HITL implementation guide | Good patterns for escalation, evidence packs, and audit-ready reviewer UX. | Does not define concrete control-plane checks like approval lock contention handling or snapshot mismatch conflicts. |
Approval model
- - Step 1: Evaluate policy at submit-time and block publish for `require_human` decisions.
- - Step 2: Persist approval context, including policy snapshot and deterministic job hash.
- - Step 3: At approve time, re-validate snapshot and hash under a distributed lock.
- - Step 4: Publish once, mark approval metadata, and make repeated approve calls idempotent.
- - Step 5: Replay approved but stuck jobs to survive transient worker availability failures.
Runtime behavior with real values
| Control point | Behavior | Observed values | Outcome |
|---|---|---|---|
| Submit-time policy | Policy is evaluated before state persistence and queue publish. | 5s eval timeout; deny=403, throttle=429, require_human=APPROVAL | Unsafe actions never enter dispatch path. |
| Approval initialization | Approval-required jobs persist request + safety decision + job hash. | Stores `PolicySnapshot`, `JobHash`, `ApprovalRef` | Approval endpoint can verify drift before releasing job. |
| Approve endpoint | Distributed lock and strict validation before publish. | 409 on snapshot changed/job request changed; idempotent already_approved | Prevents stale or tampered approvals from shipping. |
| Post-approval dispatch | State transitions to `PENDING`, then job is published. | `approval_granted=true` label added before publish | Scheduler sees approval context and resumes normal routing. |
| Recovery path | Pending replayer rescans approval state for approved stuck jobs. | Replays jobs in approval state with `approval_granted=true` | Approved work is not lost during worker availability gaps. |
Implementation code
1) Policy rules for approval-required actions
version: v1
rules:
- id: allow-readonly
when:
topic: knowledge.read
decision: allow
- id: require-approval-prod-finance
when:
topic: finance.payment.execute
env: production
decision: require_human
- id: require-approval-external-notify
when:
topic: customer.notify
channel: external
decision: require_human2) API flow with idempotent approve calls
# 1) Submit action (may return approval_required)
curl -s -X POST "$CORDUM_API/api/v1/jobs" \
-H "Content-Type: application/json" \
-d '{"topic":"finance.payment.execute","priority":"high"}'
# 2) Approve after reviewer decision
curl -s -X POST "$CORDUM_API/api/v1/approvals/$JOB_ID/approve" \
-H "Content-Type: application/json" \
-d '{"reason":"validated invoice and policy context"}'
# 3) Repeated approve is idempotent (already_approved)
curl -s -X POST "$CORDUM_API/api/v1/approvals/$JOB_ID/approve"3) Side-effect idempotency after approval
type ApprovalContext = {
approvalId: string;
policySnapshot: string;
jobHash: string;
};
export async function executeAfterApproval(ctx: ApprovalContext, actionId: string) {
const key = ctx.approvalId + ":" + actionId;
const existing = await db.sideEffects.findUnique({ where: { key } });
if (existing) return existing.result;
// Keep request immutable between approval and side effect.
await verifySnapshotAndHash(ctx.policySnapshot, ctx.jobHash);
const result = await runAction(actionId);
await db.sideEffects.create({ data: { key, result } });
return result;
}Limitations and tradeoffs
- - Strict snapshot checks prevent stale approvals and can increase 409 conflicts during rapid policy updates.
- - Distributed approval locks prevent double-publish and add contention during bursts.
- - Idempotent approve paths reduce incidents and require extra storage and careful state modeling.
- - Rich audit payloads help compliance and can expose sensitive data if log redaction is weak.
- - Replaying approved jobs improves resilience and can hide deeper routing capacity problems if not monitored.
Next step
Run this rollout in one sprint:
- 1. Define exactly which topics require approval in production.
- 2. Bind every approval to policy snapshot + job hash and reject mismatches.
- 3. Make approve/reject endpoints idempotent and add lock-busy retry behavior.
- 4. Add metrics for already-approved, snapshot-mismatch, and replayed-approved jobs.
- 5. Test worker outage during approval release and verify replay behavior end-to-end.
Continue with approvals for autonomous workflows and audit trails for AI agents.