The production problem
Approval queues age. Policy evolves. Requests mutate.
If your system accepts old approvals after those changes, an operator signs one thing and executes another.
That is not a UX bug. It is a governance failure with audit fallout.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Google Secret Manager: ETags for optimistic concurrency | ETag checks prevent one writer from overwriting another writer's newer intent. | No approval-workflow pattern that validates both policy version and request payload before execution. |
| Google Cloud setIamPolicy docs (`etag` guidance) | Read-modify-write with etag to avoid racing policy updates. | No human-approval queue semantics where an approval can expire because the policy snapshot changed. |
| Twilio: Mutation and conflict resolution | Mutation preconditions with ETag and If-Match to detect stale writes. | No pre-dispatch governance flow that combines snapshot drift and job-hash drift checks. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Policy snapshot guard | For non-workflow-gate approvals, Cordum compares current Safety Kernel snapshot base against stored `policy_snapshot`. | If policy changed, approval is rejected with `409 policy snapshot changed; re-evaluate before approving`. |
| Request hash guard | Cordum recomputes `scheduler.HashJobRequest(req)` and compares it to stored `safetyRecord.JobHash`. | If request mutated, approval is rejected with `409 job request changed; approval rejected`. |
| Workflow gate branch | Workflow-gate approvals can set `policySnapshot = workflow-gate` and skip Safety Kernel snapshot listing. | Workflow gates prioritize workflow-state checks and context over strict snapshot-base equality. |
| Service dependency | If `s.safetyClient` is unavailable for non-workflow approvals, approve returns `503 safety kernel unavailable`. | Availability of Safety Kernel affects approval throughput for policy approvals. |
Snapshot and hash checks in code
Policy snapshot drift guard
// core/controlplane/gateway/handlers_approvals.go (excerpt)
policySnapshot := strings.TrimSpace(safetyRecord.PolicySnapshot)
if isWorkflowGate {
if policySnapshot == "" {
policySnapshot = "workflow-gate"
}
} else {
if policySnapshot == "" {
result = handlerResult{http.StatusConflict, "approval policy snapshot unavailable"}
return nil
}
snapResp, err := s.safetyClient.ListSnapshots(ctx, &pb.ListSnapshotsRequest{})
if err != nil {
result = handlerResult{http.StatusBadGateway, "list safety snapshots failed"}
return nil
}
currentSnapshot := ""
if snapResp != nil && len(snapResp.Snapshots) > 0 {
currentSnapshot = strings.TrimSpace(snapResp.Snapshots[0])
}
if currentSnapshot == "" || snapshotBase(currentSnapshot) != snapshotBase(policySnapshot) {
result = handlerResult{http.StatusConflict, "policy snapshot changed; re-evaluate before approving"}
return nil
}
}Job request hash drift guard
// core/controlplane/gateway/handlers_approvals.go (excerpt)
hash, err := scheduler.HashJobRequest(req)
if err != nil {
result = handlerResult{http.StatusInternalServerError, "failed to hash job request"}
return nil
}
if hash != safetyRecord.JobHash {
result = handlerResult{http.StatusConflict, "job request changed; approval rejected"}
return nil
}Overlay-tolerant snapshot base comparison
// core/controlplane/gateway/handlers_approvals.go (excerpt)
// Combined snapshots are "base|cfg:hash".
// snapshotBase strips config overlay hash so overlay-only changes do not invalidate approvals.
func snapshotBase(snap string) string {
if i := strings.Index(snap, "|"); i >= 0 {
return snap[:i]
}
return snap
}Validation runbook
Run this in staging before changing approval semantics.
# 1) Create an approval-required job and capture /api/v1/approvals item fields:
# policy_snapshot, job_hash, job.id
# 2) Change active policy snapshot (publish new snapshot)
# 3) Call POST /api/v1/approvals/{job_id}/approve
# 4) Verify 409 "policy snapshot changed; re-evaluate before approving"
# 5) Re-run policy evaluation to create a fresh approval item
# 6) Modify job request labels/body for the old item and retry approve
# 7) Verify 409 "job request changed; approval rejected"Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Snapshot + hash checks (current) | Catches both policy drift and request mutation before dispatch. | More conflict paths to handle in clients and operator runbooks. |
| Snapshot-only check | Lower compute overhead and simpler reasoning. | Request payload mutations can slip through stale approvals. |
| Hash-only check | Protects against payload tampering after approval queueing. | Policy drift can still approve actions under outdated policy assumptions. |
- - Workflow-gate approvals intentionally use a different snapshot behavior than non-workflow policy approvals.
- - I did not find dedicated gateway tests that directly assert the exact 409 messages for snapshot/hash drift branches.
- - Hash checks depend on stable request canonicalization; field-order surprises in custom tooling can produce false conflicts.
Next step
Do this next:
- 1. Add explicit tests for `policy snapshot changed` and `job request changed` conflict branches.
- 2. Expose machine-readable conflict codes so SDKs can route retries vs re-evaluation flows.
- 3. Document workflow-gate vs non-workflow-gate approval semantics in the public API docs.
- 4. Track drift-conflict rate as an SLI to catch policy rollout regressions early.
Continue with AI Agent Policy Decision Cache Invalidation and AI Agent Approval Lock Contention.