The production problem
Approval systems live at the boundary of humans and unreliable networks.
Humans double-submit. Browsers retry. Proxies retry. Your endpoint gets called again after the decision already happened.
If your handler only returns conflicts, clients cannot tell duplicate success from real failure.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Builders' Library: Making retries safe with idempotent APIs | Client request identity and server-side dedup contracts for safe retries. | No human approval endpoint pattern where retries must distinguish already-approved vs still-pending states. |
| Stripe API docs: Idempotent requests | Idempotency key behavior for API retries and deterministic response replay expectations. | No treatment of approval queues where actor state and workflow state can diverge during retries. |
| PayPal docs: Idempotency | Retry-safe request replay and duplicate request handling with idempotency headers. | No dual-endpoint approve/reject flow where conflict and replay status need separate semantics. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Approve replay path | If job state moved beyond APPROVAL and request labels contain `approval_granted=true`, handler returns `200 already_approved`. | Safe client retries do not create approval duplicates or noisy conflict errors. |
| Reject replay path | If state is DENIED, reject handler returns `200 already_rejected`. | Retrying a successful reject remains deterministic for operators and bots. |
| Message dedup key | Approve path sets `req.Labels[cordum.bus_msg_id] = approval:<job_id>` before republishing. | NATS dedup can collapse repeated publish attempts for the same approved job. |
| Conflict scope | If state and labels do not match replay conditions, handler still returns conflict (`job not awaiting approval`). | Idempotency does not mask true state mismatches. |
Idempotency paths in code
Approve replay branch
// core/controlplane/gateway/handlers_approvals.go (excerpt)
if state != model.JobStateApproval {
if state == model.JobStatePending || state == model.JobStateSucceeded ||
state == model.JobStateScheduled || state == model.JobStateDispatched ||
state == model.JobStateRunning {
req, _ := s.jobStore.GetJobRequest(ctx, jobID)
if req != nil && req.Labels != nil && req.Labels["approval_granted"] == "true" {
rec, _ := s.jobStore.GetApprovalRecord(ctx, jobID)
result = handlerResult{http.StatusOK, map[string]any{
"job_id": jobID,
"status": "already_approved",
"approved_by": rec.ApprovedBy,
"approved_at": rec.ApprovedAt,
}}
return nil
}
}
result = handlerResult{http.StatusConflict, "job not awaiting approval"}
return nil
}Reject replay branch
// core/controlplane/gateway/handlers_approvals.go (excerpt)
if state != model.JobStateApproval {
if state == model.JobStateDenied {
rec, _ := s.jobStore.GetApprovalRecord(ctx, jobID)
result = handlerResult{http.StatusOK, map[string]any{
"job_id": jobID,
"status": "already_rejected",
"rejected_by": rec.ApprovedBy,
"rejected_at": rec.ApprovedAt,
}}
return nil
}
result = handlerResult{http.StatusConflict, "job not awaiting approval"}
return nil
}Dedup key for publish retries
// core/controlplane/gateway/handlers_approvals.go (excerpt)
// Stable idempotency key per job so NATS dedup works on retries.
req.Labels[bus.LabelBusMsgID] = "approval:" + jobID
if err := s.jobStore.SetJobRequest(ctx, req); err != nil {
if strings.Contains(err.Error(), "transaction failed") {
result = handlerResult{http.StatusConflict, "concurrent approval conflict; retry"}
return nil
}
}Existing idempotency tests
// core/controlplane/gateway/handlers_approvals_test.go (excerpt)
func TestApproveJobIdempotent(t *testing.T) {
// first approval returns 200
// second approval returns 200 with status=already_approved
}
func TestRejectJobIdempotent(t *testing.T) {
// first rejection returns 200
// second rejection returns 200 with status=already_rejected
}Validation runbook
Validate this on staging before changing approval-client retry behavior.
# 1) Create approval-required job_id J # 2) POST /api/v1/approvals/J/approve (expect 200) # 3) Retry same approve call 5 times in parallel # 4) Verify all retries return 200 and include status=already_approved # 5) Repeat flow with reject path (expect already_rejected) # 6) Inspect bus dedup label in stored request: cordum.bus_msg_id=approval:J
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Always return 409 after first approval | Simple state machine exposure to clients. | Retries become noisy and require extra client-side interpretation. |
| Idempotent replay responses (current) | Deterministic outcomes for safe retries and better operator UX. | Requires stricter replay condition checks to avoid false positives. |
| Replay everything without state checks | Lowest client complexity. | Can hide real conflicts and weaken audit confidence. |
- - Replay logic depends on specific state and label conditions, so custom integrations must preserve label integrity.
- - Replay semantics do not replace conflict handling for genuine concurrent state transitions.
- - I found idempotency tests for success replay paths, but not exhaustive tests for every conflict branch under high concurrency.
Next step
Implement this next:
- 1. Document replay contracts explicitly in API docs (`already_approved`, `already_rejected`).
- 2. Add machine-readable error/replay codes for SDK-level retry routing.
- 3. Add concurrency tests that mix duplicate retries with true state conflicts.
- 4. Track replay-rate vs conflict-rate per endpoint to catch client retry regressions.
Continue with AI Agent NATS Msg-Id Strategy and AI Agent Approval Lock Contention.