The production problem
Approval endpoints look simple until two actors click approve and reject at nearly the same time.
You need a lock to prevent split-brain state transitions. You also need an HTTP contract that tells clients what happened and how to retry.
If that contract is vague, operators see conflict noise, bots retry too aggressively, and incident timelines get harder to read.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| RFC 9110 (`409 Conflict`) | `409` for requests that conflict with current resource state and may be retried after conflict resolution. | No concrete mapping for distributed approval locks with fixed lock-acquire budgets in control planes. |
| RFC 4918 (`423 Locked`) | `423` as a lock-specific signal when the source or destination resource is locked. | No guidance for modern JSON APIs deciding between `409` and `423` for non-WebDAV lock contention. |
| Google Cloud IAM retry strategy | Truncated exponential backoff with jitter and explicit retry classes for contention and service failures. | No API contract pattern for approval endpoints that return conflict errors without `Retry-After` metadata. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Distributed lock key | Approval handlers use `cordum:scheduler:job:<job_id>` to serialize approve/reject transitions. | Concurrent mutations on one job are prevented, avoiding state races in approval flow. |
| Lock TTL | `approvalLockTTL` is `10 * time.Second`. | If release fails, lock eventually expires, but clients can still see temporary contention windows. |
| Acquire wait budget | Handler retries lock acquisition for up to 2 seconds with a 25ms sleep between attempts. | Busy lock path triggers quickly under concurrent clicks or bot retries. |
| Busy-lock response | Lock-busy error maps to HTTP `409` (`approval in progress; retry`, `rejection in progress; retry`). | Clients get a conflict signal but no retry interval hint in headers. |
Lock and status map in code
Approval lock acquisition window
// core/controlplane/gateway/handlers_approvals.go (excerpt)
const approvalLockTTL = 10 * time.Second
func (s *server) withApprovalLock(ctx context.Context, jobID string, fn func(ctx context.Context) error) error {
key := "cordum:scheduler:job:" + jobID
lockStart := time.Now()
deadline := lockStart.Add(2 * time.Second)
for {
lockCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
token, err := s.jobStore.TryAcquireLock(lockCtx, key, approvalLockTTL)
cancel()
if err != nil {
return fmt.Errorf("lock acquire: %w", err)
}
if token != "" {
break
}
if time.Now().After(deadline) {
return fmt.Errorf("approval lock busy")
}
time.Sleep(25 * time.Millisecond)
}
return fn(ctx)
}Busy-lock mapping to HTTP 409
// core/controlplane/gateway/handlers_approvals.go (excerpt)
if lockErr != nil {
if strings.Contains(lockErr.Error(), "lock busy") {
writeErrorJSON(w, http.StatusConflict, "approval in progress; retry") // 409
return
}
writeInternalError(w, r, "approval lock", lockErr)
return
}
if lockErr != nil {
if strings.Contains(lockErr.Error(), "lock busy") {
writeErrorJSON(w, http.StatusConflict, "rejection in progress; retry") // 409
return
}
writeInternalError(w, r, "rejection lock", lockErr)
return
}Store lock primitive is immediate `SetNX`
// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) TryAcquireLock(ctx context.Context, key string, ttl time.Duration) (string, error) {
if ttl <= 0 {
ttl = 30 * time.Second
}
token := uuid.NewString()
acquired, err := s.client.SetNX(ctx, key, token, ttl).Result()
if err != nil {
return "", err
}
if !acquired {
return "", nil
}
return token, nil
}Validation runbook
Run this in staging before changing status codes or retry defaults.
# 1) Pick one approval job_id currently in APPROVAL state # 2) Fire 20 concurrent approve requests for same job_id # 3) Record HTTP histogram (200/409/5xx) and p95 response time # 4) Repeat with reject requests and mixed approve+reject bursts # 5) Verify eventual state is deterministic (PENDING or DENIED, never both) # 6) Tune client retries: jittered backoff + max attempts + deadline
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Keep `409` for lock busy (current) | Aligned with generic state-conflict semantics and already supported by most clients. | Does not explicitly communicate lock ownership contention as clearly as `423`. |
| Move to `423 Locked` for lock busy | Makes lock contention explicit at protocol level for approval endpoints. | Some client SDKs and monitoring pipelines treat `423` as uncommon and need updates. |
| Add `Retry-After` header on busy responses | Provides concrete pacing hint and reduces guesswork in retry loops. | Static values can be wrong under fluctuating load; dynamic estimates add complexity. |
- - A status-code change without client retry tuning usually shifts noise rather than removing it.
- - Returning lock-specific codes helps operators, but client libraries may need explicit allowlists.
- - No `Retry-After` means each client team may invent different retry pacing unless you document one policy.
Next step
Implement this in one sprint:
- 1. Decide contract: keep `409` or adopt `423` for approval lock contention.
- 2. Add explicit tests for lock-busy HTTP mapping on both approve and reject handlers.
- 3. Publish one retry policy: jittered backoff, max attempts, and request deadline.
- 4. Consider `Retry-After` if clients cannot be upgraded quickly.
Continue with Approval Workflows for Autonomous AI Agents and AI Agent Workflow Admission 429 vs 503.