Name: Cordum
Author: Cordum

The production problem

Approval endpoints look simple until two actors click approve and reject at nearly the same time.

You need a lock to prevent split-brain state transitions. You also need an HTTP contract that tells clients what happened and how to retry.

If that contract is vague, operators see conflict noise, bots retry too aggressively, and incident timelines get harder to read.

What top results cover and miss

Source	Strong coverage	Missing piece
RFC 9110 (`409 Conflict`)	`409` for requests that conflict with current resource state and may be retried after conflict resolution.	No concrete mapping for distributed approval locks with fixed lock-acquire budgets in control planes.
RFC 4918 (`423 Locked`)	`423` as a lock-specific signal when the source or destination resource is locked.	No guidance for modern JSON APIs deciding between `409` and `423` for non-WebDAV lock contention.
Google Cloud IAM retry strategy	Truncated exponential backoff with jitter and explicit retry classes for contention and service failures.	No API contract pattern for approval endpoints that return conflict errors without `Retry-After` metadata.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Distributed lock key	Approval handlers use `cordum:scheduler:job:<job_id>` to serialize approve/reject transitions.	Concurrent mutations on one job are prevented, avoiding state races in approval flow.
Lock TTL	`approvalLockTTL` is `10 * time.Second`.	If release fails, lock eventually expires, but clients can still see temporary contention windows.
Acquire wait budget	Handler retries lock acquisition for up to 2 seconds with a 25ms sleep between attempts.	Busy lock path triggers quickly under concurrent clicks or bot retries.
Busy-lock response	Lock-busy error maps to HTTP `409` (`approval in progress; retry`, `rejection in progress; retry`).	Clients get a conflict signal but no retry interval hint in headers.

Lock and status map in code

Approval lock acquisition window

core/controlplane/gateway/handlers_approvals.go

// core/controlplane/gateway/handlers_approvals.go (excerpt)
const approvalLockTTL = 10 * time.Second

func (s *server) withApprovalLock(ctx context.Context, jobID string, fn func(ctx context.Context) error) error {
  key := "cordum:scheduler:job:" + jobID
  lockStart := time.Now()
  deadline := lockStart.Add(2 * time.Second)

  for {
    lockCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
    token, err := s.jobStore.TryAcquireLock(lockCtx, key, approvalLockTTL)
    cancel()
    if err != nil {
      return fmt.Errorf("lock acquire: %w", err)
    }
    if token != "" {
      break
    }
    if time.Now().After(deadline) {
      return fmt.Errorf("approval lock busy")
    }
    time.Sleep(25 * time.Millisecond)
  }
  return fn(ctx)
}

Busy-lock mapping to HTTP 409

core/controlplane/gateway/handlers_approvals.go

// core/controlplane/gateway/handlers_approvals.go (excerpt)
if lockErr != nil {
  if strings.Contains(lockErr.Error(), "lock busy") {
    writeErrorJSON(w, http.StatusConflict, "approval in progress; retry") // 409
    return
  }
  writeInternalError(w, r, "approval lock", lockErr)
  return
}

if lockErr != nil {
  if strings.Contains(lockErr.Error(), "lock busy") {
    writeErrorJSON(w, http.StatusConflict, "rejection in progress; retry") // 409
    return
  }
  writeInternalError(w, r, "rejection lock", lockErr)
  return
}

Store lock primitive is immediate `SetNX`

core/infra/store/job_store.go

// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) TryAcquireLock(ctx context.Context, key string, ttl time.Duration) (string, error) {
  if ttl <= 0 {
    ttl = 30 * time.Second
  }
  token := uuid.NewString()
  acquired, err := s.client.SetNX(ctx, key, token, ttl).Result()
  if err != nil {
    return "", err
  }
  if !acquired {
    return "", nil
  }
  return token, nil
}

Validation runbook

Run this in staging before changing status codes or retry defaults.

runbook.sh

bash

# 1) Pick one approval job_id currently in APPROVAL state
# 2) Fire 20 concurrent approve requests for same job_id
# 3) Record HTTP histogram (200/409/5xx) and p95 response time
# 4) Repeat with reject requests and mixed approve+reject bursts
# 5) Verify eventual state is deterministic (PENDING or DENIED, never both)
# 6) Tune client retries: jittered backoff + max attempts + deadline

Limitations and tradeoffs

Approach	Upside	Downside
Keep `409` for lock busy (current)	Aligned with generic state-conflict semantics and already supported by most clients.	Does not explicitly communicate lock ownership contention as clearly as `423`.
Move to `423 Locked` for lock busy	Makes lock contention explicit at protocol level for approval endpoints.	Some client SDKs and monitoring pipelines treat `423` as uncommon and need updates.
Add `Retry-After` header on busy responses	Provides concrete pacing hint and reduces guesswork in retry loops.	Static values can be wrong under fluctuating load; dynamic estimates add complexity.

- A status-code change without client retry tuning usually shifts noise rather than removing it.
- Returning lock-specific codes helps operators, but client libraries may need explicit allowlists.
- No `Retry-After` means each client team may invent different retry pacing unless you document one policy.

Next step

Implement this in one sprint:

1. Decide contract: keep `409` or adopt `423` for approval lock contention.
2. Add explicit tests for lock-busy HTTP mapping on both approve and reject handlers.
3. Publish one retry policy: jittered backoff, max attempts, and request deadline.
4. Consider `Retry-After` if clients cannot be upgraded quickly.

Continue with Approval Workflows for Autonomous AI Agents and AI Agent Workflow Admission 429 vs 503.

AI Agent Approval Lock Contention