Skip to content
Deep Dive

AI Agent Approval Lock Contention

If approve and reject race on one job, status semantics decide whether clients recover cleanly.

Deep Dive10 min readMar 2026
TL;DR
  • -Cordum approval paths take a per-job distributed lock with a 10s TTL and a 2s acquire budget.
  • -If the lock is still busy after that 2s window, approve and reject handlers currently return HTTP 409.
  • -409 is valid for state conflict, but 423 can be a clearer lock-specific signal for some clients.
  • -Retry behavior matters more than code choice: jittered backoff and bounded deadlines reduce herd retries.
Failure mode

Concurrent approve/reject calls can pile up on the same job-level lock and generate avoidable retry storms.

Current behavior

Busy approval lock maps to `409` with a retry message and no `Retry-After` hint.

Operational payoff

Explicit lock-aware retry policy reduces duplicate operator clicks and noisy conflict alerts.

Scope

This guide targets approval endpoint contention behavior (`/api/v1/approvals/{job_id}/approve|reject`) and client retry contracts.

The production problem

Approval endpoints look simple until two actors click approve and reject at nearly the same time.

You need a lock to prevent split-brain state transitions. You also need an HTTP contract that tells clients what happened and how to retry.

If that contract is vague, operators see conflict noise, bots retry too aggressively, and incident timelines get harder to read.

What top results cover and miss

SourceStrong coverageMissing piece
RFC 9110 (`409 Conflict`)`409` for requests that conflict with current resource state and may be retried after conflict resolution.No concrete mapping for distributed approval locks with fixed lock-acquire budgets in control planes.
RFC 4918 (`423 Locked`)`423` as a lock-specific signal when the source or destination resource is locked.No guidance for modern JSON APIs deciding between `409` and `423` for non-WebDAV lock contention.
Google Cloud IAM retry strategyTruncated exponential backoff with jitter and explicit retry classes for contention and service failures.No API contract pattern for approval endpoints that return conflict errors without `Retry-After` metadata.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Distributed lock keyApproval handlers use `cordum:scheduler:job:<job_id>` to serialize approve/reject transitions.Concurrent mutations on one job are prevented, avoiding state races in approval flow.
Lock TTL`approvalLockTTL` is `10 * time.Second`.If release fails, lock eventually expires, but clients can still see temporary contention windows.
Acquire wait budgetHandler retries lock acquisition for up to 2 seconds with a 25ms sleep between attempts.Busy lock path triggers quickly under concurrent clicks or bot retries.
Busy-lock responseLock-busy error maps to HTTP `409` (`approval in progress; retry`, `rejection in progress; retry`).Clients get a conflict signal but no retry interval hint in headers.

Lock and status map in code

Approval lock acquisition window

core/controlplane/gateway/handlers_approvals.go
go
// core/controlplane/gateway/handlers_approvals.go (excerpt)
const approvalLockTTL = 10 * time.Second

func (s *server) withApprovalLock(ctx context.Context, jobID string, fn func(ctx context.Context) error) error {
  key := "cordum:scheduler:job:" + jobID
  lockStart := time.Now()
  deadline := lockStart.Add(2 * time.Second)

  for {
    lockCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
    token, err := s.jobStore.TryAcquireLock(lockCtx, key, approvalLockTTL)
    cancel()
    if err != nil {
      return fmt.Errorf("lock acquire: %w", err)
    }
    if token != "" {
      break
    }
    if time.Now().After(deadline) {
      return fmt.Errorf("approval lock busy")
    }
    time.Sleep(25 * time.Millisecond)
  }
  return fn(ctx)
}

Busy-lock mapping to HTTP 409

core/controlplane/gateway/handlers_approvals.go
go
// core/controlplane/gateway/handlers_approvals.go (excerpt)
if lockErr != nil {
  if strings.Contains(lockErr.Error(), "lock busy") {
    writeErrorJSON(w, http.StatusConflict, "approval in progress; retry") // 409
    return
  }
  writeInternalError(w, r, "approval lock", lockErr)
  return
}

if lockErr != nil {
  if strings.Contains(lockErr.Error(), "lock busy") {
    writeErrorJSON(w, http.StatusConflict, "rejection in progress; retry") // 409
    return
  }
  writeInternalError(w, r, "rejection lock", lockErr)
  return
}

Store lock primitive is immediate `SetNX`

core/infra/store/job_store.go
go
// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) TryAcquireLock(ctx context.Context, key string, ttl time.Duration) (string, error) {
  if ttl <= 0 {
    ttl = 30 * time.Second
  }
  token := uuid.NewString()
  acquired, err := s.client.SetNX(ctx, key, token, ttl).Result()
  if err != nil {
    return "", err
  }
  if !acquired {
    return "", nil
  }
  return token, nil
}

Validation runbook

Run this in staging before changing status codes or retry defaults.

runbook.sh
bash
# 1) Pick one approval job_id currently in APPROVAL state
# 2) Fire 20 concurrent approve requests for same job_id
# 3) Record HTTP histogram (200/409/5xx) and p95 response time
# 4) Repeat with reject requests and mixed approve+reject bursts
# 5) Verify eventual state is deterministic (PENDING or DENIED, never both)
# 6) Tune client retries: jittered backoff + max attempts + deadline

Limitations and tradeoffs

ApproachUpsideDownside
Keep `409` for lock busy (current)Aligned with generic state-conflict semantics and already supported by most clients.Does not explicitly communicate lock ownership contention as clearly as `423`.
Move to `423 Locked` for lock busyMakes lock contention explicit at protocol level for approval endpoints.Some client SDKs and monitoring pipelines treat `423` as uncommon and need updates.
Add `Retry-After` header on busy responsesProvides concrete pacing hint and reduces guesswork in retry loops.Static values can be wrong under fluctuating load; dynamic estimates add complexity.
  • - A status-code change without client retry tuning usually shifts noise rather than removing it.
  • - Returning lock-specific codes helps operators, but client libraries may need explicit allowlists.
  • - No `Retry-After` means each client team may invent different retry pacing unless you document one policy.

Next step

Implement this in one sprint:

  1. 1. Decide contract: keep `409` or adopt `423` for approval lock contention.
  2. 2. Add explicit tests for lock-busy HTTP mapping on both approve and reject handlers.
  3. 3. Publish one retry policy: jittered backoff, max attempts, and request deadline.
  4. 4. Consider `Retry-After` if clients cannot be upgraded quickly.

Continue with Approval Workflows for Autonomous AI Agents and AI Agent Workflow Admission 429 vs 503.

Concurrency is an API contract

A lock prevents races in storage. A precise status code prevents races in client behavior.