Skip to content
Deep Dive

AI Agent Workflow Admission Lock

A lock that protects concurrency can still hurt throughput if every retry arrives at the same millisecond.

Deep Dive10 min readMar 2026
TL;DR
  • -Workflow run admission is serialized per org with a Redis lock when max concurrency is enabled.
  • -Cordum currently retries lock acquisition every fixed 10ms, with a 2s wait budget and a 10s lock TTL.
  • -Fixed retry intervals are simple and can synchronize high-cardinality callers under load.
  • -Jitter guidance from AWS, Google Cloud, and Redis does not map 1:1 to run admission gates, but it highlights the same herd-risk pattern.
Failure mode

Burst traffic from one tenant can pile up on a hot admission key and retry in sync.

Current guardrail

Cordum bounds lock waiting to 2 seconds and fails fast with 503 if the gate backend is unavailable.

Operational payoff

With proper retry policy, callers avoid duplicate run storms while respecting tenant run limits.

Scope

This guide focuses on run-admission lock behavior in the gateway, not every lock in the Cordum runtime.

The production problem

Max-concurrency controls are useless if 100 concurrent start requests race the same tenant and slip through.

Admission locking fixes that race. It also introduces a new hotspot: one lock key per org.

With fixed 10ms retries, parallel clients can wake together and collide again. The lock is safe, but throughput looks jagged and noisy.

What top results cover and miss

SourceStrong coverageMissing piece
Redis Docs: Distributed LocksRandom lock value ownership, compare-delete release script, and random retry delay to avoid split-brain pressure.No workflow-run admission context with org-scoped concurrency caps and idempotency reservations.
AWS Builders’ Library: Timeouts, retries and backoff with jitterWhy correlated retries create overload loops and how jitter spreads retry traffic.No lock-key level guidance for one-tenant admission hotspots inside an AI control plane.
Google Cloud IAM: Retry strategyTruncated exponential backoff with jitter and concurrency-aware retries for `409 ABORTED` flows.No per-tenant lock budget design where requests race for run slots under strict admission windows.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Admission key scopeLock key format is `cordum:wf:run:admission:<orgID>`.One tenant cannot block run admission for other tenants.
Retry and timeout constants`workflowAdmissionLockTTL=10s`, `workflowAdmissionLockRetryDelay=10ms`, `workflowAdmissionLockMaxWait=2s`.Admission is bounded, but fixed retry cadence can align under heavy parallel callers.
Lock semantics`TryAcquireLock` uses `SetNX` with UUID token and TTL; `ReleaseLock` uses token-matching Lua compare-delete.Prevents accidental release by non-owner code path.
Failure surfaceIf lock acquire fails, gateway returns `503 workflow concurrency gate unavailable`.Client retry strategy becomes part of availability behavior.

Lock lifecycle in code

Admission lock loop and constants

core/controlplane/gateway/handlers_workflows.go
go
// core/controlplane/gateway/handlers_workflows.go (excerpt)
const (
  workflowAdmissionLockTTL        = 10 * time.Second
  workflowAdmissionLockRetryDelay = 10 * time.Millisecond
  workflowAdmissionLockMaxWait    = 2 * time.Second
)

func (s *server) acquireWorkflowAdmissionLock(ctx context.Context, orgID string) (func(), error) {
  waitCtx, cancel := context.WithTimeout(ctx, workflowAdmissionLockMaxWait)
  defer cancel()

  lockKey := "cordum:wf:run:admission:" + strings.TrimSpace(orgID)
  for {
    token, err := s.jobStore.TryAcquireLock(waitCtx, lockKey, workflowAdmissionLockTTL)
    if err != nil {
      return nil, err
    }
    if token != "" {
      return func() {
        releaseCtx, releaseCancel := context.WithTimeout(context.Background(), time.Second)
        defer releaseCancel()
        _ = s.jobStore.ReleaseLock(releaseCtx, lockKey, token)
      }, nil
    }
    timer := time.NewTimer(workflowAdmissionLockRetryDelay) // fixed 10ms
    select {
    case <-waitCtx.Done():
      timer.Stop()
      return nil, waitCtx.Err()
    case <-timer.C:
    }
  }
}

Redis token lock and safe release script

core/infra/store/job_store.go
go
// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) TryAcquireLock(ctx context.Context, key string, ttl time.Duration) (string, error) {
  token := uuid.NewString()
  acquired, err := s.client.SetNX(ctx, key, token, ttl).Result()
  if err != nil {
    return "", err
  }
  if !acquired {
    return "", nil
  }
  return token, nil
}

var releaseLockScript = redis.NewScript(`
if redis.call('get', KEYS[1]) == ARGV[1] then
  return redis.call('del', KEYS[1])
end
return 0
`)

Concurrency test with concrete numbers

core/controlplane/gateway/workflow_runs_test.go
go
// core/controlplane/gateway/workflow_runs_test.go (excerpt)
func TestHandleStartRunConcurrentRequestsRespectMaxConcurrentLimit(t *testing.T) {
  // max_concurrent_runs = 2
  // workers = 10 concurrent start-run calls
  // expected: exactly 2 x 200 OK, 8 x 429 Too Many Requests
}

Validation runbook

Measure contention behavior before tuning lock constants. Gut feeling is a poor lock strategy.

runbook.sh
bash
# 1) Configure tenant max_concurrent_runs=2
# 2) Fire 100 concurrent run-start requests for same org
# 3) Measure P50/P95 admission latency and 429/503 ratios
# 4) Repeat with client-side fixed-delay retries
# 5) Repeat with client-side jittered exponential retries
# 6) Compare lock-key request burst shape and success spread

Limitations and tradeoffs

ApproachUpsideDownside
Fixed retry delay (current 10ms)Predictable and easy to reason about in tests.High chance of synchronized retries under contention.
Jittered retry delayReduces thundering herd behavior on a hot lock key.Harder to reproduce exact contention timing in local tests.
Longer max wait budgetMore callers eventually acquire lock during transient spikes.Higher request latency and more blocked goroutines under sustained overload.
  • - A 2-second lock wait budget can reject bursts that might have succeeded with a slightly wider window.
  • - A 10-second TTL is a compromise. Too short risks premature expiry; too long increases contention recovery lag.
  • - Jitter helps herd risk but can reduce deterministic reproducibility in tests and incident replay.

Next step

Do this in one sprint:

  1. 1. Add admission-lock wait histograms by tenant and endpoint.
  2. 2. Load test fixed-delay versus jittered client retry behavior at 10x normal burst.
  3. 3. Tune `max_wait` and client retry budget together, not in isolation.
  4. 4. Keep a rollback knob for lock-delay policy changes.

Continue with AI Agent Workflow Idempotency Reservation and AI Agent Distributed Locking.

Control-plane reality

Locks solve correctness first. Throughput comes next. You need both if your agents run at scale.