Name: Cordum
Author: Cordum

The production problem

Max-concurrency controls are useless if 100 concurrent start requests race the same tenant and slip through.

Admission locking fixes that race. It also introduces a new hotspot: one lock key per org.

With fixed 10ms retries, parallel clients can wake together and collide again. The lock is safe, but throughput looks jagged and noisy.

What top results cover and miss

Source	Strong coverage	Missing piece
Redis Docs: Distributed Locks	Random lock value ownership, compare-delete release script, and random retry delay to avoid split-brain pressure.	No workflow-run admission context with org-scoped concurrency caps and idempotency reservations.
AWS Builders’ Library: Timeouts, retries and backoff with jitter	Why correlated retries create overload loops and how jitter spreads retry traffic.	No lock-key level guidance for one-tenant admission hotspots inside an AI control plane.
Google Cloud IAM: Retry strategy	Truncated exponential backoff with jitter and concurrency-aware retries for `409 ABORTED` flows.	No per-tenant lock budget design where requests race for run slots under strict admission windows.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Admission key scope	Lock key format is `cordum:wf:run:admission:<orgID>`.	One tenant cannot block run admission for other tenants.
Retry and timeout constants	`workflowAdmissionLockTTL=10s`, `workflowAdmissionLockRetryDelay=10ms`, `workflowAdmissionLockMaxWait=2s`.	Admission is bounded, but fixed retry cadence can align under heavy parallel callers.
Lock semantics	`TryAcquireLock` uses `SetNX` with UUID token and TTL; `ReleaseLock` uses token-matching Lua compare-delete.	Prevents accidental release by non-owner code path.
Failure surface	If lock acquire fails, gateway returns `503 workflow concurrency gate unavailable`.	Client retry strategy becomes part of availability behavior.

Lock lifecycle in code

Admission lock loop and constants

core/controlplane/gateway/handlers_workflows.go

// core/controlplane/gateway/handlers_workflows.go (excerpt)
const (
  workflowAdmissionLockTTL        = 10 * time.Second
  workflowAdmissionLockRetryDelay = 10 * time.Millisecond
  workflowAdmissionLockMaxWait    = 2 * time.Second
)

func (s *server) acquireWorkflowAdmissionLock(ctx context.Context, orgID string) (func(), error) {
  waitCtx, cancel := context.WithTimeout(ctx, workflowAdmissionLockMaxWait)
  defer cancel()

  lockKey := "cordum:wf:run:admission:" + strings.TrimSpace(orgID)
  for {
    token, err := s.jobStore.TryAcquireLock(waitCtx, lockKey, workflowAdmissionLockTTL)
    if err != nil {
      return nil, err
    }
    if token != "" {
      return func() {
        releaseCtx, releaseCancel := context.WithTimeout(context.Background(), time.Second)
        defer releaseCancel()
        _ = s.jobStore.ReleaseLock(releaseCtx, lockKey, token)
      }, nil
    }
    timer := time.NewTimer(workflowAdmissionLockRetryDelay) // fixed 10ms
    select {
    case <-waitCtx.Done():
      timer.Stop()
      return nil, waitCtx.Err()
    case <-timer.C:
    }
  }
}

Redis token lock and safe release script

core/infra/store/job_store.go

// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) TryAcquireLock(ctx context.Context, key string, ttl time.Duration) (string, error) {
  token := uuid.NewString()
  acquired, err := s.client.SetNX(ctx, key, token, ttl).Result()
  if err != nil {
    return "", err
  }
  if !acquired {
    return "", nil
  }
  return token, nil
}

var releaseLockScript = redis.NewScript(`
if redis.call('get', KEYS[1]) == ARGV[1] then
  return redis.call('del', KEYS[1])
end
return 0
`)

Concurrency test with concrete numbers

core/controlplane/gateway/workflow_runs_test.go

// core/controlplane/gateway/workflow_runs_test.go (excerpt)
func TestHandleStartRunConcurrentRequestsRespectMaxConcurrentLimit(t *testing.T) {
  // max_concurrent_runs = 2
  // workers = 10 concurrent start-run calls
  // expected: exactly 2 x 200 OK, 8 x 429 Too Many Requests
}

Validation runbook

Measure contention behavior before tuning lock constants. Gut feeling is a poor lock strategy.

runbook.sh

bash

# 1) Configure tenant max_concurrent_runs=2
# 2) Fire 100 concurrent run-start requests for same org
# 3) Measure P50/P95 admission latency and 429/503 ratios
# 4) Repeat with client-side fixed-delay retries
# 5) Repeat with client-side jittered exponential retries
# 6) Compare lock-key request burst shape and success spread

Limitations and tradeoffs

Approach	Upside	Downside
Fixed retry delay (current 10ms)	Predictable and easy to reason about in tests.	High chance of synchronized retries under contention.
Jittered retry delay	Reduces thundering herd behavior on a hot lock key.	Harder to reproduce exact contention timing in local tests.
Longer max wait budget	More callers eventually acquire lock during transient spikes.	Higher request latency and more blocked goroutines under sustained overload.

- A 2-second lock wait budget can reject bursts that might have succeeded with a slightly wider window.
- A 10-second TTL is a compromise. Too short risks premature expiry; too long increases contention recovery lag.
- Jitter helps herd risk but can reduce deterministic reproducibility in tests and incident replay.

Next step

Do this in one sprint:

1. Add admission-lock wait histograms by tenant and endpoint.
2. Load test fixed-delay versus jittered client retry behavior at 10x normal burst.
3. Tune `max_wait` and client retry budget together, not in isolation.
4. Keep a rollback knob for lock-delay policy changes.

Continue with AI Agent Workflow Idempotency Reservation and AI Agent Distributed Locking.

AI Agent Workflow Admission Lock