The production problem
Max-concurrency controls are useless if 100 concurrent start requests race the same tenant and slip through.
Admission locking fixes that race. It also introduces a new hotspot: one lock key per org.
With fixed 10ms retries, parallel clients can wake together and collide again. The lock is safe, but throughput looks jagged and noisy.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Redis Docs: Distributed Locks | Random lock value ownership, compare-delete release script, and random retry delay to avoid split-brain pressure. | No workflow-run admission context with org-scoped concurrency caps and idempotency reservations. |
| AWS Builders’ Library: Timeouts, retries and backoff with jitter | Why correlated retries create overload loops and how jitter spreads retry traffic. | No lock-key level guidance for one-tenant admission hotspots inside an AI control plane. |
| Google Cloud IAM: Retry strategy | Truncated exponential backoff with jitter and concurrency-aware retries for `409 ABORTED` flows. | No per-tenant lock budget design where requests race for run slots under strict admission windows. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Admission key scope | Lock key format is `cordum:wf:run:admission:<orgID>`. | One tenant cannot block run admission for other tenants. |
| Retry and timeout constants | `workflowAdmissionLockTTL=10s`, `workflowAdmissionLockRetryDelay=10ms`, `workflowAdmissionLockMaxWait=2s`. | Admission is bounded, but fixed retry cadence can align under heavy parallel callers. |
| Lock semantics | `TryAcquireLock` uses `SetNX` with UUID token and TTL; `ReleaseLock` uses token-matching Lua compare-delete. | Prevents accidental release by non-owner code path. |
| Failure surface | If lock acquire fails, gateway returns `503 workflow concurrency gate unavailable`. | Client retry strategy becomes part of availability behavior. |
Lock lifecycle in code
Admission lock loop and constants
// core/controlplane/gateway/handlers_workflows.go (excerpt)
const (
workflowAdmissionLockTTL = 10 * time.Second
workflowAdmissionLockRetryDelay = 10 * time.Millisecond
workflowAdmissionLockMaxWait = 2 * time.Second
)
func (s *server) acquireWorkflowAdmissionLock(ctx context.Context, orgID string) (func(), error) {
waitCtx, cancel := context.WithTimeout(ctx, workflowAdmissionLockMaxWait)
defer cancel()
lockKey := "cordum:wf:run:admission:" + strings.TrimSpace(orgID)
for {
token, err := s.jobStore.TryAcquireLock(waitCtx, lockKey, workflowAdmissionLockTTL)
if err != nil {
return nil, err
}
if token != "" {
return func() {
releaseCtx, releaseCancel := context.WithTimeout(context.Background(), time.Second)
defer releaseCancel()
_ = s.jobStore.ReleaseLock(releaseCtx, lockKey, token)
}, nil
}
timer := time.NewTimer(workflowAdmissionLockRetryDelay) // fixed 10ms
select {
case <-waitCtx.Done():
timer.Stop()
return nil, waitCtx.Err()
case <-timer.C:
}
}
}Redis token lock and safe release script
// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) TryAcquireLock(ctx context.Context, key string, ttl time.Duration) (string, error) {
token := uuid.NewString()
acquired, err := s.client.SetNX(ctx, key, token, ttl).Result()
if err != nil {
return "", err
}
if !acquired {
return "", nil
}
return token, nil
}
var releaseLockScript = redis.NewScript(`
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('del', KEYS[1])
end
return 0
`)Concurrency test with concrete numbers
// core/controlplane/gateway/workflow_runs_test.go (excerpt)
func TestHandleStartRunConcurrentRequestsRespectMaxConcurrentLimit(t *testing.T) {
// max_concurrent_runs = 2
// workers = 10 concurrent start-run calls
// expected: exactly 2 x 200 OK, 8 x 429 Too Many Requests
}Validation runbook
Measure contention behavior before tuning lock constants. Gut feeling is a poor lock strategy.
# 1) Configure tenant max_concurrent_runs=2 # 2) Fire 100 concurrent run-start requests for same org # 3) Measure P50/P95 admission latency and 429/503 ratios # 4) Repeat with client-side fixed-delay retries # 5) Repeat with client-side jittered exponential retries # 6) Compare lock-key request burst shape and success spread
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Fixed retry delay (current 10ms) | Predictable and easy to reason about in tests. | High chance of synchronized retries under contention. |
| Jittered retry delay | Reduces thundering herd behavior on a hot lock key. | Harder to reproduce exact contention timing in local tests. |
| Longer max wait budget | More callers eventually acquire lock during transient spikes. | Higher request latency and more blocked goroutines under sustained overload. |
- - A 2-second lock wait budget can reject bursts that might have succeeded with a slightly wider window.
- - A 10-second TTL is a compromise. Too short risks premature expiry; too long increases contention recovery lag.
- - Jitter helps herd risk but can reduce deterministic reproducibility in tests and incident replay.
Next step
Do this in one sprint:
- 1. Add admission-lock wait histograms by tenant and endpoint.
- 2. Load test fixed-delay versus jittered client retry behavior at 10x normal burst.
- 3. Tune `max_wait` and client retry budget together, not in isolation.
- 4. Keep a rollback knob for lock-delay policy changes.
Continue with AI Agent Workflow Idempotency Reservation and AI Agent Distributed Locking.