Skip to content
AI Agent Concurrency

AI Agent Run Lock Busy Retries (2026): Why Fixed 500ms Delays Create Contention Waves

Fixed delays are easy to reason about, but they can keep contenders synchronized under pressure. This guide maps the exact Cordum path and shows a safe jitter rollout model.

Concurrency13 min readUpdated Apr 2026

TL;DR

  • Cordum currently returns `RetryAfter(..., 500ms)` for run-lock contention in key workflow result paths.
  • Fixed intervals are deterministic but can align contenders into repeated collision waves.
  • Bounded jitter lowers synchronized retries without changing lock correctness.
  • Treat jitter rollout as a reliability change with explicit before/after metrics.

The production problem

Lock contention bugs often look like throughput bugs. Workers are healthy. Queue is healthy. Yet progress is slow because contenders wake at the same interval, collide, and repeat.

A fixed `RetryAfter(..., 500ms)` delay creates a deterministic retry rhythm. Under high parallelism, that rhythm can produce collision waves every half second.

You still get correctness. You lose efficiency. More retries. More queue churn. Longer tails for job result handling.

What top ranking sources cover vs miss

SourceWhat it coversWhat it misses
AWS Architecture Blog: Exponential Backoff and JitterShows why synchronized retries waste work under contention and why jittered backoff reduces collision spikes.Does not map jitter decisions to lock-ownership code paths in workflow engines handling job results.
gRPC Retry GuideDefines retry policy controls and documents jitter around backoff delays to avoid client stampedes.Focuses on RPC retries; does not cover queue redelivery plus distributed run-lock contention loops.
Redis Distributed LocksRecommends retrying lock acquisition after random delay to desynchronize competing clients.No implementation guidance for mixed paths where one handler spin-waits and another immediately returns retryable errors.

Cordum runtime paths

There are two primary lock-busy result paths today. Both converge on the same fixed 500ms delayed retry, even though one path spin-waits before giving up.

PathLocationCurrent behaviorConsequence
Workflow reconciler`core/workflow/reconciler.go`Returns `RetryAfter("run lock busy", 500ms)` when run lock token is empty.Fast retries; high chance of synchronized retries under load.
Gateway workflow result handler`core/controlplane/gateway/handlers_stream.go`Spin-waits up to 3s for lock, then returns `RetryAfter("run lock busy", 500ms)` if still contended.Lower immediate NATS bounce rate, but same fixed retry period after timeout.
Bus retry mapping`core/infra/bus/nats.go` + `core/infra/bus/retry.go``RetryDelay()` extraction feeds delayed NAK (`NakWithDelay(delay)`).Transport honors the 500ms value exactly unless caller changes it.
Lock-busy test coverage`core/workflow/runner_test.go``TestReconcilerHandleJobResultLockBusy` asserts 500ms delay.Current fixed delay is intentional and test-protected.

Failure modes

FaultObserved symptomOperational effect
Fixed retry period shared by many workersRetry bursts every 500ms (2 waves/sec)Low success rate per wave during heavy lock contention
Spin-wait then fixed retry3s local wait followed by synchronized delayed redeliveryLatency tails get wide and noisy
No jitter telemetryHard to prove contention improvementsRollout debates based on opinion, not data
Over-wide jitter bandQueue latency swings too muchImproved contention but degraded response-time predictability

Implementation examples

current-lock-busy-path.go
Go
// core/workflow/reconciler.go
if token == "" {
  return bus.RetryAfter(fmt.Errorf("run lock busy"), 500*time.Millisecond)
}

// core/controlplane/gateway/handlers_stream.go
if time.Now().After(lockDeadline) {
  return bus.RetryAfter(fmt.Errorf("run lock busy: %s", runID), 500*time.Millisecond)
}
bounded-jitter-design.go
Go
// Suggested bounded jitter helper for lock-busy retries.
func lockBusyDelay(base time.Duration, jitterPct float64, rnd *rand.Rand) time.Duration {
  // Example: base=500ms, jitterPct=0.30 -> range [350ms, 650ms]
  min := float64(base) * (1.0 - jitterPct)
  max := float64(base) * (1.0 + jitterPct)
  if min < 0 {
    min = 0
  }
  if max < min {
    max = min
  }
  span := max - min
  if span == 0 {
    return time.Duration(min)
  }
  return time.Duration(min + rnd.Float64()*span)
}

// Use only on lock-busy classification paths, not all retries.
// Keep base delay visible in config so operators can tune safely.
validation-runbook.sh
Bash
# Confirm fixed-delay lock-busy behavior in code
rg --line-number "run lock busy|RetryAfter(.*500*time.Millisecond"   core/workflow/reconciler.go core/controlplane/gateway/handlers_stream.go

# Confirm queue mapping preserves delay intent
rg --line-number "RetryDelay(|msgActionNakDelay|NakWithDelay" core/infra/bus/nats.go core/infra/bus/retry.go

# Confirm lock-busy delay assertion exists in tests
rg --line-number "TestReconcilerHandleJobResultLockBusy|500*time.Millisecond" core/workflow/runner_test.go

# Recommended rollout metrics (example names)
# lock_busy_retry_count, lock_busy_retry_delay_ms, run_lock_wait_ms_p95, job_result_latency_ms_p95

Operational defaults

Wave math

Fixed 500ms delay means two retry waves per second. If many contenders enter the wave together, lock acquisition success per wave can stay low for long periods.

Rollout guardrails

Roll jitter behind a flag. Compare p95 lock wait, retry count, and job-result latency before and after. Roll back if latency tails grow beyond policy limits.

ControlDefaultWhy it exists
Lock-busy delay (current)500msQuick retry cadence after lock contention
Gateway lock wait window3s spin-wait before retryAvoid immediate redelivery bounce during short lock holds
Run lock TTL30sPrevents permanent lock ownership on worker or process failure
Jitter candidate range500ms ± 30% (350ms..650ms)Break retry synchronization while keeping latency bounded
Queue redelivery cap100 (JetStream max deliver)Stops infinite redelivery loops on persistent failures
Rollout gateObserve p95/p99 lock wait before full rolloutAvoid trading contention improvements for unacceptable tail latency

Limitations and tradeoffs

Determinism vs smoothing

Fixed delays are predictable for debugging. Jitter improves smoothing but adds distribution complexity.

Latency budget pressure

A wide jitter band can reduce collisions but hurt tail latency. Keep the jitter window bounded and measured.

Scope control

Start with lock-busy errors only. Global jitter across all retries can introduce noisy behavior in unrelated paths.

Frequently Asked Questions

Why not randomize every retry path?
Because not all retries have the same objective. Lock contention retries need desynchronization. Some infrastructure retries may prefer deterministic delays for predictability.
Does jitter weaken lock safety?
No. Safety comes from lock ownership and token checks. Jitter only changes when contenders retry, not who can hold the lock.
How much jitter is enough?
Start with 20%-30% around the current base delay. Measure lock collision rate and p95 job-result latency before increasing the band.
What is the fastest validation signal after rollout?
Watch lock-busy retries per second and successful lock acquisitions after first retry. If both improve without latency regressions, rollout is healthy.
Next step

Run a 7-day canary with bounded lock-busy jitter and compare two metrics first: lock-busy retries per minute and p95 job-result latency. Keep the change only if both move in the right direction.

Sources