Skip to content
Deep Dive

AI Agent Run Lock Busy Retries

Fixed retry intervals are clean on paper and noisy under contention. This matters when many workers chase the same lock.

Deep Dive10 min readMar 2026
TL;DR
  • -Cordum reconciler returns `RetryAfter(..., 500ms)` for `run lock busy` in `HandleJobResult`.
  • -The retry delay remains fixed end-to-end: `RetryAfter` stores static delay and bus processing uses it directly for NAK delay.
  • -Fixed retry intervals can synchronize workers into repeat contention waves under high parallelism.
  • -Adding bounded jitter to lock-busy retries can reduce burst contention without changing correctness semantics.
Failure mode

Many workers hit the same lock, all sleep 500ms, then all wake together and collide again.

Current behavior

Lock-busy retries are deterministic and easy to reason about, but can create synchronized retry pulses.

Operational payoff

Jitter can lower collision spikes and smooth queue pressure during lock contention events.

Scope

This guide focuses on retry timing for lock-busy outcomes in workflow reconciliation paths.

The production problem

Lock contention by itself is normal. Synchronized retries are the multiplier.

If many workers use the same fixed delay, they can reattempt acquisition in phase. That creates repeated bursts instead of smooth spreading.

The result is counterintuitive. You get more retries, longer queue residence, and noisy lock-busy logs even when lock ownership is behaving correctly.

Retry timing is a throughput control, not only an error-handling detail.

What top results cover and miss

SourceStrong coverageMissing piece
gRPC retry guideBackoff with jitter (plus/minus variance) to avoid synchronized retries.No lock-busy specific pattern tied to per-run distributed lock ownership in workflow reconciler paths.
AWS EventBridge retries and jitterDefault retry model uses exponential backoff and jitter.No concrete mapping for lock-busy retry decisions where low latency and contention smoothing must be balanced.
AWS Builders' Library: avoiding fallbackFallback/retry modes need real testing; rarely used paths become outage multipliers.No concrete code-level guidance for fixed-delay lock retries in agent orchestration control loops.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Lock busy in HandleJobResultReconciler returns `RetryAfter(fmt.Errorf("run lock busy"), 500*time.Millisecond)` when lock token is empty.Fast, predictable retries, but all contenders can retry in sync.
Lock backend errorReconciler returns retryable error with 1s delay when `TryAcquireLock` errors.Different delay class for infrastructure errors versus contention.
Retry transport`RetryAfter` stores static delay; bus processing reads it and uses delayed NAK directly.No jitter is injected in this path by default.
Test coverage`TestReconcilerHandleJobResultLockBusy` asserts exact 500ms retry delay.Behavior is intentional and locked by tests.

Retry paths in code

Reconciler lock-busy retry

core/workflow/reconciler.go
go
// core/workflow/reconciler.go (excerpt)
if r.jobStore != nil {
  lockKey := runLockKey(runID)
  token, err := r.jobStore.TryAcquireLock(ctx, lockKey, 30*time.Second)
  if err != nil {
    return bus.RetryAfter(err, 1*time.Second)
  }
  if token == "" {
    return bus.RetryAfter(fmt.Errorf("run lock busy"), 500*time.Millisecond)
  }
  defer func() { _ = r.jobStore.ReleaseLock(context.Background(), lockKey, token) }()
}

Static delay transport path

core/infra/bus/retry.go + nats.go
go
// core/infra/bus/retry.go + nats.go (excerpt)
func RetryAfter(err error, delay time.Duration) error {
  return &RetryableError{Err: err, Delay: delay}
}

if err := handler(&packet); err != nil {
  if delay, ok := RetryDelay(err); ok {
    if delay > 0 {
      return msgActionNakDelay, delay
    }
    return msgActionNak, 0
  }
}

Existing test locking the 500ms behavior

core/workflow/runner_test.go
go
// core/workflow/runner_test.go (excerpt)
func TestReconcilerHandleJobResultLockBusy(t *testing.T) {
  // pre-acquire run lock
  err := rec.HandleJobResult(ctx, &pb.JobResult{JobId: "run-1:step@1"})
  if delay, ok := bus.RetryDelay(err); !ok || delay != 500*time.Millisecond {
    t.Fatalf("expected retry delay 500ms")
  }
}

Bounded jitter option

suggested-jitter.go
go
// Suggested bounded jitter for lock-busy retries
base := 500 * time.Millisecond
maxJitter := 200 * time.Millisecond
j := time.Duration(rand.Int63n(int64(maxJitter)*2)) - maxJitter
return bus.RetryAfter(fmt.Errorf("run lock busy"), base+j)

Validation runbook

Evaluate retry wave behavior under realistic parallel load before changing delay strategy.

lock-busy-retry-runbook.sh
bash
# 1) Measure lock-busy retry burstiness
# - count retries per 500ms bucket
# - look for synchronized spikes at exact multiples of 500ms

# 2) Correlate with lock contention and queue depth
# - run lock acquisition misses
# - pending message growth

# 3) Validate transport behavior
# - confirm retry delay is passed through unchanged by bus layer

# 4) Staging experiment
# - run fixed 500ms policy for baseline
# - run bounded jitter policy (+/- 200ms) under same load
# - compare p95 lock-acquire wait and duplicate retry bursts

# 5) Rollout guardrail
# - keep max jitter below lock TTL safety budget
# - keep retry delay > median critical-section duration to avoid immediate re-collision

Limitations and tradeoffs

ApproachUpsideDownside
Fixed 500ms retrySimple, deterministic, easy to test and reason about.High chance of synchronized retries under parallel contention.
Bounded jitter around 500msReduces retry synchronization and smooths contention bursts.Harder to reason about exact retry timing during debugging.
Full exponential backoff + jitterStrongest herd protection for severe contention events.Can increase recovery latency for short-lived lock contention.
  • - Fixed delay behavior is simple and testable. It is not inherently wrong, but it can underperform during synchronized contention.
  • - Jitter should stay bounded and must respect lock TTL plus correctness boundaries.
  • - I found explicit tests for the current fixed 500ms delay, but no test suite yet for jitter variance bounds.

Next step

Implement this next:

  1. 1. Add optional bounded jitter for `run lock busy` retries behind a config flag.
  2. 2. Add tests validating jitter range and preserving retryability semantics.
  3. 3. Track lock-busy burstiness metric before and after rollout.
  4. 4. Keep a fast rollback to fixed delay if latency distribution regresses.

Continue with AI Agent Lock Release Failure and AI Agent Workflow Admission Lock Jitter.

Retry rhythm is architecture

If all contenders wake at the same time, your lock path becomes a metronome for contention. Small jitter can break that rhythm.