The production problem
Lock contention by itself is normal. Synchronized retries are the multiplier.
If many workers use the same fixed delay, they can reattempt acquisition in phase. That creates repeated bursts instead of smooth spreading.
The result is counterintuitive. You get more retries, longer queue residence, and noisy lock-busy logs even when lock ownership is behaving correctly.
Retry timing is a throughput control, not only an error-handling detail.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC retry guide | Backoff with jitter (plus/minus variance) to avoid synchronized retries. | No lock-busy specific pattern tied to per-run distributed lock ownership in workflow reconciler paths. |
| AWS EventBridge retries and jitter | Default retry model uses exponential backoff and jitter. | No concrete mapping for lock-busy retry decisions where low latency and contention smoothing must be balanced. |
| AWS Builders' Library: avoiding fallback | Fallback/retry modes need real testing; rarely used paths become outage multipliers. | No concrete code-level guidance for fixed-delay lock retries in agent orchestration control loops. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Lock busy in HandleJobResult | Reconciler returns `RetryAfter(fmt.Errorf("run lock busy"), 500*time.Millisecond)` when lock token is empty. | Fast, predictable retries, but all contenders can retry in sync. |
| Lock backend error | Reconciler returns retryable error with 1s delay when `TryAcquireLock` errors. | Different delay class for infrastructure errors versus contention. |
| Retry transport | `RetryAfter` stores static delay; bus processing reads it and uses delayed NAK directly. | No jitter is injected in this path by default. |
| Test coverage | `TestReconcilerHandleJobResultLockBusy` asserts exact 500ms retry delay. | Behavior is intentional and locked by tests. |
Retry paths in code
Reconciler lock-busy retry
// core/workflow/reconciler.go (excerpt)
if r.jobStore != nil {
lockKey := runLockKey(runID)
token, err := r.jobStore.TryAcquireLock(ctx, lockKey, 30*time.Second)
if err != nil {
return bus.RetryAfter(err, 1*time.Second)
}
if token == "" {
return bus.RetryAfter(fmt.Errorf("run lock busy"), 500*time.Millisecond)
}
defer func() { _ = r.jobStore.ReleaseLock(context.Background(), lockKey, token) }()
}Static delay transport path
// core/infra/bus/retry.go + nats.go (excerpt)
func RetryAfter(err error, delay time.Duration) error {
return &RetryableError{Err: err, Delay: delay}
}
if err := handler(&packet); err != nil {
if delay, ok := RetryDelay(err); ok {
if delay > 0 {
return msgActionNakDelay, delay
}
return msgActionNak, 0
}
}Existing test locking the 500ms behavior
// core/workflow/runner_test.go (excerpt)
func TestReconcilerHandleJobResultLockBusy(t *testing.T) {
// pre-acquire run lock
err := rec.HandleJobResult(ctx, &pb.JobResult{JobId: "run-1:step@1"})
if delay, ok := bus.RetryDelay(err); !ok || delay != 500*time.Millisecond {
t.Fatalf("expected retry delay 500ms")
}
}Bounded jitter option
// Suggested bounded jitter for lock-busy retries
base := 500 * time.Millisecond
maxJitter := 200 * time.Millisecond
j := time.Duration(rand.Int63n(int64(maxJitter)*2)) - maxJitter
return bus.RetryAfter(fmt.Errorf("run lock busy"), base+j)Validation runbook
Evaluate retry wave behavior under realistic parallel load before changing delay strategy.
# 1) Measure lock-busy retry burstiness # - count retries per 500ms bucket # - look for synchronized spikes at exact multiples of 500ms # 2) Correlate with lock contention and queue depth # - run lock acquisition misses # - pending message growth # 3) Validate transport behavior # - confirm retry delay is passed through unchanged by bus layer # 4) Staging experiment # - run fixed 500ms policy for baseline # - run bounded jitter policy (+/- 200ms) under same load # - compare p95 lock-acquire wait and duplicate retry bursts # 5) Rollout guardrail # - keep max jitter below lock TTL safety budget # - keep retry delay > median critical-section duration to avoid immediate re-collision
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Fixed 500ms retry | Simple, deterministic, easy to test and reason about. | High chance of synchronized retries under parallel contention. |
| Bounded jitter around 500ms | Reduces retry synchronization and smooths contention bursts. | Harder to reason about exact retry timing during debugging. |
| Full exponential backoff + jitter | Strongest herd protection for severe contention events. | Can increase recovery latency for short-lived lock contention. |
- - Fixed delay behavior is simple and testable. It is not inherently wrong, but it can underperform during synchronized contention.
- - Jitter should stay bounded and must respect lock TTL plus correctness boundaries.
- - I found explicit tests for the current fixed 500ms delay, but no test suite yet for jitter variance bounds.
Next step
Implement this next:
- 1. Add optional bounded jitter for `run lock busy` retries behind a config flag.
- 2. Add tests validating jitter range and preserving retryability semantics.
- 3. Track lock-busy burstiness metric before and after rollout.
- 4. Keep a fast rollback to fixed delay if latency distribution regresses.
Continue with AI Agent Lock Release Failure and AI Agent Workflow Admission Lock Jitter.