Name: Cordum
Author: Cordum

The production problem

Lock contention by itself is normal. Synchronized retries are the multiplier.

If many workers use the same fixed delay, they can reattempt acquisition in phase. That creates repeated bursts instead of smooth spreading.

The result is counterintuitive. You get more retries, longer queue residence, and noisy lock-busy logs even when lock ownership is behaving correctly.

Retry timing is a throughput control, not only an error-handling detail.

What top results cover and miss

Source	Strong coverage	Missing piece
gRPC retry guide	Backoff with jitter (plus/minus variance) to avoid synchronized retries.	No lock-busy specific pattern tied to per-run distributed lock ownership in workflow reconciler paths.
AWS EventBridge retries and jitter	Default retry model uses exponential backoff and jitter.	No concrete mapping for lock-busy retry decisions where low latency and contention smoothing must be balanced.
AWS Builders' Library: avoiding fallback	Fallback/retry modes need real testing; rarely used paths become outage multipliers.	No concrete code-level guidance for fixed-delay lock retries in agent orchestration control loops.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Lock busy in HandleJobResult	Reconciler returns `RetryAfter(fmt.Errorf("run lock busy"), 500*time.Millisecond)` when lock token is empty.	Fast, predictable retries, but all contenders can retry in sync.
Lock backend error	Reconciler returns retryable error with 1s delay when `TryAcquireLock` errors.	Different delay class for infrastructure errors versus contention.
Retry transport	`RetryAfter` stores static delay; bus processing reads it and uses delayed NAK directly.	No jitter is injected in this path by default.
Test coverage	`TestReconcilerHandleJobResultLockBusy` asserts exact 500ms retry delay.	Behavior is intentional and locked by tests.

Retry paths in code

Reconciler lock-busy retry

core/workflow/reconciler.go

// core/workflow/reconciler.go (excerpt)
if r.jobStore != nil {
  lockKey := runLockKey(runID)
  token, err := r.jobStore.TryAcquireLock(ctx, lockKey, 30*time.Second)
  if err != nil {
    return bus.RetryAfter(err, 1*time.Second)
  }
  if token == "" {
    return bus.RetryAfter(fmt.Errorf("run lock busy"), 500*time.Millisecond)
  }
  defer func() { _ = r.jobStore.ReleaseLock(context.Background(), lockKey, token) }()
}

Static delay transport path

core/infra/bus/retry.go + nats.go

// core/infra/bus/retry.go + nats.go (excerpt)
func RetryAfter(err error, delay time.Duration) error {
  return &RetryableError{Err: err, Delay: delay}
}

if err := handler(&packet); err != nil {
  if delay, ok := RetryDelay(err); ok {
    if delay > 0 {
      return msgActionNakDelay, delay
    }
    return msgActionNak, 0
  }
}

Existing test locking the 500ms behavior

core/workflow/runner_test.go

// core/workflow/runner_test.go (excerpt)
func TestReconcilerHandleJobResultLockBusy(t *testing.T) {
  // pre-acquire run lock
  err := rec.HandleJobResult(ctx, &pb.JobResult{JobId: "run-1:step@1"})
  if delay, ok := bus.RetryDelay(err); !ok || delay != 500*time.Millisecond {
    t.Fatalf("expected retry delay 500ms")
  }
}

Bounded jitter option

suggested-jitter.go

// Suggested bounded jitter for lock-busy retries
base := 500 * time.Millisecond
maxJitter := 200 * time.Millisecond
j := time.Duration(rand.Int63n(int64(maxJitter)*2)) - maxJitter
return bus.RetryAfter(fmt.Errorf("run lock busy"), base+j)

Validation runbook

Evaluate retry wave behavior under realistic parallel load before changing delay strategy.

lock-busy-retry-runbook.sh

bash

# 1) Measure lock-busy retry burstiness
# - count retries per 500ms bucket
# - look for synchronized spikes at exact multiples of 500ms

# 2) Correlate with lock contention and queue depth
# - run lock acquisition misses
# - pending message growth

# 3) Validate transport behavior
# - confirm retry delay is passed through unchanged by bus layer

# 4) Staging experiment
# - run fixed 500ms policy for baseline
# - run bounded jitter policy (+/- 200ms) under same load
# - compare p95 lock-acquire wait and duplicate retry bursts

# 5) Rollout guardrail
# - keep max jitter below lock TTL safety budget
# - keep retry delay > median critical-section duration to avoid immediate re-collision

Limitations and tradeoffs

Approach	Upside	Downside
Fixed 500ms retry	Simple, deterministic, easy to test and reason about.	High chance of synchronized retries under parallel contention.
Bounded jitter around 500ms	Reduces retry synchronization and smooths contention bursts.	Harder to reason about exact retry timing during debugging.
Full exponential backoff + jitter	Strongest herd protection for severe contention events.	Can increase recovery latency for short-lived lock contention.

- Fixed delay behavior is simple and testable. It is not inherently wrong, but it can underperform during synchronized contention.
- Jitter should stay bounded and must respect lock TTL plus correctness boundaries.
- I found explicit tests for the current fixed 500ms delay, but no test suite yet for jitter variance bounds.

Next step

Implement this next:

1. Add optional bounded jitter for `run lock busy` retries behind a config flag.
2. Add tests validating jitter range and preserving retryability semantics.
3. Track lock-busy burstiness metric before and after rollout.
4. Keep a fast rollback to fixed delay if latency distribution regresses.

Continue with AI Agent Lock Release Failure and AI Agent Workflow Admission Lock Jitter.

AI Agent Run Lock Busy Retries