TL;DR
- Cordum currently returns `RetryAfter(..., 500ms)` for run-lock contention in key workflow result paths.
- Fixed intervals are deterministic but can align contenders into repeated collision waves.
- Bounded jitter lowers synchronized retries without changing lock correctness.
- Treat jitter rollout as a reliability change with explicit before/after metrics.
The production problem
Lock contention bugs often look like throughput bugs. Workers are healthy. Queue is healthy. Yet progress is slow because contenders wake at the same interval, collide, and repeat.
A fixed `RetryAfter(..., 500ms)` delay creates a deterministic retry rhythm. Under high parallelism, that rhythm can produce collision waves every half second.
You still get correctness. You lose efficiency. More retries. More queue churn. Longer tails for job result handling.
What top ranking sources cover vs miss
| Source | What it covers | What it misses |
|---|---|---|
| AWS Architecture Blog: Exponential Backoff and Jitter | Shows why synchronized retries waste work under contention and why jittered backoff reduces collision spikes. | Does not map jitter decisions to lock-ownership code paths in workflow engines handling job results. |
| gRPC Retry Guide | Defines retry policy controls and documents jitter around backoff delays to avoid client stampedes. | Focuses on RPC retries; does not cover queue redelivery plus distributed run-lock contention loops. |
| Redis Distributed Locks | Recommends retrying lock acquisition after random delay to desynchronize competing clients. | No implementation guidance for mixed paths where one handler spin-waits and another immediately returns retryable errors. |
Cordum runtime paths
There are two primary lock-busy result paths today. Both converge on the same fixed 500ms delayed retry, even though one path spin-waits before giving up.
| Path | Location | Current behavior | Consequence |
|---|---|---|---|
| Workflow reconciler | `core/workflow/reconciler.go` | Returns `RetryAfter("run lock busy", 500ms)` when run lock token is empty. | Fast retries; high chance of synchronized retries under load. |
| Gateway workflow result handler | `core/controlplane/gateway/handlers_stream.go` | Spin-waits up to 3s for lock, then returns `RetryAfter("run lock busy", 500ms)` if still contended. | Lower immediate NATS bounce rate, but same fixed retry period after timeout. |
| Bus retry mapping | `core/infra/bus/nats.go` + `core/infra/bus/retry.go` | `RetryDelay()` extraction feeds delayed NAK (`NakWithDelay(delay)`). | Transport honors the 500ms value exactly unless caller changes it. |
| Lock-busy test coverage | `core/workflow/runner_test.go` | `TestReconcilerHandleJobResultLockBusy` asserts 500ms delay. | Current fixed delay is intentional and test-protected. |
Failure modes
| Fault | Observed symptom | Operational effect |
|---|---|---|
| Fixed retry period shared by many workers | Retry bursts every 500ms (2 waves/sec) | Low success rate per wave during heavy lock contention |
| Spin-wait then fixed retry | 3s local wait followed by synchronized delayed redelivery | Latency tails get wide and noisy |
| No jitter telemetry | Hard to prove contention improvements | Rollout debates based on opinion, not data |
| Over-wide jitter band | Queue latency swings too much | Improved contention but degraded response-time predictability |
Implementation examples
// core/workflow/reconciler.go
if token == "" {
return bus.RetryAfter(fmt.Errorf("run lock busy"), 500*time.Millisecond)
}
// core/controlplane/gateway/handlers_stream.go
if time.Now().After(lockDeadline) {
return bus.RetryAfter(fmt.Errorf("run lock busy: %s", runID), 500*time.Millisecond)
}// Suggested bounded jitter helper for lock-busy retries.
func lockBusyDelay(base time.Duration, jitterPct float64, rnd *rand.Rand) time.Duration {
// Example: base=500ms, jitterPct=0.30 -> range [350ms, 650ms]
min := float64(base) * (1.0 - jitterPct)
max := float64(base) * (1.0 + jitterPct)
if min < 0 {
min = 0
}
if max < min {
max = min
}
span := max - min
if span == 0 {
return time.Duration(min)
}
return time.Duration(min + rnd.Float64()*span)
}
// Use only on lock-busy classification paths, not all retries.
// Keep base delay visible in config so operators can tune safely.# Confirm fixed-delay lock-busy behavior in code rg --line-number "run lock busy|RetryAfter(.*500*time.Millisecond" core/workflow/reconciler.go core/controlplane/gateway/handlers_stream.go # Confirm queue mapping preserves delay intent rg --line-number "RetryDelay(|msgActionNakDelay|NakWithDelay" core/infra/bus/nats.go core/infra/bus/retry.go # Confirm lock-busy delay assertion exists in tests rg --line-number "TestReconcilerHandleJobResultLockBusy|500*time.Millisecond" core/workflow/runner_test.go # Recommended rollout metrics (example names) # lock_busy_retry_count, lock_busy_retry_delay_ms, run_lock_wait_ms_p95, job_result_latency_ms_p95
Operational defaults
Fixed 500ms delay means two retry waves per second. If many contenders enter the wave together, lock acquisition success per wave can stay low for long periods.
Roll jitter behind a flag. Compare p95 lock wait, retry count, and job-result latency before and after. Roll back if latency tails grow beyond policy limits.
| Control | Default | Why it exists |
|---|---|---|
| Lock-busy delay (current) | 500ms | Quick retry cadence after lock contention |
| Gateway lock wait window | 3s spin-wait before retry | Avoid immediate redelivery bounce during short lock holds |
| Run lock TTL | 30s | Prevents permanent lock ownership on worker or process failure |
| Jitter candidate range | 500ms ± 30% (350ms..650ms) | Break retry synchronization while keeping latency bounded |
| Queue redelivery cap | 100 (JetStream max deliver) | Stops infinite redelivery loops on persistent failures |
| Rollout gate | Observe p95/p99 lock wait before full rollout | Avoid trading contention improvements for unacceptable tail latency |
Limitations and tradeoffs
Fixed delays are predictable for debugging. Jitter improves smoothing but adds distribution complexity.
A wide jitter band can reduce collisions but hurt tail latency. Keep the jitter window bounded and measured.
Start with lock-busy errors only. Global jitter across all retries can introduce noisy behavior in unrelated paths.