The production problem
Most lock incidents are not caused by lock contention. They are caused by lock infrastructure uncertainty.
If Redis is briefly unreachable, your lock manager must choose. Stop processing and preserve stronger exclusion, or continue processing and accept race risk.
That choice is a policy decision, not a Redis command decision.
If the choice is implicit, you discover your policy during the incident call, which is expensive and usually loud.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Builders' Library: Avoiding fallback in distributed systems | Why fallback logic can amplify outages and how latent fallback bugs escape testing for long periods. | No concrete lock-manager policy where contention, lock errors, and release failures require different handling paths. |
| Redis docs: Distributed locks | Token-safe lock acquire/release semantics and safety/liveness framing. | No runbook for deciding fail-open versus fail-closed when lock infrastructure is temporarily unavailable. |
| Kleppmann: How to do distributed locking | Correctness risks under pauses/partitions and why lock assumptions can fail in production timing. | No implementation guidance for mixed policy systems that must trade correctness against service continuity under outages. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Distributed lock success | `TryAcquireLock` returns token; engine executes with distributed exclusion and best-effort renewal. | Normal path with cross-replica mutual exclusion. |
| Distributed lock contention | `TryAcquireLock` returns empty token; lock manager returns `(nil, false)` and skips processing for that run. | Avoids duplicate execution when another replica already owns the lock. |
| Distributed lock error | If `TryAcquireLock` returns error, engine increments `lock_fallback_total`, logs risk, and continues with local lock only. | Improves availability but allows cross-replica race during lock outage windows. |
| Renew cadence | Run lock TTL is 30s; renew ticker fires every TTL/3 (~10s) with 2s renew timeout. | Keeps long-running sections alive while still expiring stale owners. |
| Release behavior | Distributed release failure is logged as warn; execution path does not roll back completed work. | Operational signal exists, but correctness depends on TTL expiry and idempotent downstream transitions. |
Fallback behavior in code
Fail-open on lock acquisition error
// core/workflow/engine.go (excerpt)
var lockFallbackTotal = prometheus.NewCounter(prometheus.CounterOpts{
Namespace: "cordum",
Subsystem: "workflow",
Name: "lock_fallback_total",
Help: "Number of times distributed lock acquisition failed and fell back to local-only locking.",
})
func (lm *lockManager) acquire(runID string) (func(), bool) {
// local mutex acquired first
token, err := lm.locker.TryAcquireLock(ctx, key, runLockTTL)
if err != nil {
lockFallbackTotal.Inc()
slog.Error("distributed lock failed — using local-only lock, cross-replica race possible",
"run_id", runID, "error", err)
} else if token == "" {
// contention path: skip this run
return nil, false
}
// ... continue execution
}TTL renewal behavior
// core/workflow/engine.go (excerpt)
const runLockTTL = 30 * time.Second
if renewer, ok := lm.locker.(RunLockRenewer); ok {
ticker := time.NewTicker(runLockTTL / 3)
for {
select {
case <-ticker.C:
rCtx, cancel := context.WithTimeout(renewCtx, 2*time.Second)
if err := renewer.RenewLock(rCtx, key, token, runLockTTL); err != nil {
slog.Warn("run lock renewal failed", "run_id", runID, "error", err)
}
cancel()
}
}
}Existing distributed lock tests
// core/workflow/lock_test.go (excerpt)
func TestDistributedRunLock_MutualExclusion(t *testing.T) {
// two lock managers share Redis; only one enters critical section at a time
}
func TestDistributedRunLock_TTLExpiry(t *testing.T) {
// lock expires after runLockTTL; another manager can then acquire
}
func TestDistributedRunLock_LocalFallback(t *testing.T) {
// no RunLocker configured => local-only lock path still functions
}
func TestDistributedRunLock_Renewal(t *testing.T) {
// key still exists after fast-forward beyond initial TTL
}Coverage gap to close
// coverage gap to consider // There is no dedicated test that forces TryAcquireLock to return error // and then asserts: // 1) lockFallbackTotal increments // 2) local-only execution proceeds // 3) duplicate cross-replica effects are bounded by downstream idempotency
Validation runbook
If you keep fail-open behavior, track it with SLO-level signals. Otherwise it becomes invisible technical debt.
# 1) Watch fallback metric # sum(rate(cordum_workflow_lock_fallback_total[5m])) by (pod) # 2) Correlate with duplicate processing signals # - repeated step completions for same run_id # - repeated publish attempts for same job_id # 3) Check lock backend health window # redis-cli INFO replication # redis-cli PING # 4) Inspect logs for explicit fallback signal # "distributed lock failed — using local-only lock" # 5) Escalation policy example # - if fallback rate > 0.1/s for 5m and duplicate-run rate rises, # switch to fail-closed mode for high-risk workflows. # 6) Recovery validation # after lock backend recovery, fallback metric should return to near-zero
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Fail open on lock error (current workflow path) | Higher continuity during transient lock backend outages. | Cross-replica races are possible; correctness burden moves to idempotency and conflict handling. |
| Fail closed on lock error | Stronger correctness posture under lock failures. | Reduced availability; stuck workflows until lock service recovers. |
| Policy split by workflow risk | Keep low-risk flows moving while high-risk flows pause on lock errors. | More policy complexity and testing overhead. |
- - I found solid lock-path tests for success, TTL expiry, and renewal, but no direct test for lock-service error fallback and metric increments.
- - Fail-open policies need idempotent side effects. Without that, retries and races write different truths.
- - If fallback is rare and unexercised, it can become the least reliable code path in the system.
Next step
Implement this next:
- 1. Add a runtime flag for fail-open versus fail-closed lock-error policy per workflow risk class.
- 2. Add an integration test that forces `TryAcquireLock` errors and verifies fallback metric behavior.
- 3. Define an on-call threshold for fallback rate and duplicate-run indicators.
- 4. Run controlled chaos drills that drop lock backend connectivity for 30 to 120 seconds.
Continue with AI Agent Distributed Locking and AI Agent Lock Token Ownership.