Skip to content
Deep Dive

AI Agent Distributed Lock Fallback

The hard question is not how to acquire a lock. The hard question is what to do when lock acquisition infrastructure is unhealthy.

Deep Dive11 min readMar 2026
TL;DR
  • -Cordum workflow locking falls back to local-only execution when distributed lock acquisition returns an error.
  • -This behavior preserves progress but explicitly accepts cross-replica race risk during lock backend incidents.
  • -Lock contention is treated differently from lock errors: contention skips work (`ok=false`), while lock errors still execute locally.
  • -You need metrics and policy thresholds around fallback frequency, or a transient lock outage can become a correctness incident.
Failure mode

Redis lock path blips, replicas keep executing locally, and duplicate workflow updates appear minutes later.

Current behavior

Cordum increments `cordum_workflow_lock_fallback_total` and logs a high-signal error when distributed locking fails.

Operator payoff

You can choose where to sit on the availability-correctness line with explicit policy instead of accidental defaults.

Scope

This guide covers workflow run lock policy under lock-backend failure, not generic lock algorithm proofs.

The production problem

Most lock incidents are not caused by lock contention. They are caused by lock infrastructure uncertainty.

If Redis is briefly unreachable, your lock manager must choose. Stop processing and preserve stronger exclusion, or continue processing and accept race risk.

That choice is a policy decision, not a Redis command decision.

If the choice is implicit, you discover your policy during the incident call, which is expensive and usually loud.

What top results cover and miss

SourceStrong coverageMissing piece
AWS Builders' Library: Avoiding fallback in distributed systemsWhy fallback logic can amplify outages and how latent fallback bugs escape testing for long periods.No concrete lock-manager policy where contention, lock errors, and release failures require different handling paths.
Redis docs: Distributed locksToken-safe lock acquire/release semantics and safety/liveness framing.No runbook for deciding fail-open versus fail-closed when lock infrastructure is temporarily unavailable.
Kleppmann: How to do distributed lockingCorrectness risks under pauses/partitions and why lock assumptions can fail in production timing.No implementation guidance for mixed policy systems that must trade correctness against service continuity under outages.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Distributed lock success`TryAcquireLock` returns token; engine executes with distributed exclusion and best-effort renewal.Normal path with cross-replica mutual exclusion.
Distributed lock contention`TryAcquireLock` returns empty token; lock manager returns `(nil, false)` and skips processing for that run.Avoids duplicate execution when another replica already owns the lock.
Distributed lock errorIf `TryAcquireLock` returns error, engine increments `lock_fallback_total`, logs risk, and continues with local lock only.Improves availability but allows cross-replica race during lock outage windows.
Renew cadenceRun lock TTL is 30s; renew ticker fires every TTL/3 (~10s) with 2s renew timeout.Keeps long-running sections alive while still expiring stale owners.
Release behaviorDistributed release failure is logged as warn; execution path does not roll back completed work.Operational signal exists, but correctness depends on TTL expiry and idempotent downstream transitions.

Fallback behavior in code

Fail-open on lock acquisition error

core/workflow/engine.go
go
// core/workflow/engine.go (excerpt)
var lockFallbackTotal = prometheus.NewCounter(prometheus.CounterOpts{
  Namespace: "cordum",
  Subsystem: "workflow",
  Name:      "lock_fallback_total",
  Help:      "Number of times distributed lock acquisition failed and fell back to local-only locking.",
})

func (lm *lockManager) acquire(runID string) (func(), bool) {
  // local mutex acquired first
  token, err := lm.locker.TryAcquireLock(ctx, key, runLockTTL)
  if err != nil {
    lockFallbackTotal.Inc()
    slog.Error("distributed lock failed — using local-only lock, cross-replica race possible",
      "run_id", runID, "error", err)
  } else if token == "" {
    // contention path: skip this run
    return nil, false
  }
  // ... continue execution
}

TTL renewal behavior

core/workflow/engine.go
go
// core/workflow/engine.go (excerpt)
const runLockTTL = 30 * time.Second

if renewer, ok := lm.locker.(RunLockRenewer); ok {
  ticker := time.NewTicker(runLockTTL / 3)
  for {
    select {
    case <-ticker.C:
      rCtx, cancel := context.WithTimeout(renewCtx, 2*time.Second)
      if err := renewer.RenewLock(rCtx, key, token, runLockTTL); err != nil {
        slog.Warn("run lock renewal failed", "run_id", runID, "error", err)
      }
      cancel()
    }
  }
}

Existing distributed lock tests

core/workflow/lock_test.go
go
// core/workflow/lock_test.go (excerpt)
func TestDistributedRunLock_MutualExclusion(t *testing.T) {
  // two lock managers share Redis; only one enters critical section at a time
}

func TestDistributedRunLock_TTLExpiry(t *testing.T) {
  // lock expires after runLockTTL; another manager can then acquire
}

func TestDistributedRunLock_LocalFallback(t *testing.T) {
  // no RunLocker configured => local-only lock path still functions
}

func TestDistributedRunLock_Renewal(t *testing.T) {
  // key still exists after fast-forward beyond initial TTL
}

Coverage gap to close

test-gap.md
text
// coverage gap to consider
// There is no dedicated test that forces TryAcquireLock to return error
// and then asserts:
//   1) lockFallbackTotal increments
//   2) local-only execution proceeds
//   3) duplicate cross-replica effects are bounded by downstream idempotency

Validation runbook

If you keep fail-open behavior, track it with SLO-level signals. Otherwise it becomes invisible technical debt.

lock-fallback-runbook.sh
bash
# 1) Watch fallback metric
# sum(rate(cordum_workflow_lock_fallback_total[5m])) by (pod)

# 2) Correlate with duplicate processing signals
# - repeated step completions for same run_id
# - repeated publish attempts for same job_id

# 3) Check lock backend health window
# redis-cli INFO replication
# redis-cli PING

# 4) Inspect logs for explicit fallback signal
# "distributed lock failed — using local-only lock"

# 5) Escalation policy example
# - if fallback rate > 0.1/s for 5m and duplicate-run rate rises,
#   switch to fail-closed mode for high-risk workflows.

# 6) Recovery validation
# after lock backend recovery, fallback metric should return to near-zero

Limitations and tradeoffs

ApproachUpsideDownside
Fail open on lock error (current workflow path)Higher continuity during transient lock backend outages.Cross-replica races are possible; correctness burden moves to idempotency and conflict handling.
Fail closed on lock errorStronger correctness posture under lock failures.Reduced availability; stuck workflows until lock service recovers.
Policy split by workflow riskKeep low-risk flows moving while high-risk flows pause on lock errors.More policy complexity and testing overhead.
  • - I found solid lock-path tests for success, TTL expiry, and renewal, but no direct test for lock-service error fallback and metric increments.
  • - Fail-open policies need idempotent side effects. Without that, retries and races write different truths.
  • - If fallback is rare and unexercised, it can become the least reliable code path in the system.

Next step

Implement this next:

  1. 1. Add a runtime flag for fail-open versus fail-closed lock-error policy per workflow risk class.
  2. 2. Add an integration test that forces `TryAcquireLock` errors and verifies fallback metric behavior.
  3. 3. Define an on-call threshold for fallback rate and duplicate-run indicators.
  4. 4. Run controlled chaos drills that drop lock backend connectivity for 30 to 120 seconds.

Continue with AI Agent Distributed Locking and AI Agent Lock Token Ownership.

Policy before incident

If your lock outage policy is undocumented, it still exists. It is just being decided in real time by whichever code path fires first.