Name: Cordum
Author: Cordum

The production problem

Most lock incidents are not caused by lock contention. They are caused by lock infrastructure uncertainty.

If Redis is briefly unreachable, your lock manager must choose. Stop processing and preserve stronger exclusion, or continue processing and accept race risk.

That choice is a policy decision, not a Redis command decision.

If the choice is implicit, you discover your policy during the incident call, which is expensive and usually loud.

What top results cover and miss

Source	Strong coverage	Missing piece
AWS Builders' Library: Avoiding fallback in distributed systems	Why fallback logic can amplify outages and how latent fallback bugs escape testing for long periods.	No concrete lock-manager policy where contention, lock errors, and release failures require different handling paths.
Redis docs: Distributed locks	Token-safe lock acquire/release semantics and safety/liveness framing.	No runbook for deciding fail-open versus fail-closed when lock infrastructure is temporarily unavailable.
Kleppmann: How to do distributed locking	Correctness risks under pauses/partitions and why lock assumptions can fail in production timing.	No implementation guidance for mixed policy systems that must trade correctness against service continuity under outages.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Distributed lock success	`TryAcquireLock` returns token; engine executes with distributed exclusion and best-effort renewal.	Normal path with cross-replica mutual exclusion.
Distributed lock contention	`TryAcquireLock` returns empty token; lock manager returns `(nil, false)` and skips processing for that run.	Avoids duplicate execution when another replica already owns the lock.
Distributed lock error	If `TryAcquireLock` returns error, engine increments `lock_fallback_total`, logs risk, and continues with local lock only.	Improves availability but allows cross-replica race during lock outage windows.
Renew cadence	Run lock TTL is 30s; renew ticker fires every TTL/3 (~10s) with 2s renew timeout.	Keeps long-running sections alive while still expiring stale owners.
Release behavior	Distributed release failure is logged as warn; execution path does not roll back completed work.	Operational signal exists, but correctness depends on TTL expiry and idempotent downstream transitions.

Fallback behavior in code

Fail-open on lock acquisition error

core/workflow/engine.go

// core/workflow/engine.go (excerpt)
var lockFallbackTotal = prometheus.NewCounter(prometheus.CounterOpts{
  Namespace: "cordum",
  Subsystem: "workflow",
  Name:      "lock_fallback_total",
  Help:      "Number of times distributed lock acquisition failed and fell back to local-only locking.",
})

func (lm *lockManager) acquire(runID string) (func(), bool) {
  // local mutex acquired first
  token, err := lm.locker.TryAcquireLock(ctx, key, runLockTTL)
  if err != nil {
    lockFallbackTotal.Inc()
    slog.Error("distributed lock failed — using local-only lock, cross-replica race possible",
      "run_id", runID, "error", err)
  } else if token == "" {
    // contention path: skip this run
    return nil, false
  }
  // ... continue execution
}

TTL renewal behavior

core/workflow/engine.go

// core/workflow/engine.go (excerpt)
const runLockTTL = 30 * time.Second

if renewer, ok := lm.locker.(RunLockRenewer); ok {
  ticker := time.NewTicker(runLockTTL / 3)
  for {
    select {
    case <-ticker.C:
      rCtx, cancel := context.WithTimeout(renewCtx, 2*time.Second)
      if err := renewer.RenewLock(rCtx, key, token, runLockTTL); err != nil {
        slog.Warn("run lock renewal failed", "run_id", runID, "error", err)
      }
      cancel()
    }
  }
}

Existing distributed lock tests

core/workflow/lock_test.go

// core/workflow/lock_test.go (excerpt)
func TestDistributedRunLock_MutualExclusion(t *testing.T) {
  // two lock managers share Redis; only one enters critical section at a time
}

func TestDistributedRunLock_TTLExpiry(t *testing.T) {
  // lock expires after runLockTTL; another manager can then acquire
}

func TestDistributedRunLock_LocalFallback(t *testing.T) {
  // no RunLocker configured => local-only lock path still functions
}

func TestDistributedRunLock_Renewal(t *testing.T) {
  // key still exists after fast-forward beyond initial TTL
}

Coverage gap to close

test-gap.md

text

// coverage gap to consider
// There is no dedicated test that forces TryAcquireLock to return error
// and then asserts:
//   1) lockFallbackTotal increments
//   2) local-only execution proceeds
//   3) duplicate cross-replica effects are bounded by downstream idempotency

Validation runbook

If you keep fail-open behavior, track it with SLO-level signals. Otherwise it becomes invisible technical debt.

lock-fallback-runbook.sh

bash

# 1) Watch fallback metric
# sum(rate(cordum_workflow_lock_fallback_total[5m])) by (pod)

# 2) Correlate with duplicate processing signals
# - repeated step completions for same run_id
# - repeated publish attempts for same job_id

# 3) Check lock backend health window
# redis-cli INFO replication
# redis-cli PING

# 4) Inspect logs for explicit fallback signal
# "distributed lock failed — using local-only lock"

# 5) Escalation policy example
# - if fallback rate > 0.1/s for 5m and duplicate-run rate rises,
#   switch to fail-closed mode for high-risk workflows.

# 6) Recovery validation
# after lock backend recovery, fallback metric should return to near-zero

Limitations and tradeoffs

Approach	Upside	Downside
Fail open on lock error (current workflow path)	Higher continuity during transient lock backend outages.	Cross-replica races are possible; correctness burden moves to idempotency and conflict handling.
Fail closed on lock error	Stronger correctness posture under lock failures.	Reduced availability; stuck workflows until lock service recovers.
Policy split by workflow risk	Keep low-risk flows moving while high-risk flows pause on lock errors.	More policy complexity and testing overhead.

- I found solid lock-path tests for success, TTL expiry, and renewal, but no direct test for lock-service error fallback and metric increments.
- Fail-open policies need idempotent side effects. Without that, retries and races write different truths.
- If fallback is rare and unexercised, it can become the least reliable code path in the system.

Next step

Implement this next:

1. Add a runtime flag for fail-open versus fail-closed lock-error policy per workflow risk class.
2. Add an integration test that forces `TryAcquireLock` errors and verifies fallback metric behavior.
3. Define an on-call threshold for fallback rate and duplicate-run indicators.
4. Run controlled chaos drills that drop lock backend connectivity for 30 to 120 seconds.

Continue with AI Agent Distributed Locking and AI Agent Lock Token Ownership.

AI Agent Distributed Lock Fallback