Name: Cordum
Author: Cordum

The production problem

Distributed locks usually fail after the design review, not during it. Everything looks correct until one worker pauses, a network packet arrives late, or Redis latency spikes at 2:17 AM.

A lock is only part of the safety story. You still need stale-writer defense and clear operational checks when replicas disagree about ownership.

If lock strategy is hand-wavy, incident response becomes archaeology. Engineers end up reading source code while production queues keep growing.

What top results miss

Source	Strong coverage	Missing piece
Redis distributed lock patterns	Lease basics (`SET ... NX PX`), mutual exclusion goal, and Redlock algorithm shape.	No operator playbook for diagnosing lock debt in AI job schedulers with multi-tenant pressure.
How to do distributed locking (Kleppmann)	Why lock leases expire under pauses and why fencing tokens are needed for correctness-critical paths.	No concrete lock-key runbook for Redis-backed orchestration systems in day-2 operations.
etcd concurrency lock API	Lease-tied lock ownership and explicit concurrency API for distributed coordination.	No mixed-pattern guidance for per-item explicit release plus per-loop TTL hold used by agent schedulers.

The gap is not lock acquisition syntax. The gap is combining lock correctness with operator-facing runbooks and measurable reliability signals in autonomous workloads.

Lock model that survives incidents

Pick lock pattern by failure blast radius. If duplicate work is cheap, lease-only locking may be enough. If duplicate work corrupts state, add fencing semantics at write boundaries.

Pattern	Best fit	Common failure	Mitigation
Per-item lock + explicit release	Single state transitions (job state change, one workflow step)	Lock expires mid-operation without renewal	Renew every ttl/3 and cap consecutive renewal failures.
Per-loop lock + TTL hold	Periodic reconcilers and cleanup loops in multi-replica deployments	Double-processing in the same poll cycle if released too fast	Hold lock for full TTL window, let expiry rotate ownership.
Lock + fencing at write boundary	Correctness-critical writes where stale workers must be rejected	Paused worker writes after lease expiry	Monotonic token checked by storage service on every write.
Local mutex + distributed lock	High-contention workflow runs across goroutines and replicas	Extra Redis round trips and contention spikes	Gate intra-process work first with local mutex, then cross-replica lock.

Cordum runtime lock behavior

Cordum uses different lock behaviors per code path. That is intentional. A one-size lock policy would either reduce throughput or increase duplicate processing risk.

Lock key	Current behavior	Why it matters
`cordum:scheduler:job:<id>`	60s TTL, explicit release, renewal every 20s, abandon renewal after 3 consecutive failures.	Protects per-job transitions while avoiding permanent lock ownership on crash.
`cordum:reconciler:default`	TTL = 2x poll interval (default poll 30s), no explicit release after tick.	Prevents two replicas from running reconciler work inside the same window.
`cordum:wf:run:lock:<runID>`	30s TTL with 10s renewal, paired with local mutex for two-layer exclusion.	Coordinates workflow run updates across goroutines and across replicas.
`cordum:scheduler:snapshot:writer`	10s TTL, writer runs every 5s, crash failover window around 15s.	Avoids competing snapshot writes and limits stale worker registry risk.

The less glamorous detail is critical: reconciler-style loops hold lock ownership until TTL expiry to avoid same-cycle duplicate ticks. Per-item transitions still release explicitly.

Working implementation examples

Acquire lease lock and attach fence token (Go)

lease_lock.go

type LeaseLock struct {
  Key         string
  OwnerToken  string
  FenceToken  int64
  TTL         time.Duration
}

func AcquireLeaseLock(ctx context.Context, rdb *redis.Client, key string, ttl time.Duration) (*LeaseLock, error) {
  owner := uuid.NewString()
  ok, err := rdb.SetNX(ctx, key, owner, ttl).Result()
  if err != nil {
    return nil, fmt.Errorf("acquire lock: %w", err)
  }
  if !ok {
    return nil, ErrLockBusy
  }

  // Monotonic fence token in the same store. For stricter correctness,
  // generate this in a consensus-backed store.
  fence, err := rdb.Incr(ctx, "fence:"+key).Result()
  if err != nil {
    _ = rdb.Del(ctx, key).Err()
    return nil, fmt.Errorf("fence token: %w", err)
  }

  return &LeaseLock{
    Key: key, OwnerToken: owner, FenceToken: fence, TTL: ttl,
  }, nil
}

Reject stale writers at storage boundary (SQL)

fence_guard.sql

SQL

-- Reject stale writer attempts using fencing token monotonicity
UPDATE workflow_runs
SET payload = $2,
    fence_token = $3,
    updated_at = NOW()
WHERE run_id = $1
  AND $3 > fence_token;

-- rows_affected = 0 means stale lock holder or concurrent newer writer

Incident diagnostics for lock ownership (Bash)

lock_runbook.sh

Bash

# Duplicate dispatch checks
redis-cli GET "cordum:scheduler:job:JOB_ID"
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

# Workflow run contention checks
redis-cli GET "cordum:wf:run:lock:RUN_ID"
redis-cli GET "cordum:workflow-engine:reconciler:default"

# Snapshot writer leadership checks
redis-cli GET "cordum:scheduler:snapshot:writer"
redis-cli OBJECT IDLETIME "sys:workers:snapshot"

Metrics to alert on lock stress (PromQL)

lock_health.promql

PromQL

# Lock contention pressure
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

# Recovery debt
stale_jobs
rate(orphan_replayed_total[5m])

# Fail-open path pressure
rate(cordum_scheduler_input_fail_open_total[5m])

Limitations and tradeoffs

- Lease locking alone cannot prevent stale writers after long pauses; fencing checks are still needed.
- Fence token generation in non-consensus stores can become a correctness bottleneck in extreme failure modes.
- TTL-hold patterns reduce duplicate loops but can delay failover by one TTL window.
- High lock cardinality increases Redis load and operator noise during outages.

If the lock protects correctness-critical writes, treat lock service design as a data-consistency concern, not a queueing concern. The pager does not care which team owns the architecture diagram.

Next step

Do this in one sprint:

1. Inventory every distributed lock key and classify it as per-item explicit release or per-loop TTL hold.
2. Add stale-writer protection for the top 3 correctness-critical write paths.
3. Add alerts on `job_lock_wait`, `stale_jobs`, and `orphan_replayed` trend acceleration.
4. Run one chaos drill: pause a worker for longer than lock TTL and verify stale writes are rejected.

Continue with AI Agent Idempotency Keys and AI Agent Incident Response Runbook.

AI Agent Distributed Locking