The production problem
Distributed locks usually fail after the design review, not during it. Everything looks correct until one worker pauses, a network packet arrives late, or Redis latency spikes at 2:17 AM.
A lock is only part of the safety story. You still need stale-writer defense and clear operational checks when replicas disagree about ownership.
If lock strategy is hand-wavy, incident response becomes archaeology. Engineers end up reading source code while production queues keep growing.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Redis distributed lock patterns | Lease basics (`SET ... NX PX`), mutual exclusion goal, and Redlock algorithm shape. | No operator playbook for diagnosing lock debt in AI job schedulers with multi-tenant pressure. |
| How to do distributed locking (Kleppmann) | Why lock leases expire under pauses and why fencing tokens are needed for correctness-critical paths. | No concrete lock-key runbook for Redis-backed orchestration systems in day-2 operations. |
| etcd concurrency lock API | Lease-tied lock ownership and explicit concurrency API for distributed coordination. | No mixed-pattern guidance for per-item explicit release plus per-loop TTL hold used by agent schedulers. |
The gap is not lock acquisition syntax. The gap is combining lock correctness with operator-facing runbooks and measurable reliability signals in autonomous workloads.
Lock model that survives incidents
Pick lock pattern by failure blast radius. If duplicate work is cheap, lease-only locking may be enough. If duplicate work corrupts state, add fencing semantics at write boundaries.
| Pattern | Best fit | Common failure | Mitigation |
|---|---|---|---|
| Per-item lock + explicit release | Single state transitions (job state change, one workflow step) | Lock expires mid-operation without renewal | Renew every ttl/3 and cap consecutive renewal failures. |
| Per-loop lock + TTL hold | Periodic reconcilers and cleanup loops in multi-replica deployments | Double-processing in the same poll cycle if released too fast | Hold lock for full TTL window, let expiry rotate ownership. |
| Lock + fencing at write boundary | Correctness-critical writes where stale workers must be rejected | Paused worker writes after lease expiry | Monotonic token checked by storage service on every write. |
| Local mutex + distributed lock | High-contention workflow runs across goroutines and replicas | Extra Redis round trips and contention spikes | Gate intra-process work first with local mutex, then cross-replica lock. |
Cordum runtime lock behavior
Cordum uses different lock behaviors per code path. That is intentional. A one-size lock policy would either reduce throughput or increase duplicate processing risk.
| Lock key | Current behavior | Why it matters |
|---|---|---|
| `cordum:scheduler:job:<id>` | 60s TTL, explicit release, renewal every 20s, abandon renewal after 3 consecutive failures. | Protects per-job transitions while avoiding permanent lock ownership on crash. |
| `cordum:reconciler:default` | TTL = 2x poll interval (default poll 30s), no explicit release after tick. | Prevents two replicas from running reconciler work inside the same window. |
| `cordum:wf:run:lock:<runID>` | 30s TTL with 10s renewal, paired with local mutex for two-layer exclusion. | Coordinates workflow run updates across goroutines and across replicas. |
| `cordum:scheduler:snapshot:writer` | 10s TTL, writer runs every 5s, crash failover window around 15s. | Avoids competing snapshot writes and limits stale worker registry risk. |
The less glamorous detail is critical: reconciler-style loops hold lock ownership until TTL expiry to avoid same-cycle duplicate ticks. Per-item transitions still release explicitly.
Working implementation examples
Acquire lease lock and attach fence token (Go)
type LeaseLock struct {
Key string
OwnerToken string
FenceToken int64
TTL time.Duration
}
func AcquireLeaseLock(ctx context.Context, rdb *redis.Client, key string, ttl time.Duration) (*LeaseLock, error) {
owner := uuid.NewString()
ok, err := rdb.SetNX(ctx, key, owner, ttl).Result()
if err != nil {
return nil, fmt.Errorf("acquire lock: %w", err)
}
if !ok {
return nil, ErrLockBusy
}
// Monotonic fence token in the same store. For stricter correctness,
// generate this in a consensus-backed store.
fence, err := rdb.Incr(ctx, "fence:"+key).Result()
if err != nil {
_ = rdb.Del(ctx, key).Err()
return nil, fmt.Errorf("fence token: %w", err)
}
return &LeaseLock{
Key: key, OwnerToken: owner, FenceToken: fence, TTL: ttl,
}, nil
}Reject stale writers at storage boundary (SQL)
-- Reject stale writer attempts using fencing token monotonicity
UPDATE workflow_runs
SET payload = $2,
fence_token = $3,
updated_at = NOW()
WHERE run_id = $1
AND $3 > fence_token;
-- rows_affected = 0 means stale lock holder or concurrent newer writerIncident diagnostics for lock ownership (Bash)
# Duplicate dispatch checks redis-cli GET "cordum:scheduler:job:JOB_ID" redis-cli GET "cordum:reconciler:default" redis-cli GET "cordum:replayer:pending" # Workflow run contention checks redis-cli GET "cordum:wf:run:lock:RUN_ID" redis-cli GET "cordum:workflow-engine:reconciler:default" # Snapshot writer leadership checks redis-cli GET "cordum:scheduler:snapshot:writer" redis-cli OBJECT IDLETIME "sys:workers:snapshot"
Metrics to alert on lock stress (PromQL)
# Lock contention pressure histogram_quantile(0.99, rate(job_lock_wait_bucket[5m])) # Recovery debt stale_jobs rate(orphan_replayed_total[5m]) # Fail-open path pressure rate(cordum_scheduler_input_fail_open_total[5m])
Limitations and tradeoffs
- - Lease locking alone cannot prevent stale writers after long pauses; fencing checks are still needed.
- - Fence token generation in non-consensus stores can become a correctness bottleneck in extreme failure modes.
- - TTL-hold patterns reduce duplicate loops but can delay failover by one TTL window.
- - High lock cardinality increases Redis load and operator noise during outages.
If the lock protects correctness-critical writes, treat lock service design as a data-consistency concern, not a queueing concern. The pager does not care which team owns the architecture diagram.
Next step
Do this in one sprint:
- 1. Inventory every distributed lock key and classify it as per-item explicit release or per-loop TTL hold.
- 2. Add stale-writer protection for the top 3 correctness-critical write paths.
- 3. Add alerts on `job_lock_wait`, `stale_jobs`, and `orphan_replayed` trend acceleration.
- 4. Run one chaos drill: pause a worker for longer than lock TTL and verify stale writes are rejected.
Continue with AI Agent Idempotency Keys and AI Agent Incident Response Runbook.