Skip to content
Guide

AI Agent Distributed Locking

Lease locks fail in boring ways. Build for paused workers, delayed packets, and operator fatigue.

Guide12 min readMar 2026
TL;DR
  • -A lease lock without fencing can still allow stale writers after long pauses or delayed packets.
  • -Cordum uses both explicit-release locks and TTL-hold locks, depending on whether the work is per-item or per-loop.
  • -You should monitor lock wait and stale-job debt, not just lock acquire success.
  • -Runbook commands for lock keys need to be prepared before incidents, not during them.
Lease discipline

Set TTL and renewal cadence from real operation duration, not hopeful averages.

Stale-writer defense

Use fencing semantics at write boundaries when correctness is critical.

Operator diagnostics

Keep lock key checks and recovery thresholds codified in your runbook.

Scope

This guide targets autonomous AI agent systems that run across multiple replicas and need deterministic coordination under retries, failover, and partial outages.

The production problem

Distributed locks usually fail after the design review, not during it. Everything looks correct until one worker pauses, a network packet arrives late, or Redis latency spikes at 2:17 AM.

A lock is only part of the safety story. You still need stale-writer defense and clear operational checks when replicas disagree about ownership.

If lock strategy is hand-wavy, incident response becomes archaeology. Engineers end up reading source code while production queues keep growing.

What top results miss

SourceStrong coverageMissing piece
Redis distributed lock patternsLease basics (`SET ... NX PX`), mutual exclusion goal, and Redlock algorithm shape.No operator playbook for diagnosing lock debt in AI job schedulers with multi-tenant pressure.
How to do distributed locking (Kleppmann)Why lock leases expire under pauses and why fencing tokens are needed for correctness-critical paths.No concrete lock-key runbook for Redis-backed orchestration systems in day-2 operations.
etcd concurrency lock APILease-tied lock ownership and explicit concurrency API for distributed coordination.No mixed-pattern guidance for per-item explicit release plus per-loop TTL hold used by agent schedulers.

The gap is not lock acquisition syntax. The gap is combining lock correctness with operator-facing runbooks and measurable reliability signals in autonomous workloads.

Lock model that survives incidents

Pick lock pattern by failure blast radius. If duplicate work is cheap, lease-only locking may be enough. If duplicate work corrupts state, add fencing semantics at write boundaries.

PatternBest fitCommon failureMitigation
Per-item lock + explicit releaseSingle state transitions (job state change, one workflow step)Lock expires mid-operation without renewalRenew every ttl/3 and cap consecutive renewal failures.
Per-loop lock + TTL holdPeriodic reconcilers and cleanup loops in multi-replica deploymentsDouble-processing in the same poll cycle if released too fastHold lock for full TTL window, let expiry rotate ownership.
Lock + fencing at write boundaryCorrectness-critical writes where stale workers must be rejectedPaused worker writes after lease expiryMonotonic token checked by storage service on every write.
Local mutex + distributed lockHigh-contention workflow runs across goroutines and replicasExtra Redis round trips and contention spikesGate intra-process work first with local mutex, then cross-replica lock.

Cordum runtime lock behavior

Cordum uses different lock behaviors per code path. That is intentional. A one-size lock policy would either reduce throughput or increase duplicate processing risk.

Lock keyCurrent behaviorWhy it matters
`cordum:scheduler:job:<id>`60s TTL, explicit release, renewal every 20s, abandon renewal after 3 consecutive failures.Protects per-job transitions while avoiding permanent lock ownership on crash.
`cordum:reconciler:default`TTL = 2x poll interval (default poll 30s), no explicit release after tick.Prevents two replicas from running reconciler work inside the same window.
`cordum:wf:run:lock:<runID>`30s TTL with 10s renewal, paired with local mutex for two-layer exclusion.Coordinates workflow run updates across goroutines and across replicas.
`cordum:scheduler:snapshot:writer`10s TTL, writer runs every 5s, crash failover window around 15s.Avoids competing snapshot writes and limits stale worker registry risk.

The less glamorous detail is critical: reconciler-style loops hold lock ownership until TTL expiry to avoid same-cycle duplicate ticks. Per-item transitions still release explicitly.

Working implementation examples

Acquire lease lock and attach fence token (Go)

lease_lock.go
Go
type LeaseLock struct {
  Key         string
  OwnerToken  string
  FenceToken  int64
  TTL         time.Duration
}

func AcquireLeaseLock(ctx context.Context, rdb *redis.Client, key string, ttl time.Duration) (*LeaseLock, error) {
  owner := uuid.NewString()
  ok, err := rdb.SetNX(ctx, key, owner, ttl).Result()
  if err != nil {
    return nil, fmt.Errorf("acquire lock: %w", err)
  }
  if !ok {
    return nil, ErrLockBusy
  }

  // Monotonic fence token in the same store. For stricter correctness,
  // generate this in a consensus-backed store.
  fence, err := rdb.Incr(ctx, "fence:"+key).Result()
  if err != nil {
    _ = rdb.Del(ctx, key).Err()
    return nil, fmt.Errorf("fence token: %w", err)
  }

  return &LeaseLock{
    Key: key, OwnerToken: owner, FenceToken: fence, TTL: ttl,
  }, nil
}

Reject stale writers at storage boundary (SQL)

fence_guard.sql
SQL
-- Reject stale writer attempts using fencing token monotonicity
UPDATE workflow_runs
SET payload = $2,
    fence_token = $3,
    updated_at = NOW()
WHERE run_id = $1
  AND $3 > fence_token;

-- rows_affected = 0 means stale lock holder or concurrent newer writer

Incident diagnostics for lock ownership (Bash)

lock_runbook.sh
Bash
# Duplicate dispatch checks
redis-cli GET "cordum:scheduler:job:JOB_ID"
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

# Workflow run contention checks
redis-cli GET "cordum:wf:run:lock:RUN_ID"
redis-cli GET "cordum:workflow-engine:reconciler:default"

# Snapshot writer leadership checks
redis-cli GET "cordum:scheduler:snapshot:writer"
redis-cli OBJECT IDLETIME "sys:workers:snapshot"

Metrics to alert on lock stress (PromQL)

lock_health.promql
PromQL
# Lock contention pressure
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

# Recovery debt
stale_jobs
rate(orphan_replayed_total[5m])

# Fail-open path pressure
rate(cordum_scheduler_input_fail_open_total[5m])

Limitations and tradeoffs

  • - Lease locking alone cannot prevent stale writers after long pauses; fencing checks are still needed.
  • - Fence token generation in non-consensus stores can become a correctness bottleneck in extreme failure modes.
  • - TTL-hold patterns reduce duplicate loops but can delay failover by one TTL window.
  • - High lock cardinality increases Redis load and operator noise during outages.

If the lock protects correctness-critical writes, treat lock service design as a data-consistency concern, not a queueing concern. The pager does not care which team owns the architecture diagram.

Next step

Do this in one sprint:

  1. 1. Inventory every distributed lock key and classify it as per-item explicit release or per-loop TTL hold.
  2. 2. Add stale-writer protection for the top 3 correctness-critical write paths.
  3. 3. Add alerts on `job_lock_wait`, `stale_jobs`, and `orphan_replayed` trend acceleration.
  4. 4. Run one chaos drill: pause a worker for longer than lock TTL and verify stale writes are rejected.

Continue with AI Agent Idempotency Keys and AI Agent Incident Response Runbook.

Locking without runbooks is optimism

Add lock-key diagnostics and stale-writer protections now, while the system is healthy and your coffee is still warm.