Name: Cordum
Author: Cordum

The production problem

Lock incidents often look harmless in logs and destructive in outcomes.

A slow worker times out, another worker acquires the lock, then the slow worker finally calls release. If release is plain `DEL key`, the new owner just lost its lock.

The damage is not theoretical. It appears as duplicate dispatch, out-of-order state transitions, and occasional cannot-reproduce outage reports.

Redis does not care that your pod had a long GC pause. The key either matches your token or it does not.

What top results cover and miss

Source	Strong coverage	Missing piece
Redis docs: Distributed locks with random values	SET NX PX with per-lock random value and compare-on-release script guidance.	No concrete incident runbook for repeated `lock not owned` bursts across workflow and approval paths.
Kleppmann: How to do distributed locking	Why lease expiry and process pauses break naive lock assumptions, and why fencing exists.	No practical mapping from theory to lock API responses, metrics, and retry behavior in control planes.
etcd docs: Why lease is not mutual exclusion	Explicitly states lease alone does not guarantee exclusion; revision validation is required.	Does not show how a Redis-based platform can enforce token ownership and surface failure semantics to operators.

Cordum runtime mechanics

Boundary	Current behavior	Why it matters
Lock acquisition token	`TryAcquireLock` uses `SetNX` and returns a new UUID token only when lock acquisition succeeds.	Each owner has a unique token, so a prior holder cannot safely release by key alone.
Release ownership check	Release path runs Lua: delete only if current value equals caller token; mismatch returns `lock not owned`.	Prevents stale workers from releasing a new owner's lock after TTL expiry and ownership transfer.
Renew ownership check	Renew path runs Lua: extend TTL only when key still contains caller token.	Avoids extending someone else's lock when a stale owner retries renew.
Approval lock envelope	Approval lock TTL is 10s; acquire attempts stop after ~2s with 25ms backoff; release uses bounded 2s context.	Limits request tail latency and avoids hanging handlers on lock infrastructure issues.
Workflow run lock renewal	Workflow run locks use 30s TTL and renew every TTL/3 (~10s) with 2s renew timeout.	Keeps long handlers alive while still expiring stale holders if renew keeps failing.

Ownership checks in code

Compare-and-release scripts

core/infra/store/job_store.go

// core/infra/store/job_store.go (excerpt)
var releaseLockScript = redis.NewScript(`
if redis.call('get', KEYS[1]) == ARGV[1] then
  return redis.call('del', KEYS[1])
end
return 0
`)

var renewLockScript = redis.NewScript(`
if redis.call('get', KEYS[1]) == ARGV[1] then
  return redis.call('pexpire', KEYS[1], ARGV[2])
end
return 0
`)

Release and renew error semantics

core/infra/store/job_store.go

// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) ReleaseLock(ctx context.Context, key, token string) error {
  if token == "" {
    return fmt.Errorf("lock token required")
  }
  result, err := releaseLockScript.Run(ctx, s.client, []string{key}, token).Int()
  if err != nil {
    return fmt.Errorf("job store release lock %s: %w", key, err)
  }
  if result == 0 {
    return fmt.Errorf("lock not owned")
  }
  return nil
}

func (s *RedisJobStore) RenewLock(ctx context.Context, key, token string, ttl time.Duration) error {
  result, err := renewLockScript.Run(ctx, s.client, []string{key}, token, ttl.Milliseconds()).Int()
  if err != nil {
    return fmt.Errorf("job store renew lock %s: %w", key, err)
  }
  if result == 0 {
    return fmt.Errorf("lock not owned")
  }
  return nil
}

Approval lock envelope

core/controlplane/gateway/handlers_approvals.go

// core/controlplane/gateway/handlers_approvals.go (excerpt)
const approvalLockTTL = 10 * time.Second

func (s *server) withApprovalLock(ctx context.Context, jobID string, fn func(ctx context.Context) error) error {
  key := approvalLockPrefix + jobID
  deadline := time.Now().Add(2 * time.Second)

  for {
    lockCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
    token, err := s.jobStore.TryAcquireLock(lockCtx, key, approvalLockTTL)
    cancel()
    if err != nil {
      return fmt.Errorf("lock acquire: %w", err)
    }
    if token != "" {
      defer func() {
        releaseCtx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer cancel()
        if rErr := s.jobStore.ReleaseLock(releaseCtx, key, token); rErr != nil {
          slog.Warn("approval lock release failed", "job_id", jobID, "error", rErr)
        }
      }()
      return fn(ctx)
    }
    if time.Now().After(deadline) {
      return fmt.Errorf("approval lock busy")
    }
    time.Sleep(25 * time.Millisecond)
  }
}

Workflow lock renewal cadence

core/workflow/engine.go

// core/workflow/engine.go (excerpt)
const runLockTTL = 30 * time.Second

if renewer, ok := lm.locker.(RunLockRenewer); ok {
  ticker := time.NewTicker(runLockTTL / 3) // 10s cadence
  for {
    select {
    case <-ticker.C:
      rCtx, cancel := context.WithTimeout(renewCtx, 2*time.Second)
      if err := renewer.RenewLock(rCtx, key, token, runLockTTL); err != nil {
        slog.Warn("run lock renewal failed", "run_id", runID, "error", err)
      }
      cancel()
    }
  }
}

Tests that pin behavior

core/infra/store/*_test.go

// core/infra/store/consistency_test.go + job_store_test.go (excerpt)
func TestReleaseLockTokenMismatch(t *testing.T) {
  token, _ := store.TryAcquireLock(ctx, key, 5*time.Second)
  err := store.ReleaseLock(ctx, key, "wrong-token")
  assert.Error(t, err)
  assert.Contains(t, err.Error(), "not owned")
}

func TestRenewLockAfterExpiry(t *testing.T) {
  token, _ := store.TryAcquireLock(ctx, key, 2*time.Second)
  srv.FastForward(3 * time.Second)
  err := store.RenewLock(ctx, key, token, 5*time.Second)
  assert.Error(t, err)
}

func TestRedisJobStoreLockRejectsWrongOwner(t *testing.T) {
  // owner A expires, owner B acquires, owner A release must fail
}

Validation runbook

Run this before changing lock TTL defaults or retry behavior in worker code.

lock-ownership-runbook.sh

bash

# 1) Trigger contention in staging for a single lock key
#    (example: cordum:scheduler:job:<job_id> or cordum:wf:run:lock:<run_id>)

# 2) Inspect lock owner value and TTL
redis-cli GET "cordum:wf:run:lock:RUN_ID"
redis-cli PTTL "cordum:wf:run:lock:RUN_ID"

# 3) Confirm stale release attempts are rejected
#    Search logs for: "lock not owned" and "lock release skipped: token mismatch"

# 4) Verify system still preserves exclusion
#    TryAcquireLock should return empty token while current owner holds lock.

# 5) If lock-not-owned rate spikes, check for:
#    - GC pauses or long stop-the-world events
#    - network delay larger than lock TTL
#    - handlers exceeding TTL without successful renew

Limitations and tradeoffs

Approach	Upside	Downside
Delete by key without token check	Very simple implementation.	A stale worker can delete a newer owner's lock. Silent correctness bug.
Token ownership checks (current)	Wrong-owner release/renew is rejected deterministically.	You need operational handling for frequent `lock not owned` signals.
Token checks + fenced writes	Covers both lock ownership and stale writer hazards on external systems.	Requires storage APIs that validate monotonic fence/version on every write.

- Token ownership protects lock lifecycle, but stale side effects still need fenced writes or version checks in downstream systems.
- I found strong tests for token mismatch and expiry, but no single stress test that injects long GC pauses and network delay at once.
- Frequent `lock not owned` is usually a timing symptom, not just a lock-library bug. Treat it as an SRE signal.

Next step

Implement this next:

1. Add a dedicated metric for `lock not owned` grouped by lock namespace (`approval`, `workflow`, `reconciler`).
2. Alert only on sustained ratio increases, not single events, to avoid pager noise during normal failover.
3. For correctness-critical external writes, add fencing/version validation at the storage boundary.
4. Rehearse a staged failure drill: TTL expiry + stale release + renew failure in the same scenario.

Continue with AI Agent Distributed Locking and AI Agent Approval Lock Contention.

AI Agent Lock Token Ownership