Skip to content
Deep Dive

AI Agent Lock Token Ownership

If release only checks the key and not the token, stale workers can break your mutual exclusion guarantees.

Deep Dive11 min readMar 2026
TL;DR
  • -A lock key by itself is not ownership. Ownership is key plus token that must match at release and renew time.
  • -Cordum uses Lua compare-and-delete and compare-and-pexpire scripts to reject stale lock holders with `lock not owned`.
  • -Approval paths use a 10s lock TTL and bounded 2s acquire/release contexts, which limits dead waits under contention.
  • -Token ownership prevents wrong lock releases, but it does not prevent stale writes to external systems without fencing checks.
Failure mode

Worker A times out, Worker B takes lock, Worker A finally wakes up and tries to release with stale context.

Current behavior

Cordum returns `lock not owned` on release and renew mismatch instead of deleting another owner's lock.

Operational payoff

You keep mutual exclusion intact, and incidents become diagnosis problems instead of silent data corruption.

Scope

This guide focuses on lock ownership semantics in Redis-backed control planes. It is not a generic consensus-lock proof.

The production problem

Lock incidents often look harmless in logs and destructive in outcomes.

A slow worker times out, another worker acquires the lock, then the slow worker finally calls release. If release is plain `DEL key`, the new owner just lost its lock.

The damage is not theoretical. It appears as duplicate dispatch, out-of-order state transitions, and occasional cannot-reproduce outage reports.

Redis does not care that your pod had a long GC pause. The key either matches your token or it does not.

What top results cover and miss

SourceStrong coverageMissing piece
Redis docs: Distributed locks with random valuesSET NX PX with per-lock random value and compare-on-release script guidance.No concrete incident runbook for repeated `lock not owned` bursts across workflow and approval paths.
Kleppmann: How to do distributed lockingWhy lease expiry and process pauses break naive lock assumptions, and why fencing exists.No practical mapping from theory to lock API responses, metrics, and retry behavior in control planes.
etcd docs: Why lease is not mutual exclusionExplicitly states lease alone does not guarantee exclusion; revision validation is required.Does not show how a Redis-based platform can enforce token ownership and surface failure semantics to operators.

Cordum runtime mechanics

BoundaryCurrent behaviorWhy it matters
Lock acquisition token`TryAcquireLock` uses `SetNX` and returns a new UUID token only when lock acquisition succeeds.Each owner has a unique token, so a prior holder cannot safely release by key alone.
Release ownership checkRelease path runs Lua: delete only if current value equals caller token; mismatch returns `lock not owned`.Prevents stale workers from releasing a new owner's lock after TTL expiry and ownership transfer.
Renew ownership checkRenew path runs Lua: extend TTL only when key still contains caller token.Avoids extending someone else's lock when a stale owner retries renew.
Approval lock envelopeApproval lock TTL is 10s; acquire attempts stop after ~2s with 25ms backoff; release uses bounded 2s context.Limits request tail latency and avoids hanging handlers on lock infrastructure issues.
Workflow run lock renewalWorkflow run locks use 30s TTL and renew every TTL/3 (~10s) with 2s renew timeout.Keeps long handlers alive while still expiring stale holders if renew keeps failing.

Ownership checks in code

Compare-and-release scripts

core/infra/store/job_store.go
go
// core/infra/store/job_store.go (excerpt)
var releaseLockScript = redis.NewScript(`
if redis.call('get', KEYS[1]) == ARGV[1] then
  return redis.call('del', KEYS[1])
end
return 0
`)

var renewLockScript = redis.NewScript(`
if redis.call('get', KEYS[1]) == ARGV[1] then
  return redis.call('pexpire', KEYS[1], ARGV[2])
end
return 0
`)

Release and renew error semantics

core/infra/store/job_store.go
go
// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) ReleaseLock(ctx context.Context, key, token string) error {
  if token == "" {
    return fmt.Errorf("lock token required")
  }
  result, err := releaseLockScript.Run(ctx, s.client, []string{key}, token).Int()
  if err != nil {
    return fmt.Errorf("job store release lock %s: %w", key, err)
  }
  if result == 0 {
    return fmt.Errorf("lock not owned")
  }
  return nil
}

func (s *RedisJobStore) RenewLock(ctx context.Context, key, token string, ttl time.Duration) error {
  result, err := renewLockScript.Run(ctx, s.client, []string{key}, token, ttl.Milliseconds()).Int()
  if err != nil {
    return fmt.Errorf("job store renew lock %s: %w", key, err)
  }
  if result == 0 {
    return fmt.Errorf("lock not owned")
  }
  return nil
}

Approval lock envelope

core/controlplane/gateway/handlers_approvals.go
go
// core/controlplane/gateway/handlers_approvals.go (excerpt)
const approvalLockTTL = 10 * time.Second

func (s *server) withApprovalLock(ctx context.Context, jobID string, fn func(ctx context.Context) error) error {
  key := approvalLockPrefix + jobID
  deadline := time.Now().Add(2 * time.Second)

  for {
    lockCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
    token, err := s.jobStore.TryAcquireLock(lockCtx, key, approvalLockTTL)
    cancel()
    if err != nil {
      return fmt.Errorf("lock acquire: %w", err)
    }
    if token != "" {
      defer func() {
        releaseCtx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
        defer cancel()
        if rErr := s.jobStore.ReleaseLock(releaseCtx, key, token); rErr != nil {
          slog.Warn("approval lock release failed", "job_id", jobID, "error", rErr)
        }
      }()
      return fn(ctx)
    }
    if time.Now().After(deadline) {
      return fmt.Errorf("approval lock busy")
    }
    time.Sleep(25 * time.Millisecond)
  }
}

Workflow lock renewal cadence

core/workflow/engine.go
go
// core/workflow/engine.go (excerpt)
const runLockTTL = 30 * time.Second

if renewer, ok := lm.locker.(RunLockRenewer); ok {
  ticker := time.NewTicker(runLockTTL / 3) // 10s cadence
  for {
    select {
    case <-ticker.C:
      rCtx, cancel := context.WithTimeout(renewCtx, 2*time.Second)
      if err := renewer.RenewLock(rCtx, key, token, runLockTTL); err != nil {
        slog.Warn("run lock renewal failed", "run_id", runID, "error", err)
      }
      cancel()
    }
  }
}

Tests that pin behavior

core/infra/store/*_test.go
go
// core/infra/store/consistency_test.go + job_store_test.go (excerpt)
func TestReleaseLockTokenMismatch(t *testing.T) {
  token, _ := store.TryAcquireLock(ctx, key, 5*time.Second)
  err := store.ReleaseLock(ctx, key, "wrong-token")
  assert.Error(t, err)
  assert.Contains(t, err.Error(), "not owned")
}

func TestRenewLockAfterExpiry(t *testing.T) {
  token, _ := store.TryAcquireLock(ctx, key, 2*time.Second)
  srv.FastForward(3 * time.Second)
  err := store.RenewLock(ctx, key, token, 5*time.Second)
  assert.Error(t, err)
}

func TestRedisJobStoreLockRejectsWrongOwner(t *testing.T) {
  // owner A expires, owner B acquires, owner A release must fail
}

Validation runbook

Run this before changing lock TTL defaults or retry behavior in worker code.

lock-ownership-runbook.sh
bash
# 1) Trigger contention in staging for a single lock key
#    (example: cordum:scheduler:job:<job_id> or cordum:wf:run:lock:<run_id>)

# 2) Inspect lock owner value and TTL
redis-cli GET "cordum:wf:run:lock:RUN_ID"
redis-cli PTTL "cordum:wf:run:lock:RUN_ID"

# 3) Confirm stale release attempts are rejected
#    Search logs for: "lock not owned" and "lock release skipped: token mismatch"

# 4) Verify system still preserves exclusion
#    TryAcquireLock should return empty token while current owner holds lock.

# 5) If lock-not-owned rate spikes, check for:
#    - GC pauses or long stop-the-world events
#    - network delay larger than lock TTL
#    - handlers exceeding TTL without successful renew

Limitations and tradeoffs

ApproachUpsideDownside
Delete by key without token checkVery simple implementation.A stale worker can delete a newer owner's lock. Silent correctness bug.
Token ownership checks (current)Wrong-owner release/renew is rejected deterministically.You need operational handling for frequent `lock not owned` signals.
Token checks + fenced writesCovers both lock ownership and stale writer hazards on external systems.Requires storage APIs that validate monotonic fence/version on every write.
  • - Token ownership protects lock lifecycle, but stale side effects still need fenced writes or version checks in downstream systems.
  • - I found strong tests for token mismatch and expiry, but no single stress test that injects long GC pauses and network delay at once.
  • - Frequent `lock not owned` is usually a timing symptom, not just a lock-library bug. Treat it as an SRE signal.

Next step

Implement this next:

  1. 1. Add a dedicated metric for `lock not owned` grouped by lock namespace (`approval`, `workflow`, `reconciler`).
  2. 2. Alert only on sustained ratio increases, not single events, to avoid pager noise during normal failover.
  3. 3. For correctness-critical external writes, add fencing/version validation at the storage boundary.
  4. 4. Rehearse a staged failure drill: TTL expiry + stale release + renew failure in the same scenario.

Continue with AI Agent Distributed Locking and AI Agent Approval Lock Contention.

Lock safety is a product decision

If stale workers can remove active locks, your system is only correct when timing is perfect, which is the least available environment in production.