The production problem
Lock incidents often look harmless in logs and destructive in outcomes.
A slow worker times out, another worker acquires the lock, then the slow worker finally calls release. If release is plain `DEL key`, the new owner just lost its lock.
The damage is not theoretical. It appears as duplicate dispatch, out-of-order state transitions, and occasional cannot-reproduce outage reports.
Redis does not care that your pod had a long GC pause. The key either matches your token or it does not.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Redis docs: Distributed locks with random values | SET NX PX with per-lock random value and compare-on-release script guidance. | No concrete incident runbook for repeated `lock not owned` bursts across workflow and approval paths. |
| Kleppmann: How to do distributed locking | Why lease expiry and process pauses break naive lock assumptions, and why fencing exists. | No practical mapping from theory to lock API responses, metrics, and retry behavior in control planes. |
| etcd docs: Why lease is not mutual exclusion | Explicitly states lease alone does not guarantee exclusion; revision validation is required. | Does not show how a Redis-based platform can enforce token ownership and surface failure semantics to operators. |
Cordum runtime mechanics
| Boundary | Current behavior | Why it matters |
|---|---|---|
| Lock acquisition token | `TryAcquireLock` uses `SetNX` and returns a new UUID token only when lock acquisition succeeds. | Each owner has a unique token, so a prior holder cannot safely release by key alone. |
| Release ownership check | Release path runs Lua: delete only if current value equals caller token; mismatch returns `lock not owned`. | Prevents stale workers from releasing a new owner's lock after TTL expiry and ownership transfer. |
| Renew ownership check | Renew path runs Lua: extend TTL only when key still contains caller token. | Avoids extending someone else's lock when a stale owner retries renew. |
| Approval lock envelope | Approval lock TTL is 10s; acquire attempts stop after ~2s with 25ms backoff; release uses bounded 2s context. | Limits request tail latency and avoids hanging handlers on lock infrastructure issues. |
| Workflow run lock renewal | Workflow run locks use 30s TTL and renew every TTL/3 (~10s) with 2s renew timeout. | Keeps long handlers alive while still expiring stale holders if renew keeps failing. |
Ownership checks in code
Compare-and-release scripts
// core/infra/store/job_store.go (excerpt)
var releaseLockScript = redis.NewScript(`
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('del', KEYS[1])
end
return 0
`)
var renewLockScript = redis.NewScript(`
if redis.call('get', KEYS[1]) == ARGV[1] then
return redis.call('pexpire', KEYS[1], ARGV[2])
end
return 0
`)Release and renew error semantics
// core/infra/store/job_store.go (excerpt)
func (s *RedisJobStore) ReleaseLock(ctx context.Context, key, token string) error {
if token == "" {
return fmt.Errorf("lock token required")
}
result, err := releaseLockScript.Run(ctx, s.client, []string{key}, token).Int()
if err != nil {
return fmt.Errorf("job store release lock %s: %w", key, err)
}
if result == 0 {
return fmt.Errorf("lock not owned")
}
return nil
}
func (s *RedisJobStore) RenewLock(ctx context.Context, key, token string, ttl time.Duration) error {
result, err := renewLockScript.Run(ctx, s.client, []string{key}, token, ttl.Milliseconds()).Int()
if err != nil {
return fmt.Errorf("job store renew lock %s: %w", key, err)
}
if result == 0 {
return fmt.Errorf("lock not owned")
}
return nil
}Approval lock envelope
// core/controlplane/gateway/handlers_approvals.go (excerpt)
const approvalLockTTL = 10 * time.Second
func (s *server) withApprovalLock(ctx context.Context, jobID string, fn func(ctx context.Context) error) error {
key := approvalLockPrefix + jobID
deadline := time.Now().Add(2 * time.Second)
for {
lockCtx, cancel := context.WithTimeout(ctx, 2*time.Second)
token, err := s.jobStore.TryAcquireLock(lockCtx, key, approvalLockTTL)
cancel()
if err != nil {
return fmt.Errorf("lock acquire: %w", err)
}
if token != "" {
defer func() {
releaseCtx, cancel := context.WithTimeout(context.Background(), 2*time.Second)
defer cancel()
if rErr := s.jobStore.ReleaseLock(releaseCtx, key, token); rErr != nil {
slog.Warn("approval lock release failed", "job_id", jobID, "error", rErr)
}
}()
return fn(ctx)
}
if time.Now().After(deadline) {
return fmt.Errorf("approval lock busy")
}
time.Sleep(25 * time.Millisecond)
}
}Workflow lock renewal cadence
// core/workflow/engine.go (excerpt)
const runLockTTL = 30 * time.Second
if renewer, ok := lm.locker.(RunLockRenewer); ok {
ticker := time.NewTicker(runLockTTL / 3) // 10s cadence
for {
select {
case <-ticker.C:
rCtx, cancel := context.WithTimeout(renewCtx, 2*time.Second)
if err := renewer.RenewLock(rCtx, key, token, runLockTTL); err != nil {
slog.Warn("run lock renewal failed", "run_id", runID, "error", err)
}
cancel()
}
}
}Tests that pin behavior
// core/infra/store/consistency_test.go + job_store_test.go (excerpt)
func TestReleaseLockTokenMismatch(t *testing.T) {
token, _ := store.TryAcquireLock(ctx, key, 5*time.Second)
err := store.ReleaseLock(ctx, key, "wrong-token")
assert.Error(t, err)
assert.Contains(t, err.Error(), "not owned")
}
func TestRenewLockAfterExpiry(t *testing.T) {
token, _ := store.TryAcquireLock(ctx, key, 2*time.Second)
srv.FastForward(3 * time.Second)
err := store.RenewLock(ctx, key, token, 5*time.Second)
assert.Error(t, err)
}
func TestRedisJobStoreLockRejectsWrongOwner(t *testing.T) {
// owner A expires, owner B acquires, owner A release must fail
}Validation runbook
Run this before changing lock TTL defaults or retry behavior in worker code.
# 1) Trigger contention in staging for a single lock key # (example: cordum:scheduler:job:<job_id> or cordum:wf:run:lock:<run_id>) # 2) Inspect lock owner value and TTL redis-cli GET "cordum:wf:run:lock:RUN_ID" redis-cli PTTL "cordum:wf:run:lock:RUN_ID" # 3) Confirm stale release attempts are rejected # Search logs for: "lock not owned" and "lock release skipped: token mismatch" # 4) Verify system still preserves exclusion # TryAcquireLock should return empty token while current owner holds lock. # 5) If lock-not-owned rate spikes, check for: # - GC pauses or long stop-the-world events # - network delay larger than lock TTL # - handlers exceeding TTL without successful renew
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Delete by key without token check | Very simple implementation. | A stale worker can delete a newer owner's lock. Silent correctness bug. |
| Token ownership checks (current) | Wrong-owner release/renew is rejected deterministically. | You need operational handling for frequent `lock not owned` signals. |
| Token checks + fenced writes | Covers both lock ownership and stale writer hazards on external systems. | Requires storage APIs that validate monotonic fence/version on every write. |
- - Token ownership protects lock lifecycle, but stale side effects still need fenced writes or version checks in downstream systems.
- - I found strong tests for token mismatch and expiry, but no single stress test that injects long GC pauses and network delay at once.
- - Frequent `lock not owned` is usually a timing symptom, not just a lock-library bug. Treat it as an SRE signal.
Next step
Implement this next:
- 1. Add a dedicated metric for `lock not owned` grouped by lock namespace (`approval`, `workflow`, `reconciler`).
- 2. Alert only on sustained ratio increases, not single events, to avoid pager noise during normal failover.
- 3. For correctness-critical external writes, add fencing/version validation at the storage boundary.
- 4. Rehearse a staged failure drill: TTL expiry + stale release + renew failure in the same scenario.
Continue with AI Agent Distributed Locking and AI Agent Approval Lock Contention.