Name: Cordum
Author: Cordum

The production problem

Lock bugs are usually timing bugs. They hide in happy-path load tests and appear when latency spikes or a node dies during critical sections.

If TTL is too short, locks expire before work completes. If TTL is too long, failover becomes sluggish and queues stall while everyone waits for expired ownership.

This is not theoretical. It is one of the fastest ways to create duplicate side effects in autonomous agent platforms.

What top results miss

Source	Strong coverage	Missing piece
Redis distributed locks	TTL-based lock validity, unique lock token, and safe release semantics.	No concrete takeover SLO process for queue-driven AI schedulers.
Kubernetes Leases	Lease-based coordination model used for leader election and heartbeats.	No guidance for business-level job lock tuning in application schedulers.
kube-scheduler leader election flags	Concrete lease/renew/retry timing defaults (15s lease, 10s renew, 2s retry).	No mapping from timing defaults to duplicate-dispatch risk and backlog recovery.

The missing part is application-level calibration: translate lock theory into takeover SLO and duplicate dispatch risk for your workload profile.

TTL and renewal math

TTL must be computed from real execution data, not guessed from round numbers like 30s or 60s.

lock_ttl_math.txt

Text

# Inputs
T_exec_p99 = 18s          # p99 lock-held critical section
T_jitter = 4s             # network + Redis tail jitter budget
T_guard = 2s              # extra buffer

# Candidate TTL
TTL = T_exec_p99 + T_jitter + T_guard = 24s

# Renewal interval baseline
renew_interval = TTL / 3 = 8s

# Recovery bound after hard crash
takeover_max ~= TTL + detection_jitter

# If takeover SLO target is <= 30s, TTL cannot be 60s unless you
# use a separate fast-fail mechanism.

Component	Rule	If too low	If too high
Lock TTL	Set TTL > worst-case critical section time + renewal jitter + store tail latency.	Owner loses lock mid-operation; concurrent owner can process same job.	Takeover after crash is delayed and queue latency spikes.
Renew interval	Start with TTL/3 cadence and tighten if storage/network jitter is high.	Renew traffic increases and can amplify store pressure.	Small hiccups can skip renewal window and abandon the lock.
Renewal failure budget	Cap consecutive renewal failures, then fence critical section mutations.	Transient blips trigger unnecessary abandonment.	Owner keeps writing after lock ownership may have moved.
Takeover SLO	Define a hard upper bound from crash time to next valid owner action.	SLO is impossible during normal jitter and causes false paging.	Incidents hide behind broad SLO and degrade user-facing latency.

Cordum lock behavior

Current scheduler implementation includes explicit renewal cadence, abandonment fencing, and release safeguards that are directly relevant to TTL tuning.

Boundary	Current behavior	Operational impact
Scheduler lock TTL	`jobLockTTL = 60s` in scheduler engine lock path.	Bounds crash takeover delay to about one minute worst case.
Renew cadence	Renew ticker runs at `ttl/3` (20s for 60s TTL).	Provides multiple renewal attempts before lock expiry.
Failure fence	After 3 consecutive renewal failures, lock is marked abandoned and critical section is fenced.	Reduces risk of state mutation after ownership uncertainty.
Release safety	Lock release is skipped after abandonment to avoid deleting a new owner's lock.	Prevents cross-owner lock deletion race.
Operational guide	Troubleshooting docs describe 60s TTL and 20s renewal interval for scheduler job locks.	Keeps docs and runtime aligned for incident debugging.

Implementation examples

Redis-safe lock acquire/release primitives

redis_lock_primitives.lua

Lua

# Acquire lock with unique token and TTL
SET cordum:scheduler:job:JOB_ID <token> NX PX 60000

# Safe release (Redis <= 8.2 style)
if redis.call("get", KEYS[1]) == ARGV[1] then
  return redis.call("del", KEYS[1])
else
  return 0
end

Renewal loop with abandonment fence (Go)

lock_renewal.go

const (
  jobLockTTL         = 60 * time.Second
  maxRenewalFailures = 3
)

renewTicker := time.NewTicker(jobLockTTL / 3)
consecutiveFailures := 0

for {
  select {
  case <-renewTicker.C:
    if err := store.RenewLock(ctx, key, token, jobLockTTL); err != nil {
      consecutiveFailures++
      if consecutiveFailures >= maxRenewalFailures {
        fenceCancel(errLockAbandoned) // stop state mutation path
        return
      }
      continue
    }
    consecutiveFailures = 0
  case <-ctx.Done():
    return
  }
}

Operational lock-debug runbook (Bash)

lock_ttl_runbook.sh

Bash

# 1) Inspect lock ownership
redis-cli GET "cordum:scheduler:job:JOB_ID"

# 2) Check scheduler coordination locks
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

# 3) Simulate abrupt scheduler death and measure takeover time
kubectl delete pod -l app=cordum-scheduler -n cordum --grace-period=0 --force

# 4) Verify no duplicate terminal states for the same job
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "job lock|lock renewal|abandoned"

Limitations and tradeoffs

- Smaller TTL improves failover speed but narrows margin for latency spikes.
- Larger TTL reduces accidental expiry but extends crash-recovery delay.
- Aggressive renewal cadence improves safety but increases store write load.
- Renewal-failure fencing protects consistency but can temporarily reduce throughput.

If you cannot state your takeover SLO in seconds, your lock TTL is likely a guess dressed as a configuration value.

Next step

Run this in one sprint:

1. Export p95/p99 lock-held duration per job class.
2. Recompute TTL and renewal interval from live latency data.
3. Add a hard alert for lock abandonment and duplicate terminal states.
4. Run one forced-kill game day and confirm takeover stays within target SLO.

Continue with AI Agent Distributed Locking and AI Agent Rolling Restart Playbook.

AI Agent Lock TTL Tuning