Skip to content
Guide

AI Agent Lock TTL Tuning

Most duplicate-dispatch incidents start as lock timing mistakes, not queue bugs.

Guide11 min readMar 2026
TL;DR
  • -A lock TTL is both a safety boundary and a recovery timer.
  • -Too short: duplicate work. Too long: slow takeover after crash.
  • -Renewal cadence must be derived from TTL and tail latency, not copied from examples.
  • -Cordum scheduler uses 60s lock TTL with renewal every 20s and explicit abandonment fencing.
Bounded recovery

TTL defines the upper bound for takeover when a lock owner dies abruptly.

Renew safely

Renew too late and the lock expires. Renew forever and liveness can break.

Fence on failure

When renewal fails repeatedly, stop mutating state before another owner takes over.

Scope

This guide focuses on Redis-backed distributed locks for AI control-plane schedulers where each job must be processed by one owner at a time.

The production problem

Lock bugs are usually timing bugs. They hide in happy-path load tests and appear when latency spikes or a node dies during critical sections.

If TTL is too short, locks expire before work completes. If TTL is too long, failover becomes sluggish and queues stall while everyone waits for expired ownership.

This is not theoretical. It is one of the fastest ways to create duplicate side effects in autonomous agent platforms.

What top results miss

SourceStrong coverageMissing piece
Redis distributed locksTTL-based lock validity, unique lock token, and safe release semantics.No concrete takeover SLO process for queue-driven AI schedulers.
Kubernetes LeasesLease-based coordination model used for leader election and heartbeats.No guidance for business-level job lock tuning in application schedulers.
kube-scheduler leader election flagsConcrete lease/renew/retry timing defaults (15s lease, 10s renew, 2s retry).No mapping from timing defaults to duplicate-dispatch risk and backlog recovery.

The missing part is application-level calibration: translate lock theory into takeover SLO and duplicate dispatch risk for your workload profile.

TTL and renewal math

TTL must be computed from real execution data, not guessed from round numbers like 30s or 60s.

lock_ttl_math.txt
Text
# Inputs
T_exec_p99 = 18s          # p99 lock-held critical section
T_jitter = 4s             # network + Redis tail jitter budget
T_guard = 2s              # extra buffer

# Candidate TTL
TTL = T_exec_p99 + T_jitter + T_guard = 24s

# Renewal interval baseline
renew_interval = TTL / 3 = 8s

# Recovery bound after hard crash
takeover_max ~= TTL + detection_jitter

# If takeover SLO target is <= 30s, TTL cannot be 60s unless you
# use a separate fast-fail mechanism.
ComponentRuleIf too lowIf too high
Lock TTLSet TTL > worst-case critical section time + renewal jitter + store tail latency.Owner loses lock mid-operation; concurrent owner can process same job.Takeover after crash is delayed and queue latency spikes.
Renew intervalStart with TTL/3 cadence and tighten if storage/network jitter is high.Renew traffic increases and can amplify store pressure.Small hiccups can skip renewal window and abandon the lock.
Renewal failure budgetCap consecutive renewal failures, then fence critical section mutations.Transient blips trigger unnecessary abandonment.Owner keeps writing after lock ownership may have moved.
Takeover SLODefine a hard upper bound from crash time to next valid owner action.SLO is impossible during normal jitter and causes false paging.Incidents hide behind broad SLO and degrade user-facing latency.

Cordum lock behavior

Current scheduler implementation includes explicit renewal cadence, abandonment fencing, and release safeguards that are directly relevant to TTL tuning.

BoundaryCurrent behaviorOperational impact
Scheduler lock TTL`jobLockTTL = 60s` in scheduler engine lock path.Bounds crash takeover delay to about one minute worst case.
Renew cadenceRenew ticker runs at `ttl/3` (20s for 60s TTL).Provides multiple renewal attempts before lock expiry.
Failure fenceAfter 3 consecutive renewal failures, lock is marked abandoned and critical section is fenced.Reduces risk of state mutation after ownership uncertainty.
Release safetyLock release is skipped after abandonment to avoid deleting a new owner&apos;s lock.Prevents cross-owner lock deletion race.
Operational guideTroubleshooting docs describe 60s TTL and 20s renewal interval for scheduler job locks.Keeps docs and runtime aligned for incident debugging.

Implementation examples

Redis-safe lock acquire/release primitives

redis_lock_primitives.lua
Lua
# Acquire lock with unique token and TTL
SET cordum:scheduler:job:JOB_ID <token> NX PX 60000

# Safe release (Redis <= 8.2 style)
if redis.call("get", KEYS[1]) == ARGV[1] then
  return redis.call("del", KEYS[1])
else
  return 0
end

Renewal loop with abandonment fence (Go)

lock_renewal.go
Go
const (
  jobLockTTL         = 60 * time.Second
  maxRenewalFailures = 3
)

renewTicker := time.NewTicker(jobLockTTL / 3)
consecutiveFailures := 0

for {
  select {
  case <-renewTicker.C:
    if err := store.RenewLock(ctx, key, token, jobLockTTL); err != nil {
      consecutiveFailures++
      if consecutiveFailures >= maxRenewalFailures {
        fenceCancel(errLockAbandoned) // stop state mutation path
        return
      }
      continue
    }
    consecutiveFailures = 0
  case <-ctx.Done():
    return
  }
}

Operational lock-debug runbook (Bash)

lock_ttl_runbook.sh
Bash
# 1) Inspect lock ownership
redis-cli GET "cordum:scheduler:job:JOB_ID"

# 2) Check scheduler coordination locks
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

# 3) Simulate abrupt scheduler death and measure takeover time
kubectl delete pod -l app=cordum-scheduler -n cordum --grace-period=0 --force

# 4) Verify no duplicate terminal states for the same job
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "job lock|lock renewal|abandoned"

Limitations and tradeoffs

  • - Smaller TTL improves failover speed but narrows margin for latency spikes.
  • - Larger TTL reduces accidental expiry but extends crash-recovery delay.
  • - Aggressive renewal cadence improves safety but increases store write load.
  • - Renewal-failure fencing protects consistency but can temporarily reduce throughput.

If you cannot state your takeover SLO in seconds, your lock TTL is likely a guess dressed as a configuration value.

Next step

Run this in one sprint:

  1. 1. Export p95/p99 lock-held duration per job class.
  2. 2. Recompute TTL and renewal interval from live latency data.
  3. 3. Add a hard alert for lock abandonment and duplicate terminal states.
  4. 4. Run one forced-kill game day and confirm takeover stays within target SLO.

Continue with AI Agent Distributed Locking and AI Agent Rolling Restart Playbook.

A lock is a time contract

Treat lock TTL like an SLO-backed contract between correctness and recovery speed.