The production problem
Lock bugs are usually timing bugs. They hide in happy-path load tests and appear when latency spikes or a node dies during critical sections.
If TTL is too short, locks expire before work completes. If TTL is too long, failover becomes sluggish and queues stall while everyone waits for expired ownership.
This is not theoretical. It is one of the fastest ways to create duplicate side effects in autonomous agent platforms.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Redis distributed locks | TTL-based lock validity, unique lock token, and safe release semantics. | No concrete takeover SLO process for queue-driven AI schedulers. |
| Kubernetes Leases | Lease-based coordination model used for leader election and heartbeats. | No guidance for business-level job lock tuning in application schedulers. |
| kube-scheduler leader election flags | Concrete lease/renew/retry timing defaults (15s lease, 10s renew, 2s retry). | No mapping from timing defaults to duplicate-dispatch risk and backlog recovery. |
The missing part is application-level calibration: translate lock theory into takeover SLO and duplicate dispatch risk for your workload profile.
TTL and renewal math
TTL must be computed from real execution data, not guessed from round numbers like 30s or 60s.
# Inputs T_exec_p99 = 18s # p99 lock-held critical section T_jitter = 4s # network + Redis tail jitter budget T_guard = 2s # extra buffer # Candidate TTL TTL = T_exec_p99 + T_jitter + T_guard = 24s # Renewal interval baseline renew_interval = TTL / 3 = 8s # Recovery bound after hard crash takeover_max ~= TTL + detection_jitter # If takeover SLO target is <= 30s, TTL cannot be 60s unless you # use a separate fast-fail mechanism.
| Component | Rule | If too low | If too high |
|---|---|---|---|
| Lock TTL | Set TTL > worst-case critical section time + renewal jitter + store tail latency. | Owner loses lock mid-operation; concurrent owner can process same job. | Takeover after crash is delayed and queue latency spikes. |
| Renew interval | Start with TTL/3 cadence and tighten if storage/network jitter is high. | Renew traffic increases and can amplify store pressure. | Small hiccups can skip renewal window and abandon the lock. |
| Renewal failure budget | Cap consecutive renewal failures, then fence critical section mutations. | Transient blips trigger unnecessary abandonment. | Owner keeps writing after lock ownership may have moved. |
| Takeover SLO | Define a hard upper bound from crash time to next valid owner action. | SLO is impossible during normal jitter and causes false paging. | Incidents hide behind broad SLO and degrade user-facing latency. |
Cordum lock behavior
Current scheduler implementation includes explicit renewal cadence, abandonment fencing, and release safeguards that are directly relevant to TTL tuning.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Scheduler lock TTL | `jobLockTTL = 60s` in scheduler engine lock path. | Bounds crash takeover delay to about one minute worst case. |
| Renew cadence | Renew ticker runs at `ttl/3` (20s for 60s TTL). | Provides multiple renewal attempts before lock expiry. |
| Failure fence | After 3 consecutive renewal failures, lock is marked abandoned and critical section is fenced. | Reduces risk of state mutation after ownership uncertainty. |
| Release safety | Lock release is skipped after abandonment to avoid deleting a new owner's lock. | Prevents cross-owner lock deletion race. |
| Operational guide | Troubleshooting docs describe 60s TTL and 20s renewal interval for scheduler job locks. | Keeps docs and runtime aligned for incident debugging. |
Implementation examples
Redis-safe lock acquire/release primitives
# Acquire lock with unique token and TTL
SET cordum:scheduler:job:JOB_ID <token> NX PX 60000
# Safe release (Redis <= 8.2 style)
if redis.call("get", KEYS[1]) == ARGV[1] then
return redis.call("del", KEYS[1])
else
return 0
endRenewal loop with abandonment fence (Go)
const (
jobLockTTL = 60 * time.Second
maxRenewalFailures = 3
)
renewTicker := time.NewTicker(jobLockTTL / 3)
consecutiveFailures := 0
for {
select {
case <-renewTicker.C:
if err := store.RenewLock(ctx, key, token, jobLockTTL); err != nil {
consecutiveFailures++
if consecutiveFailures >= maxRenewalFailures {
fenceCancel(errLockAbandoned) // stop state mutation path
return
}
continue
}
consecutiveFailures = 0
case <-ctx.Done():
return
}
}Operational lock-debug runbook (Bash)
# 1) Inspect lock ownership redis-cli GET "cordum:scheduler:job:JOB_ID" # 2) Check scheduler coordination locks redis-cli GET "cordum:reconciler:default" redis-cli GET "cordum:replayer:pending" # 3) Simulate abrupt scheduler death and measure takeover time kubectl delete pod -l app=cordum-scheduler -n cordum --grace-period=0 --force # 4) Verify no duplicate terminal states for the same job kubectl logs deploy/cordum-scheduler -n cordum | grep -E "job lock|lock renewal|abandoned"
Limitations and tradeoffs
- - Smaller TTL improves failover speed but narrows margin for latency spikes.
- - Larger TTL reduces accidental expiry but extends crash-recovery delay.
- - Aggressive renewal cadence improves safety but increases store write load.
- - Renewal-failure fencing protects consistency but can temporarily reduce throughput.
If you cannot state your takeover SLO in seconds, your lock TTL is likely a guess dressed as a configuration value.
Next step
Run this in one sprint:
- 1. Export p95/p99 lock-held duration per job class.
- 2. Recompute TTL and renewal interval from live latency data.
- 3. Add a hard alert for lock abandonment and duplicate terminal states.
- 4. Run one forced-kill game day and confirm takeover stays within target SLO.
Continue with AI Agent Distributed Locking and AI Agent Rolling Restart Playbook.