The production problem
Two leaders running the same control loop can corrupt state fast. Zero leaders can silently accumulate stale jobs until customers notice before dashboards do.
Most teams focus on lock acquisition and stop there. The hard part is setting lease timing so leadership remains stable in normal operation but hands over quickly on failure.
Election logic that is not tied to clear failover math becomes folklore. Folklore does not help at 03:00.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes Leases | Lease objects for leader selection and heartbeat semantics in control-plane systems. | No practical guidance for Redis-based election loops in agent runtimes with mixed lock-release patterns. |
| kube-scheduler leader election flags | Concrete defaults for `lease-duration`, `renew-deadline`, and `retry-period`. | No mapping to queue-loop failover math (`ttl + poll`) for autonomous workflow orchestrators. |
| etcd Election API | Campaign/Observe/Resign flow with lease-bound leadership and ownership checks. | No day-2 operator runbook for debugging lock contention and stale leadership keys. |
The practical gap is operating multiple election loops with different timing and blast radius requirements inside one AI control plane.
Leader-election timing model
Treat lease tuning as an SLO decision. Start with tolerated stale-work window, then back into TTL and poll values. Do not do the reverse.
| Parameter | Rule of thumb | What breaks if wrong |
|---|---|---|
| Lease duration (TTL) | Must exceed normal work tick + network jitter + one retry window | Too small: frequent leader churn. Too large: slow takeover after crash. |
| Renew cadence | Common baseline: renew every TTL/3 | Too sparse: expiry during transient latency. Too aggressive: avoidable Redis/API load. |
| Poll / tick interval | Poll frequency should reflect your stale-work detection objective | Too slow: high recovery lag. Too fast: noisy lock contention and extra compute spend. |
| Failover budget | Approximate upper bound: `TTL + pollInterval` for TTL-held loops | False SLO assumptions and broken incident expectations. |
Cordum single-writer loops
These values were verified from current source code paths in `cmd/cordum-scheduler` and `core/workflow`. They are not guessed from slides or stale snippets.
| Loop | Leader lock behavior | Lock key | Typical takeover window |
|---|---|---|---|
| Scheduler reconciler | Default scan interval 30s, lock TTL 60s, renewal loop at TTL/3, no explicit release. | `cordum:reconciler:default` | ~60-90s (TTL + next poll tick). |
| Scheduler pending replayer | Default poll 30s, lock TTL 60s, no explicit release. | `cordum:replayer:pending` | ~60-90s in crash scenarios. |
| Workflow reconciler | Default poll 5s, lock TTL 10s, no explicit release. | `cordum:workflow-engine:reconciler:default` | ~10-15s. |
| Workflow delay poller | Poll 5s, lock TTL 10s, no explicit release. | `cordum:wf:delay:poller` | ~10-15s. |
| Worker snapshot writer | Tick every 5s, lock TTL 30s, explicit release after write. | `cordum:scheduler:snapshot:writer` | ~35s worst case after leader crash. |
Implementation examples
Lease-based leader loop (Go)
type LeaderLoopConfig struct {
PollInterval time.Duration
LockTTL time.Duration
RenewEvery time.Duration // 0 means no renewal
HoldUntilTTL bool // true for single-writer loop windows
}
func RunLeaderLoop(
ctx context.Context,
store LockStore,
lockKey string,
cfg LeaderLoopConfig,
tick func(context.Context) error,
) {
ticker := time.NewTicker(cfg.PollInterval)
defer ticker.Stop()
for {
select {
case <-ctx.Done():
return
case <-ticker.C:
token, err := store.TryAcquireLock(ctx, lockKey, cfg.LockTTL)
if err != nil || token == "" {
continue
}
stopRenew := make(chan struct{})
if cfg.RenewEvery > 0 {
go renewLoop(ctx, store, lockKey, token, cfg.LockTTL, cfg.RenewEvery, stopRenew)
}
_ = tick(ctx)
close(stopRenew)
if !cfg.HoldUntilTTL {
_ = store.ReleaseLock(context.Background(), lockKey, token)
}
}
}
}Election policy sheet (YAML)
leader_election:
scheduler_reconciler:
poll_interval: 30s
lock_key: cordum:reconciler:default
lock_ttl: 60s
renew_every: 20s
release_mode: ttl_hold
scheduler_pending_replayer:
poll_interval: 30s
lock_key: cordum:replayer:pending
lock_ttl: 60s
renew_every: 0s
release_mode: ttl_hold
workflow_reconciler:
poll_interval: 5s
lock_key: cordum:workflow-engine:reconciler:default
lock_ttl: 10s
renew_every: 0s
release_mode: ttl_hold
snapshot_writer:
poll_interval: 5s
lock_key: cordum:scheduler:snapshot:writer
lock_ttl: 30s
renew_every: 0s
release_mode: explicitRunbook commands for leadership state (Bash)
# Scheduler leadership checks redis-cli GET "cordum:reconciler:default" redis-cli GET "cordum:replayer:pending" # Workflow leadership checks redis-cli GET "cordum:workflow-engine:reconciler:default" redis-cli GET "cordum:wf:delay:poller" # Snapshot writer checks redis-cli GET "cordum:scheduler:snapshot:writer" redis-cli OBJECT IDLETIME "sys:workers:snapshot" # Lock pressure signal curl -s http://localhost:9090/metrics | grep job_lock_wait
Election health alerts (PromQL)
# Lock contention (p99 wait) histogram_quantile(0.99, rate(job_lock_wait_bucket[5m])) # Stale job debt trend stale_jobs # Orphan replay pressure rate(orphan_replayed_total[5m])
Limitations and tradeoffs
- - TTL-held loops reduce duplicate ticks but increase crash failover time by design.
- - Aggressive renew cadence improves stability but increases backend load and log noise.
- - A single Redis dependency simplifies architecture but creates shared failure domains across election loops.
- - Different loops need different takeover windows, so one global election profile is rarely correct.
If your documented failover target is 15 seconds and one loop is configured for 90-second takeover, the document is wrong, not the incident timeline.
Next step
Run this in one sprint:
- 1. Build an inventory of every leader-elected loop and lock key in your control plane.
- 2. Define takeover SLO per loop and verify timing with `TTL + poll` math.
- 3. Add alerting for lock wait and stale backlog growth, not only hard failures.
- 4. Kill the current leader replica during peak load and validate takeover within target windows.
Continue with AI Agent Distributed Locking and AI Agent Incident Response Runbook.