Skip to content
Guide

AI Agent Leader Election

Most election bugs are timing bugs wearing lock-shaped costumes.

Guide12 min readMar 2026
TL;DR
  • -Leader election is a timing problem first, not a lock API problem.
  • -If lease duration and tick cadence are misaligned, failover stalls even when the system is healthy.
  • -Cordum uses multiple independent single-writer loops with different TTL and interval budgets.
  • -You need lock-key diagnostics and failover SLOs before the first multi-replica incident.
Lease budgets

Set `lease`, `renew`, and `retry` from explicit outage budgets and work duration.

Single-writer loops

Different loops need different leadership behavior. One lock policy for all loops creates regressions.

Failover math

Estimate worst-case takeover time with TTL + poll interval, then test it in drills.

Scope

This guide is for teams running autonomous AI agents across multiple replicas and needing predictable single-writer behavior for reconciliation, replay, and background maintenance loops.

The production problem

Two leaders running the same control loop can corrupt state fast. Zero leaders can silently accumulate stale jobs until customers notice before dashboards do.

Most teams focus on lock acquisition and stop there. The hard part is setting lease timing so leadership remains stable in normal operation but hands over quickly on failure.

Election logic that is not tied to clear failover math becomes folklore. Folklore does not help at 03:00.

What top results miss

SourceStrong coverageMissing piece
Kubernetes LeasesLease objects for leader selection and heartbeat semantics in control-plane systems.No practical guidance for Redis-based election loops in agent runtimes with mixed lock-release patterns.
kube-scheduler leader election flagsConcrete defaults for `lease-duration`, `renew-deadline`, and `retry-period`.No mapping to queue-loop failover math (`ttl + poll`) for autonomous workflow orchestrators.
etcd Election APICampaign/Observe/Resign flow with lease-bound leadership and ownership checks.No day-2 operator runbook for debugging lock contention and stale leadership keys.

The practical gap is operating multiple election loops with different timing and blast radius requirements inside one AI control plane.

Leader-election timing model

Treat lease tuning as an SLO decision. Start with tolerated stale-work window, then back into TTL and poll values. Do not do the reverse.

ParameterRule of thumbWhat breaks if wrong
Lease duration (TTL)Must exceed normal work tick + network jitter + one retry windowToo small: frequent leader churn. Too large: slow takeover after crash.
Renew cadenceCommon baseline: renew every TTL/3Too sparse: expiry during transient latency. Too aggressive: avoidable Redis/API load.
Poll / tick intervalPoll frequency should reflect your stale-work detection objectiveToo slow: high recovery lag. Too fast: noisy lock contention and extra compute spend.
Failover budgetApproximate upper bound: `TTL + pollInterval` for TTL-held loopsFalse SLO assumptions and broken incident expectations.

Cordum single-writer loops

These values were verified from current source code paths in `cmd/cordum-scheduler` and `core/workflow`. They are not guessed from slides or stale snippets.

LoopLeader lock behaviorLock keyTypical takeover window
Scheduler reconcilerDefault scan interval 30s, lock TTL 60s, renewal loop at TTL/3, no explicit release.`cordum:reconciler:default`~60-90s (TTL + next poll tick).
Scheduler pending replayerDefault poll 30s, lock TTL 60s, no explicit release.`cordum:replayer:pending`~60-90s in crash scenarios.
Workflow reconcilerDefault poll 5s, lock TTL 10s, no explicit release.`cordum:workflow-engine:reconciler:default`~10-15s.
Workflow delay pollerPoll 5s, lock TTL 10s, no explicit release.`cordum:wf:delay:poller`~10-15s.
Worker snapshot writerTick every 5s, lock TTL 30s, explicit release after write.`cordum:scheduler:snapshot:writer`~35s worst case after leader crash.

Implementation examples

Lease-based leader loop (Go)

leader_loop.go
Go
type LeaderLoopConfig struct {
  PollInterval time.Duration
  LockTTL      time.Duration
  RenewEvery   time.Duration // 0 means no renewal
  HoldUntilTTL bool          // true for single-writer loop windows
}

func RunLeaderLoop(
  ctx context.Context,
  store LockStore,
  lockKey string,
  cfg LeaderLoopConfig,
  tick func(context.Context) error,
) {
  ticker := time.NewTicker(cfg.PollInterval)
  defer ticker.Stop()

  for {
    select {
    case <-ctx.Done():
      return
    case <-ticker.C:
      token, err := store.TryAcquireLock(ctx, lockKey, cfg.LockTTL)
      if err != nil || token == "" {
        continue
      }

      stopRenew := make(chan struct{})
      if cfg.RenewEvery > 0 {
        go renewLoop(ctx, store, lockKey, token, cfg.LockTTL, cfg.RenewEvery, stopRenew)
      }

      _ = tick(ctx)
      close(stopRenew)

      if !cfg.HoldUntilTTL {
        _ = store.ReleaseLock(context.Background(), lockKey, token)
      }
    }
  }
}

Election policy sheet (YAML)

leader_election.yaml
YAML
leader_election:
  scheduler_reconciler:
    poll_interval: 30s
    lock_key: cordum:reconciler:default
    lock_ttl: 60s
    renew_every: 20s
    release_mode: ttl_hold
  scheduler_pending_replayer:
    poll_interval: 30s
    lock_key: cordum:replayer:pending
    lock_ttl: 60s
    renew_every: 0s
    release_mode: ttl_hold
  workflow_reconciler:
    poll_interval: 5s
    lock_key: cordum:workflow-engine:reconciler:default
    lock_ttl: 10s
    renew_every: 0s
    release_mode: ttl_hold
  snapshot_writer:
    poll_interval: 5s
    lock_key: cordum:scheduler:snapshot:writer
    lock_ttl: 30s
    renew_every: 0s
    release_mode: explicit

Runbook commands for leadership state (Bash)

leader_election_runbook.sh
Bash
# Scheduler leadership checks
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

# Workflow leadership checks
redis-cli GET "cordum:workflow-engine:reconciler:default"
redis-cli GET "cordum:wf:delay:poller"

# Snapshot writer checks
redis-cli GET "cordum:scheduler:snapshot:writer"
redis-cli OBJECT IDLETIME "sys:workers:snapshot"

# Lock pressure signal
curl -s http://localhost:9090/metrics | grep job_lock_wait

Election health alerts (PromQL)

leader_election_health.promql
PromQL
# Lock contention (p99 wait)
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

# Stale job debt trend
stale_jobs

# Orphan replay pressure
rate(orphan_replayed_total[5m])

Limitations and tradeoffs

  • - TTL-held loops reduce duplicate ticks but increase crash failover time by design.
  • - Aggressive renew cadence improves stability but increases backend load and log noise.
  • - A single Redis dependency simplifies architecture but creates shared failure domains across election loops.
  • - Different loops need different takeover windows, so one global election profile is rarely correct.

If your documented failover target is 15 seconds and one loop is configured for 90-second takeover, the document is wrong, not the incident timeline.

Next step

Run this in one sprint:

  1. 1. Build an inventory of every leader-elected loop and lock key in your control plane.
  2. 2. Define takeover SLO per loop and verify timing with `TTL + poll` math.
  3. 3. Add alerting for lock wait and stale backlog growth, not only hard failures.
  4. 4. Kill the current leader replica during peak load and validate takeover within target windows.

Continue with AI Agent Distributed Locking and AI Agent Incident Response Runbook.

Election drift is silent debt

Document lock keys, timing budgets, and takeover SLOs in the same place your on-call team actually reads.