Skip to content
Deep Dive

AI Agent Worker Heartbeat Warm-Start

Heartbeats are necessary. They are not enough if your scheduler boots with an empty registry.

Deep Dive10 min readMar 2026
TL;DR
  • -A heartbeat-only registry can create a startup blind spot where schedulers think no workers exist.
  • -Cordum warm-start hydrates worker registry from `sys:workers:snapshot` before waiting for fresh heartbeats.
  • -Default timing in current code path: heartbeat TTL `30s`, snapshot write interval `5s`, warm-start read timeout `5s`.
  • -Hydrating with `lastSeen=now` reduces cold-start errors but can preserve stale workers for up to one TTL window.
Cold-start fix

Registry hydration avoids the initial 0-30s empty-worker window on new scheduler replicas.

Single writer

Snapshot writes are serialized with lock key `cordum:scheduler:snapshot:writer`.

TTL tradeoff

Hydrated entries get fresh timestamps, so stale workers can survive until TTL expiry if no heartbeat arrives.

Scope

This guide focuses on scheduler worker-registry startup behavior and heartbeat visibility. It does not cover worker handler logic or job retry policy.

The production problem

A scheduler replica can start healthy and still reject dispatch because it has not seen any worker heartbeats yet. The cluster is fine. The local registry is empty.

In systems with ~30 second heartbeat TTL windows, that startup blind spot can be long enough to trigger needless retries, noisy alerts, and tired people.

The fix is straightforward: bootstrap the registry from a recent persisted snapshot, then let live heartbeats overwrite it.

What top results miss

SourceStrong coverageMissing piece
Kubernetes probes guideReadiness vs liveness roles and startup probe protection for slow starts.No control-plane worker registry warm-start mechanism across scheduler replicas.
etcd Lease API docsTTL leases, keepalive semantics, and liveness expiration behavior.No guidance on snapshot-hydrated in-memory registries used by schedulers.
RabbitMQ heartbeat docsHeartbeat timeout negotiation, false-positive ranges, and missed-heartbeat detection.No scheduler-side warm-start path to prevent post-restart empty routing tables.

Public guidance explains heartbeat and liveness mechanics well. It usually stops short of registry warm-start behavior inside a distributed scheduler process.

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Registry TTLIn-memory worker registry default TTL is `30s` (`defaultWorkerTTL`).Workers missing heartbeats are evicted after TTL window.
Warm-start readScheduler reads snapshot key `sys:workers:snapshot` with `5s` timeout during startup.New replica can route quickly without waiting for first live heartbeat cycle.
Hydration timestampHydrated workers are inserted with `lastSeen = now`.Stale snapshot entries can appear alive for up to one TTL if no heartbeat follows.
Snapshot writer cadenceDefault write interval is `5s` (`WORKER_SNAPSHOT_INTERVAL` override supported).Snapshot freshness is bounded by write interval plus write latency.
Snapshot writer lockWriter acquires `cordum:scheduler:snapshot:writer` with TTL `30s` before writing.Only one replica writes snapshot at a time in HA deployments.
Failure fallbackIf snapshot read/hydrate fails, scheduler logs warning and continues cold start.Startup is resilient but may temporarily return no-worker dispatch errors.

Code-level mechanics

Registry TTL and hydrate behavior (Go)

registry_memory.go
Go
const defaultWorkerTTL = 30 * time.Second

func NewMemoryRegistry() *MemoryRegistry {
  return NewMemoryRegistryWithTTL(defaultWorkerTTL)
}

// HydrateFromSnapshot ...
// Workers are inserted with lastSeen=time.Now() so normal TTL expiry applies.
func (r *MemoryRegistry) HydrateFromSnapshot(data []byte) error {
  ...
  now := time.Now()
  for _, w := range snap.Workers {
    r.workers[w.WorkerID] = &workerEntry{hb: hb, lastSeen: now}
  }
  return nil
}

Startup warm-start path (Go)

scheduler_main_warmstart.go
Go
// Warm-start: hydrate registry from last-written snapshot
hydrateCtx, hydrateCancel := context.WithTimeout(ctx, 5*time.Second)
snapData, snapErr := snapshotStore.GetResult(hydrateCtx, agentregistry.SnapshotKey)
hydrateCancel()

if len(snapData) > 0 {
  _ = registry.HydrateFromSnapshot(snapData)
}

snapshotInterval := 5 * time.Second
const snapshotLockKey = "cordum:scheduler:snapshot:writer"
const snapshotLockTTL = 30 * time.Second

This is the practical point: snapshot read is bounded by timeout and never blocks forever. If it fails, scheduler starts cold and keeps moving.

Snapshot writer lock path (Go)

snapshot_writer_loop.go
Go
token, err := jobStore.TryAcquireLock(lockCtx, snapshotLockKey, snapshotLockTTL)
if token == "" { return } // another replica owns writer role now

snap := agentregistry.BuildSnapshot(registry.Snapshot(), current.TopicToPool())
data, _ := json.Marshal(snap)
_ = snapshotStore.PutResult(writeCtx, agentregistry.SnapshotKey, data)
_ = jobStore.ReleaseLock(releaseCtx, snapshotLockKey, token)

Operator runbook

heartbeat_warmstart_runbook.sh
Bash
# 1) Check snapshot key exists
redis-cli GET "sys:workers:snapshot" | jq '.captured_at, .writer_id, (.workers | length)'

# 2) Restart one scheduler replica
kubectl rollout restart deploy/cordum-scheduler -n cordum

# 3) Confirm warm-start log path
# expected: "registry hydrated from snapshot" or "no snapshot found, starting cold"

# 4) Verify worker availability right after restart
curl -H "X-API-Key: $CORDUM_API_KEY" "$CORDUM_URL/api/v1/workers" | jq '.items | length'

# 5) If stale workers appear after restart, track TTL expiry window (30s default)

Limitations and tradeoffs

  • - Warm-start improves availability, but snapshot freshness is bounded by writer interval.
  • - `lastSeen=now` on hydrate can temporarily hide dead workers until TTL eviction.
  • - Writer lock protects consistency, but lock contention can skip writes on some replicas.
  • - If Redis is unavailable at startup, warm-start is bypassed and cold-start behavior returns.

Warm-start snapshots reduce false no-worker errors. They do not replace live heartbeats or proper worker liveness checks.

Next step

Run one restart drill with measurement:

  1. 1. Capture dispatch errors in the first minute after scheduler restart.
  2. 2. Compare cold-start behavior (snapshot disabled) vs warm-start behavior.
  3. 3. Track stale-worker lifetime after hydrate to validate TTL assumptions.
  4. 4. Tune `WORKER_SNAPSHOT_INTERVAL` only after measuring Redis write impact.

Continue with Health Check Strategy and Cold Start Recovery.

Startup reliability matters

Most outages do not start with total failure. They start with one process coming up slightly wrong.