AI Agent Worker Heartbeat Warm-Start: Eliminate 30s No-Worker Windows (2026)

The production problem

A scheduler replica can start healthy and still reject dispatch because it has not seen any worker heartbeats yet. The cluster is fine. The local registry is empty.

In systems with ~30 second heartbeat TTL windows, that startup blind spot can be long enough to trigger needless retries, noisy alerts, and tired people.

The fix is straightforward: bootstrap the registry from a recent persisted snapshot, then let live heartbeats overwrite it.

What top results miss

Source	Strong coverage	Missing piece
Kubernetes probes guide	Readiness vs liveness roles and startup probe protection for slow starts.	No control-plane worker registry warm-start mechanism across scheduler replicas.
etcd Lease API docs	TTL leases, keepalive semantics, and liveness expiration behavior.	No guidance on snapshot-hydrated in-memory registries used by schedulers.
RabbitMQ heartbeat docs	Heartbeat timeout negotiation, false-positive ranges, and missed-heartbeat detection.	No scheduler-side warm-start path to prevent post-restart empty routing tables.

Public guidance explains heartbeat and liveness mechanics well. It usually stops short of registry warm-start behavior inside a distributed scheduler process.

Cordum runtime behavior

Boundary	Current behavior	Operational impact
Registry TTL	In-memory worker registry default TTL is `30s` (`defaultWorkerTTL`).	Workers missing heartbeats are evicted after TTL window.
Warm-start read	Scheduler reads snapshot key `sys:workers:snapshot` with `5s` timeout during startup.	New replica can route quickly without waiting for first live heartbeat cycle.
Hydration timestamp	Hydrated workers are inserted with `lastSeen = now`.	Stale snapshot entries can appear alive for up to one TTL if no heartbeat follows.
Staleness bound	Snapshot has `captured_at`, while hydrate resets `lastSeen` to current time.	After restart, stale-worker visibility can last about `snapshot_age + worker_ttl`.
Snapshot writer cadence	Default write interval is `5s` (`WORKER_SNAPSHOT_INTERVAL` override supported).	Snapshot freshness is bounded by write interval plus write latency.
Snapshot writer lock	Writer acquires `cordum:scheduler:snapshot:writer` with TTL `30s` before writing.	Only one replica writes snapshot at a time in HA deployments.
Failure fallback	If snapshot read/hydrate fails, scheduler logs warning and continues cold start.	Startup is resilient but may temporarily return no-worker dispatch errors.

Code-level mechanics

Registry TTL and hydrate behavior (Go)

registry_memory.go

const defaultWorkerTTL = 30 * time.Second

func NewMemoryRegistry() *MemoryRegistry {
  return NewMemoryRegistryWithTTL(defaultWorkerTTL)
}

// HydrateFromSnapshot ...
// Workers are inserted with lastSeen=time.Now() so normal TTL expiry applies.
func (r *MemoryRegistry) HydrateFromSnapshot(data []byte) error {
  ...
  now := time.Now()
  for _, w := range snap.Workers {
    r.workers[w.WorkerID] = &workerEntry{hb: hb, lastSeen: now}
  }
  return nil
}

Startup warm-start path (Go)

scheduler_main_warmstart.go

// Warm-start: hydrate registry from last-written snapshot
hydrateCtx, hydrateCancel := context.WithTimeout(ctx, 5*time.Second)
snapData, snapErr := snapshotStore.GetResult(hydrateCtx, agentregistry.SnapshotKey)
hydrateCancel()

if len(snapData) > 0 {
  _ = registry.HydrateFromSnapshot(snapData)
}

snapshotInterval := 5 * time.Second
const snapshotLockKey = "cordum:scheduler:snapshot:writer"
const snapshotLockTTL = 30 * time.Second

This is the practical point: snapshot read is bounded by timeout and never blocks forever. If it fails, scheduler starts cold and keeps moving.

Snapshot writer lock path (Go)

snapshot_writer_loop.go

token, err := jobStore.TryAcquireLock(lockCtx, snapshotLockKey, snapshotLockTTL)
if token == "" { return } // another replica owns writer role now

snap := agentregistry.BuildSnapshot(registry.Snapshot(), current.TopicToPool())
data, _ := json.Marshal(snap)
_ = snapshotStore.PutResult(writeCtx, agentregistry.SnapshotKey, data)
_ = jobStore.ReleaseLock(releaseCtx, snapshotLockKey, token)

Staleness budget math (Ops)

snapshot_staleness_budget.txt

Bash

# Stale visibility budget after restart (upper bound):
# stale_window_seconds ~= snapshot_age_seconds + worker_ttl_seconds
#
# With defaults:
# snapshot_age ~= 5s (typical)
# worker_ttl   = 30s
# stale_window ~= 35s
#
# Worst case grows if snapshot writer lags or lock handoff is delayed.

Operator runbook

heartbeat_warmstart_runbook.sh

Bash

# 1) Check snapshot key exists
redis-cli GET "sys:workers:snapshot" | jq '.captured_at, .writer_id, (.workers | length)'

# 2) Measure snapshot age immediately before restart
SNAP_TS=$(redis-cli GET "sys:workers:snapshot" | jq -r '.captured_at')
echo "snapshot_age_seconds=$(( $(date -u +%s) - $(date -u -d "$SNAP_TS" +%s) ))"

# 3) Restart one scheduler replica
kubectl rollout restart deploy/cordum-scheduler -n cordum

# 4) Confirm warm-start log path
# expected: "registry hydrated from snapshot" or "no snapshot found, starting cold"

# 5) Verify worker availability right after restart
curl -H "X-API-Key: $CORDUM_API_KEY" "$CORDUM_URL/api/v1/workers" | jq '.items | length'

# 6) If stale workers appear after restart, compare observed stale window
#    against snapshot_age + worker_ttl (30s default)

Limitations and tradeoffs

- Warm-start improves availability, but snapshot freshness is bounded by writer interval.
- `lastSeen=now` on hydrate can temporarily hide dead workers until TTL eviction.
- Larger snapshot age directly increases stale-worker exposure after restart.
- Writer lock protects consistency, but lock contention can skip writes on some replicas.
- If Redis is unavailable at startup, warm-start is bypassed and cold-start behavior returns.

Warm-start snapshots reduce false no-worker errors. They do not replace live heartbeats or proper worker liveness checks.

Next step

Run one restart drill with measurement:

1. Capture dispatch errors in the first minute after scheduler restart.
2. Compare cold-start behavior (snapshot disabled) vs warm-start behavior.
3. Track stale-worker lifetime after hydrate to validate TTL assumptions.
4. Tune `WORKER_SNAPSHOT_INTERVAL` only after measuring Redis write impact.

Continue with Health Check Strategy and Cold Start Recovery.