The production problem
A scheduler replica can start healthy and still reject dispatch because it has not seen any worker heartbeats yet. The cluster is fine. The local registry is empty.
In systems with ~30 second heartbeat TTL windows, that startup blind spot can be long enough to trigger needless retries, noisy alerts, and tired people.
The fix is straightforward: bootstrap the registry from a recent persisted snapshot, then let live heartbeats overwrite it.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes probes guide | Readiness vs liveness roles and startup probe protection for slow starts. | No control-plane worker registry warm-start mechanism across scheduler replicas. |
| etcd Lease API docs | TTL leases, keepalive semantics, and liveness expiration behavior. | No guidance on snapshot-hydrated in-memory registries used by schedulers. |
| RabbitMQ heartbeat docs | Heartbeat timeout negotiation, false-positive ranges, and missed-heartbeat detection. | No scheduler-side warm-start path to prevent post-restart empty routing tables. |
Public guidance explains heartbeat and liveness mechanics well. It usually stops short of registry warm-start behavior inside a distributed scheduler process.
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Registry TTL | In-memory worker registry default TTL is `30s` (`defaultWorkerTTL`). | Workers missing heartbeats are evicted after TTL window. |
| Warm-start read | Scheduler reads snapshot key `sys:workers:snapshot` with `5s` timeout during startup. | New replica can route quickly without waiting for first live heartbeat cycle. |
| Hydration timestamp | Hydrated workers are inserted with `lastSeen = now`. | Stale snapshot entries can appear alive for up to one TTL if no heartbeat follows. |
| Snapshot writer cadence | Default write interval is `5s` (`WORKER_SNAPSHOT_INTERVAL` override supported). | Snapshot freshness is bounded by write interval plus write latency. |
| Snapshot writer lock | Writer acquires `cordum:scheduler:snapshot:writer` with TTL `30s` before writing. | Only one replica writes snapshot at a time in HA deployments. |
| Failure fallback | If snapshot read/hydrate fails, scheduler logs warning and continues cold start. | Startup is resilient but may temporarily return no-worker dispatch errors. |
Code-level mechanics
Registry TTL and hydrate behavior (Go)
const defaultWorkerTTL = 30 * time.Second
func NewMemoryRegistry() *MemoryRegistry {
return NewMemoryRegistryWithTTL(defaultWorkerTTL)
}
// HydrateFromSnapshot ...
// Workers are inserted with lastSeen=time.Now() so normal TTL expiry applies.
func (r *MemoryRegistry) HydrateFromSnapshot(data []byte) error {
...
now := time.Now()
for _, w := range snap.Workers {
r.workers[w.WorkerID] = &workerEntry{hb: hb, lastSeen: now}
}
return nil
}Startup warm-start path (Go)
// Warm-start: hydrate registry from last-written snapshot
hydrateCtx, hydrateCancel := context.WithTimeout(ctx, 5*time.Second)
snapData, snapErr := snapshotStore.GetResult(hydrateCtx, agentregistry.SnapshotKey)
hydrateCancel()
if len(snapData) > 0 {
_ = registry.HydrateFromSnapshot(snapData)
}
snapshotInterval := 5 * time.Second
const snapshotLockKey = "cordum:scheduler:snapshot:writer"
const snapshotLockTTL = 30 * time.SecondThis is the practical point: snapshot read is bounded by timeout and never blocks forever. If it fails, scheduler starts cold and keeps moving.
Snapshot writer lock path (Go)
token, err := jobStore.TryAcquireLock(lockCtx, snapshotLockKey, snapshotLockTTL)
if token == "" { return } // another replica owns writer role now
snap := agentregistry.BuildSnapshot(registry.Snapshot(), current.TopicToPool())
data, _ := json.Marshal(snap)
_ = snapshotStore.PutResult(writeCtx, agentregistry.SnapshotKey, data)
_ = jobStore.ReleaseLock(releaseCtx, snapshotLockKey, token)Operator runbook
# 1) Check snapshot key exists redis-cli GET "sys:workers:snapshot" | jq '.captured_at, .writer_id, (.workers | length)' # 2) Restart one scheduler replica kubectl rollout restart deploy/cordum-scheduler -n cordum # 3) Confirm warm-start log path # expected: "registry hydrated from snapshot" or "no snapshot found, starting cold" # 4) Verify worker availability right after restart curl -H "X-API-Key: $CORDUM_API_KEY" "$CORDUM_URL/api/v1/workers" | jq '.items | length' # 5) If stale workers appear after restart, track TTL expiry window (30s default)
Limitations and tradeoffs
- - Warm-start improves availability, but snapshot freshness is bounded by writer interval.
- - `lastSeen=now` on hydrate can temporarily hide dead workers until TTL eviction.
- - Writer lock protects consistency, but lock contention can skip writes on some replicas.
- - If Redis is unavailable at startup, warm-start is bypassed and cold-start behavior returns.
Warm-start snapshots reduce false no-worker errors. They do not replace live heartbeats or proper worker liveness checks.
Next step
Run one restart drill with measurement:
- 1. Capture dispatch errors in the first minute after scheduler restart.
- 2. Compare cold-start behavior (snapshot disabled) vs warm-start behavior.
- 3. Track stale-worker lifetime after hydrate to validate TTL assumptions.
- 4. Tune `WORKER_SNAPSHOT_INTERVAL` only after measuring Redis write impact.
Continue with Health Check Strategy and Cold Start Recovery.