The production problem
Cold starts hurt most when stateful control loops wake up empty. A scheduler that starts instantly but has no worker view behaves like a healthy process producing unhealthy outcomes.
Teams often optimize container startup and ignore state hydration. Then every restart creates a short blind period where routing decisions degrade or stall.
Recovery design needs startup budget math, warm-start artifacts, and strict stale-state expiration.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes startup probes | How startup probes shield slow initialization and how `failureThreshold * periodSeconds` defines startup budget. | No guidance for restoring distributed scheduler state before workloads are admitted. |
| AWS Lambda runtime lifecycle | Init/invoke lifecycle, cold vs warm starts, and initialization latency factors. | No control-plane replica convergence model for shared routing state. |
| Azure Functions warmup trigger | Warmup hooks for scale-out and preloading dependencies before traffic. | No multi-replica snapshot freshness strategy or cross-node consistency checks. |
The gap is distributed state convergence after restart, not process boot mechanics alone.
Cold-start recovery model
| Phase | Objective | Common failure | Mitigation |
|---|---|---|---|
| Bootstrap | Bring process up and connect to required control-plane dependencies. | Service is technically up but functionally blind to worker/routing state. | Separate process start from readiness admission. |
| Hydrate | Load prior state snapshot with timeout-bound read path. | Slow or failed reads block startup indefinitely. | Use strict read timeout and deterministic fallback to cold path. |
| Validate freshness | Apply TTL/freshness checks so stale state expires quickly. | Warm-started stale entries route jobs to non-existent workers. | Reset `lastSeen` with bounded TTL and prefer live heartbeats. |
| Converge | Refresh state continuously from live signals and periodic snapshots. | Replica behavior diverges after restart and traffic spikes. | Single-writer snapshot updates and restart verification checks. |
Cordum startup behavior
These values come from current scheduler code paths and supporting docs, with conflicts resolved by checking source code.
| Behavior | Current implementation | Why it matters |
|---|---|---|
| Worker registry TTL | Default in-memory worker TTL is `30s`. | Warm-started workers age out if fresh heartbeats do not arrive. |
| Warm-start read timeout | Scheduler snapshot hydration uses a `5s` timeout on startup read. | Prevents blocked startup when Redis or snapshot retrieval is slow. |
| Snapshot write cadence | Snapshot writer runs every `5s` behind `cordum:scheduler:snapshot:writer` lock. | Limits stale registry data while avoiding concurrent writer races. |
| Snapshot writer lock TTL | Lock TTL is `30s` with explicit release and TTL fallback on release failure. | Crash recovery for writer leadership is bounded to about `35s` worst case. |
| Cold-path fallback | If hydration fails or snapshot is absent, scheduler starts cold and waits for heartbeats. | Keeps startup reliable even when warm-start artifacts are unavailable. |
Implementation examples
Warm-start hydration + snapshot writer settings (Go)
// Startup path in scheduler main
hydrateCtx, hydrateCancel := context.WithTimeout(ctx, 5*time.Second)
snapData, snapErr := snapshotStore.GetResult(hydrateCtx, agentregistry.SnapshotKey)
hydrateCancel()
if snapErr != nil {
slog.Warn("warm-start read failed", "error", snapErr)
} else if len(snapData) == 0 {
slog.Info("no snapshot found, starting cold")
} else if err := registry.HydrateFromSnapshot(snapData); err != nil {
slog.Warn("warm-start hydrate failed", "error", err)
}
// Snapshot writer loop
const snapshotInterval = 5 * time.Second
const snapshotLockKey = "cordum:scheduler:snapshot:writer"
const snapshotLockTTL = 30 * time.SecondCold-start policy sheet (YAML)
cold_start_policy:
startup_probe_budget:
period_seconds: 10
failure_threshold: 30
max_startup_window: 300s
warm_start:
snapshot_key: sys:workers:snapshot
read_timeout: 5s
writer_interval: 5s
writer_lock:
key: cordum:scheduler:snapshot:writer
ttl: 30s
stale_state_controls:
worker_ttl: 30s
prefer_live_heartbeats: true
fallback_mode: cold_startPost-restart diagnostics (Bash)
# Warm-start artifact checks redis-cli GET "sys:workers:snapshot" redis-cli GET "cordum:scheduler:snapshot:writer" redis-cli OBJECT IDLETIME "sys:workers:snapshot" # Worker availability checks after restart curl -s http://localhost:8080/api/v1/workers | jq '.workers | length' # Side effects in scheduler metrics curl -s http://localhost:9090/metrics | grep stale_jobs curl -s http://localhost:9090/metrics | grep orphan_replayed
Recovery health checks (PromQL)
# Dispatch latency after restart histogram_quantile(0.99, rate(dispatch_latency_bucket[5m])) # Stale work debt stale_jobs rate(orphan_replayed_total[5m]) # Lock contention during convergence histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))
Limitations and tradeoffs
- - Warm-start snapshots improve startup but can temporarily surface stale worker entries.
- - Shorter TTL improves stale-state cleanup but can increase churn during transient heartbeat delays.
- - Snapshot writing adds Redis activity; high-frequency writes need capacity planning.
- - Cold-path fallback is safe but can reintroduce startup blind windows if snapshots are repeatedly unavailable.
If restart recovery relies on manual restarts to fix stale state, you have not solved cold starts. You have added ritual.
Next step
Run this in one sprint:
- 1. Define startup budgets for each control-plane component and add startup probes accordingly.
- 2. Implement warm-start hydration for the top state object needed for routing correctness.
- 3. Add stale-state TTL checks and verify live heartbeats override hydrated data.
- 4. Chaos test: restart a scheduler replica during load and measure convergence to full routing capacity.
Continue with AI Agent Config Drift Detection and AI Agent Leader Election.