Name: Cordum
Author: Cordum

The production problem

Cold starts hurt most when stateful control loops wake up empty. A scheduler that starts instantly but has no worker view behaves like a healthy process producing unhealthy outcomes.

Teams often optimize container startup and ignore state hydration. Then every restart creates a short blind period where routing decisions degrade or stall.

Recovery design needs startup budget math, warm-start artifacts, and strict stale-state expiration.

What top results miss

Source	Strong coverage	Missing piece
Kubernetes startup probes	How startup probes shield slow initialization and how `failureThreshold * periodSeconds` defines startup budget.	No guidance for restoring distributed scheduler state before workloads are admitted.
AWS Lambda runtime lifecycle	Init/invoke lifecycle, cold vs warm starts, and initialization latency factors.	No control-plane replica convergence model for shared routing state.
Azure Functions warmup trigger	Warmup hooks for scale-out and preloading dependencies before traffic.	No multi-replica snapshot freshness strategy or cross-node consistency checks.

The gap is distributed state convergence after restart, not process boot mechanics alone.

Cold-start recovery model

Phase	Objective	Common failure	Mitigation
Bootstrap	Bring process up and connect to required control-plane dependencies.	Service is technically up but functionally blind to worker/routing state.	Separate process start from readiness admission.
Hydrate	Load prior state snapshot with timeout-bound read path.	Slow or failed reads block startup indefinitely.	Use strict read timeout and deterministic fallback to cold path.
Validate freshness	Apply TTL/freshness checks so stale state expires quickly.	Warm-started stale entries route jobs to non-existent workers.	Reset `lastSeen` with bounded TTL and prefer live heartbeats.
Converge	Refresh state continuously from live signals and periodic snapshots.	Replica behavior diverges after restart and traffic spikes.	Single-writer snapshot updates and restart verification checks.

Cordum startup behavior

These values come from current scheduler code paths and supporting docs, with conflicts resolved by checking source code.

Behavior	Current implementation	Why it matters
Worker registry TTL	Default in-memory worker TTL is `30s`.	Warm-started workers age out if fresh heartbeats do not arrive.
Warm-start read timeout	Scheduler snapshot hydration uses a `5s` timeout on startup read.	Prevents blocked startup when Redis or snapshot retrieval is slow.
Snapshot write cadence	Snapshot writer runs every `5s` behind `cordum:scheduler:snapshot:writer` lock.	Limits stale registry data while avoiding concurrent writer races.
Snapshot writer lock TTL	Lock TTL is `30s` with explicit release and TTL fallback on release failure.	Crash recovery for writer leadership is bounded to about `35s` worst case.
Cold-path fallback	If hydration fails or snapshot is absent, scheduler starts cold and waits for heartbeats.	Keeps startup reliable even when warm-start artifacts are unavailable.

Implementation examples

Warm-start hydration + snapshot writer settings (Go)

startup_warmstart.go

// Startup path in scheduler main
hydrateCtx, hydrateCancel := context.WithTimeout(ctx, 5*time.Second)
snapData, snapErr := snapshotStore.GetResult(hydrateCtx, agentregistry.SnapshotKey)
hydrateCancel()

if snapErr != nil {
  slog.Warn("warm-start read failed", "error", snapErr)
} else if len(snapData) == 0 {
  slog.Info("no snapshot found, starting cold")
} else if err := registry.HydrateFromSnapshot(snapData); err != nil {
  slog.Warn("warm-start hydrate failed", "error", err)
}

// Snapshot writer loop
const snapshotInterval = 5 * time.Second
const snapshotLockKey = "cordum:scheduler:snapshot:writer"
const snapshotLockTTL = 30 * time.Second

Cold-start policy sheet (YAML)

cold_start_policy.yaml

YAML

cold_start_policy:
  startup_probe_budget:
    period_seconds: 10
    failure_threshold: 30
    max_startup_window: 300s
  warm_start:
    snapshot_key: sys:workers:snapshot
    read_timeout: 5s
    writer_interval: 5s
    writer_lock:
      key: cordum:scheduler:snapshot:writer
      ttl: 30s
  stale_state_controls:
    worker_ttl: 30s
    prefer_live_heartbeats: true
    fallback_mode: cold_start

Post-restart diagnostics (Bash)

cold_start_runbook.sh

Bash

# Warm-start artifact checks
redis-cli GET "sys:workers:snapshot"
redis-cli GET "cordum:scheduler:snapshot:writer"
redis-cli OBJECT IDLETIME "sys:workers:snapshot"

# Worker availability checks after restart
curl -s http://localhost:8080/api/v1/workers | jq '.workers | length'

# Side effects in scheduler metrics
curl -s http://localhost:9090/metrics | grep stale_jobs
curl -s http://localhost:9090/metrics | grep orphan_replayed

Recovery health checks (PromQL)

cold_start_health.promql

PromQL

# Dispatch latency after restart
histogram_quantile(0.99, rate(dispatch_latency_bucket[5m]))

# Stale work debt
stale_jobs
rate(orphan_replayed_total[5m])

# Lock contention during convergence
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

Limitations and tradeoffs

- Warm-start snapshots improve startup but can temporarily surface stale worker entries.
- Shorter TTL improves stale-state cleanup but can increase churn during transient heartbeat delays.
- Snapshot writing adds Redis activity; high-frequency writes need capacity planning.
- Cold-path fallback is safe but can reintroduce startup blind windows if snapshots are repeatedly unavailable.

If restart recovery relies on manual restarts to fix stale state, you have not solved cold starts. You have added ritual.

Next step

Run this in one sprint:

1. Define startup budgets for each control-plane component and add startup probes accordingly.
2. Implement warm-start hydration for the top state object needed for routing correctness.
3. Add stale-state TTL checks and verify live heartbeats override hydrated data.
4. Chaos test: restart a scheduler replica during load and measure convergence to full routing capacity.

Continue with AI Agent Config Drift Detection and AI Agent Leader Election.

AI Agent Cold Start Recovery