The production problem
Rolling restarts are regular operations. They happen on deploys, node drains, security patching, and cluster autoscaling.
If restart behavior is only modeled at the pod level, you miss the real failure mode: state handoff between replicas that own locks, queue offsets, and in-flight RPC chains.
The result is predictable. Duplicate dispatch, delayed takeover, and retry storms that start five minutes after rollout looked “green.”
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| GKE cluster upgrade best practices | Upgrade sequencing, surge strategy, and graceful termination considerations during node/pod replacement. | No scheduler lock-handoff or queue-drain guidance for control-plane workloads. |
| Kubernetes PodDisruptionBudget docs | How disruption budgets limit voluntary evictions during maintenance events. | No coupling between PDB settings and application-level drain order across HTTP, gRPC, and queue handlers. |
| gRPC graceful shutdown | Drain in-flight RPCs and force-stop fallback after timeout. | No multi-service orchestration with message bus drains and Redis lock expiry. |
The missing layer is restart choreography across orchestrator policy, transport drain, and lock ownership. Autonomous agents need all three in one rollout contract.
Rollout budget math
Teams set `maxUnavailable` without calculating rollout duration. That is backwards. Compute the budget first, then set rollout knobs.
# Conservative rollout budget # R = replicas, U = maxUnavailable # T_shutdown = graceful shutdown timeout # T_start = time until new pod is Ready # T_minReady = minReadySeconds waves = ceil(R / U) wave_time = T_shutdown + T_start + T_minReady rollout_time ~= waves * wave_time # Example: R=6, U=1, shutdown=15s, startup=20s, minReady=10s # waves=6, wave_time=45s => ~270s total (~4.5 min)
For a 6-replica service at conservative settings, rollout is roughly 4.5 minutes. If your change window is two minutes, you need parallelism and stronger readiness controls, not hope.
| Phase | Action | Failure if skipped | Mitigation |
|---|---|---|---|
| Pre-flight | Validate PDB, replica count, and maxUnavailable before starting rollout. | Controller evicts too many pods and quorum-dependent paths stall. | Enforce pre-rollout checklist and block deploy if guardrails fail. |
| Drain window | Mark pod unready, stop new ingress, then drain HTTP/gRPC and queue handlers. | New work lands on a pod that is already terminating. | Readiness-first drain sequence plus bounded shutdown timeout. |
| Handoff | Release locks or rely on bounded TTL takeover if release is missed. | Orphaned lock delays work resumption and inflates backlog. | Keep lock renewal healthy and validate takeover within defined SLO. |
| Post-rollout | Verify no duplicate dispatch, stale backlog, or elevated retry rates. | Hidden data-plane regressions surface minutes after rollout completes. | Run mandatory post-rollout query pack before closing deployment. |
Cordum restart behavior
These values are verified from current documentation and runtime paths in scheduler, gateway, workflow, context-engine, and safety-kernel services.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Service shutdown timeout | Cordum services target a 15s graceful shutdown path on SIGTERM. | Predictable upper bound for drain time in rolling restarts. |
| Kubernetes envelope | Docs recommend `terminationGracePeriodSeconds: 30`, leaving headroom over 15s service timeout. | Reduces forced SIGKILL risk during node drains or rollout pressure. |
| API Gateway | Stops bus taps, drains HTTP, drains gRPC with forced fallback, then shuts metrics. | Prevents late request acceptance while transport shutdown is in progress. |
| Scheduler lock ownership | Job lock TTL is 60s with renewal every 20s; surviving replica takes ownership after expiry if a pod dies abruptly. | Caps worst-case takeover delay but creates temporary throughput dip after hard kills. |
| Broadcast convergence | During rolling restarts, ephemeral broadcast consumers can miss messages while a replica is replaced. | State converges via self-healing loops, but short-lived UI drift is expected. |
Implementation examples
Deployment strategy and termination envelope (YAML)
apiVersion: apps/v1
kind: Deployment
metadata:
name: cordum-scheduler
namespace: cordum
spec:
replicas: 3
strategy:
type: RollingUpdate
rollingUpdate:
maxUnavailable: 1
maxSurge: 1
minReadySeconds: 10
template:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: scheduler
image: ghcr.io/cordum-io/cordum-scheduler:latest
readinessProbe:
httpGet:
path: /health
port: 8080
periodSeconds: 5
failureThreshold: 2
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 2"]Readiness-first drain pattern (Go)
var draining atomic.Bool
mux.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
if draining.Load() {
http.Error(w, "draining", http.StatusServiceUnavailable)
return
}
w.WriteHeader(http.StatusOK)
})
sigCtx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
<-sigCtx.Done()
draining.Store(true) // stop new traffic via readiness
shutdownCtx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()
_ = httpServer.Shutdown(shutdownCtx)
engine.Stop() // bounded in-flight wait
_ = metricsServer.Shutdown(shutdownCtx)Rolling restart verification runbook (Bash)
# 1) Verify disruption safety before rollout kubectl get pdb -n cordum kubectl get deploy -n cordum # 2) Start rollout and watch progress kubectl rollout restart deployment/cordum-scheduler -n cordum kubectl rollout status deployment/cordum-scheduler -n cordum # 3) Confirm graceful drain logs kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "shutting down gracefully|gRPC server drained" kubectl logs deploy/cordum-scheduler -n cordum | grep -E "scheduler shutting down gracefully|graceful shutdown deadline" # 4) Verify lock handoff after forced pod kill drill redis-cli GET "cordum:scheduler:job:JOB_ID" redis-cli GET "cordum:reconciler:default" redis-cli GET "cordum:replayer:pending"
Post-rollout regression checks (PromQL)
# Rollout should not create prolonged unavailable replicas
max_over_time(kube_deployment_status_replicas_unavailable{namespace="cordum"}[10m])
# Restart storms after rollout are a red flag
increase(kube_pod_container_status_restarts_total{namespace="cordum"}[10m])
# P99 request latency should recover quickly after each wave
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))Limitations and tradeoffs
- - Lower `maxUnavailable` protects availability but increases total rollout time.
- - Higher surge reduces deployment duration but increases temporary resource pressure.
- - Long lock TTL lowers duplicate execution risk but delays takeover after hard kills.
- - Fast force-stop fallback guarantees termination but can cut off long-running in-flight RPCs.
A green `kubectl rollout status` means controller convergence. It does not prove queue-state convergence.
Next step
Run this in one sprint:
- 1. Add a restart game day that force-kills one scheduler pod per rollout wave.
- 2. Record takeover lag from kill event to next successful dispatch and set an SLO.
- 3. Gate production rollout completion on lock-handoff and retry-rate checks.
- 4. Keep a rollback command block in the release template, not in someone's memory.
Continue with AI Agent Graceful Shutdown and AI Agent Leader Election.