The production problem

Rolling restarts are regular operations. They happen on deploys, node drains, security patching, and cluster autoscaling.

If restart behavior is only modeled at the pod level, you miss the real failure mode: state handoff between replicas that own locks, queue offsets, and in-flight RPC chains.

The result is predictable. Duplicate dispatch, delayed takeover, and retry storms that start five minutes after rollout looked “green.”

What top results miss

Source	Strong coverage	Missing piece
GKE cluster upgrade best practices	Upgrade sequencing, surge strategy, and graceful termination considerations during node/pod replacement.	No scheduler lock-handoff or queue-drain guidance for control-plane workloads.
Kubernetes PodDisruptionBudget docs	How disruption budgets limit voluntary evictions during maintenance events.	No coupling between PDB settings and application-level drain order across HTTP, gRPC, and queue handlers.
gRPC graceful shutdown	Drain in-flight RPCs and force-stop fallback after timeout.	No multi-service orchestration with message bus drains and Redis lock expiry.

The missing layer is restart choreography across orchestrator policy, transport drain, and lock ownership. Autonomous agents need all three in one rollout contract.

Rollout budget math

Teams set `maxUnavailable` without calculating rollout duration. That is backwards. Compute the budget first, then set rollout knobs.

rollout_budget.txt

Text

# Conservative rollout budget
# R = replicas, U = maxUnavailable
# T_shutdown = graceful shutdown timeout
# T_start = time until new pod is Ready
# T_minReady = minReadySeconds

waves = ceil(R / U)
wave_time = T_shutdown + T_start + T_minReady
rollout_time ~= waves * wave_time

# Example: R=6, U=1, shutdown=15s, startup=20s, minReady=10s
# waves=6, wave_time=45s => ~270s total (~4.5 min)

For a 6-replica service at conservative settings, rollout is roughly 4.5 minutes. If your change window is two minutes, you need parallelism and stronger readiness controls, not hope.

Phase	Action	Failure if skipped	Mitigation
Pre-flight	Validate PDB, replica count, and maxUnavailable before starting rollout.	Controller evicts too many pods and quorum-dependent paths stall.	Enforce pre-rollout checklist and block deploy if guardrails fail.
Drain window	Mark pod unready, stop new ingress, then drain HTTP/gRPC and queue handlers.	New work lands on a pod that is already terminating.	Readiness-first drain sequence plus bounded shutdown timeout.
Handoff	Release locks or rely on bounded TTL takeover if release is missed.	Orphaned lock delays work resumption and inflates backlog.	Keep lock renewal healthy and validate takeover within defined SLO.
Post-rollout	Verify no duplicate dispatch, stale backlog, or elevated retry rates.	Hidden data-plane regressions surface minutes after rollout completes.	Run mandatory post-rollout query pack before closing deployment.

Cordum restart behavior

These values are verified from current documentation and runtime paths in scheduler, gateway, workflow, context-engine, and safety-kernel services.

Boundary	Current behavior	Operational impact
Service shutdown timeout	Cordum services target a 15s graceful shutdown path on SIGTERM.	Predictable upper bound for drain time in rolling restarts.
Kubernetes envelope	Docs recommend `terminationGracePeriodSeconds: 30`, leaving headroom over 15s service timeout.	Reduces forced SIGKILL risk during node drains or rollout pressure.
API Gateway	Stops bus taps, drains HTTP, drains gRPC with forced fallback, then shuts metrics.	Prevents late request acceptance while transport shutdown is in progress.
Scheduler lock ownership	Job lock TTL is 60s with renewal every 20s; surviving replica takes ownership after expiry if a pod dies abruptly.	Caps worst-case takeover delay but creates temporary throughput dip after hard kills.
Broadcast convergence	During rolling restarts, ephemeral broadcast consumers can miss messages while a replica is replaced.	State converges via self-healing loops, but short-lived UI drift is expected.

Implementation examples

Deployment strategy and termination envelope (YAML)

scheduler_rollout.yaml

YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cordum-scheduler
  namespace: cordum
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  minReadySeconds: 10
  template:
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: scheduler
          image: ghcr.io/cordum-io/cordum-scheduler:latest
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            periodSeconds: 5
            failureThreshold: 2
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 2"]

Readiness-first drain pattern (Go)

drain_readiness.go

var draining atomic.Bool

mux.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
  if draining.Load() {
    http.Error(w, "draining", http.StatusServiceUnavailable)
    return
  }
  w.WriteHeader(http.StatusOK)
})

sigCtx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

<-sigCtx.Done()
draining.Store(true)                // stop new traffic via readiness
shutdownCtx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()

_ = httpServer.Shutdown(shutdownCtx)
engine.Stop()                       // bounded in-flight wait
_ = metricsServer.Shutdown(shutdownCtx)

Rolling restart verification runbook (Bash)

rolling_restart_check.sh

Bash

# 1) Verify disruption safety before rollout
kubectl get pdb -n cordum
kubectl get deploy -n cordum

# 2) Start rollout and watch progress
kubectl rollout restart deployment/cordum-scheduler -n cordum
kubectl rollout status deployment/cordum-scheduler -n cordum

# 3) Confirm graceful drain logs
kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "shutting down gracefully|gRPC server drained"
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "scheduler shutting down gracefully|graceful shutdown deadline"

# 4) Verify lock handoff after forced pod kill drill
redis-cli GET "cordum:scheduler:job:JOB_ID"
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

Post-rollout regression checks (PromQL)

rollout_regression.promql

PromQL

# Rollout should not create prolonged unavailable replicas
max_over_time(kube_deployment_status_replicas_unavailable{namespace="cordum"}[10m])

# Restart storms after rollout are a red flag
increase(kube_pod_container_status_restarts_total{namespace="cordum"}[10m])

# P99 request latency should recover quickly after each wave
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Limitations and tradeoffs

- Lower `maxUnavailable` protects availability but increases total rollout time.
- Higher surge reduces deployment duration but increases temporary resource pressure.
- Long lock TTL lowers duplicate execution risk but delays takeover after hard kills.
- Fast force-stop fallback guarantees termination but can cut off long-running in-flight RPCs.

A green `kubectl rollout status` means controller convergence. It does not prove queue-state convergence.

Next step

Run this in one sprint:

1. Add a restart game day that force-kills one scheduler pod per rollout wave.
2. Record takeover lag from kill event to next successful dispatch and set an SLO.
3. Gate production rollout completion on lock-handoff and retry-rate checks.
4. Keep a rollback command block in the release template, not in someone's memory.

Continue with AI Agent Graceful Shutdown and AI Agent Leader Election.

AI Agent Rolling Restart Playbook