Skip to content
Guide

AI Agent Rolling Restart Playbook

Most failed rollouts are state handoff failures disguised as deployment settings.

Guide12 min readApr 2026
TL;DR
  • -Rolling restarts fail when teams treat them as pure Kubernetes settings instead of distributed consistency events.
  • -Set rollout budget from real shutdown and startup numbers, not defaults.
  • -Cordum services target 15s graceful shutdown; keep Kubernetes termination grace at 30s for headroom.
  • -Scheduler lock TTL and takeover lag define your worst-case recovery delay during forced pod kills.
Budgeted rollout

Use explicit time math so deploy speed does not violate durability guarantees.

Disruption control

PDB and maxUnavailable settings prevent too many replicas from draining at once.

State takeover

Lock TTL and replay behavior decide if restart is smooth or incident-worthy.

Scope

This guide covers rolling restarts for autonomous AI control-plane services that run on Kubernetes and depend on gRPC, queue consumption, and Redis-backed distributed coordination.

The production problem

Rolling restarts are regular operations. They happen on deploys, node drains, security patching, and cluster autoscaling.

If restart behavior is only modeled at the pod level, you miss the real failure mode: state handoff between replicas that own locks, queue offsets, and in-flight RPC chains.

The result is predictable. Duplicate dispatch, delayed takeover, and retry storms that start five minutes after rollout looked “green.”

What top results miss

SourceStrong coverageMissing piece
GKE cluster upgrade best practicesUpgrade sequencing, surge strategy, and graceful termination considerations during node/pod replacement.No scheduler lock-handoff or queue-drain guidance for control-plane workloads.
Kubernetes PodDisruptionBudget docsHow disruption budgets limit voluntary evictions during maintenance events.No coupling between PDB settings and application-level drain order across HTTP, gRPC, and queue handlers.
gRPC graceful shutdownDrain in-flight RPCs and force-stop fallback after timeout.No multi-service orchestration with message bus drains and Redis lock expiry.

The missing layer is restart choreography across orchestrator policy, transport drain, and lock ownership. Autonomous agents need all three in one rollout contract.

Rollout budget math

Teams set `maxUnavailable` without calculating rollout duration. That is backwards. Compute the budget first, then set rollout knobs.

rollout_budget.txt
Text
# Conservative rollout budget
# R = replicas, U = maxUnavailable
# T_shutdown = graceful shutdown timeout
# T_start = time until new pod is Ready
# T_minReady = minReadySeconds

waves = ceil(R / U)
wave_time = T_shutdown + T_start + T_minReady
rollout_time ~= waves * wave_time

# Example: R=6, U=1, shutdown=15s, startup=20s, minReady=10s
# waves=6, wave_time=45s => ~270s total (~4.5 min)

For a 6-replica service at conservative settings, rollout is roughly 4.5 minutes. If your change window is two minutes, you need parallelism and stronger readiness controls, not hope.

PhaseActionFailure if skippedMitigation
Pre-flightValidate PDB, replica count, and maxUnavailable before starting rollout.Controller evicts too many pods and quorum-dependent paths stall.Enforce pre-rollout checklist and block deploy if guardrails fail.
Drain windowMark pod unready, stop new ingress, then drain HTTP/gRPC and queue handlers.New work lands on a pod that is already terminating.Readiness-first drain sequence plus bounded shutdown timeout.
HandoffRelease locks or rely on bounded TTL takeover if release is missed.Orphaned lock delays work resumption and inflates backlog.Keep lock renewal healthy and validate takeover within defined SLO.
Post-rolloutVerify no duplicate dispatch, stale backlog, or elevated retry rates.Hidden data-plane regressions surface minutes after rollout completes.Run mandatory post-rollout query pack before closing deployment.

Cordum restart behavior

These values are verified from current documentation and runtime paths in scheduler, gateway, workflow, context-engine, and safety-kernel services.

BoundaryCurrent behaviorOperational impact
Service shutdown timeoutCordum services target a 15s graceful shutdown path on SIGTERM.Predictable upper bound for drain time in rolling restarts.
Kubernetes envelopeDocs recommend `terminationGracePeriodSeconds: 30`, leaving headroom over 15s service timeout.Reduces forced SIGKILL risk during node drains or rollout pressure.
API GatewayStops bus taps, drains HTTP, drains gRPC with forced fallback, then shuts metrics.Prevents late request acceptance while transport shutdown is in progress.
Scheduler lock ownershipJob lock TTL is 60s with renewal every 20s; surviving replica takes ownership after expiry if a pod dies abruptly.Caps worst-case takeover delay but creates temporary throughput dip after hard kills.
Broadcast convergenceDuring rolling restarts, ephemeral broadcast consumers can miss messages while a replica is replaced.State converges via self-healing loops, but short-lived UI drift is expected.

Implementation examples

Deployment strategy and termination envelope (YAML)

scheduler_rollout.yaml
YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cordum-scheduler
  namespace: cordum
spec:
  replicas: 3
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxUnavailable: 1
      maxSurge: 1
  minReadySeconds: 10
  template:
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: scheduler
          image: ghcr.io/cordum-io/cordum-scheduler:latest
          readinessProbe:
            httpGet:
              path: /health
              port: 8080
            periodSeconds: 5
            failureThreshold: 2
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 2"]

Readiness-first drain pattern (Go)

drain_readiness.go
Go
var draining atomic.Bool

mux.HandleFunc("/readyz", func(w http.ResponseWriter, r *http.Request) {
  if draining.Load() {
    http.Error(w, "draining", http.StatusServiceUnavailable)
    return
  }
  w.WriteHeader(http.StatusOK)
})

sigCtx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

<-sigCtx.Done()
draining.Store(true)                // stop new traffic via readiness
shutdownCtx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()

_ = httpServer.Shutdown(shutdownCtx)
engine.Stop()                       // bounded in-flight wait
_ = metricsServer.Shutdown(shutdownCtx)

Rolling restart verification runbook (Bash)

rolling_restart_check.sh
Bash
# 1) Verify disruption safety before rollout
kubectl get pdb -n cordum
kubectl get deploy -n cordum

# 2) Start rollout and watch progress
kubectl rollout restart deployment/cordum-scheduler -n cordum
kubectl rollout status deployment/cordum-scheduler -n cordum

# 3) Confirm graceful drain logs
kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "shutting down gracefully|gRPC server drained"
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "scheduler shutting down gracefully|graceful shutdown deadline"

# 4) Verify lock handoff after forced pod kill drill
redis-cli GET "cordum:scheduler:job:JOB_ID"
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

Post-rollout regression checks (PromQL)

rollout_regression.promql
PromQL
# Rollout should not create prolonged unavailable replicas
max_over_time(kube_deployment_status_replicas_unavailable{namespace="cordum"}[10m])

# Restart storms after rollout are a red flag
increase(kube_pod_container_status_restarts_total{namespace="cordum"}[10m])

# P99 request latency should recover quickly after each wave
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

Limitations and tradeoffs

  • - Lower `maxUnavailable` protects availability but increases total rollout time.
  • - Higher surge reduces deployment duration but increases temporary resource pressure.
  • - Long lock TTL lowers duplicate execution risk but delays takeover after hard kills.
  • - Fast force-stop fallback guarantees termination but can cut off long-running in-flight RPCs.

A green `kubectl rollout status` means controller convergence. It does not prove queue-state convergence.

Next step

Run this in one sprint:

  1. 1. Add a restart game day that force-kills one scheduler pod per rollout wave.
  2. 2. Record takeover lag from kill event to next successful dispatch and set an SLO.
  3. 3. Gate production rollout completion on lock-handoff and retry-rate checks.
  4. 4. Keep a rollback command block in the release template, not in someone's memory.

Continue with AI Agent Graceful Shutdown and AI Agent Leader Election.

Restart safety is release safety

A rollout policy is only complete when it specifies disruption limits, drain order, and ownership handoff verification.