Skip to content
Guide

AI Agent Graceful Shutdown

Most rollout incidents are shutdown incidents that were not modeled as such.

Guide11 min readMar 2026
TL;DR
  • -Graceful shutdown is a consistency operation, not just a process-exit operation.
  • -Drain order matters: stop ingress first, finish in-flight work second, then release resources.
  • -Cordum service shutdown targets 15s, with Kubernetes termination grace recommended at 30s.
  • -If locks or queues are left half-processed, restarts convert planned maintenance into incident response.
Drain order

A deterministic sequence prevents dropped messages and half-completed workflows.

Timeout budgets

Service-level shutdown windows must fit inside orchestrator termination windows.

State safety

Lock expiration and replay paths must be validated for forced-termination cases.

Scope

This guide covers graceful shutdown for multi-replica AI control planes that use HTTP/gRPC ingress, NATS messaging, and Redis-backed distributed coordination.

The production problem

Shutdown paths are usually the least-tested paths in production services. That is unfortunate, because every deploy, autoscale event, and node drain runs them.

If you stop a process without draining ingress and in-flight work, you create partial side effects and delayed retries. Then the next replica inherits an inconsistent queue.

Graceful shutdown should be treated as reliability logic with explicit ordering and budgets.

What top results miss

SourceStrong coverageMissing piece
Kubernetes Pod LifecycleTermination semantics, signals, and pod-level lifecycle expectations.No app-level sequencing for lock-backed AI control loops and queue drains.
NATS drain behaviorConnection/subscription drain flow and in-flight message handling before close.No multi-service shutdown choreography with HTTP/gRPC servers plus distributed locks.
Go `net/http` Server.ShutdownGraceful HTTP server drain with context deadlines.No guidance for coordinating shutdown across message bus, lock store, and worker scheduler state.

The gap is cross-layer choreography: request ingress, message transport, lock state, and replay logic must terminate in one coordinated sequence.

Shutdown sequencing model

StepActionWhat fails if skippedMitigation
Block new ingressStop accepting new HTTP/gRPC requests and new queue pulls.New work arrives while old work is draining, extending shutdown indefinitely.Set ingress-stop as the first shutdown action.
Drain in-flight workAllow active handlers to complete with strict timeout budget.Hard kill interrupts idempotency boundaries and leaves partial side effects.Use context deadlines and explicit drain APIs for transport clients.
Finalize shared stateRelease locks and flush critical state writes where possible.Orphaned locks delay takeover or trigger duplicate work.Bound lock TTL and verify replay/reconciler takeover paths.
Close dependenciesClose message bus, stores, and metrics server last.Premature connection close causes silent drop of final in-flight operations.Make transport close the final stage after handler drain.

Cordum shutdown behavior

These values are taken from current docs and core runtime code paths, including service-level shutdown handlers and engine stop semantics.

Service / boundaryCurrent behaviorOperational impact
SchedulerMain process shutdown window is 15s; `Engine.Stop()` waits up to 10s for in-flight handlers.Keeps controlled drain bounded while avoiding indefinite termination hangs.
API GatewayOn SIGTERM: stop bus taps, drain HTTP, drain gRPC (`GracefulStop` with forced fallback), then metrics shutdown.Prevents request loss during rolling restart and ensures controlled transport teardown.
Workflow EngineSignal-driven context cancellation stops background loops; HTTP health server shutdown uses 15s timeout.Stops reconciler/poller activity without abrupt in-flight termination.
Kubernetes envelopeRecommended `terminationGracePeriodSeconds: 30` while service-level graceful shutdown target is 15s.Leaves headroom for signal delivery delay and final cleanup.
Unclean termination fallbackJob lock TTL is 60s; surviving replica takes ownership after lock expiry if a node is killed mid-work.Caps recovery delay but may increase temporary queue latency after abrupt kills.

Implementation examples

Signal-aware shutdown orchestrator (Go)

graceful_shutdown.go
Go
sigCtx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

go func() {
  <-sigCtx.Done()
  shutdownCtx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
  defer cancel()

  // 1) Stop ingress first
  _ = httpServer.Shutdown(shutdownCtx)

  // 2) Drain message transport
  if err := natsConn.Drain(); err != nil {
    slog.Warn("nats drain failed", "error", err)
  }

  // 3) Stop background workers with bounded wait
  schedulerEngine.Stop() // internal max wait: 10s

  // 4) Final closes
  _ = metricsServer.Shutdown(shutdownCtx)
}()

Kubernetes termination envelope (YAML)

termination_budget.yaml
YAML
apiVersion: apps/v1
kind: Deployment
metadata:
  name: cordum-scheduler
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: scheduler
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 2"] # optional ingress drain buffer
          readinessProbe:
            httpGet:
              path: /health
              port: 8080

Rollout shutdown verification runbook (Bash)

shutdown_runbook.sh
Bash
# During rollout, verify pods are terminating gracefully
kubectl get pods -n cordum -w

# Check shutdown logs for drain sequence
kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "shutting down|drain|gRPC server drained"
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "shutting down|graceful|deadline exceeded"

# Check lock state if a pod was killed abruptly
redis-cli GET "cordum:scheduler:job:JOB_ID"
redis-cli GET "cordum:reconciler:default"

# Confirm service recovered ownership after lock TTL window
curl -s http://localhost:9090/metrics | grep -E "stale_jobs|orphan_replayed"

Post-shutdown regression signals (PromQL)

shutdown_regression.promql
PromQL
# Request latency spikes during rollout
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Scheduler lock pressure
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

# Recovery debt after restart
stale_jobs
rate(orphan_replayed_total[5m])

Limitations and tradeoffs

  • - Short shutdown windows reduce rollout time but increase risk of forced termination under heavy load.
  • - Longer lock TTLs reduce duplicate processing risk but delay recovery after abrupt pod kill.
  • - Draining everything can increase rollout latency during peak traffic periods.
  • - Forced gRPC stop fallback preserves termination guarantees but can interrupt long in-flight RPCs.

If shutdown tests only run on idle environments, they do not test shutdown. They test process exit.

Next step

Run this in one sprint:

  1. 1. Document a single shutdown order per service and enforce it in integration tests.
  2. 2. Verify service timeout budget fits under `terminationGracePeriodSeconds` with headroom.
  3. 3. Add a game day that rolls all replicas during synthetic peak load.
  4. 4. Measure takeover lag and stale backlog after forced kill vs graceful termination.

Continue with AI Agent Incident Response Runbook and AI Agent Cold Start Recovery.

Restart safety is part of uptime

Treat every deploy like a controlled failure scenario and verify the recovery math in production-like load.