The production problem
Shutdown paths are usually the least-tested paths in production services. That is unfortunate, because every deploy, autoscale event, and node drain runs them.
If you stop a process without draining ingress and in-flight work, you create partial side effects and delayed retries. Then the next replica inherits an inconsistent queue.
Graceful shutdown should be treated as reliability logic with explicit ordering and budgets.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Kubernetes Pod Lifecycle | Termination semantics, signals, and pod-level lifecycle expectations. | No app-level sequencing for lock-backed AI control loops and queue drains. |
| NATS drain behavior | Connection/subscription drain flow and in-flight message handling before close. | No multi-service shutdown choreography with HTTP/gRPC servers plus distributed locks. |
| Go `net/http` Server.Shutdown | Graceful HTTP server drain with context deadlines. | No guidance for coordinating shutdown across message bus, lock store, and worker scheduler state. |
The gap is cross-layer choreography: request ingress, message transport, lock state, and replay logic must terminate in one coordinated sequence.
Shutdown sequencing model
| Step | Action | What fails if skipped | Mitigation |
|---|---|---|---|
| Block new ingress | Stop accepting new HTTP/gRPC requests and new queue pulls. | New work arrives while old work is draining, extending shutdown indefinitely. | Set ingress-stop as the first shutdown action. |
| Drain in-flight work | Allow active handlers to complete with strict timeout budget. | Hard kill interrupts idempotency boundaries and leaves partial side effects. | Use context deadlines and explicit drain APIs for transport clients. |
| Finalize shared state | Release locks and flush critical state writes where possible. | Orphaned locks delay takeover or trigger duplicate work. | Bound lock TTL and verify replay/reconciler takeover paths. |
| Close dependencies | Close message bus, stores, and metrics server last. | Premature connection close causes silent drop of final in-flight operations. | Make transport close the final stage after handler drain. |
Cordum shutdown behavior
These values are taken from current docs and core runtime code paths, including service-level shutdown handlers and engine stop semantics.
| Service / boundary | Current behavior | Operational impact |
|---|---|---|
| Scheduler | Main process shutdown window is 15s; `Engine.Stop()` waits up to 10s for in-flight handlers. | Keeps controlled drain bounded while avoiding indefinite termination hangs. |
| API Gateway | On SIGTERM: stop bus taps, drain HTTP, drain gRPC (`GracefulStop` with forced fallback), then metrics shutdown. | Prevents request loss during rolling restart and ensures controlled transport teardown. |
| Workflow Engine | Signal-driven context cancellation stops background loops; HTTP health server shutdown uses 15s timeout. | Stops reconciler/poller activity without abrupt in-flight termination. |
| Kubernetes envelope | Recommended `terminationGracePeriodSeconds: 30` while service-level graceful shutdown target is 15s. | Leaves headroom for signal delivery delay and final cleanup. |
| Unclean termination fallback | Job lock TTL is 60s; surviving replica takes ownership after lock expiry if a node is killed mid-work. | Caps recovery delay but may increase temporary queue latency after abrupt kills. |
Implementation examples
Signal-aware shutdown orchestrator (Go)
sigCtx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()
go func() {
<-sigCtx.Done()
shutdownCtx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
defer cancel()
// 1) Stop ingress first
_ = httpServer.Shutdown(shutdownCtx)
// 2) Drain message transport
if err := natsConn.Drain(); err != nil {
slog.Warn("nats drain failed", "error", err)
}
// 3) Stop background workers with bounded wait
schedulerEngine.Stop() // internal max wait: 10s
// 4) Final closes
_ = metricsServer.Shutdown(shutdownCtx)
}()Kubernetes termination envelope (YAML)
apiVersion: apps/v1
kind: Deployment
metadata:
name: cordum-scheduler
spec:
template:
spec:
terminationGracePeriodSeconds: 30
containers:
- name: scheduler
lifecycle:
preStop:
exec:
command: ["/bin/sh", "-c", "sleep 2"] # optional ingress drain buffer
readinessProbe:
httpGet:
path: /health
port: 8080Rollout shutdown verification runbook (Bash)
# During rollout, verify pods are terminating gracefully kubectl get pods -n cordum -w # Check shutdown logs for drain sequence kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "shutting down|drain|gRPC server drained" kubectl logs deploy/cordum-scheduler -n cordum | grep -E "shutting down|graceful|deadline exceeded" # Check lock state if a pod was killed abruptly redis-cli GET "cordum:scheduler:job:JOB_ID" redis-cli GET "cordum:reconciler:default" # Confirm service recovered ownership after lock TTL window curl -s http://localhost:9090/metrics | grep -E "stale_jobs|orphan_replayed"
Post-shutdown regression signals (PromQL)
# Request latency spikes during rollout histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # Scheduler lock pressure histogram_quantile(0.99, rate(job_lock_wait_bucket[5m])) # Recovery debt after restart stale_jobs rate(orphan_replayed_total[5m])
Limitations and tradeoffs
- - Short shutdown windows reduce rollout time but increase risk of forced termination under heavy load.
- - Longer lock TTLs reduce duplicate processing risk but delay recovery after abrupt pod kill.
- - Draining everything can increase rollout latency during peak traffic periods.
- - Forced gRPC stop fallback preserves termination guarantees but can interrupt long in-flight RPCs.
If shutdown tests only run on idle environments, they do not test shutdown. They test process exit.
Next step
Run this in one sprint:
- 1. Document a single shutdown order per service and enforce it in integration tests.
- 2. Verify service timeout budget fits under `terminationGracePeriodSeconds` with headroom.
- 3. Add a game day that rolls all replicas during synthetic peak load.
- 4. Measure takeover lag and stale backlog after forced kill vs graceful termination.
Continue with AI Agent Incident Response Runbook and AI Agent Cold Start Recovery.