Name: Cordum
Author: Cordum

The production problem

Shutdown paths are usually the least-tested paths in production services. That is unfortunate, because every deploy, autoscale event, and node drain runs them.

If you stop a process without draining ingress and in-flight work, you create partial side effects and delayed retries. Then the next replica inherits an inconsistent queue.

Graceful shutdown should be treated as reliability logic with explicit ordering and budgets.

What top results miss

Source	Strong coverage	Missing piece
Kubernetes Pod Lifecycle	Termination semantics, signals, and pod-level lifecycle expectations.	No app-level sequencing for lock-backed AI control loops and queue drains.
NATS drain behavior	Connection/subscription drain flow and in-flight message handling before close.	No multi-service shutdown choreography with HTTP/gRPC servers plus distributed locks.
Go `net/http` Server.Shutdown	Graceful HTTP server drain with context deadlines.	No guidance for coordinating shutdown across message bus, lock store, and worker scheduler state.

The gap is cross-layer choreography: request ingress, message transport, lock state, and replay logic must terminate in one coordinated sequence.

Shutdown sequencing model

Step	Action	What fails if skipped	Mitigation
Block new ingress	Stop accepting new HTTP/gRPC requests and new queue pulls.	New work arrives while old work is draining, extending shutdown indefinitely.	Set ingress-stop as the first shutdown action.
Drain in-flight work	Allow active handlers to complete with strict timeout budget.	Hard kill interrupts idempotency boundaries and leaves partial side effects.	Use context deadlines and explicit drain APIs for transport clients.
Finalize shared state	Release locks and flush critical state writes where possible.	Orphaned locks delay takeover or trigger duplicate work.	Bound lock TTL and verify replay/reconciler takeover paths.
Close dependencies	Close message bus, stores, and metrics server last.	Premature connection close causes silent drop of final in-flight operations.	Make transport close the final stage after handler drain.

Cordum shutdown behavior

These values are taken from current docs and core runtime code paths, including service-level shutdown handlers and engine stop semantics.

Service / boundary	Current behavior	Operational impact
Scheduler	Main process shutdown window is 15s; `Engine.Stop()` waits up to 10s for in-flight handlers.	Keeps controlled drain bounded while avoiding indefinite termination hangs.
API Gateway	On SIGTERM: stop bus taps, drain HTTP, drain gRPC (`GracefulStop` with forced fallback), then metrics shutdown.	Prevents request loss during rolling restart and ensures controlled transport teardown.
Workflow Engine	Signal-driven context cancellation stops background loops; HTTP health server shutdown uses 15s timeout.	Stops reconciler/poller activity without abrupt in-flight termination.
Kubernetes envelope	Recommended `terminationGracePeriodSeconds: 30` while service-level graceful shutdown target is 15s.	Leaves headroom for signal delivery delay and final cleanup.
Unclean termination fallback	Job lock TTL is 60s; surviving replica takes ownership after lock expiry if a node is killed mid-work.	Caps recovery delay but may increase temporary queue latency after abrupt kills.

Implementation examples

Signal-aware shutdown orchestrator (Go)

graceful_shutdown.go

sigCtx, stop := signal.NotifyContext(context.Background(), os.Interrupt, syscall.SIGTERM)
defer stop()

go func() {
  <-sigCtx.Done()
  shutdownCtx, cancel := context.WithTimeout(context.Background(), 15*time.Second)
  defer cancel()

  // 1) Stop ingress first
  _ = httpServer.Shutdown(shutdownCtx)

  // 2) Drain message transport
  if err := natsConn.Drain(); err != nil {
    slog.Warn("nats drain failed", "error", err)
  }

  // 3) Stop background workers with bounded wait
  schedulerEngine.Stop() // internal max wait: 10s

  // 4) Final closes
  _ = metricsServer.Shutdown(shutdownCtx)
}()

Kubernetes termination envelope (YAML)

termination_budget.yaml

YAML

apiVersion: apps/v1
kind: Deployment
metadata:
  name: cordum-scheduler
spec:
  template:
    spec:
      terminationGracePeriodSeconds: 30
      containers:
        - name: scheduler
          lifecycle:
            preStop:
              exec:
                command: ["/bin/sh", "-c", "sleep 2"] # optional ingress drain buffer
          readinessProbe:
            httpGet:
              path: /health
              port: 8080

Rollout shutdown verification runbook (Bash)

shutdown_runbook.sh

Bash

# During rollout, verify pods are terminating gracefully
kubectl get pods -n cordum -w

# Check shutdown logs for drain sequence
kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "shutting down|drain|gRPC server drained"
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "shutting down|graceful|deadline exceeded"

# Check lock state if a pod was killed abruptly
redis-cli GET "cordum:scheduler:job:JOB_ID"
redis-cli GET "cordum:reconciler:default"

# Confirm service recovered ownership after lock TTL window
curl -s http://localhost:9090/metrics | grep -E "stale_jobs|orphan_replayed"

Post-shutdown regression signals (PromQL)

shutdown_regression.promql

PromQL

# Request latency spikes during rollout
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))

# Scheduler lock pressure
histogram_quantile(0.99, rate(job_lock_wait_bucket[5m]))

# Recovery debt after restart
stale_jobs
rate(orphan_replayed_total[5m])

Limitations and tradeoffs

- Short shutdown windows reduce rollout time but increase risk of forced termination under heavy load.
- Longer lock TTLs reduce duplicate processing risk but delay recovery after abrupt pod kill.
- Draining everything can increase rollout latency during peak traffic periods.
- Forced gRPC stop fallback preserves termination guarantees but can interrupt long in-flight RPCs.

If shutdown tests only run on idle environments, they do not test shutdown. They test process exit.

Next step

Run this in one sprint:

1. Document a single shutdown order per service and enforce it in integration tests.
2. Verify service timeout budget fits under `terminationGracePeriodSeconds` with headroom.
3. Add a game day that rolls all replicas during synthetic peak load.
4. Measure takeover lag and stale backlog after forced kill vs graceful termination.

Continue with AI Agent Incident Response Runbook and AI Agent Cold Start Recovery.

AI Agent Graceful Shutdown