Name: Cordum
Author: Cordum

The production problem

A deploy sends SIGTERM. Your service calls `GracefulStop()`.

One RPC handler blocks on downstream I/O. The shutdown never completes.

Kubernetes eventually sends SIGKILL. The pod dies hard, and you lose control of drain behavior.

What top results cover and miss

Source	Strong coverage	Missing piece
gRPC Docs: Graceful Shutdown	Graceful stop semantics and recommendation to pair with a forceful shutdown timer.	No concrete multi-server shutdown ordering for HTTP + gRPC + metrics inside one control-plane process.
grpc-go API Docs (`Server.GracefulStop`, `Server.Stop`)	Exact method behavior: graceful blocks for pending RPCs, forceful stop cancels active RPCs.	No end-to-end pattern for signal handling, timeout wiring, and shutdown synchronization in production services.
Kubernetes Pod Lifecycle	SIGTERM flow, grace periods, and eventual SIGKILL if workloads exceed allowed termination time.	No gRPC-specific guidance for how application servers should bound graceful drain time to fit pod lifecycle constraints.

Cordum runtime mechanics

Cordum uses the same bounded graceful-stop shape in the API gateway and context-engine services.

Boundary	Current behavior	Operational impact
Shutdown budget	Gateway and context-engine both use a 15 second shutdown timeout.	Termination has a hard upper bound that aligns with container lifecycle planning.
gRPC drain strategy	`GracefulStop()` runs in a goroutine; timeout path calls `Stop()`.	No indefinite block if a handler is stuck or slow.
Ordering	Gateway drains HTTP first, then gRPC, then metrics.	No new API work enters while in-flight requests are draining.
Shutdown coordination	`shutdownDone` channel is awaited before returning from server close path.	Resource closes happen after drain completion rather than mid-shutdown.

Gateway graceful-stop fallback

// core/controlplane/gateway/gateway.go (excerpt)
const shutdownTimeout = 15 * time.Second

shutdownCtx, cancel := context.WithTimeout(context.Background(), shutdownTimeout)
defer cancel()

if err := srv.Shutdown(shutdownCtx); err != nil {
  slog.Error("http shutdown error", "error", err)
}

grpcDone := make(chan struct{})
go func() {
  grpcServer.GracefulStop()
  close(grpcDone)
}()

select {
case <-grpcDone:
  slog.Info("gRPC server drained")
case <-shutdownCtx.Done():
  slog.Warn("gRPC graceful stop timed out, forcing")
  grpcServer.Stop()
}

Shutdown synchronization channel

// core/controlplane/gateway/gateway.go (excerpt)
shutdownDone := make(chan struct{})
go func() {
  defer close(shutdownDone)
  <-sigCtx.Done()
  // drain servers...
}()

if errors.Is(err, http.ErrServerClosed) {
  <-shutdownDone // wait for drain goroutine to finish before returning
  return nil
}

Context-engine graceful-stop fallback

// cmd/cordum-context-engine/main.go (excerpt)
const shutdownTimeout = 15 * time.Second
shutdownCtx, cancel := context.WithTimeout(context.Background(), shutdownTimeout)
defer cancel()

grpcDone := make(chan struct{})
go func() {
  server.GracefulStop()
  close(grpcDone)
}()

select {
case <-grpcDone:
  slog.Info("context-engine gRPC server drained")
case <-shutdownCtx.Done():
  slog.Warn("context-engine gRPC graceful stop timed out, forcing")
  server.Stop()
}

Shutdown sequence details

Sequence matters. Cordum first stops accepting new HTTP work, then drains gRPC, then closes metrics.

The pattern reduces race windows where late requests enter while the process is already in teardown mode.

Unit tests verify the timeout branch and assert that forced stop releases a stuck graceful drain.

Timeout branch unit test

// core/controlplane/gateway/shutdown_test.go (excerpt)
shutdownCtx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
defer cancel()

go func() {
  grpcServer.GracefulStop()
  close(grpcDone)
}()

select {
case <-grpcDone:
case <-shutdownCtx.Done():
  grpcServer.Stop()
  select {
  case <-grpcDone:
  case <-time.After(5 * time.Second):
    t.Fatal("grpcServer.Stop() did not unblock GracefulStop")
  }
}

Validation runbook

Test this with synthetic slow handlers before relying on it during real rollout pressure.

Staging drill

bash

# 1) Start staging instance and open a long-running gRPC call
# 2) Send SIGTERM to the process
# 3) Confirm log shows graceful shutdown start with 15s timeout
# 4) If call exceeds budget, confirm forced Stop log appears
# 5) Verify process exits before pod grace period and rollout proceeds

Limitations and tradeoffs

Approach	Upside	Downside
Only GracefulStop with no timeout	Best chance for in-flight RPC completion.	Can hang shutdown forever if handlers never finish.
GracefulStop + forced Stop fallback (Cordum pattern)	Bounded shutdown time with graceful path first.	Some in-flight RPCs can be cancelled when timeout is hit.
Immediate Stop only	Fastest process termination.	Higher client failure rate during deploys and less graceful behavior.

Next step

Add a CI shutdown test that opens a long-running streaming RPC, sends SIGTERM, and asserts process exit under your deployment grace period with explicit verification of the forced-stop path.

AI Agent gRPC GracefulStop Timeout

The production problem

What top results cover and miss

Cordum runtime mechanics

Shutdown sequence details

Validation runbook

Limitations and tradeoffs

Next step

Related Articles

AI Agent gRPC CANCELLED and UNAVAILABLE: Retry Logic for Rolling Restarts (2026)

AI Agent Rolling Restart Playbook: Zero-Drop Deployments with PDBs and Lock TTL Safety (2026)

AI Agent Graceful Shutdown: Drain Order, Lock Safety, and 15s Timeout Design (2026)

Need production-safe agent governance?