The production problem
A deploy sends SIGTERM. Your service calls `GracefulStop()`.
One RPC handler blocks on downstream I/O. The shutdown never completes.
Kubernetes eventually sends SIGKILL. The pod dies hard, and you lose control of drain behavior.
What top results cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC Docs: Graceful Shutdown | Graceful stop semantics and recommendation to pair with a forceful shutdown timer. | No concrete multi-server shutdown ordering for HTTP + gRPC + metrics inside one control-plane process. |
| grpc-go API Docs (`Server.GracefulStop`, `Server.Stop`) | Exact method behavior: graceful blocks for pending RPCs, forceful stop cancels active RPCs. | No end-to-end pattern for signal handling, timeout wiring, and shutdown synchronization in production services. |
| Kubernetes Pod Lifecycle | SIGTERM flow, grace periods, and eventual SIGKILL if workloads exceed allowed termination time. | No gRPC-specific guidance for how application servers should bound graceful drain time to fit pod lifecycle constraints. |
Cordum runtime mechanics
Cordum uses the same bounded graceful-stop shape in the API gateway and context-engine services.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Shutdown budget | Gateway and context-engine both use a 15 second shutdown timeout. | Termination has a hard upper bound that aligns with container lifecycle planning. |
| gRPC drain strategy | `GracefulStop()` runs in a goroutine; timeout path calls `Stop()`. | No indefinite block if a handler is stuck or slow. |
| Ordering | Gateway drains HTTP first, then gRPC, then metrics. | No new API work enters while in-flight requests are draining. |
| Shutdown coordination | `shutdownDone` channel is awaited before returning from server close path. | Resource closes happen after drain completion rather than mid-shutdown. |
// core/controlplane/gateway/gateway.go (excerpt)
const shutdownTimeout = 15 * time.Second
shutdownCtx, cancel := context.WithTimeout(context.Background(), shutdownTimeout)
defer cancel()
if err := srv.Shutdown(shutdownCtx); err != nil {
slog.Error("http shutdown error", "error", err)
}
grpcDone := make(chan struct{})
go func() {
grpcServer.GracefulStop()
close(grpcDone)
}()
select {
case <-grpcDone:
slog.Info("gRPC server drained")
case <-shutdownCtx.Done():
slog.Warn("gRPC graceful stop timed out, forcing")
grpcServer.Stop()
}// core/controlplane/gateway/gateway.go (excerpt)
shutdownDone := make(chan struct{})
go func() {
defer close(shutdownDone)
<-sigCtx.Done()
// drain servers...
}()
if errors.Is(err, http.ErrServerClosed) {
<-shutdownDone // wait for drain goroutine to finish before returning
return nil
}// cmd/cordum-context-engine/main.go (excerpt)
const shutdownTimeout = 15 * time.Second
shutdownCtx, cancel := context.WithTimeout(context.Background(), shutdownTimeout)
defer cancel()
grpcDone := make(chan struct{})
go func() {
server.GracefulStop()
close(grpcDone)
}()
select {
case <-grpcDone:
slog.Info("context-engine gRPC server drained")
case <-shutdownCtx.Done():
slog.Warn("context-engine gRPC graceful stop timed out, forcing")
server.Stop()
}Shutdown sequence details
Sequence matters. Cordum first stops accepting new HTTP work, then drains gRPC, then closes metrics.
The pattern reduces race windows where late requests enter while the process is already in teardown mode.
Unit tests verify the timeout branch and assert that forced stop releases a stuck graceful drain.
// core/controlplane/gateway/shutdown_test.go (excerpt)
shutdownCtx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
defer cancel()
go func() {
grpcServer.GracefulStop()
close(grpcDone)
}()
select {
case <-grpcDone:
case <-shutdownCtx.Done():
grpcServer.Stop()
select {
case <-grpcDone:
case <-time.After(5 * time.Second):
t.Fatal("grpcServer.Stop() did not unblock GracefulStop")
}
}Validation runbook
Test this with synthetic slow handlers before relying on it during real rollout pressure.
# 1) Start staging instance and open a long-running gRPC call # 2) Send SIGTERM to the process # 3) Confirm log shows graceful shutdown start with 15s timeout # 4) If call exceeds budget, confirm forced Stop log appears # 5) Verify process exits before pod grace period and rollout proceeds
Limitations and tradeoffs
| Approach | Upside | Downside |
|---|---|---|
| Only GracefulStop with no timeout | Best chance for in-flight RPC completion. | Can hang shutdown forever if handlers never finish. |
| GracefulStop + forced Stop fallback (Cordum pattern) | Bounded shutdown time with graceful path first. | Some in-flight RPCs can be cancelled when timeout is hit. |
| Immediate Stop only | Fastest process termination. | Higher client failure rate during deploys and less graceful behavior. |
Next step
Add a CI shutdown test that opens a long-running streaming RPC, sends SIGTERM, and asserts process exit under your deployment grace period with explicit verification of the forced-stop path.