Skip to content
Deep Dive

AI Agent gRPC GracefulStop Timeout

Graceful drain is necessary. Bounded graceful drain is what keeps deploys predictable.

Deep Dive10 min readMar 2026
TL;DR
  • -`GracefulStop()` blocks until active RPCs finish, which can hang if one handler never returns.
  • -Cordum applies a 15 second shutdown deadline and forces `Stop()` when that deadline expires.
  • -Gateway shutdown drains HTTP before gRPC and uses `shutdownDone` coordination to avoid early process teardown.
  • -Cordum includes a unit test that verifies forced stop unblocks graceful stop under timeout pressure.
Failure mode

A single stuck RPC can keep process shutdown blocked and delay rollouts.

Code guardrail

Cordum uses a bounded graceful window, then calls `Stop()` as safety fallback.

Operational payoff

Pods terminate predictably within Kubernetes grace periods instead of hanging indefinitely.

Scope

This guide focuses on server shutdown behavior for gRPC services in Cordum runtime processes. It does not cover client retry policy tuning in depth.

The production problem

A deploy sends SIGTERM. Your service calls `GracefulStop()`.

One RPC handler blocks on downstream I/O. The shutdown never completes.

Kubernetes eventually sends SIGKILL. The pod dies hard, and you lose control of drain behavior.

What top results cover and miss

SourceStrong coverageMissing piece
gRPC Docs: Graceful ShutdownGraceful stop semantics and recommendation to pair with a forceful shutdown timer.No concrete multi-server shutdown ordering for HTTP + gRPC + metrics inside one control-plane process.
grpc-go API Docs (`Server.GracefulStop`, `Server.Stop`)Exact method behavior: graceful blocks for pending RPCs, forceful stop cancels active RPCs.No end-to-end pattern for signal handling, timeout wiring, and shutdown synchronization in production services.
Kubernetes Pod LifecycleSIGTERM flow, grace periods, and eventual SIGKILL if workloads exceed allowed termination time.No gRPC-specific guidance for how application servers should bound graceful drain time to fit pod lifecycle constraints.

Cordum runtime mechanics

Cordum uses the same bounded graceful-stop shape in the API gateway and context-engine services.

BoundaryCurrent behaviorOperational impact
Shutdown budgetGateway and context-engine both use a 15 second shutdown timeout.Termination has a hard upper bound that aligns with container lifecycle planning.
gRPC drain strategy`GracefulStop()` runs in a goroutine; timeout path calls `Stop()`.No indefinite block if a handler is stuck or slow.
OrderingGateway drains HTTP first, then gRPC, then metrics.No new API work enters while in-flight requests are draining.
Shutdown coordination`shutdownDone` channel is awaited before returning from server close path.Resource closes happen after drain completion rather than mid-shutdown.
Gateway graceful-stop fallback
go
// core/controlplane/gateway/gateway.go (excerpt)
const shutdownTimeout = 15 * time.Second

shutdownCtx, cancel := context.WithTimeout(context.Background(), shutdownTimeout)
defer cancel()

if err := srv.Shutdown(shutdownCtx); err != nil {
  slog.Error("http shutdown error", "error", err)
}

grpcDone := make(chan struct{})
go func() {
  grpcServer.GracefulStop()
  close(grpcDone)
}()

select {
case <-grpcDone:
  slog.Info("gRPC server drained")
case <-shutdownCtx.Done():
  slog.Warn("gRPC graceful stop timed out, forcing")
  grpcServer.Stop()
}
Shutdown synchronization channel
go
// core/controlplane/gateway/gateway.go (excerpt)
shutdownDone := make(chan struct{})
go func() {
  defer close(shutdownDone)
  <-sigCtx.Done()
  // drain servers...
}()

if errors.Is(err, http.ErrServerClosed) {
  <-shutdownDone // wait for drain goroutine to finish before returning
  return nil
}
Context-engine graceful-stop fallback
go
// cmd/cordum-context-engine/main.go (excerpt)
const shutdownTimeout = 15 * time.Second
shutdownCtx, cancel := context.WithTimeout(context.Background(), shutdownTimeout)
defer cancel()

grpcDone := make(chan struct{})
go func() {
  server.GracefulStop()
  close(grpcDone)
}()

select {
case <-grpcDone:
  slog.Info("context-engine gRPC server drained")
case <-shutdownCtx.Done():
  slog.Warn("context-engine gRPC graceful stop timed out, forcing")
  server.Stop()
}

Shutdown sequence details

Sequence matters. Cordum first stops accepting new HTTP work, then drains gRPC, then closes metrics.

The pattern reduces race windows where late requests enter while the process is already in teardown mode.

Unit tests verify the timeout branch and assert that forced stop releases a stuck graceful drain.

Timeout branch unit test
go
// core/controlplane/gateway/shutdown_test.go (excerpt)
shutdownCtx, cancel := context.WithTimeout(context.Background(), 50*time.Millisecond)
defer cancel()

go func() {
  grpcServer.GracefulStop()
  close(grpcDone)
}()

select {
case <-grpcDone:
case <-shutdownCtx.Done():
  grpcServer.Stop()
  select {
  case <-grpcDone:
  case <-time.After(5 * time.Second):
    t.Fatal("grpcServer.Stop() did not unblock GracefulStop")
  }
}

Validation runbook

Test this with synthetic slow handlers before relying on it during real rollout pressure.

Staging drill
bash
# 1) Start staging instance and open a long-running gRPC call
# 2) Send SIGTERM to the process
# 3) Confirm log shows graceful shutdown start with 15s timeout
# 4) If call exceeds budget, confirm forced Stop log appears
# 5) Verify process exits before pod grace period and rollout proceeds

Limitations and tradeoffs

ApproachUpsideDownside
Only GracefulStop with no timeoutBest chance for in-flight RPC completion.Can hang shutdown forever if handlers never finish.
GracefulStop + forced Stop fallback (Cordum pattern)Bounded shutdown time with graceful path first.Some in-flight RPCs can be cancelled when timeout is hit.
Immediate Stop onlyFastest process termination.Higher client failure rate during deploys and less graceful behavior.

Next step

Add a CI shutdown test that opens a long-running streaming RPC, sends SIGTERM, and asserts process exit under your deployment grace period with explicit verification of the forced-stop path.

Related Articles

View all posts

Need production-safe agent governance?

Cordum helps teams enforce pre-dispatch policy, run dependable agent workflows, and keep evidence trails auditable.