The production problem
Rolling restart succeeds at the deployment layer, but callers still see transient gRPC failures. Teams then respond with either no retry or unlimited retry. Both are wrong.
Without clear status-code handling, one transient disconnect can become a retry flood. Or worse, one dropped call can silently skip a safety-critical decision path.
Correct handling requires code-aware retry rules tied to idempotency boundaries.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| gRPC status codes | Semantics of `CANCELLED`, `UNAVAILABLE`, and other status outcomes. | No restart-phase retry policy for lock-backed control-plane workflows. |
| gRPC graceful shutdown | How servers stop accepting new RPCs and drain in-flight calls. | No client strategy for mixed in-flight cancellation + pod replacement windows. |
| Kubernetes Pod lifecycle | Termination behavior and pod state transitions under shutdown. | No app-level handling for gRPC retry/idempotency boundaries. |
The gap is decision logic for real production flows: what to retry, how long, and when to fail fast.
Status code handling matrix
| gRPC code | Likely cause | Retry rule | Guardrail |
|---|---|---|---|
| CANCELLED | Client-side context cancellation or server shutdown race during in-flight call. | Retry only if operation is idempotent and caller deadline still valid. | Attach idempotency key and short jittered retry budget. |
| UNAVAILABLE | Connection dropped, pod terminating, endpoint not yet ready, or transient transport outage. | Retry with exponential backoff and max-attempt cap. | Hedge only for read-like calls; avoid fan-out retry amplification. |
| DEADLINE_EXCEEDED | Caller timeout too short for operation latency profile. | Retry only after checking timeout budget and service latency trend. | Do not stack retries under an already exhausted deadline. |
| FAILED_PRECONDITION | Business state violation, not transport instability. | Do not retry blindly. | Surface to caller and require state correction. |
Cordum restart behavior
Cordum's shutdown and troubleshooting docs already describe the expected transport outcomes during rolling restarts. Client policy should align with this behavior.
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Documented restart behavior | Cordum troubleshooting states in-flight gRPC calls can receive `CANCELLED` or `UNAVAILABLE` during rolling restarts. | These codes should be treated as planned transient signals, not always incidents. |
| Gateway shutdown | Gateway drains gRPC with `GracefulStop()` and force-stops if timeout expires. | Late in-flight calls can still fail; callers need bounded retry policy. |
| Context Engine / Safety Kernel | Both services use graceful gRPC drain with force-stop fallback after timeout. | Transient transport errors during rollout are expected and recoverable. |
| Shutdown envelope | Service shutdown target is 15s, under default 30s Kubernetes termination grace. | Well-tuned clients usually recover within one retry window. |
Implementation examples
Bounded retry interceptor (Go)
func callWithRetry(ctx context.Context, req *pb.PolicyCheckRequest, c pb.SafetyKernelClient) (*pb.PolicyCheckResponse, error) {
backoff := []time.Duration{100 * time.Millisecond, 250 * time.Millisecond, 500 * time.Millisecond}
for i, wait := range backoff {
resp, err := c.Check(ctx, req)
if err == nil {
return resp, nil
}
st, ok := status.FromError(err)
if !ok {
return nil, err
}
switch st.Code() {
case codes.Canceled, codes.Unavailable:
if i == len(backoff)-1 || ctx.Err() != nil {
return nil, err
}
jitter := time.Duration(rand.Int63n(int64(wait / 2)))
time.Sleep(wait + jitter)
continue
default:
return nil, err
}
}
return nil, status.Error(codes.Unavailable, "retry budget exhausted")
}Retry decision policy
# Retry policy decision rules # Input: grpc_code, idempotent, deadline_remaining_ms if grpc_code in [CANCELLED, UNAVAILABLE] and idempotent and deadline_remaining_ms > 300: retry_with_jitter(max_attempts=3) else: return_error_immediately()
Rollout verification runbook
# Watch shutdown + gRPC drain logs kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "shutting down gracefully|gRPC server drained|timed out" kubectl logs deploy/cordum-context-engine -n cordum | grep -E "shutting down gracefully|gRPC" kubectl logs deploy/cordum-safety-kernel -n cordum | grep -E "shutting down gracefully|gRPC" # During rollout, track retry pressure and failures kubectl rollout restart deployment/cordum-api-gateway -n cordum kubectl rollout status deployment/cordum-api-gateway -n cordum # Verify no duplicate side-effects for retried operations redis-cli GET "cordum:scheduler:job:JOB_ID"
Limitations and tradeoffs
- - More retries improve transient recovery but can amplify load under broad outages.
- - Fewer retries reduce blast radius but can increase visible error rate during rollouts.
- - Tight caller deadlines reduce tail latency but increase `DEADLINE_EXCEEDED` volume.
- - Idempotency enforcement adds storage and metadata overhead.
Retrying non-idempotent calls blindly is how one transient transport error becomes permanent data corruption.
Next step
Run this in one sprint:
- 1. Build a status-code retry matrix per internal gRPC method.
- 2. Require idempotency key on every retryable mutating call.
- 3. Add rollout canary that tracks `CANCELLED`/`UNAVAILABLE` rate and auto-halts on threshold.
- 4. Run forced termination drills and confirm retry budget prevents storms.
Continue with AI Agent Health Checks and AI Agent Rolling Restart Playbook.