Name: Cordum
Author: Cordum

The production problem

Rolling restart succeeds at the deployment layer, but callers still see transient gRPC failures. Teams then respond with either no retry or unlimited retry. Both are wrong.

Without clear status-code handling, one transient disconnect can become a retry flood. Or worse, one dropped call can silently skip a safety-critical decision path.

Correct handling requires code-aware retry rules tied to idempotency boundaries.

What top results miss

Source	Strong coverage	Missing piece
gRPC status codes	Semantics of `CANCELLED`, `UNAVAILABLE`, and other status outcomes.	No restart-phase retry policy for lock-backed control-plane workflows.
gRPC graceful shutdown	How servers stop accepting new RPCs and drain in-flight calls.	No client strategy for mixed in-flight cancellation + pod replacement windows.
Kubernetes Pod lifecycle	Termination behavior and pod state transitions under shutdown.	No app-level handling for gRPC retry/idempotency boundaries.

The gap is decision logic for real production flows: what to retry, how long, and when to fail fast.

Status code handling matrix

gRPC code	Likely cause	Retry rule	Guardrail
CANCELLED	Client-side context cancellation or server shutdown race during in-flight call.	Retry only if operation is idempotent and caller deadline still valid.	Attach idempotency key and short jittered retry budget.
UNAVAILABLE	Connection dropped, pod terminating, endpoint not yet ready, or transient transport outage.	Retry with exponential backoff and max-attempt cap.	Hedge only for read-like calls; avoid fan-out retry amplification.
DEADLINE_EXCEEDED	Caller timeout too short for operation latency profile.	Retry only after checking timeout budget and service latency trend.	Do not stack retries under an already exhausted deadline.
FAILED_PRECONDITION	Business state violation, not transport instability.	Do not retry blindly.	Surface to caller and require state correction.

Cordum restart behavior

Cordum's shutdown and troubleshooting docs already describe the expected transport outcomes during rolling restarts. Client policy should align with this behavior.

Boundary	Current behavior	Operational impact
Documented restart behavior	Cordum troubleshooting states in-flight gRPC calls can receive `CANCELLED` or `UNAVAILABLE` during rolling restarts.	These codes should be treated as planned transient signals, not always incidents.
Gateway shutdown	Gateway drains gRPC with `GracefulStop()` and force-stops if timeout expires.	Late in-flight calls can still fail; callers need bounded retry policy.
Context Engine / Safety Kernel	Both services use graceful gRPC drain with force-stop fallback after timeout.	Transient transport errors during rollout are expected and recoverable.
Shutdown envelope	Service shutdown target is 15s, under default 30s Kubernetes termination grace.	Well-tuned clients usually recover within one retry window.

Implementation examples

Bounded retry interceptor (Go)

grpc_retry.go

func callWithRetry(ctx context.Context, req *pb.PolicyCheckRequest, c pb.SafetyKernelClient) (*pb.PolicyCheckResponse, error) {
  backoff := []time.Duration{100 * time.Millisecond, 250 * time.Millisecond, 500 * time.Millisecond}

  for i, wait := range backoff {
    resp, err := c.Check(ctx, req)
    if err == nil {
      return resp, nil
    }

    st, ok := status.FromError(err)
    if !ok {
      return nil, err
    }

    switch st.Code() {
    case codes.Canceled, codes.Unavailable:
      if i == len(backoff)-1 || ctx.Err() != nil {
        return nil, err
      }
      jitter := time.Duration(rand.Int63n(int64(wait / 2)))
      time.Sleep(wait + jitter)
      continue
    default:
      return nil, err
    }
  }
  return nil, status.Error(codes.Unavailable, "retry budget exhausted")
}

Retry decision policy

retry_policy.txt

Text

# Retry policy decision rules
# Input: grpc_code, idempotent, deadline_remaining_ms

if grpc_code in [CANCELLED, UNAVAILABLE] and idempotent and deadline_remaining_ms > 300:
  retry_with_jitter(max_attempts=3)
else:
  return_error_immediately()

Rollout verification runbook

grpc_restart_runbook.sh

Bash

# Watch shutdown + gRPC drain logs
kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "shutting down gracefully|gRPC server drained|timed out"
kubectl logs deploy/cordum-context-engine -n cordum | grep -E "shutting down gracefully|gRPC"
kubectl logs deploy/cordum-safety-kernel -n cordum | grep -E "shutting down gracefully|gRPC"

# During rollout, track retry pressure and failures
kubectl rollout restart deployment/cordum-api-gateway -n cordum
kubectl rollout status deployment/cordum-api-gateway -n cordum

# Verify no duplicate side-effects for retried operations
redis-cli GET "cordum:scheduler:job:JOB_ID"

Limitations and tradeoffs

- More retries improve transient recovery but can amplify load under broad outages.
- Fewer retries reduce blast radius but can increase visible error rate during rollouts.
- Tight caller deadlines reduce tail latency but increase `DEADLINE_EXCEEDED` volume.
- Idempotency enforcement adds storage and metadata overhead.

Retrying non-idempotent calls blindly is how one transient transport error becomes permanent data corruption.

Next step

Run this in one sprint:

1. Build a status-code retry matrix per internal gRPC method.
2. Require idempotency key on every retryable mutating call.
3. Add rollout canary that tracks `CANCELLED`/`UNAVAILABLE` rate and auto-halts on threshold.
4. Run forced termination drills and confirm retry budget prevents storms.

Continue with AI Agent Health Checks and AI Agent Rolling Restart Playbook.

AI Agent gRPC CANCELLED and UNAVAILABLE