Skip to content
Guide

AI Agent gRPC CANCELLED and UNAVAILABLE

During rollouts, these are often transient transport signals, not business failures.

Guide10 min readMar 2026
TL;DR
  • -`CANCELLED` and `UNAVAILABLE` are expected during controlled shutdown windows.
  • -Blind retries create retry storms unless bounded by idempotency and deadlines.
  • -Retry policy should key off status code + call phase + side-effect boundary.
  • -Cordum docs explicitly call out these two status codes during rolling restarts.
Code-aware retries

Retry behavior must be different for `CANCELLED` vs `UNAVAILABLE`.

Idempotency first

Retries are safe only when duplicate side effects are blocked.

Bounded backoff

Short, jittered retries recover quickly without flooding control services.

Scope

This guide focuses on internal gRPC calls between AI control-plane services during restart and rollout events where transient transport failures are expected.

The production problem

Rolling restart succeeds at the deployment layer, but callers still see transient gRPC failures. Teams then respond with either no retry or unlimited retry. Both are wrong.

Without clear status-code handling, one transient disconnect can become a retry flood. Or worse, one dropped call can silently skip a safety-critical decision path.

Correct handling requires code-aware retry rules tied to idempotency boundaries.

What top results miss

SourceStrong coverageMissing piece
gRPC status codesSemantics of `CANCELLED`, `UNAVAILABLE`, and other status outcomes.No restart-phase retry policy for lock-backed control-plane workflows.
gRPC graceful shutdownHow servers stop accepting new RPCs and drain in-flight calls.No client strategy for mixed in-flight cancellation + pod replacement windows.
Kubernetes Pod lifecycleTermination behavior and pod state transitions under shutdown.No app-level handling for gRPC retry/idempotency boundaries.

The gap is decision logic for real production flows: what to retry, how long, and when to fail fast.

Status code handling matrix

gRPC codeLikely causeRetry ruleGuardrail
CANCELLEDClient-side context cancellation or server shutdown race during in-flight call.Retry only if operation is idempotent and caller deadline still valid.Attach idempotency key and short jittered retry budget.
UNAVAILABLEConnection dropped, pod terminating, endpoint not yet ready, or transient transport outage.Retry with exponential backoff and max-attempt cap.Hedge only for read-like calls; avoid fan-out retry amplification.
DEADLINE_EXCEEDEDCaller timeout too short for operation latency profile.Retry only after checking timeout budget and service latency trend.Do not stack retries under an already exhausted deadline.
FAILED_PRECONDITIONBusiness state violation, not transport instability.Do not retry blindly.Surface to caller and require state correction.

Cordum restart behavior

Cordum's shutdown and troubleshooting docs already describe the expected transport outcomes during rolling restarts. Client policy should align with this behavior.

BoundaryCurrent behaviorOperational impact
Documented restart behaviorCordum troubleshooting states in-flight gRPC calls can receive `CANCELLED` or `UNAVAILABLE` during rolling restarts.These codes should be treated as planned transient signals, not always incidents.
Gateway shutdownGateway drains gRPC with `GracefulStop()` and force-stops if timeout expires.Late in-flight calls can still fail; callers need bounded retry policy.
Context Engine / Safety KernelBoth services use graceful gRPC drain with force-stop fallback after timeout.Transient transport errors during rollout are expected and recoverable.
Shutdown envelopeService shutdown target is 15s, under default 30s Kubernetes termination grace.Well-tuned clients usually recover within one retry window.

Implementation examples

Bounded retry interceptor (Go)

grpc_retry.go
Go
func callWithRetry(ctx context.Context, req *pb.PolicyCheckRequest, c pb.SafetyKernelClient) (*pb.PolicyCheckResponse, error) {
  backoff := []time.Duration{100 * time.Millisecond, 250 * time.Millisecond, 500 * time.Millisecond}

  for i, wait := range backoff {
    resp, err := c.Check(ctx, req)
    if err == nil {
      return resp, nil
    }

    st, ok := status.FromError(err)
    if !ok {
      return nil, err
    }

    switch st.Code() {
    case codes.Canceled, codes.Unavailable:
      if i == len(backoff)-1 || ctx.Err() != nil {
        return nil, err
      }
      jitter := time.Duration(rand.Int63n(int64(wait / 2)))
      time.Sleep(wait + jitter)
      continue
    default:
      return nil, err
    }
  }
  return nil, status.Error(codes.Unavailable, "retry budget exhausted")
}

Retry decision policy

retry_policy.txt
Text
# Retry policy decision rules
# Input: grpc_code, idempotent, deadline_remaining_ms

if grpc_code in [CANCELLED, UNAVAILABLE] and idempotent and deadline_remaining_ms > 300:
  retry_with_jitter(max_attempts=3)
else:
  return_error_immediately()

Rollout verification runbook

grpc_restart_runbook.sh
Bash
# Watch shutdown + gRPC drain logs
kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "shutting down gracefully|gRPC server drained|timed out"
kubectl logs deploy/cordum-context-engine -n cordum | grep -E "shutting down gracefully|gRPC"
kubectl logs deploy/cordum-safety-kernel -n cordum | grep -E "shutting down gracefully|gRPC"

# During rollout, track retry pressure and failures
kubectl rollout restart deployment/cordum-api-gateway -n cordum
kubectl rollout status deployment/cordum-api-gateway -n cordum

# Verify no duplicate side-effects for retried operations
redis-cli GET "cordum:scheduler:job:JOB_ID"

Limitations and tradeoffs

  • - More retries improve transient recovery but can amplify load under broad outages.
  • - Fewer retries reduce blast radius but can increase visible error rate during rollouts.
  • - Tight caller deadlines reduce tail latency but increase `DEADLINE_EXCEEDED` volume.
  • - Idempotency enforcement adds storage and metadata overhead.

Retrying non-idempotent calls blindly is how one transient transport error becomes permanent data corruption.

Next step

Run this in one sprint:

  1. 1. Build a status-code retry matrix per internal gRPC method.
  2. 2. Require idempotency key on every retryable mutating call.
  3. 3. Add rollout canary that tracks `CANCELLED`/`UNAVAILABLE` rate and auto-halts on threshold.
  4. 4. Run forced termination drills and confirm retry budget prevents storms.

Continue with AI Agent Health Checks and AI Agent Rolling Restart Playbook.

Transport failures need policy, not panic

Make retry behavior explicit per code path before your next maintenance window.