Skip to content
Guide

AI Agent gRPC Deadline Budgeting

Timeouts are budget math. If you skip the math, you inherit cascading failure.

Guide10 min readMar 2026
TL;DR
  • -A deadline is a shared budget across all downstream hops, not a per-hop timeout.
  • -Retrying after `DEADLINE_EXCEEDED` without budget checks multiplies failure load.
  • -Each hop needs explicit reserve time for serialization, queueing, and retry jitter.
  • -Cordum services use short internal timeouts (2s store ops, 3s safety checks, 15s shutdown envelope).
Budget per hop

Allocate time intentionally for each call layer instead of using one global timeout.

Retry discipline

Retries need remaining-deadline checks before they can be considered safe.

Fail fast on debt

When budget is exhausted, fail early and surface cause instead of compounding load.

Scope

This guide focuses on internal control-plane gRPC calls where one request fans out across policy, storage, and orchestration dependencies.

The production problem

One caller sets a 1-second deadline. Downstream services each use their own default timeout. Retries kick in without checking remaining budget. That is how a single slow hop becomes a full-chain timeout event.

Most teams notice this only after rollout or traffic spikes because median latency looks fine. Tail latency quietly consumes the budget.

Deadline strategy has to be explicit across the call graph.

What top results miss

SourceStrong coverageMissing piece
gRPC deadlines guideDeadline propagation concepts and why callers should set deadlines.No concrete budget split strategy for multi-hop AI control planes.
gRPC retry guideRetry primitives and policy controls.No production method for retry gating on remaining deadline budget.
gRPC status codesMeaning of `DEADLINE_EXCEEDED`, `UNAVAILABLE`, and related outcomes.No operation-level decision matrix combining status code + budget state.

The gap is budget allocation discipline: how much time each hop gets and when retries are still legal.

Deadline budget math

deadline_budget.txt
Text
# End-to-end deadline budget example
# Caller deadline: 1200ms

ingress_budget = 480ms     # 40%
core_rpc_budget = 540ms    # 45%
retry_reserve = 120ms      # 10%
response_tail = 60ms       # 5%

# Retry rule:
# Only retry if remaining_deadline > core_rpc_min + jitter_reserve
# Example threshold: remaining_deadline > 220ms
StageBudget sharePurposeFailure if mis-sized
Ingress handler40%Auth, validation, routing, initial policy checks.Slow preflight burns downstream budget before core work starts.
Core dependency RPC45%Main business call (for example policy check, context fetch, write path).Insufficient budget forces immediate `DEADLINE_EXCEEDED` under normal latency tail.
Retry reserve10%One bounded retry with jitter for transient transport failures.No reserve means retries violate caller deadline and worsen load.
Response/cleanup5%Marshal response, metrics flush, and final state transition.No tail reserve causes success-path calls to timeout at response edge.

Cordum timeout baseline

These values are live in the current codebase and are useful anchors for budget design.

BoundaryCurrent behaviorOperational impact
Scheduler store operations`storeOpTimeout = 2s` for many lock/store interactions in scheduler engine.Keeps internal lock/store calls bounded under contention.
Scheduler safety checks`safetyCheckTimeout = 3s` for pre-dispatch policy evaluation path.Prevents long policy stalls from blocking scheduler worker loops.
Workflow handler budgetWorkflow result handling uses a 30s handler context timeout.Longer path budget for workflow step completion and state updates.
Service shutdown envelopeCore services drain gracefully within 15s during SIGTERM windows.Deadlines longer than shutdown window need caller-side retry/continuation logic.

Implementation examples

Remaining-budget-aware retry (Go)

deadline_retry.go
Go
func withDeadlineRetry(ctx context.Context, req *pb.PolicyCheckRequest, c pb.SafetyKernelClient) (*pb.PolicyCheckResponse, error) {
  attempts := 0
  for {
    attempts++
    resp, err := c.Check(ctx, req)
    if err == nil {
      return resp, nil
    }

    st, ok := status.FromError(err)
    if !ok {
      return nil, err
    }

    if st.Code() != codes.Unavailable && st.Code() != codes.DeadlineExceeded {
      return nil, err
    }

    dl, hasDL := ctx.Deadline()
    if !hasDL || attempts >= 2 {
      return nil, err
    }

    remaining := time.Until(dl)
    if remaining < 220*time.Millisecond { // budget floor for one safe retry
      return nil, err
    }

    jitter := time.Duration(rand.Int63n(int64(40 * time.Millisecond)))
    time.Sleep(80*time.Millisecond + jitter)
  }
}

Deadline drift runbook

deadline_runbook.sh
Bash
# Check timeout-related errors during rollout
kubectl logs deploy/cordum-api-gateway -n cordum | grep -E "DEADLINE_EXCEEDED|UNAVAILABLE|CANCELLED"
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety check timeout|storeOpTimeout|retry"

# Trigger controlled rollout
kubectl rollout restart deployment/cordum-api-gateway -n cordum
kubectl rollout status deployment/cordum-api-gateway -n cordum

# Confirm caller budgets and retry counts via metrics/logs
curl -s http://localhost:9092/metrics | grep -E "grpc|timeout|retry"

Limitations and tradeoffs

  • - Larger deadlines improve success under tail latency but can hide slow dependency regressions.
  • - Smaller deadlines protect upstream latency SLOs but increase timeout error rate during spikes.
  • - Retry reserves improve resilience but consume budget that could be used by primary execution.
  • - Strict fail-fast logic lowers blast radius but may reject recoverable transient calls.

If you cannot explain where each millisecond goes, your deadline policy is probably guessing in production.

Next step

Run this in one sprint:

  1. 1. Trace one critical request path and list every hop with p95 and p99 latency.
  2. 2. Assign deadline shares explicitly and codify them in code comments and config.
  3. 3. Add guard that blocks retries when remaining deadline drops below safety floor.
  4. 4. Run rollout drill and confirm timeout/retry counters stay inside error budget.

Continue with AI Agent gRPC CANCELLED and UNAVAILABLE and AI Agent Lock TTL Tuning.

Deadlines are architecture

Put deadline math into design reviews before the next incident does it for you.