Skip to content
Guide

AI Agent Timeouts, Retries, and Backoff in Production

Retry loops do not create reliability by magic. Budget math does.

Guide11 min readApr 2026
TL;DR
  • -Retry policy is budget policy. Every extra attempt spends time, queue capacity, and dependency headroom.
  • -Cordum currently applies two safety timeout layers: 2s in `SafetyClient`, 3s outer guard in scheduler engine.
  • -A 50-attempt scheduling cap with 1s-30s exponential backoff can stretch failure realization to roughly 25 minutes.
Deadline budget

Split time intentionally across safety checks, dispatch, and result handling.

Retry boundaries

Classify retryable errors and cap attempts per class.

Jitter behavior

Spread retries to avoid synchronized herd effects.

Scope

This guide targets autonomous control planes that schedule work, call policy services, and publish to a queue under at-least-once delivery.

The production problem

Partial outages are where retry policy quality shows up. One slow dependency can trigger retries faster than recovery can happen.

In autonomous systems, retry loops are automatic and parallel. If budgets are loose, you amplify the outage instead of riding through it.

Teams usually tune retries first and deadlines later. That order is backwards. Deadline budget defines what retries are allowed to spend.

What top sources cover and miss

SourceStrong coverageMissing piece
AWS Builders Library: timeouts, retries, and backoff with jitterExcellent explanation of retry amplification and why jitter reduces correlated retry spikes. Clear warning: retries are selfish under overload.No concrete control-plane profile for autonomous agents with policy checks, approval gates, and dead-letter fallback.
gRPC deadlines guideStrong semantics for deadlines, `DEADLINE_EXCEEDED`, automatic cancellation, and propagation with timeout conversion for clock-skew safety.No budgeting model for multi-step orchestration where one logical action includes several internal RPC layers.
Google Cloud retry strategyConcrete backoff+jitter defaults and a useful idempotency split: always idempotent, conditionally idempotent, non-idempotent.No governance-aware retry matrix for autonomous actions where safety-unavailable and no-worker errors need different treatment.

Timeout and retry budget model

LayerRequired ruleFailure if missing
Run budgetSet a hard upper deadline for the entire run.Retries can outlive user and operator intent.
Safety budgetUse explicit timeout layers for policy checks and track them separately.Policy service latency silently dominates run time.
Scheduling retriesUse jittered exponential backoff with a hard max-attempt cap.Queue churn hides root cause and delays terminal handling.
Error-class policyMap each error class to retry, delay, or terminal path.Non-retryable failure loops consume capacity with no chance of recovery.

Cordum runtime behavior

ControlCurrent behaviorWhy it matters
Safety check timeout (inner)2s (`SafetyClient` gRPC timeout)Bounds direct policy RPC latency per attempt.
Safety check timeout (outer)3s (`engine` defense-in-depth timeout)Guards against handler stalls around policy path.
Safety unavailable delay5s requeue delay in fail-closed modeApplies pressure control during policy outages.
Scheduling backoffExponential from 1s to 30s + crypto jitter up to 500msReduces synchronized retry spikes across workers.
Max scheduling retries50 attempts before FAILED + DLQ (~25 minutes worst case)Creates deterministic handoff from automatic retries to operator triage.
Doc drift to track`scheduler-internals.md` still lists `safetyTimeout=2s`Code currently has dual layers (2s inner, 3s outer); docs need explicit clarification.

Code check detail: scheduler docs list `safetyTimeout=2s`, but runtime behavior uses two layers: 2s in `SafetyClient` and 3s outer timeout in scheduler engine.

Implementation examples

Scheduler retry constants (Go)

engine.go
Go
const (
  retryDelayBusy      = 500 * time.Millisecond
  retryDelayStore     = 1 * time.Second
  retryDelayPublish   = 2 * time.Second
  retryDelayNoWorkers = 2 * time.Second
  safetyThrottleDelay = 5 * time.Second
  safetyCheckTimeout  = 3 * time.Second
  maxSchedulingRetries = 50
)

Backoff formula with jitter (Go)

backoff.go
Go
const (
  backoffBase      = 1 * time.Second
  backoffMax       = 30 * time.Second
  backoffJitterMax = 500 * time.Millisecond
)

func backoffDelay(attempt int) time.Duration {
  delay := min(backoffBase<<attempt, backoffMax)
  jitter := cryptoJitter(backoffJitterMax)
  return min(delay+jitter, backoffMax)
}

Retry window estimation (Go)

retry_window.go
Go
func ApproxWorstCaseRetryWindow(attempts int) time.Duration {
  // Rough estimate for backoff(1s->30s) + jitter cap.
  total := time.Duration(0)
  for i := 0; i < attempts; i++ {
    d := backoffBase << i
    if d > backoffMax {
      d = backoffMax
    }
    total += d
  }
  return total
}

// ApproxWorstCaseRetryWindow(50) ~= 25 minutes

This number belongs in your on-call docs. If retries can run for ~25 minutes, incident handling should expect delayed terminal signals.

Limitations and tradeoffs

  • - A high retry cap improves transient resilience and delays hard failure visibility for operators.
  • - Fail-closed safety mode protects risky actions and can increase queue pressure during policy outages.
  • - Jitter spreads load, but it also makes per-request completion time less predictable.
  • - Two timeout layers improve defense and can confuse telemetry unless you emit both in traces.

Next step

Run this in one sprint:

  1. 1. Define run deadline, per-step deadline, and max retry window for top five workflows.
  2. 2. Classify retryable errors and assign explicit delay policy per class.
  3. 3. Add trace fields for remaining budget and selected retry delay on each attempt.
  4. 4. Run one chaos drill: policy kernel unavailable for 10 minutes and measure queue/backlog behavior.

Continue with AI Agent Circuit Breaker Pattern and AI Agent DLQ and Replay Patterns.

Retry budget first, retry loop second

If your team cannot state the worst-case retry window in minutes, your retry policy is not production-ready.