Skip to content
Guide

AI Agent Timeouts, Retries, and Backoff in Production

Set reliability budgets before your retry loops spend them for you.

Guide10 min readMar 2026
TL;DR
  • -Retries are not free. Every retry consumes budget, queue capacity, and sometimes external side effects.
  • -Deadline propagation must be explicit across agent hops, or one slow dependency blocks the entire run.
  • -Use jittered exponential backoff with hard stop conditions, not open-ended loops.
Deadline budget

Allocate time per step instead of one giant timeout

Retry boundaries

Hard caps prevent retry storms

Jittered backoff

Desynchronize fleets under partial outages

Scope

This guide focuses on autonomous agent jobs that call external tools and internal control-plane services under real production latency and failure constraints.

The production problem

Most incident timelines include this sentence: “Retrying eventually made it worse.” Autonomous agents do that sentence faster.

Without explicit timeout budgets, retry boundaries, and jitter, a partial outage can cascade into queue growth, duplicate actions, and exhausted operator attention.

What top results miss

SourceStrong coverageMissing piece
AWS Builders Library: timeouts, retries, backoff with jitterStrong failure-mode analysis and retry amplification risks across layered services.No agent-control-plane guidance for policy-aware retries on side-effecting actions.
gRPC Deadlines guideClear deadline semantics, `DEADLINE_EXCEEDED`, and propagation behavior.Limited advice on budget splitting across autonomous multi-step agent workflows.
Google Cloud IAM retry strategyConcrete truncated exponential backoff algorithm with jitter and deadline stop.No mapping from error classes to governance outcomes for autonomous actions.

Timeout budget model

LayerRequired ruleFailure if missing
Run budgetSet an upper bound for the full workflow (for example, 120s).Infinite waiting across chained retries.
Step budgetSplit run budget by step criticality and historical latency.One step consumes all remaining time and starves downstream checks.
Retry budgetCap attempts and backoff ceilings per error class.Retry storms and queue bloat during partial outage.
Policy budgetFor high-risk actions, prefer fail-closed when policy service is unavailable.Unsafe actions execute because retries degraded into fail-open behavior.

Cordum runtime defaults

ControlDefaultWhy this matters
Safety RPC timeout2s per checkBounds control-plane latency so policy checks do not stall scheduling.
Scheduling backoffExponential 1s to 30s with jitterSpreads retries and avoids synchronized thundering herds.
Retry limit50 attempts before terminal failure + DLQCreates deterministic handoff from automatic retries to operator triage.
Safety unavailable pathRequeue with 5s delay when circuit is openProvides bounded pressure during dependency degradation.

Implementation examples

Budget + jitter helper (Go)

budget.go
Go
type Budget struct {
  RunDeadline    time.Time
  StepTimeout    time.Duration
  MaxAttempts    int
}

func NextAttemptDelay(n int) time.Duration {
  base := time.Second
  max := 30 * time.Second
  jitter := time.Duration(rand.Intn(500)) * time.Millisecond
  d := base << n
  if d > max {
    d = max
  }
  return d + jitter
}

Retry and deadline policy (YAML)

retries.yaml
YAML
retry_policy:
  transient_errors:
    attempts: 5
    backoff:
      base: 1s
      max: 30s
      jitter: true
  policy_unavailable:
    attempts: 3
    delay: 5s
    fail_mode: closed
deadline:
  run_timeout: 120s
  per_step_timeout: 15s

Per-step retry trace (JSON)

retry-trace.json
JSON
{
  "run_id": "run_31ab",
  "step_id": "step_fetch_pr",
  "attempt": 4,
  "step_timeout_ms": 15000,
  "backoff_ms": 8400,
  "remaining_run_budget_ms": 43200,
  "status": "retry_scheduled"
}

Limitations and tradeoffs

  • - Tight deadlines reduce resource waste but can increase false timeout rates during load spikes.
  • - Large backoff caps protect dependencies but may delay user-visible recovery.
  • - Fail-closed policy behavior improves safety and can reduce availability during control-plane outages.
  • - Budget tuning needs real latency distributions; static guesses drift as workload changes.

Next step

Run this in one sprint:

  1. 1. Define run-level and step-level timeout budgets for top five workflows.
  2. 2. Classify retryable errors and assign attempt caps per class.
  3. 3. Enforce jittered backoff and block unbounded retry loops in review checks.
  4. 4. Replay one incident trace and verify where budget was actually spent.

Continue with AI Agent Circuit Breaker Pattern and AI Agent DLQ and Replay Patterns.

Latency is a budget

Every retry spends time somewhere. Choose where that time goes before production chooses for you.