Name: Cordum
Author: Cordum

The production problem

Most incident timelines include this sentence: “Retrying eventually made it worse.” Autonomous agents do that sentence faster.

Without explicit timeout budgets, retry boundaries, and jitter, a partial outage can cascade into queue growth, duplicate actions, and exhausted operator attention.

What top results miss

Source	Strong coverage	Missing piece
AWS Builders Library: timeouts, retries, backoff with jitter	Strong failure-mode analysis and retry amplification risks across layered services.	No agent-control-plane guidance for policy-aware retries on side-effecting actions.
gRPC Deadlines guide	Clear deadline semantics, `DEADLINE_EXCEEDED`, and propagation behavior.	Limited advice on budget splitting across autonomous multi-step agent workflows.
Google Cloud IAM retry strategy	Concrete truncated exponential backoff algorithm with jitter and deadline stop.	No mapping from error classes to governance outcomes for autonomous actions.

Timeout budget model

Layer	Required rule	Failure if missing
Run budget	Set an upper bound for the full workflow (for example, 120s).	Infinite waiting across chained retries.
Step budget	Split run budget by step criticality and historical latency.	One step consumes all remaining time and starves downstream checks.
Retry budget	Cap attempts and backoff ceilings per error class.	Retry storms and queue bloat during partial outage.
Policy budget	For high-risk actions, prefer fail-closed when policy service is unavailable.	Unsafe actions execute because retries degraded into fail-open behavior.

Cordum runtime defaults

Control	Default	Why this matters
Safety RPC timeout	2s per check	Bounds control-plane latency so policy checks do not stall scheduling.
Scheduling backoff	Exponential 1s to 30s with jitter	Spreads retries and avoids synchronized thundering herds.
Retry limit	50 attempts before terminal failure + DLQ	Creates deterministic handoff from automatic retries to operator triage.
Safety unavailable path	Requeue with 5s delay when circuit is open	Provides bounded pressure during dependency degradation.

Implementation examples

Budget + jitter helper (Go)

budget.go

type Budget struct {
  RunDeadline    time.Time
  StepTimeout    time.Duration
  MaxAttempts    int
}

func NextAttemptDelay(n int) time.Duration {
  base := time.Second
  max := 30 * time.Second
  jitter := time.Duration(rand.Intn(500)) * time.Millisecond
  d := base << n
  if d > max {
    d = max
  }
  return d + jitter
}

Retry and deadline policy (YAML)

retries.yaml

YAML

retry_policy:
  transient_errors:
    attempts: 5
    backoff:
      base: 1s
      max: 30s
      jitter: true
  policy_unavailable:
    attempts: 3
    delay: 5s
    fail_mode: closed
deadline:
  run_timeout: 120s
  per_step_timeout: 15s

Per-step retry trace (JSON)

retry-trace.json

JSON

{
  "run_id": "run_31ab",
  "step_id": "step_fetch_pr",
  "attempt": 4,
  "step_timeout_ms": 15000,
  "backoff_ms": 8400,
  "remaining_run_budget_ms": 43200,
  "status": "retry_scheduled"
}

Limitations and tradeoffs

- Tight deadlines reduce resource waste but can increase false timeout rates during load spikes.
- Large backoff caps protect dependencies but may delay user-visible recovery.
- Fail-closed policy behavior improves safety and can reduce availability during control-plane outages.
- Budget tuning needs real latency distributions; static guesses drift as workload changes.

Next step

Run this in one sprint:

1. Define run-level and step-level timeout budgets for top five workflows.
2. Classify retryable errors and assign attempt caps per class.
3. Enforce jittered backoff and block unbounded retry loops in review checks.
4. Replay one incident trace and verify where budget was actually spent.

Continue with AI Agent Circuit Breaker Pattern and AI Agent DLQ and Replay Patterns.

AI Agent Timeouts, Retries, and Backoff in Production