The production problem
Most incident timelines include this sentence: “Retrying eventually made it worse.” Autonomous agents do that sentence faster.
Without explicit timeout budgets, retry boundaries, and jitter, a partial outage can cascade into queue growth, duplicate actions, and exhausted operator attention.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Builders Library: timeouts, retries, backoff with jitter | Strong failure-mode analysis and retry amplification risks across layered services. | No agent-control-plane guidance for policy-aware retries on side-effecting actions. |
| gRPC Deadlines guide | Clear deadline semantics, `DEADLINE_EXCEEDED`, and propagation behavior. | Limited advice on budget splitting across autonomous multi-step agent workflows. |
| Google Cloud IAM retry strategy | Concrete truncated exponential backoff algorithm with jitter and deadline stop. | No mapping from error classes to governance outcomes for autonomous actions. |
Timeout budget model
| Layer | Required rule | Failure if missing |
|---|---|---|
| Run budget | Set an upper bound for the full workflow (for example, 120s). | Infinite waiting across chained retries. |
| Step budget | Split run budget by step criticality and historical latency. | One step consumes all remaining time and starves downstream checks. |
| Retry budget | Cap attempts and backoff ceilings per error class. | Retry storms and queue bloat during partial outage. |
| Policy budget | For high-risk actions, prefer fail-closed when policy service is unavailable. | Unsafe actions execute because retries degraded into fail-open behavior. |
Cordum runtime defaults
| Control | Default | Why this matters |
|---|---|---|
| Safety RPC timeout | 2s per check | Bounds control-plane latency so policy checks do not stall scheduling. |
| Scheduling backoff | Exponential 1s to 30s with jitter | Spreads retries and avoids synchronized thundering herds. |
| Retry limit | 50 attempts before terminal failure + DLQ | Creates deterministic handoff from automatic retries to operator triage. |
| Safety unavailable path | Requeue with 5s delay when circuit is open | Provides bounded pressure during dependency degradation. |
Implementation examples
Budget + jitter helper (Go)
type Budget struct {
RunDeadline time.Time
StepTimeout time.Duration
MaxAttempts int
}
func NextAttemptDelay(n int) time.Duration {
base := time.Second
max := 30 * time.Second
jitter := time.Duration(rand.Intn(500)) * time.Millisecond
d := base << n
if d > max {
d = max
}
return d + jitter
}Retry and deadline policy (YAML)
retry_policy:
transient_errors:
attempts: 5
backoff:
base: 1s
max: 30s
jitter: true
policy_unavailable:
attempts: 3
delay: 5s
fail_mode: closed
deadline:
run_timeout: 120s
per_step_timeout: 15sPer-step retry trace (JSON)
{
"run_id": "run_31ab",
"step_id": "step_fetch_pr",
"attempt": 4,
"step_timeout_ms": 15000,
"backoff_ms": 8400,
"remaining_run_budget_ms": 43200,
"status": "retry_scheduled"
}Limitations and tradeoffs
- - Tight deadlines reduce resource waste but can increase false timeout rates during load spikes.
- - Large backoff caps protect dependencies but may delay user-visible recovery.
- - Fail-closed policy behavior improves safety and can reduce availability during control-plane outages.
- - Budget tuning needs real latency distributions; static guesses drift as workload changes.
Next step
Run this in one sprint:
- 1. Define run-level and step-level timeout budgets for top five workflows.
- 2. Classify retryable errors and assign attempt caps per class.
- 3. Enforce jittered backoff and block unbounded retry loops in review checks.
- 4. Replay one incident trace and verify where budget was actually spent.
Continue with AI Agent Circuit Breaker Pattern and AI Agent DLQ and Replay Patterns.