The production problem
Partial outages are where retry policy quality shows up. One slow dependency can trigger retries faster than recovery can happen.
In autonomous systems, retry loops are automatic and parallel. If budgets are loose, you amplify the outage instead of riding through it.
Teams usually tune retries first and deadlines later. That order is backwards. Deadline budget defines what retries are allowed to spend.
What top sources cover and miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Builders Library: timeouts, retries, and backoff with jitter | Excellent explanation of retry amplification and why jitter reduces correlated retry spikes. Clear warning: retries are selfish under overload. | No concrete control-plane profile for autonomous agents with policy checks, approval gates, and dead-letter fallback. |
| gRPC deadlines guide | Strong semantics for deadlines, `DEADLINE_EXCEEDED`, automatic cancellation, and propagation with timeout conversion for clock-skew safety. | No budgeting model for multi-step orchestration where one logical action includes several internal RPC layers. |
| Google Cloud retry strategy | Concrete backoff+jitter defaults and a useful idempotency split: always idempotent, conditionally idempotent, non-idempotent. | No governance-aware retry matrix for autonomous actions where safety-unavailable and no-worker errors need different treatment. |
Timeout and retry budget model
| Layer | Required rule | Failure if missing |
|---|---|---|
| Run budget | Set a hard upper deadline for the entire run. | Retries can outlive user and operator intent. |
| Safety budget | Use explicit timeout layers for policy checks and track them separately. | Policy service latency silently dominates run time. |
| Scheduling retries | Use jittered exponential backoff with a hard max-attempt cap. | Queue churn hides root cause and delays terminal handling. |
| Error-class policy | Map each error class to retry, delay, or terminal path. | Non-retryable failure loops consume capacity with no chance of recovery. |
Cordum runtime behavior
| Control | Current behavior | Why it matters |
|---|---|---|
| Safety check timeout (inner) | 2s (`SafetyClient` gRPC timeout) | Bounds direct policy RPC latency per attempt. |
| Safety check timeout (outer) | 3s (`engine` defense-in-depth timeout) | Guards against handler stalls around policy path. |
| Safety unavailable delay | 5s requeue delay in fail-closed mode | Applies pressure control during policy outages. |
| Scheduling backoff | Exponential from 1s to 30s + crypto jitter up to 500ms | Reduces synchronized retry spikes across workers. |
| Max scheduling retries | 50 attempts before FAILED + DLQ (~25 minutes worst case) | Creates deterministic handoff from automatic retries to operator triage. |
| Doc drift to track | `scheduler-internals.md` still lists `safetyTimeout=2s` | Code currently has dual layers (2s inner, 3s outer); docs need explicit clarification. |
Code check detail: scheduler docs list `safetyTimeout=2s`, but runtime behavior uses two layers: 2s in `SafetyClient` and 3s outer timeout in scheduler engine.
Implementation examples
Scheduler retry constants (Go)
const ( retryDelayBusy = 500 * time.Millisecond retryDelayStore = 1 * time.Second retryDelayPublish = 2 * time.Second retryDelayNoWorkers = 2 * time.Second safetyThrottleDelay = 5 * time.Second safetyCheckTimeout = 3 * time.Second maxSchedulingRetries = 50 )
Backoff formula with jitter (Go)
const (
backoffBase = 1 * time.Second
backoffMax = 30 * time.Second
backoffJitterMax = 500 * time.Millisecond
)
func backoffDelay(attempt int) time.Duration {
delay := min(backoffBase<<attempt, backoffMax)
jitter := cryptoJitter(backoffJitterMax)
return min(delay+jitter, backoffMax)
}Retry window estimation (Go)
func ApproxWorstCaseRetryWindow(attempts int) time.Duration {
// Rough estimate for backoff(1s->30s) + jitter cap.
total := time.Duration(0)
for i := 0; i < attempts; i++ {
d := backoffBase << i
if d > backoffMax {
d = backoffMax
}
total += d
}
return total
}
// ApproxWorstCaseRetryWindow(50) ~= 25 minutesThis number belongs in your on-call docs. If retries can run for ~25 minutes, incident handling should expect delayed terminal signals.
Limitations and tradeoffs
- - A high retry cap improves transient resilience and delays hard failure visibility for operators.
- - Fail-closed safety mode protects risky actions and can increase queue pressure during policy outages.
- - Jitter spreads load, but it also makes per-request completion time less predictable.
- - Two timeout layers improve defense and can confuse telemetry unless you emit both in traces.
Next step
Run this in one sprint:
- 1. Define run deadline, per-step deadline, and max retry window for top five workflows.
- 2. Classify retryable errors and assign explicit delay policy per class.
- 3. Add trace fields for remaining budget and selected retry delay on each attempt.
- 4. Run one chaos drill: policy kernel unavailable for 10 minutes and measure queue/backlog behavior.
Continue with AI Agent Circuit Breaker Pattern and AI Agent DLQ and Replay Patterns.