The production problem
Autonomous agents are fast at repeating mistakes. One slow dependency can trigger retries, queue growth, and operator pages in minutes.
Retry logic alone is useful for transient faults. It is not a strategy for sustained outages. On incident day, retries are caffeine, not reliability.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Azure Circuit Breaker pattern | Strong state-machine explanation and operational considerations. | No agent-specific guidance for pre-dispatch policy decisions. |
| AWS Circuit breaker pattern | Good implementation example with Step Functions and status store. | Limited treatment of governance decisions when the breaker is open. |
| Martin Fowler Circuit Breaker | Canonical rationale, thresholds, and half-open behavior. | Not focused on autonomous agents that trigger real-world side effects. |
Policy-aware breaker model
For AI agents, breaker state is only half the decision. You also need to define what happens to blocked actions: deny, requeue, degrade, or fail-open with an audit signal.
| Mode | Behavior | Operational risk |
|---|---|---|
| Retry only | Keeps sending calls during outage | Queue growth, token spend, duplicate actions |
| Breaker only | Stops remote calls after threshold | Still unclear what to do with blocked high-risk actions |
| Breaker + policy mode | Breaker state plus explicit fail-closed/open decision | More configuration complexity, better incident outcomes |
Cordum runtime defaults
Cordum shares circuit-breaker failure counters across scheduler replicas through Redis keys. This avoids one replica opening while others continue hammering a degraded dependency.
| Control | Default | Why this number matters |
|---|---|---|
| Safety call timeout | 2s per safety check | Prevents long tail latency from blocking scheduler workers. |
| Open threshold | 3 consecutive failures | Trips fast enough to stop cascading retries. |
| Open duration | 30s before half-open probes | Gives dependencies time to recover before new pressure. |
| Half-open probes | Max 3 probe requests; close after 2 successes | Avoids flooding a recovering dependency. |
| Policy fail mode | `POLICY_CHECK_FAIL_MODE=closed` by default | Safe default: no dispatch without a policy decision. |
Implementation examples
Breaker + policy decision loop (Go)
type BreakerState string
const (
Closed BreakerState = "CLOSED"
Open BreakerState = "OPEN"
HalfOpen BreakerState = "HALF_OPEN"
)
func HandleAction(ctx context.Context, req Action) error {
state := breaker.State(req.Topic)
if state == Open {
if policyFailMode() == "open" {
return dispatchWithWarning(ctx, req, "breaker_open_fail_open")
}
return requeue(req, 5*time.Second)
}
err := callDependency(ctx, req)
breaker.Record(req.Topic, err)
return err
}Threshold configuration (YAML)
input_policy:
fail_mode: closed
circuit_breaker:
timeout: 2s
open_after_failures: 3
open_for: 30s
half_open_max_requests: 3
close_after_successes: 2State transition audit event (JSON)
{
"ts": "2026-03-31T14:11:49Z",
"topic": "tool.github.pr.create",
"breaker_state": "OPEN",
"failure_count": 3,
"policy_fail_mode": "closed",
"action": "requeue",
"delay_seconds": 5
}Limitations and tradeoffs
- - Aggressive thresholds can trip on noise and reduce useful throughput.
- - Fail-open mode preserves availability but can bypass governance guarantees during outages.
- - Shared breaker state adds Redis dependence; local fallback reduces coordination.
- - Breakers protect dependencies, not business correctness. You still need idempotency and compensation.
Next step
Run this in one sprint:
- 1. Set breaker thresholds per high-risk topic and store them in config.
- 2. Keep policy fail mode `closed` for irreversible actions.
- 3. Add dashboards for breaker open rate, half-open probes, and requeue delays.
- 4. Execute one game-day where the safety service is intentionally degraded.
Continue with AI Agent Rollback and Compensation and AI Agent Incident Report.