The AI agent circuit breaker pattern trips a flaky tool or safety dependency into an open state after a failure threshold so agents stop hammering it, then probes for recovery before closing again. For autonomous agents you also need an explicit fail mode that decides what happens to in-flight actions while the breaker is open. In Cordum the scheduler's safety client opens after 3 failures, holds open for 30 seconds, closes after 2 successful probes, and shares breaker state across replicas through Redis.
The production problem
Retry loops can turn a short dependency outage into platform-wide pressure. At 500 jobs per minute, three extra attempts per job adds 15,000 avoidable calls over 10 minutes.
Circuit breakers cut this feedback loop. For agent systems, that is still not enough: you also need an explicit governance decision when safety services are unavailable.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Azure Architecture: Circuit Breaker pattern | Clear state-machine behavior and operational concerns for distributed systems. | No pre-dispatch governance model for autonomous actions during breaker-open periods. |
| AWS Prescriptive Guidance: Circuit breaker | Implementation framing and when to prefer breaker over pure retry. | No guidance on safety-policy bypass risk when fail-open is enabled. |
| Martin Fowler: CircuitBreaker | Foundational design, thresholds, half-open behavior, and monitoring value. | No agent-control-plane context where replay/approval semantics must survive outages. |
Policy-aware breaker model
For autonomous agents, breaker state and policy fail mode must be configured together.
| Mode | Behavior | Operational risk |
|---|---|---|
| Retry only | Keeps hitting degraded dependency during outage | Queue growth, waste, and cascading retries |
| Classic breaker | Trips after threshold and blocks traffic temporarily | Unclear policy outcome for blocked high-risk actions |
| Breaker + policy fail mode | Breaker state plus explicit closed/open safety behavior | More knobs, but predictable incident behavior |
Cordum runtime evidence
The controls below are verified against current source and docs.
| Control | Current behavior | Evidence | Why it matters |
|---|---|---|---|
| Input safety timeout stack | Safety client uses 2s gRPC timeout; scheduler wraps with 3s defense timeout. | safety_client.go + engine.go | Caps tail latency before job-level retry/requeue paths trigger. |
| Trip threshold and open window | Breaker opens at 3 failures and uses 30s TTL for reopen window. | safety_client.go + circuit_breaker.go | Fast enough to stop retry storms, short enough for rapid recheck. |
| Shared multi-replica state | Redis keys `cordum:cb:safety:failures` and `cordum:cb:safety:output:failures`. | circuit_breaker.go + safety-kernel.md | Prevents one replica from tripping while others keep hammering. |
| Safety unavailable behavior | `POLICY_CHECK_FAIL_MODE=closed` requeues (default). `open` allows and tags bypass labels. | engine.go + configuration-reference.md | Makes availability-vs-safety tradeoff explicit instead of accidental. |
| Redis outage fallback | Fallback is local in-memory breaker; local state machine uses half-open probes and close-after-success. | circuit_breaker.go | Service keeps running, but decisions become per-replica until Redis recovers. |
Operator caveat
Redis-backed reopen is TTL based. It does not enforce a strict global probe budget. If you need tighter reopen control, add external rate limits around recovered dependencies.
Implementation examples
Safety breaker defaults (Go)
const ( safetyTimeout = 2 * time.Second safetyCircuitOpenFor = 30 * time.Second safetyCircuitFailBudget = 3 safetyCircuitHalfOpenMax = 3 safetyCircuitCloseAfter = 2 )
Fail-open vs fail-closed branch (Go)
case SafetyUnavailable:
if e.isInputFailOpen() {
// allow through + mark bypass
req.Labels["safety_bypassed"] = "true"
} else {
// default closed mode: requeue with backoff
return RetryAfter(fmt.Errorf("safety unavailable"), safetyThrottleDelay)
}Runtime status payload (JSON)
{
"circuit_breakers": {
"input": {
"state": "OPEN",
"failures": 4,
"fail_threshold": 3,
"cooldown_remaining_ms": 21400
},
"output": {
"state": "CLOSED",
"failures": 0,
"fail_threshold": 3,
"cooldown_remaining_ms": 0
}
},
"input_fail_open_total": 42
}Limitations and tradeoffs
- - Distributed Redis path is TTL/counter based; reopen is time-driven rather than strict probe-budget driven.
- - Fail-open mode protects availability, but it can bypass deny/approval decisions during safety outages.
- - Redis outage fallback is per-replica, so breaker behavior can diverge temporarily across scheduler instances.
- - Aggressive thresholds reduce blast radius but can over-trip during short noisy spikes.
Frequently asked questions
What is the circuit breaker pattern for AI agents?
The AI agent circuit breaker pattern trips a degraded dependency (a flaky tool, model endpoint, or safety service) into an OPEN state after a failure threshold so agents stop hammering it, then probes for recovery via a HALF_OPEN state before closing again. Unlike a generic API breaker, an agent control plane must also decide what happens to in-flight autonomous actions while the breaker is open. In Cordum the scheduler's safety client opens after 3 consecutive failures, stays open for 30 seconds, and closes after 2 successful probes. Breaker state is shared across replicas through a Redis counter so one replica tripping protects the whole fleet.
Should an AI agent circuit breaker fail open or fail closed?
It depends on the action's reversibility. Fail-closed (the default in Cordum, POLICY_CHECK_FAIL_MODE=closed) requeues jobs with exponential backoff when the Safety Kernel is unreachable, so no unevaluated job dispatches — the right choice for irreversible or production actions. Fail-open (POLICY_CHECK_FAIL_MODE=open) allows jobs through with a warning, increments cordum_scheduler_input_fail_open_total, and tags the job with a safety_bypassed label plus a dedicated audit event so SIEM can detect the bypass. Use fail-open only in staging or where downstream compensating controls exist.
How does Cordum stop cascading tool failures for autonomous agents?
Cordum gates every dispatch through the Safety Kernel before a worker runs. A Redis-backed circuit breaker on the scheduler's safety client (key cordum:cb:safety:failures) caps retry storms when the kernel degrades, and a separate breaker (cordum:cb:safety:output:failures) protects the output-policy path. When the input breaker is open the scheduler receives a SafetyUnavailable decision instead of blocking on the RPC, and the configured fail mode determines requeue-versus-bypass. Every decision — including a bypass — is written to the audit trail.
What are good circuit breaker thresholds for agent systems?
Cordum's shipped defaults are a 3-failure budget to open, a 30-second open window, 3 half-open probes, and 2 successes to close, with a 2-second gRPC timeout on the safety call (100ms metadata / 30s content for output safety). Aggressive thresholds shrink blast radius but can over-trip on short noisy spikes, so tune the open window to your dependency's typical recovery time and alert on both breaker-open rate and the fail-open counter on the same dashboard.
Next step
Run this in one sprint:
- 1. Set per-topic breaker thresholds and document fail-mode ownership.
- 2. Keep `POLICY_CHECK_FAIL_MODE=closed` for irreversible actions.
- 3. Alert on `input_fail_open_total` and breaker-open rate in the same dashboard.
- 4. Run one game day with Safety Kernel outage and verify expected requeue/bypass behavior.
Breaker behavior is one layer of Cordum's governed dispatch. See how it fits an end-to-end workflow on the automated incident response solution page, then continue with AI Agent Rollback and Compensation and AI Agent Incident Report.