Skip to content
Guide

AI Agent Circuit Breaker Pattern for Production Reliability

A broken dependency should trip a breaker, not your entire agent fleet.

Guide10 min readMar 2026
TL;DR
  • -Retries without a breaker turn dependency outages into queue storms.
  • -Agent systems need circuit state plus policy fail mode, not only CLOSED/OPEN/HALF_OPEN.
  • -Define thresholds as code, audit state transitions, and rehearse breaker-open incidents.
Failure isolation

Trip early before downstream outages spread

Policy mode

Choose fail-closed or fail-open explicitly

Measured recovery

Probe with bounded half-open traffic

Scope

This guide focuses on circuit breakers around agent tool calls and safety checks in production control planes, not generic API clients.

The production problem

Autonomous agents are fast at repeating mistakes. One slow dependency can trigger retries, queue growth, and operator pages in minutes.

Retry logic alone is useful for transient faults. It is not a strategy for sustained outages. On incident day, retries are caffeine, not reliability.

What top results miss

SourceStrong coverageMissing piece
Azure Circuit Breaker patternStrong state-machine explanation and operational considerations.No agent-specific guidance for pre-dispatch policy decisions.
AWS Circuit breaker patternGood implementation example with Step Functions and status store.Limited treatment of governance decisions when the breaker is open.
Martin Fowler Circuit BreakerCanonical rationale, thresholds, and half-open behavior.Not focused on autonomous agents that trigger real-world side effects.

Policy-aware breaker model

For AI agents, breaker state is only half the decision. You also need to define what happens to blocked actions: deny, requeue, degrade, or fail-open with an audit signal.

ModeBehaviorOperational risk
Retry onlyKeeps sending calls during outageQueue growth, token spend, duplicate actions
Breaker onlyStops remote calls after thresholdStill unclear what to do with blocked high-risk actions
Breaker + policy modeBreaker state plus explicit fail-closed/open decisionMore configuration complexity, better incident outcomes

Cordum runtime defaults

Cordum shares circuit-breaker failure counters across scheduler replicas through Redis keys. This avoids one replica opening while others continue hammering a degraded dependency.

ControlDefaultWhy this number matters
Safety call timeout2s per safety checkPrevents long tail latency from blocking scheduler workers.
Open threshold3 consecutive failuresTrips fast enough to stop cascading retries.
Open duration30s before half-open probesGives dependencies time to recover before new pressure.
Half-open probesMax 3 probe requests; close after 2 successesAvoids flooding a recovering dependency.
Policy fail mode`POLICY_CHECK_FAIL_MODE=closed` by defaultSafe default: no dispatch without a policy decision.

Implementation examples

Breaker + policy decision loop (Go)

breaker.go
Go
type BreakerState string

const (
  Closed   BreakerState = "CLOSED"
  Open     BreakerState = "OPEN"
  HalfOpen BreakerState = "HALF_OPEN"
)

func HandleAction(ctx context.Context, req Action) error {
  state := breaker.State(req.Topic)
  if state == Open {
    if policyFailMode() == "open" {
      return dispatchWithWarning(ctx, req, "breaker_open_fail_open")
    }
    return requeue(req, 5*time.Second)
  }

  err := callDependency(ctx, req)
  breaker.Record(req.Topic, err)
  return err
}

Threshold configuration (YAML)

safety.yaml
YAML
input_policy:
  fail_mode: closed
  circuit_breaker:
    timeout: 2s
    open_after_failures: 3
    open_for: 30s
    half_open_max_requests: 3
    close_after_successes: 2

State transition audit event (JSON)

breaker-event.json
JSON
{
  "ts": "2026-03-31T14:11:49Z",
  "topic": "tool.github.pr.create",
  "breaker_state": "OPEN",
  "failure_count": 3,
  "policy_fail_mode": "closed",
  "action": "requeue",
  "delay_seconds": 5
}

Limitations and tradeoffs

  • - Aggressive thresholds can trip on noise and reduce useful throughput.
  • - Fail-open mode preserves availability but can bypass governance guarantees during outages.
  • - Shared breaker state adds Redis dependence; local fallback reduces coordination.
  • - Breakers protect dependencies, not business correctness. You still need idempotency and compensation.

Next step

Run this in one sprint:

  1. 1. Set breaker thresholds per high-risk topic and store them in config.
  2. 2. Keep policy fail mode `closed` for irreversible actions.
  3. 3. Add dashboards for breaker open rate, half-open probes, and requeue delays.
  4. 4. Execute one game-day where the safety service is intentionally degraded.

Continue with AI Agent Rollback and Compensation and AI Agent Incident Report.

Fail fast, recover faster

If every failure path is a retry loop, your outage response is just waiting longer.