Skip to content
Guide

AI Agent Circuit Breaker Pattern for Production Reliability

A degraded dependency should trip fast, not drag your whole queue into failure.

Guide10 min readUpdated June 2026
TL;DR
  • -Retries alone amplify outages; breakers cap damage.
  • -In Cordum, safety-client breaker opens at 3 failures for 30s and shares state via Redis.
  • -When safety is unavailable, `POLICY_CHECK_FAIL_MODE=closed` (default) requeues; `open` allows through with bypass signals.
Failure isolation

Trip early before dependency outages spread to every worker

Policy mode

Choose fail-closed or fail-open explicitly per risk posture

Measured recovery

Use bounded reopen logic and observability instead of hope-driven retries

Scope

This guide covers pre-dispatch safety and output-policy breaker behavior in autonomous agent control planes, not generic API-client retry wrappers.

The AI agent circuit breaker pattern trips a flaky tool or safety dependency into an open state after a failure threshold so agents stop hammering it, then probes for recovery before closing again. For autonomous agents you also need an explicit fail mode that decides what happens to in-flight actions while the breaker is open. In Cordum the scheduler's safety client opens after 3 failures, holds open for 30 seconds, closes after 2 successful probes, and shares breaker state across replicas through Redis.

The production problem

Retry loops can turn a short dependency outage into platform-wide pressure. At 500 jobs per minute, three extra attempts per job adds 15,000 avoidable calls over 10 minutes.

Circuit breakers cut this feedback loop. For agent systems, that is still not enough: you also need an explicit governance decision when safety services are unavailable.

What top results miss

SourceStrong coverageMissing piece
Azure Architecture: Circuit Breaker patternClear state-machine behavior and operational concerns for distributed systems.No pre-dispatch governance model for autonomous actions during breaker-open periods.
AWS Prescriptive Guidance: Circuit breakerImplementation framing and when to prefer breaker over pure retry.No guidance on safety-policy bypass risk when fail-open is enabled.
Martin Fowler: CircuitBreakerFoundational design, thresholds, half-open behavior, and monitoring value.No agent-control-plane context where replay/approval semantics must survive outages.

Policy-aware breaker model

For autonomous agents, breaker state and policy fail mode must be configured together.

ModeBehaviorOperational risk
Retry onlyKeeps hitting degraded dependency during outageQueue growth, waste, and cascading retries
Classic breakerTrips after threshold and blocks traffic temporarilyUnclear policy outcome for blocked high-risk actions
Breaker + policy fail modeBreaker state plus explicit closed/open safety behaviorMore knobs, but predictable incident behavior

Cordum runtime evidence

The controls below are verified against current source and docs.

ControlCurrent behaviorEvidenceWhy it matters
Input safety timeout stackSafety client uses 2s gRPC timeout; scheduler wraps with 3s defense timeout.safety_client.go + engine.goCaps tail latency before job-level retry/requeue paths trigger.
Trip threshold and open windowBreaker opens at 3 failures and uses 30s TTL for reopen window.safety_client.go + circuit_breaker.goFast enough to stop retry storms, short enough for rapid recheck.
Shared multi-replica stateRedis keys `cordum:cb:safety:failures` and `cordum:cb:safety:output:failures`.circuit_breaker.go + safety-kernel.mdPrevents one replica from tripping while others keep hammering.
Safety unavailable behavior`POLICY_CHECK_FAIL_MODE=closed` requeues (default). `open` allows and tags bypass labels.engine.go + configuration-reference.mdMakes availability-vs-safety tradeoff explicit instead of accidental.
Redis outage fallbackFallback is local in-memory breaker; local state machine uses half-open probes and close-after-success.circuit_breaker.goService keeps running, but decisions become per-replica until Redis recovers.

Operator caveat

Redis-backed reopen is TTL based. It does not enforce a strict global probe budget. If you need tighter reopen control, add external rate limits around recovered dependencies.

Implementation examples

Safety breaker defaults (Go)

safety_client.go
Go
const (
  safetyTimeout            = 2 * time.Second
  safetyCircuitOpenFor     = 30 * time.Second
  safetyCircuitFailBudget  = 3
  safetyCircuitHalfOpenMax = 3
  safetyCircuitCloseAfter  = 2
)

Fail-open vs fail-closed branch (Go)

engine.go
Go
case SafetyUnavailable:
  if e.isInputFailOpen() {
    // allow through + mark bypass
    req.Labels["safety_bypassed"] = "true"
  } else {
    // default closed mode: requeue with backoff
    return RetryAfter(fmt.Errorf("safety unavailable"), safetyThrottleDelay)
  }

Runtime status payload (JSON)

status-ha.json
JSON
{
  "circuit_breakers": {
    "input": {
      "state": "OPEN",
      "failures": 4,
      "fail_threshold": 3,
      "cooldown_remaining_ms": 21400
    },
    "output": {
      "state": "CLOSED",
      "failures": 0,
      "fail_threshold": 3,
      "cooldown_remaining_ms": 0
    }
  },
  "input_fail_open_total": 42
}

Limitations and tradeoffs

  • - Distributed Redis path is TTL/counter based; reopen is time-driven rather than strict probe-budget driven.
  • - Fail-open mode protects availability, but it can bypass deny/approval decisions during safety outages.
  • - Redis outage fallback is per-replica, so breaker behavior can diverge temporarily across scheduler instances.
  • - Aggressive thresholds reduce blast radius but can over-trip during short noisy spikes.

Frequently asked questions

What is the circuit breaker pattern for AI agents?

The AI agent circuit breaker pattern trips a degraded dependency (a flaky tool, model endpoint, or safety service) into an OPEN state after a failure threshold so agents stop hammering it, then probes for recovery via a HALF_OPEN state before closing again. Unlike a generic API breaker, an agent control plane must also decide what happens to in-flight autonomous actions while the breaker is open. In Cordum the scheduler's safety client opens after 3 consecutive failures, stays open for 30 seconds, and closes after 2 successful probes. Breaker state is shared across replicas through a Redis counter so one replica tripping protects the whole fleet.

Should an AI agent circuit breaker fail open or fail closed?

It depends on the action's reversibility. Fail-closed (the default in Cordum, POLICY_CHECK_FAIL_MODE=closed) requeues jobs with exponential backoff when the Safety Kernel is unreachable, so no unevaluated job dispatches — the right choice for irreversible or production actions. Fail-open (POLICY_CHECK_FAIL_MODE=open) allows jobs through with a warning, increments cordum_scheduler_input_fail_open_total, and tags the job with a safety_bypassed label plus a dedicated audit event so SIEM can detect the bypass. Use fail-open only in staging or where downstream compensating controls exist.

How does Cordum stop cascading tool failures for autonomous agents?

Cordum gates every dispatch through the Safety Kernel before a worker runs. A Redis-backed circuit breaker on the scheduler's safety client (key cordum:cb:safety:failures) caps retry storms when the kernel degrades, and a separate breaker (cordum:cb:safety:output:failures) protects the output-policy path. When the input breaker is open the scheduler receives a SafetyUnavailable decision instead of blocking on the RPC, and the configured fail mode determines requeue-versus-bypass. Every decision — including a bypass — is written to the audit trail.

What are good circuit breaker thresholds for agent systems?

Cordum's shipped defaults are a 3-failure budget to open, a 30-second open window, 3 half-open probes, and 2 successes to close, with a 2-second gRPC timeout on the safety call (100ms metadata / 30s content for output safety). Aggressive thresholds shrink blast radius but can over-trip on short noisy spikes, so tune the open window to your dependency's typical recovery time and alert on both breaker-open rate and the fail-open counter on the same dashboard.

Next step

Run this in one sprint:

  1. 1. Set per-topic breaker thresholds and document fail-mode ownership.
  2. 2. Keep `POLICY_CHECK_FAIL_MODE=closed` for irreversible actions.
  3. 3. Alert on `input_fail_open_total` and breaker-open rate in the same dashboard.
  4. 4. Run one game day with Safety Kernel outage and verify expected requeue/bypass behavior.

Breaker behavior is one layer of Cordum's governed dispatch. See how it fits an end-to-end workflow on the automated incident response solution page, then continue with AI Agent Rollback and Compensation and AI Agent Incident Report.

Fail fast, recover faster

A breaker is not a silver bullet. It is the line between controlled degradation and unbounded retry chaos.