Skip to content
Guide

AI Agent Circuit Breaker Pattern for Production Reliability

A degraded dependency should trip fast, not drag your whole queue into failure.

Guide10 min readApr 2026
TL;DR
  • -Retries alone amplify outages; breakers cap damage.
  • -In Cordum, safety-client breaker opens at 3 failures for 30s and shares state via Redis.
  • -When safety is unavailable, `POLICY_CHECK_FAIL_MODE=closed` (default) requeues; `open` allows through with bypass signals.
Failure isolation

Trip early before dependency outages spread to every worker

Policy mode

Choose fail-closed or fail-open explicitly per risk posture

Measured recovery

Use bounded reopen logic and observability instead of hope-driven retries

Scope

This guide covers pre-dispatch safety and output-policy breaker behavior in autonomous agent control planes, not generic API-client retry wrappers.

The production problem

Retry loops can turn a short dependency outage into platform-wide pressure. At 500 jobs per minute, three extra attempts per job adds 15,000 avoidable calls over 10 minutes.

Circuit breakers cut this feedback loop. For agent systems, that is still not enough: you also need an explicit governance decision when safety services are unavailable.

What top results miss

SourceStrong coverageMissing piece
Azure Architecture: Circuit Breaker patternClear state-machine behavior and operational concerns for distributed systems.No pre-dispatch governance model for autonomous actions during breaker-open periods.
AWS Prescriptive Guidance: Circuit breakerImplementation framing and when to prefer breaker over pure retry.No guidance on safety-policy bypass risk when fail-open is enabled.
Martin Fowler: CircuitBreakerFoundational design, thresholds, half-open behavior, and monitoring value.No agent-control-plane context where replay/approval semantics must survive outages.

Policy-aware breaker model

For autonomous agents, breaker state and policy fail mode must be configured together.

ModeBehaviorOperational risk
Retry onlyKeeps hitting degraded dependency during outageQueue growth, waste, and cascading retries
Classic breakerTrips after threshold and blocks traffic temporarilyUnclear policy outcome for blocked high-risk actions
Breaker + policy fail modeBreaker state plus explicit closed/open safety behaviorMore knobs, but predictable incident behavior

Cordum runtime evidence

The controls below are verified against current source and docs.

ControlCurrent behaviorEvidenceWhy it matters
Input safety timeout stackSafety client uses 2s gRPC timeout; scheduler wraps with 3s defense timeout.safety_client.go + engine.goCaps tail latency before job-level retry/requeue paths trigger.
Trip threshold and open windowBreaker opens at 3 failures and uses 30s TTL for reopen window.safety_client.go + circuit_breaker.goFast enough to stop retry storms, short enough for rapid recheck.
Shared multi-replica stateRedis keys `cordum:cb:safety:failures` and `cordum:cb:safety:output:failures`.circuit_breaker.go + safety-kernel.mdPrevents one replica from tripping while others keep hammering.
Safety unavailable behavior`POLICY_CHECK_FAIL_MODE=closed` requeues (default). `open` allows and tags bypass labels.engine.go + configuration-reference.mdMakes availability-vs-safety tradeoff explicit instead of accidental.
Redis outage fallbackFallback is local in-memory breaker; local state machine uses half-open probes and close-after-success.circuit_breaker.goService keeps running, but decisions become per-replica until Redis recovers.

Operator caveat

Redis-backed reopen is TTL based. It does not enforce a strict global probe budget. If you need tighter reopen control, add external rate limits around recovered dependencies.

Implementation examples

Safety breaker defaults (Go)

safety_client.go
Go
const (
  safetyTimeout            = 2 * time.Second
  safetyCircuitOpenFor     = 30 * time.Second
  safetyCircuitFailBudget  = 3
  safetyCircuitHalfOpenMax = 3
  safetyCircuitCloseAfter  = 2
)

Fail-open vs fail-closed branch (Go)

engine.go
Go
case SafetyUnavailable:
  if e.isInputFailOpen() {
    // allow through + mark bypass
    req.Labels["safety_bypassed"] = "true"
  } else {
    // default closed mode: requeue with backoff
    return RetryAfter(fmt.Errorf("safety unavailable"), safetyThrottleDelay)
  }

Runtime status payload (JSON)

status-ha.json
JSON
{
  "circuit_breakers": {
    "input":  { "state": "OPEN" },
    "output": { "state": "CLOSED" }
  },
  "input_fail_open_total": 42
}

Limitations and tradeoffs

  • - Distributed Redis path is TTL/counter based; reopen is time-driven rather than strict probe-budget driven.
  • - Fail-open mode protects availability, but it can bypass deny/approval decisions during safety outages.
  • - Redis outage fallback is per-replica, so breaker behavior can diverge temporarily across scheduler instances.
  • - Aggressive thresholds reduce blast radius but can over-trip during short noisy spikes.

Next step

Run this in one sprint:

  1. 1. Set per-topic breaker thresholds and document fail-mode ownership.
  2. 2. Keep `POLICY_CHECK_FAIL_MODE=closed` for irreversible actions.
  3. 3. Alert on `input_fail_open_total` and breaker-open rate in the same dashboard.
  4. 4. Run one game day with Safety Kernel outage and verify expected requeue/bypass behavior.

Continue with AI Agent Rollback and Compensation and AI Agent Incident Report.

Fail fast, recover faster

A breaker is not a silver bullet. It is the line between controlled degradation and unbounded retry chaos.