Name: Cordum
Author: Cordum

The production problem

Autonomous agents are fast at repeating mistakes. One slow dependency can trigger retries, queue growth, and operator pages in minutes.

Retry logic alone is useful for transient faults. It is not a strategy for sustained outages. On incident day, retries are caffeine, not reliability.

What top results miss

Source	Strong coverage	Missing piece
Azure Circuit Breaker pattern	Strong state-machine explanation and operational considerations.	No agent-specific guidance for pre-dispatch policy decisions.
AWS Circuit breaker pattern	Good implementation example with Step Functions and status store.	Limited treatment of governance decisions when the breaker is open.
Martin Fowler Circuit Breaker	Canonical rationale, thresholds, and half-open behavior.	Not focused on autonomous agents that trigger real-world side effects.

Policy-aware breaker model

For AI agents, breaker state is only half the decision. You also need to define what happens to blocked actions: deny, requeue, degrade, or fail-open with an audit signal.

Mode	Behavior	Operational risk
Retry only	Keeps sending calls during outage	Queue growth, token spend, duplicate actions
Breaker only	Stops remote calls after threshold	Still unclear what to do with blocked high-risk actions
Breaker + policy mode	Breaker state plus explicit fail-closed/open decision	More configuration complexity, better incident outcomes

Cordum runtime defaults

Cordum shares circuit-breaker failure counters across scheduler replicas through Redis keys. This avoids one replica opening while others continue hammering a degraded dependency.

Control	Default	Why this number matters
Safety call timeout	2s per safety check	Prevents long tail latency from blocking scheduler workers.
Open threshold	3 consecutive failures	Trips fast enough to stop cascading retries.
Open duration	30s before half-open probes	Gives dependencies time to recover before new pressure.
Half-open probes	Max 3 probe requests; close after 2 successes	Avoids flooding a recovering dependency.
Policy fail mode	`POLICY_CHECK_FAIL_MODE=closed` by default	Safe default: no dispatch without a policy decision.

Implementation examples

Breaker + policy decision loop (Go)

breaker.go

type BreakerState string

const (
  Closed   BreakerState = "CLOSED"
  Open     BreakerState = "OPEN"
  HalfOpen BreakerState = "HALF_OPEN"
)

func HandleAction(ctx context.Context, req Action) error {
  state := breaker.State(req.Topic)
  if state == Open {
    if policyFailMode() == "open" {
      return dispatchWithWarning(ctx, req, "breaker_open_fail_open")
    }
    return requeue(req, 5*time.Second)
  }

  err := callDependency(ctx, req)
  breaker.Record(req.Topic, err)
  return err
}

Threshold configuration (YAML)

safety.yaml

YAML

input_policy:
  fail_mode: closed
  circuit_breaker:
    timeout: 2s
    open_after_failures: 3
    open_for: 30s
    half_open_max_requests: 3
    close_after_successes: 2

State transition audit event (JSON)

breaker-event.json

JSON

{
  "ts": "2026-03-31T14:11:49Z",
  "topic": "tool.github.pr.create",
  "breaker_state": "OPEN",
  "failure_count": 3,
  "policy_fail_mode": "closed",
  "action": "requeue",
  "delay_seconds": 5
}

Limitations and tradeoffs

- Aggressive thresholds can trip on noise and reduce useful throughput.
- Fail-open mode preserves availability but can bypass governance guarantees during outages.
- Shared breaker state adds Redis dependence; local fallback reduces coordination.
- Breakers protect dependencies, not business correctness. You still need idempotency and compensation.

Next step

Run this in one sprint:

1. Set breaker thresholds per high-risk topic and store them in config.
2. Keep policy fail mode `closed` for irreversible actions.
3. Add dashboards for breaker open rate, half-open probes, and requeue delays.
4. Execute one game-day where the safety service is intentionally degraded.

Continue with AI Agent Rollback and Compensation and AI Agent Incident Report.

AI Agent Circuit Breaker Pattern for Production Reliability