The production problem
Retry loops can turn a short dependency outage into platform-wide pressure. At 500 jobs per minute, three extra attempts per job adds 15,000 avoidable calls over 10 minutes.
Circuit breakers cut this feedback loop. For agent systems, that is still not enough: you also need an explicit governance decision when safety services are unavailable.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Azure Architecture: Circuit Breaker pattern | Clear state-machine behavior and operational concerns for distributed systems. | No pre-dispatch governance model for autonomous actions during breaker-open periods. |
| AWS Prescriptive Guidance: Circuit breaker | Implementation framing and when to prefer breaker over pure retry. | No guidance on safety-policy bypass risk when fail-open is enabled. |
| Martin Fowler: CircuitBreaker | Foundational design, thresholds, half-open behavior, and monitoring value. | No agent-control-plane context where replay/approval semantics must survive outages. |
Policy-aware breaker model
For autonomous agents, breaker state and policy fail mode must be configured together.
| Mode | Behavior | Operational risk |
|---|---|---|
| Retry only | Keeps hitting degraded dependency during outage | Queue growth, waste, and cascading retries |
| Classic breaker | Trips after threshold and blocks traffic temporarily | Unclear policy outcome for blocked high-risk actions |
| Breaker + policy fail mode | Breaker state plus explicit closed/open safety behavior | More knobs, but predictable incident behavior |
Cordum runtime evidence
The controls below are verified against current source and docs.
| Control | Current behavior | Evidence | Why it matters |
|---|---|---|---|
| Input safety timeout stack | Safety client uses 2s gRPC timeout; scheduler wraps with 3s defense timeout. | safety_client.go + engine.go | Caps tail latency before job-level retry/requeue paths trigger. |
| Trip threshold and open window | Breaker opens at 3 failures and uses 30s TTL for reopen window. | safety_client.go + circuit_breaker.go | Fast enough to stop retry storms, short enough for rapid recheck. |
| Shared multi-replica state | Redis keys `cordum:cb:safety:failures` and `cordum:cb:safety:output:failures`. | circuit_breaker.go + safety-kernel.md | Prevents one replica from tripping while others keep hammering. |
| Safety unavailable behavior | `POLICY_CHECK_FAIL_MODE=closed` requeues (default). `open` allows and tags bypass labels. | engine.go + configuration-reference.md | Makes availability-vs-safety tradeoff explicit instead of accidental. |
| Redis outage fallback | Fallback is local in-memory breaker; local state machine uses half-open probes and close-after-success. | circuit_breaker.go | Service keeps running, but decisions become per-replica until Redis recovers. |
Operator caveat
Redis-backed reopen is TTL based. It does not enforce a strict global probe budget. If you need tighter reopen control, add external rate limits around recovered dependencies.
Implementation examples
Safety breaker defaults (Go)
const ( safetyTimeout = 2 * time.Second safetyCircuitOpenFor = 30 * time.Second safetyCircuitFailBudget = 3 safetyCircuitHalfOpenMax = 3 safetyCircuitCloseAfter = 2 )
Fail-open vs fail-closed branch (Go)
case SafetyUnavailable:
if e.isInputFailOpen() {
// allow through + mark bypass
req.Labels["safety_bypassed"] = "true"
} else {
// default closed mode: requeue with backoff
return RetryAfter(fmt.Errorf("safety unavailable"), safetyThrottleDelay)
}Runtime status payload (JSON)
{
"circuit_breakers": {
"input": { "state": "OPEN" },
"output": { "state": "CLOSED" }
},
"input_fail_open_total": 42
}Limitations and tradeoffs
- - Distributed Redis path is TTL/counter based; reopen is time-driven rather than strict probe-budget driven.
- - Fail-open mode protects availability, but it can bypass deny/approval decisions during safety outages.
- - Redis outage fallback is per-replica, so breaker behavior can diverge temporarily across scheduler instances.
- - Aggressive thresholds reduce blast radius but can over-trip during short noisy spikes.
Next step
Run this in one sprint:
- 1. Set per-topic breaker thresholds and document fail-mode ownership.
- 2. Keep `POLICY_CHECK_FAIL_MODE=closed` for irreversible actions.
- 3. Alert on `input_fail_open_total` and breaker-open rate in the same dashboard.
- 4. Run one game day with Safety Kernel outage and verify expected requeue/bypass behavior.
Continue with AI Agent Rollback and Compensation and AI Agent Incident Report.