AI Agent Safety Circuit Breaker Tuning: Shared Redis Thresholds and Fail-Mode Tradeoffs (2026)

The production problem

A safety service outage can trigger a second outage: every scheduler replica keeps retrying, queues swell, and critical paths starve.

Circuit breakers exist to prevent that. But in governance systems, breaker tuning has a side effect: it changes when jobs are blocked, retried, or allowed through during safety downtime.

This makes threshold tuning a policy decision, not only a resilience tweak.

What top results miss

Source	Strong coverage	Missing piece
Martin Fowler: Circuit Breaker	Core state concepts: closed, open, half-open, timeout and threshold rationale.	No TTL-key shared breaker semantics where half-open probe volume is not globally rate-limited.
Microsoft Circuit Breaker pattern	State transitions and retry interaction in production microservices.	No policy-engine fail-open/fail-closed decision path or Redis key-expiry probe burst model.
Resilience4j CircuitBreaker docs	Finite-state machine tuning, windowing, thresholds, and half-open probes.	No mapping to shared Redis counters where `permittedNumberOfCallsInHalfOpenState` has no global equivalent.

The missing layer is control-plane semantics: how `SafetyUnavailable` combines with fail mode and what evidence operators need during temporary fail-open windows.

State model and thresholds

State	Trigger	Cordum behavior
CLOSED	Normal operation; failures recorded	Opens after 3 failures (input/output safety clients)
OPEN	Fail budget exceeded	Redis key TTL is set to 30s; requests short-circuit to `SafetyUnavailable`
HALF_OPEN	Open TTL expires and next requests resume	Redis mode allows probes after key expiry; no distributed permitted-call quota is enforced
LOCAL_FALLBACK	Redis unavailable	Per-replica in-memory breaker uses half-open max 3 and close-after 2; cross-replica sharing is lost

Cordum runtime behavior

Boundary	Current behavior	Operational impact
Input safety timeout	`SafetyClient` policy checks use 2s request timeout.	Bounds wait time before failure accounting and breaker updates.
Output safety timeout	Output checks use 100ms for metadata and 30s for content path.	Separates fast-path moderation from deep content evaluation latency.
Failure recording	Lua script performs `INCR` + `EXPIRE` atomically for failure key.	Avoids race conditions when multiple replicas fail at once.
Open detection	`IsOpen()` checks Redis failure counter against threshold.	One unhealthy replica can trip shared protection for all replicas quickly.
Safety unavailable handling	Engine requeues in fail-closed mode; allows with bypass labels in fail-open mode.	Fail-mode controls availability vs governance strictness during outages.
Backoff behavior	Requeue path uses `safetyThrottleDelay = 5s` for `SafetyUnavailable`.	Prevents tight-loop retry storms while kernel recovers.
Redis half-open gating	Redis path opens on threshold and closes on key expiry or success delete; no global half-open permit counter.	Replica fleets can send synchronized probe bursts immediately after TTL expiry.

Implementation examples

Atomic distributed failure recording (Go + Lua)

redis_circuit_breaker_failure.lua.go

var recordFailureLua = redis.NewScript(`
local count = redis.call('INCR', KEYS[1])
if count == 1 then
  redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count
`)

func (cb *RedisCircuitBreaker) RecordFailure(ctx context.Context) {
  ttlSec := int64(cb.openDuration.Seconds())
  if ttlSec <= 0 {
    ttlSec = 30
  }
  count, _ := recordFailureLua.Run(ctx, cb.rdb, []string{cb.failuresKey}, ttlSec).Int64()
  if count >= cb.failThreshold {
    slog.Warn("circuit-breaker: circuit opened", "failures", count)
  }
}

Fail-open vs fail-closed decision path (Go)

safety_unavailable_fail_mode.go

case SafetyUnavailable:
  if e.isInputFailOpen() {
    if e.counterClient != nil {
      e.counterClient.Incr(lockCtx, "cordum:scheduler:input_fail_open_total")
    }
    record.Decision = SafetyAllow
    record.Reason = "fail-open: safety unavailable -- " + record.Reason
    req.Labels["safety_bypassed"] = "true"
    req.Labels["safety_bypass_reason"] = record.Reason
  } else {
    return RetryAfter(fmt.Errorf("safety unavailable: %s", record.Reason), 5*time.Second)
  }

Operator runbook baseline

safety_breaker_runbook.sh

Bash

# 1) Verify breaker keys and counters
redis-cli GET cordum:cb:safety:failures
redis-cli TTL cordum:cb:safety:failures

# 2) Alert on fail-open bypass increments
# metric: cordum_scheduler_input_fail_open_total

# 3) Estimate post-expiry probe surge envelope before tests
# probe_burst_upper_bound ~= replicas * in_flight_checks_per_replica

# 4) Keep fail mode closed in production by default
export POLICY_CHECK_FAIL_MODE=closed

# 5) If temporary fail-open is required, time-box it and monitor bypass labels
# label: safety_bypassed=true

Probe burst envelope (Ops)

shared_breaker_probe_burst.txt

Bash

# Upper-bound envelope after breaker TTL expiry
# probe_burst_upper_bound ~= scheduler_replicas * in_flight_safety_checks_per_replica
#
# Redis shared mode does not enforce global half-open permits.
# If 8 replicas each release 20 checks immediately, first wave ~= 160 probes.
#
# Practical guardrail:
# keep first-wave probes below Safety Kernel recoverable QPS

Limitations and tradeoffs

- Lower fail threshold reacts faster, but can open on short-lived noise spikes.
- Longer open duration reduces retry pressure, but extends degraded-mode windows.
- Shared Redis state gives global coordination, but Redis outages force local fallback behavior.
- Redis TTL expiry can cause probe bursts unless concurrency is bounded at dispatch or transport layers.
- Fail-open keeps throughput during outages, but can bypass deny/approval rules temporarily.

If you enable fail-open in production without alerting on `input_fail_open_total`, you have created a silent governance bypass channel.

Next step

Run this tuning drill this week:

1. Inject Safety Kernel errors until breaker opens and verify shared Redis key behavior across replicas.
2. Measure the first 10s after TTL expiry and compare observed probe surge vs your envelope estimate.
3. Keep `POLICY_CHECK_FAIL_MODE=closed` in production unless temporary exception is approved.
4. If fail-open is used, enforce a time-box and alert on bypass metrics and labels.

Continue with Safety Kernel Outage Playbook and Fail-Open Alerting.