Skip to content
Deep Dive

AI Agent Safety Circuit Breaker Tuning

Bad thresholds either melt capacity or silently bypass policy checks.

Deep Dive11 min readApr 2026
TL;DR
  • -Circuit breakers protect control-plane capacity, but threshold choices directly affect governance behavior.
  • -Cordum uses Redis-backed shared failure counters so one replica can open the breaker for all replicas.
  • -Redis-shared mode uses fail budget `3` and open window `30s`; half-open caps (`3`/`2`) apply to local fallback mode.
  • -After the Redis TTL expires, probe traffic can arrive in bursts across replicas unless explicitly bounded at a higher layer.
  • -When safety is unavailable, `POLICY_CHECK_FAIL_MODE` decides requeue (`closed`) vs allow-with-bypass-label (`open`).
Shared state

Input and output safety clients share breaker state through Redis key counters.

Deterministic timing

Open window is TTL-driven (30s), so half-open probe timing is predictable.

Governance impact

Fail-open mode can bypass pre-dispatch checks; use with explicit alerting and scope limits.

Scope

This guide covers pre-dispatch input safety and post-execution output safety breakers in Cordum scheduler, with focus on distributed behavior and outage-mode decisions.

The production problem

A safety service outage can trigger a second outage: every scheduler replica keeps retrying, queues swell, and critical paths starve.

Circuit breakers exist to prevent that. But in governance systems, breaker tuning has a side effect: it changes when jobs are blocked, retried, or allowed through during safety downtime.

This makes threshold tuning a policy decision, not only a resilience tweak.

What top results miss

SourceStrong coverageMissing piece
Martin Fowler: Circuit BreakerCore state concepts: closed, open, half-open, timeout and threshold rationale.No TTL-key shared breaker semantics where half-open probe volume is not globally rate-limited.
Microsoft Circuit Breaker patternState transitions and retry interaction in production microservices.No policy-engine fail-open/fail-closed decision path or Redis key-expiry probe burst model.
Resilience4j CircuitBreaker docsFinite-state machine tuning, windowing, thresholds, and half-open probes.No mapping to shared Redis counters where `permittedNumberOfCallsInHalfOpenState` has no global equivalent.

The missing layer is control-plane semantics: how `SafetyUnavailable` combines with fail mode and what evidence operators need during temporary fail-open windows.

State model and thresholds

StateTriggerCordum behavior
CLOSEDNormal operation; failures recordedOpens after 3 failures (input/output safety clients)
OPENFail budget exceededRedis key TTL is set to 30s; requests short-circuit to `SafetyUnavailable`
HALF_OPENOpen TTL expires and next requests resumeRedis mode allows probes after key expiry; no distributed permitted-call quota is enforced
LOCAL_FALLBACKRedis unavailablePer-replica in-memory breaker uses half-open max 3 and close-after 2; cross-replica sharing is lost

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Input safety timeout`SafetyClient` policy checks use 2s request timeout.Bounds wait time before failure accounting and breaker updates.
Output safety timeoutOutput checks use 100ms for metadata and 30s for content path.Separates fast-path moderation from deep content evaluation latency.
Failure recordingLua script performs `INCR` + `EXPIRE` atomically for failure key.Avoids race conditions when multiple replicas fail at once.
Open detection`IsOpen()` checks Redis failure counter against threshold.One unhealthy replica can trip shared protection for all replicas quickly.
Safety unavailable handlingEngine requeues in fail-closed mode; allows with bypass labels in fail-open mode.Fail-mode controls availability vs governance strictness during outages.
Backoff behaviorRequeue path uses `safetyThrottleDelay = 5s` for `SafetyUnavailable`.Prevents tight-loop retry storms while kernel recovers.
Redis half-open gatingRedis path opens on threshold and closes on key expiry or success delete; no global half-open permit counter.Replica fleets can send synchronized probe bursts immediately after TTL expiry.

Implementation examples

Atomic distributed failure recording (Go + Lua)

redis_circuit_breaker_failure.lua.go
Go
var recordFailureLua = redis.NewScript(`
local count = redis.call('INCR', KEYS[1])
if count == 1 then
  redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count
`)

func (cb *RedisCircuitBreaker) RecordFailure(ctx context.Context) {
  ttlSec := int64(cb.openDuration.Seconds())
  if ttlSec <= 0 {
    ttlSec = 30
  }
  count, _ := recordFailureLua.Run(ctx, cb.rdb, []string{cb.failuresKey}, ttlSec).Int64()
  if count >= cb.failThreshold {
    slog.Warn("circuit-breaker: circuit opened", "failures", count)
  }
}

Fail-open vs fail-closed decision path (Go)

safety_unavailable_fail_mode.go
Go
case SafetyUnavailable:
  if e.isInputFailOpen() {
    if e.counterClient != nil {
      e.counterClient.Incr(lockCtx, "cordum:scheduler:input_fail_open_total")
    }
    record.Decision = SafetyAllow
    record.Reason = "fail-open: safety unavailable -- " + record.Reason
    req.Labels["safety_bypassed"] = "true"
    req.Labels["safety_bypass_reason"] = record.Reason
  } else {
    return RetryAfter(fmt.Errorf("safety unavailable: %s", record.Reason), 5*time.Second)
  }

Operator runbook baseline

safety_breaker_runbook.sh
Bash
# 1) Verify breaker keys and counters
redis-cli GET cordum:cb:safety:failures
redis-cli TTL cordum:cb:safety:failures

# 2) Alert on fail-open bypass increments
# metric: cordum_scheduler_input_fail_open_total

# 3) Estimate post-expiry probe surge envelope before tests
# probe_burst_upper_bound ~= replicas * in_flight_checks_per_replica

# 4) Keep fail mode closed in production by default
export POLICY_CHECK_FAIL_MODE=closed

# 5) If temporary fail-open is required, time-box it and monitor bypass labels
# label: safety_bypassed=true

Probe burst envelope (Ops)

shared_breaker_probe_burst.txt
Bash
# Upper-bound envelope after breaker TTL expiry
# probe_burst_upper_bound ~= scheduler_replicas * in_flight_safety_checks_per_replica
#
# Redis shared mode does not enforce global half-open permits.
# If 8 replicas each release 20 checks immediately, first wave ~= 160 probes.
#
# Practical guardrail:
# keep first-wave probes below Safety Kernel recoverable QPS

Limitations and tradeoffs

  • - Lower fail threshold reacts faster, but can open on short-lived noise spikes.
  • - Longer open duration reduces retry pressure, but extends degraded-mode windows.
  • - Shared Redis state gives global coordination, but Redis outages force local fallback behavior.
  • - Redis TTL expiry can cause probe bursts unless concurrency is bounded at dispatch or transport layers.
  • - Fail-open keeps throughput during outages, but can bypass deny/approval rules temporarily.

If you enable fail-open in production without alerting on `input_fail_open_total`, you have created a silent governance bypass channel.

Next step

Run this tuning drill this week:

  1. 1. Inject Safety Kernel errors until breaker opens and verify shared Redis key behavior across replicas.
  2. 2. Measure the first 10s after TTL expiry and compare observed probe surge vs your envelope estimate.
  3. 3. Keep `POLICY_CHECK_FAIL_MODE=closed` in production unless temporary exception is approved.
  4. 4. If fail-open is used, enforce a time-box and alert on bypass metrics and labels.

Continue with Safety Kernel Outage Playbook and Fail-Open Alerting.

Breaker tuning is policy tuning

Capacity protection and governance guarantees are linked. Tune both on purpose.