Skip to content
Deep Dive

AI Agent Safety Circuit Breaker Tuning

Bad thresholds either melt capacity or silently bypass policy checks.

Deep Dive11 min readMar 2026
TL;DR
  • -Circuit breakers protect control-plane capacity, but threshold choices directly affect governance behavior.
  • -Cordum uses Redis-backed shared failure counters so one replica can open the breaker for all replicas.
  • -Default constants are explicit: fail budget 3, open duration 30s, half-open probes 3, close-after-successes 2.
  • -When safety is unavailable, `POLICY_CHECK_FAIL_MODE` decides requeue (`closed`) vs allow-with-bypass-label (`open`).
Shared state

Input and output safety clients share breaker state through Redis key counters.

Deterministic timing

Open window is TTL-driven (30s), so half-open probe timing is predictable.

Governance impact

Fail-open mode can bypass pre-dispatch checks; use with explicit alerting and scope limits.

Scope

This guide covers pre-dispatch input safety and post-execution output safety breakers in Cordum scheduler, with focus on distributed behavior and outage-mode decisions.

The production problem

A safety service outage can trigger a second outage: every scheduler replica keeps retrying, queues swell, and critical paths starve.

Circuit breakers exist to prevent that. But in governance systems, breaker tuning has a side effect: it changes when jobs are blocked, retried, or allowed through during safety downtime.

This makes threshold tuning a policy decision, not only a resilience tweak.

What top results miss

SourceStrong coverageMissing piece
Martin Fowler: Circuit BreakerCore state concepts: closed, open, half-open, timeout and threshold rationale.No distributed Redis-shared breaker behavior across scheduler replicas.
Microsoft Circuit Breaker patternState transitions and retry interaction in production microservices.No policy-engine fail-open/fail-closed decision path for autonomous job dispatch.
Resilience4j CircuitBreaker docsFinite-state machine tuning, windowing, thresholds, and half-open probes.No mapping to pre-dispatch safety decisions and bypass labeling requirements.

The missing layer is control-plane semantics: how `SafetyUnavailable` combines with fail mode and what evidence operators need during temporary fail-open windows.

State model and thresholds

StateTriggerCordum behavior
CLOSEDNormal operation; failures recordedOpens after 3 failures (input/output safety clients)
OPENFail budget exceededRedis key TTL is set to 30s; requests short-circuit to `SafetyUnavailable`
HALF_OPENOpen TTL expires and probe traffic resumesUp to 3 probe requests; closes after 2 successes, reopens on failure
LOCAL_FALLBACKRedis unavailablePer-replica in-memory breaker mirrors thresholds; cross-replica sharing is lost

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Input safety timeout`SafetyClient` policy checks use 2s request timeout.Bounds wait time before failure accounting and breaker updates.
Output safety timeoutOutput checks use 100ms for metadata and 30s for content path.Separates fast-path moderation from deep content evaluation latency.
Failure recordingLua script performs `INCR` + `EXPIRE` atomically for failure key.Avoids race conditions when multiple replicas fail at once.
Open detection`IsOpen()` checks Redis failure counter against threshold.One unhealthy replica can trip shared protection for all replicas quickly.
Safety unavailable handlingEngine requeues in fail-closed mode; allows with bypass labels in fail-open mode.Fail-mode controls availability vs governance strictness during outages.
Backoff behaviorRequeue path uses `safetyThrottleDelay = 5s` for `SafetyUnavailable`.Prevents tight-loop retry storms while kernel recovers.

Implementation examples

Atomic distributed failure recording (Go + Lua)

redis_circuit_breaker_failure.lua.go
Go
var recordFailureLua = redis.NewScript(`
local count = redis.call('INCR', KEYS[1])
if count == 1 then
  redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count
`)

func (cb *RedisCircuitBreaker) RecordFailure(ctx context.Context) {
  ttlSec := int64(cb.openDuration.Seconds())
  if ttlSec <= 0 {
    ttlSec = 30
  }
  count, _ := recordFailureLua.Run(ctx, cb.rdb, []string{cb.failuresKey}, ttlSec).Int64()
  if count >= cb.failThreshold {
    slog.Warn("circuit-breaker: circuit opened", "failures", count)
  }
}

Fail-open vs fail-closed decision path (Go)

safety_unavailable_fail_mode.go
Go
case SafetyUnavailable:
  if e.isInputFailOpen() {
    if e.counterClient != nil {
      e.counterClient.Incr(lockCtx, "cordum:scheduler:input_fail_open_total")
    }
    record.Decision = SafetyAllow
    record.Reason = "fail-open: safety unavailable -- " + record.Reason
    req.Labels["safety_bypassed"] = "true"
    req.Labels["safety_bypass_reason"] = record.Reason
  } else {
    return RetryAfter(fmt.Errorf("safety unavailable: %s", record.Reason), 5*time.Second)
  }

Operator runbook baseline

safety_breaker_runbook.sh
Bash
# 1) Verify breaker keys and counters
redis-cli GET cordum:cb:safety:failures
redis-cli TTL cordum:cb:safety:failures

# 2) Alert on fail-open bypass increments
# metric: cordum_scheduler_input_fail_open_total

# 3) Keep fail mode closed in production by default
export POLICY_CHECK_FAIL_MODE=closed

# 4) If temporary fail-open is required, time-box it and monitor bypass labels
# label: safety_bypassed=true

Limitations and tradeoffs

  • - Lower fail threshold reacts faster, but can open on short-lived noise spikes.
  • - Longer open duration reduces retry pressure, but extends degraded-mode windows.
  • - Shared Redis state gives global coordination, but Redis outages force local fallback behavior.
  • - Fail-open keeps throughput during outages, but can bypass deny/approval rules temporarily.

If you enable fail-open in production without alerting on `input_fail_open_total`, you have created a silent governance bypass channel.

Next step

Run this tuning drill this week:

  1. 1. Inject Safety Kernel errors until breaker opens and verify shared Redis key behavior across replicas.
  2. 2. Measure open-window recovery path at 30s and confirm half-open probe outcomes in logs.
  3. 3. Keep `POLICY_CHECK_FAIL_MODE=closed` in production unless temporary exception is approved.
  4. 4. If fail-open is used, enforce a time-box and alert on bypass metrics and labels.

Continue with Safety Kernel Outage Playbook and Fail-Open Alerting.

Breaker tuning is policy tuning

Capacity protection and governance guarantees are linked. Tune both on purpose.