The production problem
A safety service outage can trigger a second outage: every scheduler replica keeps retrying, queues swell, and critical paths starve.
Circuit breakers exist to prevent that. But in governance systems, breaker tuning has a side effect: it changes when jobs are blocked, retried, or allowed through during safety downtime.
This makes threshold tuning a policy decision, not only a resilience tweak.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Martin Fowler: Circuit Breaker | Core state concepts: closed, open, half-open, timeout and threshold rationale. | No distributed Redis-shared breaker behavior across scheduler replicas. |
| Microsoft Circuit Breaker pattern | State transitions and retry interaction in production microservices. | No policy-engine fail-open/fail-closed decision path for autonomous job dispatch. |
| Resilience4j CircuitBreaker docs | Finite-state machine tuning, windowing, thresholds, and half-open probes. | No mapping to pre-dispatch safety decisions and bypass labeling requirements. |
The missing layer is control-plane semantics: how `SafetyUnavailable` combines with fail mode and what evidence operators need during temporary fail-open windows.
State model and thresholds
| State | Trigger | Cordum behavior |
|---|---|---|
| CLOSED | Normal operation; failures recorded | Opens after 3 failures (input/output safety clients) |
| OPEN | Fail budget exceeded | Redis key TTL is set to 30s; requests short-circuit to `SafetyUnavailable` |
| HALF_OPEN | Open TTL expires and probe traffic resumes | Up to 3 probe requests; closes after 2 successes, reopens on failure |
| LOCAL_FALLBACK | Redis unavailable | Per-replica in-memory breaker mirrors thresholds; cross-replica sharing is lost |
Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Input safety timeout | `SafetyClient` policy checks use 2s request timeout. | Bounds wait time before failure accounting and breaker updates. |
| Output safety timeout | Output checks use 100ms for metadata and 30s for content path. | Separates fast-path moderation from deep content evaluation latency. |
| Failure recording | Lua script performs `INCR` + `EXPIRE` atomically for failure key. | Avoids race conditions when multiple replicas fail at once. |
| Open detection | `IsOpen()` checks Redis failure counter against threshold. | One unhealthy replica can trip shared protection for all replicas quickly. |
| Safety unavailable handling | Engine requeues in fail-closed mode; allows with bypass labels in fail-open mode. | Fail-mode controls availability vs governance strictness during outages. |
| Backoff behavior | Requeue path uses `safetyThrottleDelay = 5s` for `SafetyUnavailable`. | Prevents tight-loop retry storms while kernel recovers. |
Implementation examples
Atomic distributed failure recording (Go + Lua)
var recordFailureLua = redis.NewScript(`
local count = redis.call('INCR', KEYS[1])
if count == 1 then
redis.call('EXPIRE', KEYS[1], ARGV[1])
end
return count
`)
func (cb *RedisCircuitBreaker) RecordFailure(ctx context.Context) {
ttlSec := int64(cb.openDuration.Seconds())
if ttlSec <= 0 {
ttlSec = 30
}
count, _ := recordFailureLua.Run(ctx, cb.rdb, []string{cb.failuresKey}, ttlSec).Int64()
if count >= cb.failThreshold {
slog.Warn("circuit-breaker: circuit opened", "failures", count)
}
}Fail-open vs fail-closed decision path (Go)
case SafetyUnavailable:
if e.isInputFailOpen() {
if e.counterClient != nil {
e.counterClient.Incr(lockCtx, "cordum:scheduler:input_fail_open_total")
}
record.Decision = SafetyAllow
record.Reason = "fail-open: safety unavailable -- " + record.Reason
req.Labels["safety_bypassed"] = "true"
req.Labels["safety_bypass_reason"] = record.Reason
} else {
return RetryAfter(fmt.Errorf("safety unavailable: %s", record.Reason), 5*time.Second)
}Operator runbook baseline
# 1) Verify breaker keys and counters redis-cli GET cordum:cb:safety:failures redis-cli TTL cordum:cb:safety:failures # 2) Alert on fail-open bypass increments # metric: cordum_scheduler_input_fail_open_total # 3) Keep fail mode closed in production by default export POLICY_CHECK_FAIL_MODE=closed # 4) If temporary fail-open is required, time-box it and monitor bypass labels # label: safety_bypassed=true
Limitations and tradeoffs
- - Lower fail threshold reacts faster, but can open on short-lived noise spikes.
- - Longer open duration reduces retry pressure, but extends degraded-mode windows.
- - Shared Redis state gives global coordination, but Redis outages force local fallback behavior.
- - Fail-open keeps throughput during outages, but can bypass deny/approval rules temporarily.
If you enable fail-open in production without alerting on `input_fail_open_total`, you have created a silent governance bypass channel.
Next step
Run this tuning drill this week:
- 1. Inject Safety Kernel errors until breaker opens and verify shared Redis key behavior across replicas.
- 2. Measure open-window recovery path at 30s and confirm half-open probe outcomes in logs.
- 3. Keep `POLICY_CHECK_FAIL_MODE=closed` in production unless temporary exception is approved.
- 4. If fail-open is used, enforce a time-box and alert on bypass metrics and labels.
Continue with Safety Kernel Outage Playbook and Fail-Open Alerting.