Skip to content
Guide

AI Agent Fail-Open vs Fail-Closed

Pick fail mode by risk and blast radius, not by instinct.

Guide10 min readMar 2026
TL;DR
  • -Fail-open preserves availability but can bypass controls when dependencies are degraded.
  • -Fail-closed protects safety boundaries but can block high-volume workflows during outages.
  • -Use operation risk tiers to choose fail mode per path instead of one global default.
  • -Alert on every fail-open decision so outages cannot hide behind successful throughput.
Risk-tiered defaults

Critical side effects should fail-closed; low-risk reads can fail-open with limits.

Observable bypass

Every fail-open path needs a counter, alert, and incident response threshold.

Policy first

Never let dependency outages silently remove governance for high-risk actions.

Scope

This guide focuses on policy-check failures in autonomous systems where queue dispatch can trigger side-effecting operations.

The production problem

Safety dependencies fail in real deployments. The question is not whether they fail, but what your agent system does when they fail.

A global fail-open default keeps requests flowing. It can also bypass policy checks exactly when your control path is degraded.

A global fail-closed default protects boundaries. It can also halt useful low-risk workloads during transient outages.

What top results miss

SourceStrong coverageMissing piece
Azure Circuit Breaker patternStrong state-machine design (Closed/Open/Half-Open), thresholds, and graceful degradation guidance.No concrete governance model for autonomous agent side effects.
AWS Circuit breaker patternClear timeout/retry/open-circuit flow and practical implementation patterns.No policy-safety decision framework for AI dispatch pipelines.
Fail-Closed Alignment for LLMs (arXiv 2602.16977)Strong fail-open vs fail-closed framing for LLM refusal robustness under jailbreak pressure.No runtime control-plane blueprint for queueing, approvals, and replay decisions.

Fail-mode decision matrix

One fail mode for all paths is usually wrong. Classify operations by impact and set fail behavior per class.

Execution pathRecommended modeRationaleRequired guardrail
Read-only assistance (low impact)Fail-open with strict telemetryAvailability has higher value than strict blocking when no external mutation occurs.Rate limits + explicit fail-open metric alert
Internal write (medium impact)Fail-closed by defaultIncorrect writes cause hidden data drift that is hard to reverse.Manual override during incident with expiry
External side effects (high impact)Fail-closed mandatoryUnsafe bypass can create irreversible actions in third-party systems.Approval gate + idempotency key + audit event
Safety checker outage branchClosed in production; open only with temporary incident policyDependency outage must not silently become policy bypass.Time-boxed override + paging on every fail-open decision

Cordum runtime implications

ImplicationCurrent behaviorWhy it matters
Gateway submit-time policy fallback`GATEWAY_POLICY_FAIL_MODE=closed` by default; `open` allows through on safety-unavailable branchSubmit path can reject or allow before state persistence and bus publish.
Scheduler pre-dispatch fallback`POLICY_CHECK_FAIL_MODE=closed` by default; `open` allows dispatch with warning and metricDispatch path controls whether unavailable safety checks block or pass jobs.
Input circuit breaker thresholds2s safety timeout, fail budget 3, open duration 30s, half-open probe cap 3Outage handling is deterministic; mode choice determines whether traffic is blocked or bypassed.
Fail-open observability`cordum_scheduler_input_fail_open_total` counter increments on each bypassed checkOperations can detect and cap governance bypass during dependency incidents.
Policy default stance`default_decision: deny` in safety policy (fail-closed for unmatched jobs)Unmatched requests are rejected unless explicitly allowed by policy rules.

Implementation examples

Risk-tier fail-mode selector (Go)

failmode.go
Go
type RiskTier string

type FailMode string

const (
  TierLow  RiskTier = "low"
  TierHigh RiskTier = "high"

  ModeOpen   FailMode = "open"
  ModeClosed FailMode = "closed"
)

func chooseFailMode(tier RiskTier, emergencyOverride bool) FailMode {
  if tier == TierHigh {
    return ModeClosed
  }

  if emergencyOverride {
    return ModeOpen
  }

  return ModeClosed
}

Safety policy defaults (YAML)

safety.yaml
YAML
# config/safety.yaml
default_decision: deny

input_policy:
  fail_mode: closed

# temporary incident override (time-boxed)
# input_policy:
#   fail_mode: open

Fail-open alert query (PromQL)

alerts.promql
PromQL
# Alert when policy checks are bypassed under fail-open mode
sum(rate(cordum_scheduler_input_fail_open_total[5m])) > 0

Limitations and tradeoffs

  • - Fail-closed can reduce incident blast radius but increase operational interruption.
  • - Fail-open can protect availability but introduces governance bypass risk.
  • - Mixed-mode design needs stronger runbooks and clearer operator ownership.
  • - Metrics alone are not enough; alerts need enforced response thresholds.

Next step

Run this in one sprint:

  1. 1. Classify every agent action path into low, medium, or high impact.
  2. 2. Set fail mode per path and document justified exceptions.
  3. 3. Add an alert for any non-zero fail-open bypass in production.
  4. 4. Test one safety-kernel outage game day and verify expected behavior.

Continue with AI Agent Circuit Breaker Pattern and Pre-Dispatch Governance for AI Agents.

Availability and safety are both requirements

Good systems survive outages without silently dropping their safety boundaries.