Name: Cordum
Author: Cordum

The production problem

Safety dependencies fail in real deployments. The question is not whether they fail, but what your agent system does when they fail.

A global fail-open default keeps requests flowing. It can also bypass policy checks exactly when your control path is degraded.

A global fail-closed default protects boundaries. It can also halt useful low-risk workloads during transient outages.

What top results miss

Source	Strong coverage	Missing piece
Azure Circuit Breaker pattern	Strong state-machine design (Closed/Open/Half-Open), thresholds, and graceful degradation guidance.	No concrete governance model for autonomous agent side effects.
AWS Circuit breaker pattern	Clear timeout/retry/open-circuit flow and practical implementation patterns.	No policy-safety decision framework for AI dispatch pipelines.
Fail-Closed Alignment for LLMs (arXiv 2602.16977)	Strong fail-open vs fail-closed framing for LLM refusal robustness under jailbreak pressure.	No runtime control-plane blueprint for queueing, approvals, and replay decisions.

Fail-mode decision matrix

One fail mode for all paths is usually wrong. Classify operations by impact and set fail behavior per class.

Execution path	Recommended mode	Rationale	Required guardrail
Read-only assistance (low impact)	Fail-open with strict telemetry	Availability has higher value than strict blocking when no external mutation occurs.	Rate limits + explicit fail-open metric alert
Internal write (medium impact)	Fail-closed by default	Incorrect writes cause hidden data drift that is hard to reverse.	Manual override during incident with expiry
External side effects (high impact)	Fail-closed mandatory	Unsafe bypass can create irreversible actions in third-party systems.	Approval gate + idempotency key + audit event
Safety checker outage branch	Closed in production; open only with temporary incident policy	Dependency outage must not silently become policy bypass.	Time-boxed override + paging on every fail-open decision

Cordum runtime implications

Implication	Current behavior	Why it matters
Gateway submit-time policy fallback	`GATEWAY_POLICY_FAIL_MODE=closed` by default; `open` allows through on safety-unavailable branch	Submit path can reject or allow before state persistence and bus publish.
Scheduler pre-dispatch fallback	`POLICY_CHECK_FAIL_MODE=closed` by default; `open` allows dispatch with warning and metric	Dispatch path controls whether unavailable safety checks block or pass jobs.
Input circuit breaker thresholds	2s safety timeout, fail budget 3, open duration 30s, half-open probe cap 3	Outage handling is deterministic; mode choice determines whether traffic is blocked or bypassed.
Fail-open observability	`cordum_scheduler_input_fail_open_total` counter increments on each bypassed check	Operations can detect and cap governance bypass during dependency incidents.
Policy default stance	`default_decision: deny` in safety policy (fail-closed for unmatched jobs)	Unmatched requests are rejected unless explicitly allowed by policy rules.

Implementation examples

Risk-tier fail-mode selector (Go)

failmode.go

type RiskTier string

type FailMode string

const (
  TierLow  RiskTier = "low"
  TierHigh RiskTier = "high"

  ModeOpen   FailMode = "open"
  ModeClosed FailMode = "closed"
)

func chooseFailMode(tier RiskTier, emergencyOverride bool) FailMode {
  if tier == TierHigh {
    return ModeClosed
  }

  if emergencyOverride {
    return ModeOpen
  }

  return ModeClosed
}

Safety policy defaults (YAML)

safety.yaml

YAML

# config/safety.yaml
default_decision: deny

input_policy:
  fail_mode: closed

# temporary incident override (time-boxed)
# input_policy:
#   fail_mode: open

Fail-open alert query (PromQL)

alerts.promql

PromQL

# Alert when policy checks are bypassed under fail-open mode
sum(rate(cordum_scheduler_input_fail_open_total[5m])) > 0

Limitations and tradeoffs

- Fail-closed can reduce incident blast radius but increase operational interruption.
- Fail-open can protect availability but introduces governance bypass risk.
- Mixed-mode design needs stronger runbooks and clearer operator ownership.
- Metrics alone are not enough; alerts need enforced response thresholds.

Next step

Run this in one sprint:

1. Classify every agent action path into low, medium, or high impact.
2. Set fail mode per path and document justified exceptions.
3. Add an alert for any non-zero fail-open bypass in production.
4. Test one safety-kernel outage game day and verify expected behavior.

Continue with AI Agent Circuit Breaker Pattern and Pre-Dispatch Governance for AI Agents.

AI Agent Fail-Open vs Fail-Closed