Skip to content
Guide

AI Agent Fail-Open Alerting

If bypass can happen, paging on bypass is mandatory.

Guide10 min readMar 2026
TL;DR
  • -Fail-open is not the incident. Undetected fail-open is the incident.
  • -Cordum emits `cordum_scheduler_input_fail_open_total` exactly for this failure mode.
  • -Use multi-window alerts so short spikes do not page, but real bypass sustained for minutes does.
  • -Pair bypass alerts with `cordum_safety_unavailable_total` to separate policy outage from normal deny behavior.
Bypass first signal

Page on safety-bypass rate, not only on generic job failure counters.

Multi-window rules

Use paired windows to cut noise and still detect real drift quickly.

Fast containment

When bypass rises in production, switch to fail-closed and drain safely.

Scope

This guide covers alerting policy for pre-dispatch safety bypass in autonomous AI control planes where fail-open is possible by design.

The production problem

A safety dependency outage happens. Team switches to fail-open to preserve throughput. Jobs keep flowing. Nobody gets paged. That is how policy bypass becomes invisible.

Looking only at `failed` jobs misses this class of risk because fail-open is counted as successful processing. You need dedicated bypass telemetry and dedicated alerting.

The goal is not zero fail-open forever. The goal is zero unobserved fail-open.

What top results miss

SourceStrong coverageMissing piece
Google SRE Workbook: Alerting on SLOsMulti-window and burn-rate concepts for high-signal alerting.No metric strategy for safety-bypass signals in AI policy enforcement paths.
AWS CloudWatch SLO alarmsComposite alarms combining short and long burn-rate windows.No direct mapping from fail-open counters to governance risk in autonomous dispatch.
Prometheus alerting rules`expr`, `for`, and `keep_firing_for` mechanics for stable alert behavior.No opinionated policy for fail-open response phases (page, ticket, freeze).

The missing pattern is governance-aware alerting: bypass-specific burn rules and predefined containment actions.

Alert policy matrix

Alert levelConditionObjectiveAction
Page (P1)Bypass ratio > 0.1% on 1h and 5m windowsDetect unsafe sustained bypass in under 5-10 minutesPage on-call + incident commander
Ticket (P3)Bypass ratio > 0.01% on 6h and 30m windowsCatch slow safety drift before it becomes outage-level riskCreate reliability debt item within same business day
Deploy freezeAny bypass in production for > 15m while fail-open is enabledStop making risk state worse during live safety degradationHalt rollouts until bypass returns to zero
InformationalNon-prod bypass > 0 with explicit experiment labelKeep observability complete in staging/chaos drillsLog and dashboard only

Cordum signal mapping

SignalMetric / configOperational meaning
Fail-open counter`cordum_scheduler_input_fail_open_total`Incremented each time scheduler allows a job through while safety is unavailable in open mode.
Safety unavailable counter`cordum_safety_unavailable_total`Tracks jobs deferred because safety kernel was unreachable.
Scheduler fail mode`POLICY_CHECK_FAIL_MODE` (`closed` default, `open` optional)Controls whether unavailable safety checks requeue or bypass at dispatch time.
Gateway fail mode`GATEWAY_POLICY_FAIL_MODE` (`closed` default, `open` optional)Controls submit-time behavior when safety kernel is unreachable.
Admin health context`input_fail_open_total` in admin health payloadGives quick control-plane context without digging into raw Prometheus output.

Implementation examples

PromQL queries

fail_open_queries.promql
Text
# Bypass ratio (all topics)
sum(rate(cordum_scheduler_input_fail_open_total[5m]))
/
clamp_min(sum(rate(cordum_jobs_received_total[5m])), 0.001)

# Bypass ratio by topic
sum by (topic) (rate(cordum_scheduler_input_fail_open_total[5m]))
/
clamp_min(sum by (topic) (rate(cordum_jobs_received_total[5m])), 0.001)

# Correlate with safety outage pressure
sum(rate(cordum_safety_unavailable_total[5m]))

Prometheus alert rules

fail_open_alerts.yml
YAML
groups:
  - name: cordum-fail-open
    rules:
      - alert: CordumFailOpenBypassPage
        expr: |
          (
            (
              sum(rate(cordum_scheduler_input_fail_open_total[1h]))
              / clamp_min(sum(rate(cordum_jobs_received_total[1h])), 0.001)
            ) > 0.001
            and
            (
              sum(rate(cordum_scheduler_input_fail_open_total[5m]))
              / clamp_min(sum(rate(cordum_jobs_received_total[5m])), 0.001)
            ) > 0.001
          )
        for: 5m
        keep_firing_for: 10m
        labels:
          severity: page
        annotations:
          summary: "Safety bypass ratio is above 0.1%"
          description: "Fail-open safety bypass is sustained on short and long windows."

      - alert: CordumFailOpenBypassTicket
        expr: |
          (
            (
              sum(rate(cordum_scheduler_input_fail_open_total[6h]))
              / clamp_min(sum(rate(cordum_jobs_received_total[6h])), 0.001)
            ) > 0.0001
            and
            (
              sum(rate(cordum_scheduler_input_fail_open_total[30m]))
              / clamp_min(sum(rate(cordum_jobs_received_total[30m])), 0.001)
            ) > 0.0001
          )
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Safety bypass trend is rising"
          description: "Open reliability debt issue and review policy dependency health."

Incident runbook snippet

fail_open_runbook.sh
Bash
# 1) Confirm bypass and outage signals
curl -s http://localhost:9090/metrics | grep -E "cordum_scheduler_input_fail_open_total|cordum_safety_unavailable_total"

# 2) Check runtime mode on scheduler and gateway
kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE
kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE

# 3) Contain: switch scheduler to fail-closed in production
kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed

# 4) Verify bypass stops
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "fail-open|safety kernel unavailable"
curl -s http://localhost:9090/metrics | grep cordum_scheduler_input_fail_open_total

Limitations and tradeoffs

  • - Very low thresholds can create noise during small controlled experiments.
  • - High thresholds reduce noise but can delay detection of real policy bypass.
  • - Topic-level alerts add precision but increase rule count and maintenance cost.
  • - Global aggregate alerts are simpler but can hide one high-risk topic.

A fail-open counter that never alerts is equivalent to not having the counter.

Next step

Execute this sequence this week:

  1. 1. Add the two bypass burn-rate alerts to production Prometheus.
  2. 2. Run a controlled safety-kernel outage drill in staging and validate page/ticket timing.
  3. 3. Document who can switch `POLICY_CHECK_FAIL_MODE` and for how long.
  4. 4. Add deploy-freeze automation when page-level bypass alert is firing.

Continue with AI Agent Safety Check Timeout Tuning and AI Agent Fail-Open vs Fail-Closed.

Page on bypass, not on hope

Safety bypass is a governance signal. Treat it with first-class on-call response.