Name: Cordum
Author: Cordum

The production problem

A safety dependency outage happens. Team switches to fail-open to preserve throughput. Jobs keep flowing. Nobody gets paged. That is how policy bypass becomes invisible.

Looking only at `failed` jobs misses this class of risk because fail-open is counted as successful processing. You need dedicated bypass telemetry and dedicated alerting.

The goal is not zero fail-open forever. The goal is zero unobserved fail-open.

What top results miss

Source	Strong coverage	Missing piece
Google SRE Workbook: Alerting on SLOs	Multi-window and burn-rate concepts for high-signal alerting.	No metric strategy for safety-bypass signals in AI policy enforcement paths.
AWS CloudWatch SLO alarms	Composite alarms combining short and long burn-rate windows.	No direct mapping from fail-open counters to governance risk in autonomous dispatch.
Prometheus alerting rules	`expr`, `for`, and `keep_firing_for` mechanics for stable alert behavior.	No opinionated policy for fail-open response phases (page, ticket, freeze).

The missing pattern is governance-aware alerting: bypass-specific burn rules and predefined containment actions.

Alert policy matrix

Alert level	Condition	Objective	Action
Page (P1)	Bypass ratio > 0.1% on 1h and 5m windows	Detect unsafe sustained bypass in under 5-10 minutes	Page on-call + incident commander
Ticket (P3)	Bypass ratio > 0.01% on 6h and 30m windows	Catch slow safety drift before it becomes outage-level risk	Create reliability debt item within same business day
Deploy freeze	Any bypass in production for > 15m while fail-open is enabled	Stop making risk state worse during live safety degradation	Halt rollouts until bypass returns to zero
Informational	Non-prod bypass > 0 with explicit experiment label	Keep observability complete in staging/chaos drills	Log and dashboard only

Cordum signal mapping

Signal	Metric / config	Operational meaning
Fail-open counter	`cordum_scheduler_input_fail_open_total`	Incremented each time scheduler allows a job through while safety is unavailable in open mode.
Safety unavailable counter	`cordum_safety_unavailable_total`	Tracks jobs deferred because safety kernel was unreachable.
Scheduler fail mode	`POLICY_CHECK_FAIL_MODE` (`closed` default, `open` optional)	Controls whether unavailable safety checks requeue or bypass at dispatch time.
Gateway fail mode	`GATEWAY_POLICY_FAIL_MODE` (`closed` default, `open` optional)	Controls submit-time behavior when safety kernel is unreachable.
Admin health context	`input_fail_open_total` in admin health payload	Gives quick control-plane context without digging into raw Prometheus output.

Implementation examples

PromQL queries

fail_open_queries.promql

Text

# Bypass ratio (all topics)
sum(rate(cordum_scheduler_input_fail_open_total[5m]))
/
clamp_min(sum(rate(cordum_jobs_received_total[5m])), 0.001)

# Bypass ratio by topic
sum by (topic) (rate(cordum_scheduler_input_fail_open_total[5m]))
/
clamp_min(sum by (topic) (rate(cordum_jobs_received_total[5m])), 0.001)

# Correlate with safety outage pressure
sum(rate(cordum_safety_unavailable_total[5m]))

Prometheus alert rules

fail_open_alerts.yml

YAML

groups:
  - name: cordum-fail-open
    rules:
      - alert: CordumFailOpenBypassPage
        expr: |
          (
            (
              sum(rate(cordum_scheduler_input_fail_open_total[1h]))
              / clamp_min(sum(rate(cordum_jobs_received_total[1h])), 0.001)
            ) > 0.001
            and
            (
              sum(rate(cordum_scheduler_input_fail_open_total[5m]))
              / clamp_min(sum(rate(cordum_jobs_received_total[5m])), 0.001)
            ) > 0.001
          )
        for: 5m
        keep_firing_for: 10m
        labels:
          severity: page
        annotations:
          summary: "Safety bypass ratio is above 0.1%"
          description: "Fail-open safety bypass is sustained on short and long windows."

      - alert: CordumFailOpenBypassTicket
        expr: |
          (
            (
              sum(rate(cordum_scheduler_input_fail_open_total[6h]))
              / clamp_min(sum(rate(cordum_jobs_received_total[6h])), 0.001)
            ) > 0.0001
            and
            (
              sum(rate(cordum_scheduler_input_fail_open_total[30m]))
              / clamp_min(sum(rate(cordum_jobs_received_total[30m])), 0.001)
            ) > 0.0001
          )
        for: 15m
        labels:
          severity: ticket
        annotations:
          summary: "Safety bypass trend is rising"
          description: "Open reliability debt issue and review policy dependency health."

Incident runbook snippet

fail_open_runbook.sh

Bash

# 1) Confirm bypass and outage signals
curl -s http://localhost:9090/metrics | grep -E "cordum_scheduler_input_fail_open_total|cordum_safety_unavailable_total"

# 2) Check runtime mode on scheduler and gateway
kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE
kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE

# 3) Contain: switch scheduler to fail-closed in production
kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed

# 4) Verify bypass stops
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "fail-open|safety kernel unavailable"
curl -s http://localhost:9090/metrics | grep cordum_scheduler_input_fail_open_total

Limitations and tradeoffs

- Very low thresholds can create noise during small controlled experiments.
- High thresholds reduce noise but can delay detection of real policy bypass.
- Topic-level alerts add precision but increase rule count and maintenance cost.
- Global aggregate alerts are simpler but can hide one high-risk topic.

A fail-open counter that never alerts is equivalent to not having the counter.

Next step

Execute this sequence this week:

1. Add the two bypass burn-rate alerts to production Prometheus.
2. Run a controlled safety-kernel outage drill in staging and validate page/ticket timing.
3. Document who can switch `POLICY_CHECK_FAIL_MODE` and for how long.
4. Add deploy-freeze automation when page-level bypass alert is firing.

Continue with AI Agent Safety Check Timeout Tuning and AI Agent Fail-Open vs Fail-Closed.

AI Agent Fail-Open Alerting