The production problem
A safety dependency outage happens. Team switches to fail-open to preserve throughput. Jobs keep flowing. Nobody gets paged. That is how policy bypass becomes invisible.
Looking only at `failed` jobs misses this class of risk because fail-open is counted as successful processing. You need dedicated bypass telemetry and dedicated alerting.
The goal is not zero fail-open forever. The goal is zero unobserved fail-open.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Google SRE Workbook: Alerting on SLOs | Multi-window and burn-rate concepts for high-signal alerting. | No metric strategy for safety-bypass signals in AI policy enforcement paths. |
| AWS CloudWatch SLO alarms | Composite alarms combining short and long burn-rate windows. | No direct mapping from fail-open counters to governance risk in autonomous dispatch. |
| Prometheus alerting rules | `expr`, `for`, and `keep_firing_for` mechanics for stable alert behavior. | No opinionated policy for fail-open response phases (page, ticket, freeze). |
The missing pattern is governance-aware alerting: bypass-specific burn rules and predefined containment actions.
Alert policy matrix
| Alert level | Condition | Objective | Action |
|---|---|---|---|
| Page (P1) | Bypass ratio > 0.1% on 1h and 5m windows | Detect unsafe sustained bypass in under 5-10 minutes | Page on-call + incident commander |
| Ticket (P3) | Bypass ratio > 0.01% on 6h and 30m windows | Catch slow safety drift before it becomes outage-level risk | Create reliability debt item within same business day |
| Deploy freeze | Any bypass in production for > 15m while fail-open is enabled | Stop making risk state worse during live safety degradation | Halt rollouts until bypass returns to zero |
| Informational | Non-prod bypass > 0 with explicit experiment label | Keep observability complete in staging/chaos drills | Log and dashboard only |
Cordum signal mapping
| Signal | Metric / config | Operational meaning |
|---|---|---|
| Fail-open counter | `cordum_scheduler_input_fail_open_total` | Incremented each time scheduler allows a job through while safety is unavailable in open mode. |
| Safety unavailable counter | `cordum_safety_unavailable_total` | Tracks jobs deferred because safety kernel was unreachable. |
| Scheduler fail mode | `POLICY_CHECK_FAIL_MODE` (`closed` default, `open` optional) | Controls whether unavailable safety checks requeue or bypass at dispatch time. |
| Gateway fail mode | `GATEWAY_POLICY_FAIL_MODE` (`closed` default, `open` optional) | Controls submit-time behavior when safety kernel is unreachable. |
| Admin health context | `input_fail_open_total` in admin health payload | Gives quick control-plane context without digging into raw Prometheus output. |
Implementation examples
PromQL queries
# Bypass ratio (all topics) sum(rate(cordum_scheduler_input_fail_open_total[5m])) / clamp_min(sum(rate(cordum_jobs_received_total[5m])), 0.001) # Bypass ratio by topic sum by (topic) (rate(cordum_scheduler_input_fail_open_total[5m])) / clamp_min(sum by (topic) (rate(cordum_jobs_received_total[5m])), 0.001) # Correlate with safety outage pressure sum(rate(cordum_safety_unavailable_total[5m]))
Prometheus alert rules
groups:
- name: cordum-fail-open
rules:
- alert: CordumFailOpenBypassPage
expr: |
(
(
sum(rate(cordum_scheduler_input_fail_open_total[1h]))
/ clamp_min(sum(rate(cordum_jobs_received_total[1h])), 0.001)
) > 0.001
and
(
sum(rate(cordum_scheduler_input_fail_open_total[5m]))
/ clamp_min(sum(rate(cordum_jobs_received_total[5m])), 0.001)
) > 0.001
)
for: 5m
keep_firing_for: 10m
labels:
severity: page
annotations:
summary: "Safety bypass ratio is above 0.1%"
description: "Fail-open safety bypass is sustained on short and long windows."
- alert: CordumFailOpenBypassTicket
expr: |
(
(
sum(rate(cordum_scheduler_input_fail_open_total[6h]))
/ clamp_min(sum(rate(cordum_jobs_received_total[6h])), 0.001)
) > 0.0001
and
(
sum(rate(cordum_scheduler_input_fail_open_total[30m]))
/ clamp_min(sum(rate(cordum_jobs_received_total[30m])), 0.001)
) > 0.0001
)
for: 15m
labels:
severity: ticket
annotations:
summary: "Safety bypass trend is rising"
description: "Open reliability debt issue and review policy dependency health."Incident runbook snippet
# 1) Confirm bypass and outage signals curl -s http://localhost:9090/metrics | grep -E "cordum_scheduler_input_fail_open_total|cordum_safety_unavailable_total" # 2) Check runtime mode on scheduler and gateway kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE # 3) Contain: switch scheduler to fail-closed in production kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed # 4) Verify bypass stops kubectl logs deploy/cordum-scheduler -n cordum | grep -E "fail-open|safety kernel unavailable" curl -s http://localhost:9090/metrics | grep cordum_scheduler_input_fail_open_total
Limitations and tradeoffs
- - Very low thresholds can create noise during small controlled experiments.
- - High thresholds reduce noise but can delay detection of real policy bypass.
- - Topic-level alerts add precision but increase rule count and maintenance cost.
- - Global aggregate alerts are simpler but can hide one high-risk topic.
A fail-open counter that never alerts is equivalent to not having the counter.
Next step
Execute this sequence this week:
- 1. Add the two bypass burn-rate alerts to production Prometheus.
- 2. Run a controlled safety-kernel outage drill in staging and validate page/ticket timing.
- 3. Document who can switch `POLICY_CHECK_FAIL_MODE` and for how long.
- 4. Add deploy-freeze automation when page-level bypass alert is firing.
Continue with AI Agent Safety Check Timeout Tuning and AI Agent Fail-Open vs Fail-Closed.