Name: Cordum
Author: Cordum

The production problem

Safety dependency outages create pressure to flip fail-open immediately. That keeps throughput but can silently bypass deny and approval decisions.

Staying fail-closed protects policy integrity, but queue backlog can spike and trigger operational panic. Most teams choose mode first and math later.

Correct order is reversed. First estimate drain time. Then decide whether temporary bypass is justified.

What top results miss

Source	Strong coverage	Missing piece
AWS Builders Library: Avoiding insurmountable queue backlogs	Queue backlog behavior, lag tracking, and compensating controls under sustained load.	No governance decision model for policy-check outages in autonomous dispatch pipelines.
Google Pub/Sub monitoring docs	Backlog size and `oldest_unacked_message_age` as critical health indicators.	No fail-open versus fail-closed control guidance during policy dependency outages.
RabbitMQ monitoring guide	Queue depth, unacked/ready message metrics, and practical scrape interval guidance.	No recovery playbook that combines queue lag math with AI safety enforcement tradeoffs.

The gap is a governance-first outage model: combine queue lag math with explicit policy bypass risk.

Backlog recovery math

Variable	Meaning	Example
Ingress rate (`lambda`)	Average jobs per second arriving during outage	120 jobs/s
Outage duration (`D`)	Minutes safety checks are unavailable	8 minutes
Post-recovery service rate (`mu`)	Jobs per second processed after recovery and scale-up	180 jobs/s
Backlog size (`B = lambda * D`)	Jobs accumulated while safety service is degraded	57,600 jobs
Drain time (`T = B / (mu - lambda)`)	Time to clear backlog if `mu > lambda`	16 minutes

Recovery ETA calculator

backlog_eta.py

Python

# Inputs
lambda = 120.0  # jobs/sec ingress
outage_seconds = 8 * 60
mu = 180.0      # jobs/sec after recovery

# Backlog built during outage
backlog = lambda * outage_seconds            # 57,600 jobs

# Drain capacity after recovery
net_drain = mu - lambda
if net_drain <= 0:
  raise Exception("Backlog cannot drain. Scale workers or reduce ingress.")

drain_seconds = backlog / net_drain          # 960s => 16 minutes
print(drain_seconds)

Cordum runtime behavior

Boundary	Current behavior	Operational impact
Safety unavailable metric	Scheduler increments `cordum_safety_unavailable_total` when safety check is unavailable.	Primary outage signal for policy dependency degradation.
Closed-mode unavailable branch	In closed mode, scheduler requeues with `safetyThrottleDelay = 5s` on safety unavailable.	Throughput drops while policy guarantees stay intact.
Open-mode unavailable branch	In open mode, jobs continue and `cordum_scheduler_input_fail_open_total` increments.	Availability improves while safety bypass risk increases.
Retry ceiling	`maxSchedulingRetries = 50`; at cap, job is moved to FAILED and sent to DLQ.	Prevents infinite retry loops during prolonged outages.
Timeout guardrail	Safety check uses `safetyCheckTimeout = 3s` defense-in-depth timeout.	Avoids worker starvation on hung safety requests.

Implementation examples

PromQL outage checks

safety_outage_queries.promql

Text

# 1) Safety outage pressure
sum(rate(cordum_safety_unavailable_total[5m]))

# 2) Safety bypass signal (should be zero in closed-mode production)
sum(rate(cordum_scheduler_input_fail_open_total[5m]))

# 3) Ingress estimate
sum(rate(cordum_jobs_received_total[5m]))

# 4) Completion throughput
sum(rate(cordum_jobs_completed_total[5m]))

# 5) Failure ratio
sum(rate(cordum_jobs_completed_total{status="failed"}[5m]))
/
clamp_min(sum(rate(cordum_jobs_completed_total[5m])), 0.001)

Closed-mode recovery runbook

safety_outage_runbook.sh

Bash

# Check scheduler signals
curl -s http://localhost:9090/metrics | grep -E "cordum_safety_unavailable_total|cordum_scheduler_input_fail_open_total|cordum_jobs_received_total|cordum_jobs_completed_total"

# Verify current fail mode
kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE
kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE

# Closed-mode containment: keep policy guarantees
kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed

# Capacity boost to reduce drain time after recovery
kubectl scale deploy/cordum-scheduler -n cordum --replicas=3

# Post-recovery verification
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety kernel unavailable|max scheduling retries exceeded"
curl -s http://localhost:9090/metrics | grep -E "cordum_safety_unavailable_total|cordum_jobs_completed_total"

Limitations and tradeoffs

- Closed mode protects policy but may create customer-visible latency during long outages.
- Open mode preserves throughput but bypasses deny/approval checks until rollback.
- Backlog ETA is only accurate if ingress and throughput rates are measured in near real time.
- Scaling schedulers helps drain backlog only if downstream dependencies can absorb load.

If you switch to fail-open without a time limit and alerting, you are choosing invisible policy debt.

Next step

Run this exercise in staging this week:

1. Simulate a 10-minute safety outage and record backlog growth metrics.
2. Compute drain ETA with real ingress/throughput data.
3. Practice closed-mode recovery with temporary scheduler scale-up.
4. Define written criteria for emergency fail-open and mandatory rollback deadline.

Continue with AI Agent Fail-Open Alerting and AI Agent Incident Response Runbook.

AI Agent Safety Kernel Outage Playbook