The production problem
Safety dependency outages create pressure to flip fail-open immediately. That keeps throughput but can silently bypass deny and approval decisions.
Staying fail-closed protects policy integrity, but queue backlog can spike and trigger operational panic. Most teams choose mode first and math later.
Correct order is reversed. First estimate drain time. Then decide whether temporary bypass is justified.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| AWS Builders Library: Avoiding insurmountable queue backlogs | Queue backlog behavior, lag tracking, and compensating controls under sustained load. | No governance decision model for policy-check outages in autonomous dispatch pipelines. |
| Google Pub/Sub monitoring docs | Backlog size and `oldest_unacked_message_age` as critical health indicators. | No fail-open versus fail-closed control guidance during policy dependency outages. |
| RabbitMQ monitoring guide | Queue depth, unacked/ready message metrics, and practical scrape interval guidance. | No recovery playbook that combines queue lag math with AI safety enforcement tradeoffs. |
The gap is a governance-first outage model: combine queue lag math with explicit policy bypass risk.
Backlog recovery math
| Variable | Meaning | Example |
|---|---|---|
| Ingress rate (`lambda`) | Average jobs per second arriving during outage | 120 jobs/s |
| Outage duration (`D`) | Minutes safety checks are unavailable | 8 minutes |
| Post-recovery service rate (`mu`) | Jobs per second processed after recovery and scale-up | 180 jobs/s |
| Backlog size (`B = lambda * D`) | Jobs accumulated while safety service is degraded | 57,600 jobs |
| Drain time (`T = B / (mu - lambda)`) | Time to clear backlog if `mu > lambda` | 16 minutes |
Recovery ETA calculator
# Inputs
lambda = 120.0 # jobs/sec ingress
outage_seconds = 8 * 60
mu = 180.0 # jobs/sec after recovery
# Backlog built during outage
backlog = lambda * outage_seconds # 57,600 jobs
# Drain capacity after recovery
net_drain = mu - lambda
if net_drain <= 0:
raise Exception("Backlog cannot drain. Scale workers or reduce ingress.")
drain_seconds = backlog / net_drain # 960s => 16 minutes
print(drain_seconds)Cordum runtime behavior
| Boundary | Current behavior | Operational impact |
|---|---|---|
| Safety unavailable metric | Scheduler increments `cordum_safety_unavailable_total` when safety check is unavailable. | Primary outage signal for policy dependency degradation. |
| Closed-mode unavailable branch | In closed mode, scheduler requeues with `safetyThrottleDelay = 5s` on safety unavailable. | Throughput drops while policy guarantees stay intact. |
| Open-mode unavailable branch | In open mode, jobs continue and `cordum_scheduler_input_fail_open_total` increments. | Availability improves while safety bypass risk increases. |
| Retry ceiling | `maxSchedulingRetries = 50`; at cap, job is moved to FAILED and sent to DLQ. | Prevents infinite retry loops during prolonged outages. |
| Timeout guardrail | Safety check uses `safetyCheckTimeout = 3s` defense-in-depth timeout. | Avoids worker starvation on hung safety requests. |
Implementation examples
PromQL outage checks
# 1) Safety outage pressure
sum(rate(cordum_safety_unavailable_total[5m]))
# 2) Safety bypass signal (should be zero in closed-mode production)
sum(rate(cordum_scheduler_input_fail_open_total[5m]))
# 3) Ingress estimate
sum(rate(cordum_jobs_received_total[5m]))
# 4) Completion throughput
sum(rate(cordum_jobs_completed_total[5m]))
# 5) Failure ratio
sum(rate(cordum_jobs_completed_total{status="failed"}[5m]))
/
clamp_min(sum(rate(cordum_jobs_completed_total[5m])), 0.001)Closed-mode recovery runbook
# Check scheduler signals curl -s http://localhost:9090/metrics | grep -E "cordum_safety_unavailable_total|cordum_scheduler_input_fail_open_total|cordum_jobs_received_total|cordum_jobs_completed_total" # Verify current fail mode kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE # Closed-mode containment: keep policy guarantees kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed # Capacity boost to reduce drain time after recovery kubectl scale deploy/cordum-scheduler -n cordum --replicas=3 # Post-recovery verification kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety kernel unavailable|max scheduling retries exceeded" curl -s http://localhost:9090/metrics | grep -E "cordum_safety_unavailable_total|cordum_jobs_completed_total"
Limitations and tradeoffs
- - Closed mode protects policy but may create customer-visible latency during long outages.
- - Open mode preserves throughput but bypasses deny/approval checks until rollback.
- - Backlog ETA is only accurate if ingress and throughput rates are measured in near real time.
- - Scaling schedulers helps drain backlog only if downstream dependencies can absorb load.
If you switch to fail-open without a time limit and alerting, you are choosing invisible policy debt.
Next step
Run this exercise in staging this week:
- 1. Simulate a 10-minute safety outage and record backlog growth metrics.
- 2. Compute drain ETA with real ingress/throughput data.
- 3. Practice closed-mode recovery with temporary scheduler scale-up.
- 4. Define written criteria for emergency fail-open and mandatory rollback deadline.
Continue with AI Agent Fail-Open Alerting and AI Agent Incident Response Runbook.