Skip to content
Playbook

AI Agent Safety Kernel Outage Playbook

Use queue math and explicit gates before touching fail-open.

Playbook11 min readMar 2026
TL;DR
  • -During safety outages, fail-closed protects policy integrity but can build backlog fast.
  • -Backlog drain time is predictable with simple queue math if you measure ingress and post-recovery throughput.
  • -Cordum scheduler records safety unavailability and can requeue with a 5s safety throttle delay in closed mode.
  • -Temporary fail-open can preserve throughput, but it must be time-boxed and monitored as explicit policy bypass.
Math before mode-switch

Estimate backlog drain first. Do not flip to fail-open based on stress.

Closed-mode recovery

Keep policy guarantees while restoring service if your drain window is acceptable.

Controlled containment

Use runbook gates and concrete thresholds for any emergency fail-open decision.

Scope

This playbook targets pre-dispatch safety outages where scheduler policy checks degrade and queue lag grows while production remains in fail-closed mode.

The production problem

Safety dependency outages create pressure to flip fail-open immediately. That keeps throughput but can silently bypass deny and approval decisions.

Staying fail-closed protects policy integrity, but queue backlog can spike and trigger operational panic. Most teams choose mode first and math later.

Correct order is reversed. First estimate drain time. Then decide whether temporary bypass is justified.

What top results miss

SourceStrong coverageMissing piece
AWS Builders Library: Avoiding insurmountable queue backlogsQueue backlog behavior, lag tracking, and compensating controls under sustained load.No governance decision model for policy-check outages in autonomous dispatch pipelines.
Google Pub/Sub monitoring docsBacklog size and `oldest_unacked_message_age` as critical health indicators.No fail-open versus fail-closed control guidance during policy dependency outages.
RabbitMQ monitoring guideQueue depth, unacked/ready message metrics, and practical scrape interval guidance.No recovery playbook that combines queue lag math with AI safety enforcement tradeoffs.

The gap is a governance-first outage model: combine queue lag math with explicit policy bypass risk.

Backlog recovery math

VariableMeaningExample
Ingress rate (`lambda`)Average jobs per second arriving during outage120 jobs/s
Outage duration (`D`)Minutes safety checks are unavailable8 minutes
Post-recovery service rate (`mu`)Jobs per second processed after recovery and scale-up180 jobs/s
Backlog size (`B = lambda * D`)Jobs accumulated while safety service is degraded57,600 jobs
Drain time (`T = B / (mu - lambda)`)Time to clear backlog if `mu > lambda`16 minutes

Recovery ETA calculator

backlog_eta.py
Python
# Inputs
lambda = 120.0  # jobs/sec ingress
outage_seconds = 8 * 60
mu = 180.0      # jobs/sec after recovery

# Backlog built during outage
backlog = lambda * outage_seconds            # 57,600 jobs

# Drain capacity after recovery
net_drain = mu - lambda
if net_drain <= 0:
  raise Exception("Backlog cannot drain. Scale workers or reduce ingress.")

drain_seconds = backlog / net_drain          # 960s => 16 minutes
print(drain_seconds)

Cordum runtime behavior

BoundaryCurrent behaviorOperational impact
Safety unavailable metricScheduler increments `cordum_safety_unavailable_total` when safety check is unavailable.Primary outage signal for policy dependency degradation.
Closed-mode unavailable branchIn closed mode, scheduler requeues with `safetyThrottleDelay = 5s` on safety unavailable.Throughput drops while policy guarantees stay intact.
Open-mode unavailable branchIn open mode, jobs continue and `cordum_scheduler_input_fail_open_total` increments.Availability improves while safety bypass risk increases.
Retry ceiling`maxSchedulingRetries = 50`; at cap, job is moved to FAILED and sent to DLQ.Prevents infinite retry loops during prolonged outages.
Timeout guardrailSafety check uses `safetyCheckTimeout = 3s` defense-in-depth timeout.Avoids worker starvation on hung safety requests.

Implementation examples

PromQL outage checks

safety_outage_queries.promql
Text
# 1) Safety outage pressure
sum(rate(cordum_safety_unavailable_total[5m]))

# 2) Safety bypass signal (should be zero in closed-mode production)
sum(rate(cordum_scheduler_input_fail_open_total[5m]))

# 3) Ingress estimate
sum(rate(cordum_jobs_received_total[5m]))

# 4) Completion throughput
sum(rate(cordum_jobs_completed_total[5m]))

# 5) Failure ratio
sum(rate(cordum_jobs_completed_total{status="failed"}[5m]))
/
clamp_min(sum(rate(cordum_jobs_completed_total[5m])), 0.001)

Closed-mode recovery runbook

safety_outage_runbook.sh
Bash
# Check scheduler signals
curl -s http://localhost:9090/metrics | grep -E "cordum_safety_unavailable_total|cordum_scheduler_input_fail_open_total|cordum_jobs_received_total|cordum_jobs_completed_total"

# Verify current fail mode
kubectl exec -n cordum deploy/cordum-scheduler -- printenv POLICY_CHECK_FAIL_MODE
kubectl exec -n cordum deploy/cordum-api-gateway -- printenv GATEWAY_POLICY_FAIL_MODE

# Closed-mode containment: keep policy guarantees
kubectl set env deployment/cordum-scheduler -n cordum POLICY_CHECK_FAIL_MODE=closed

# Capacity boost to reduce drain time after recovery
kubectl scale deploy/cordum-scheduler -n cordum --replicas=3

# Post-recovery verification
kubectl logs deploy/cordum-scheduler -n cordum | grep -E "safety kernel unavailable|max scheduling retries exceeded"
curl -s http://localhost:9090/metrics | grep -E "cordum_safety_unavailable_total|cordum_jobs_completed_total"

Limitations and tradeoffs

  • - Closed mode protects policy but may create customer-visible latency during long outages.
  • - Open mode preserves throughput but bypasses deny/approval checks until rollback.
  • - Backlog ETA is only accurate if ingress and throughput rates are measured in near real time.
  • - Scaling schedulers helps drain backlog only if downstream dependencies can absorb load.

If you switch to fail-open without a time limit and alerting, you are choosing invisible policy debt.

Next step

Run this exercise in staging this week:

  1. 1. Simulate a 10-minute safety outage and record backlog growth metrics.
  2. 2. Compute drain ETA with real ingress/throughput data.
  3. 3. Practice closed-mode recovery with temporary scheduler scale-up.
  4. 4. Define written criteria for emergency fail-open and mandatory rollback deadline.

Continue with AI Agent Fail-Open Alerting and AI Agent Incident Response Runbook.

Backlog is measurable, panic is optional

Treat safety outages like controlled queueing incidents with explicit governance gates.