The production problem
Most incident docs look good until the first real autonomous-agent outage. Then every path looks critical, every team thinks another team owns the issue, and somebody suggests restarting everything.
That approach is expensive. It can clear symptoms and create duplicate side effects if replay and lock state are not verified.
A runbook should answer three questions fast: what severity is this, what component is the likely blast center, and what safe recovery path should start now.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Google SRE Book: Managing Incidents | Strong incident-command structure and communication discipline. | No concrete guidance for policy-gated AI dispatch paths and replay-safe remediation. |
| Atlassian: How to create an incident response playbook | Solid playbook framework, ownership, and simulation emphasis. | No metric-level trigger design for autonomous agent control planes. |
| PagerDuty Response Docs: Getting Started | Practical role setup, severity levels, and practice loops. | No operational decision tree for retries, quarantines, and distributed lock checks. |
Severity model and ownership
Incident status should come from objective trigger conditions first. Root cause can evolve; severity should still be deterministic.
| Severity | Trigger condition | Primary signal | Owner on point |
|---|---|---|---|
| SEV-1 | Safety kernel unavailable trend + dispatch backlog growth | `cordum_safety_unavailable_total` rising and jobs stuck | Incident Commander + platform on-call |
| SEV-2 | Dispatch latency p99 sustained above threshold | `histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m])) > 1` | Scheduler owner |
| SEV-2 | Failed completion ratio sustained above baseline | `rate(cordum_jobs_completed_total{status="failed"}[5m]) / rate(cordum_jobs_completed_total[5m]) > 0.1` | Workflow/runtime owner |
| SEV-3 | Output quarantine spike without user-facing impact | `rate(cordum_output_policy_quarantined_total[5m]) > 1` | Safety/policy owner |
| SEV-2 | Stale jobs exceed normal drift window | `cordum_scheduler_stale_jobs > 50` | Scheduler on-call |
Cordum signal map
| Implication | Current behavior | Why it matters |
|---|---|---|
| Safety dependency outage check | Track `cordum_safety_unavailable_total` and validate safety-kernel gRPC health | Distinguishes policy-service outage from worker capacity issues. |
| Output quarantine investigation | Inspect `cordum_output_policy_quarantined_total` and job `failure_reason` | Avoids disabling policy controls when the issue is narrow rule tuning. |
| Distributed lock integrity | Check Redis lock keys like `cordum:reconciler:default` | Multiple active reconcilers or missing locks can create duplicate or stuck processing. |
| Replay progress confirmation | Monitor `cordum_scheduler_orphan_replayed_total` trend after recovery | Confirms stuck pending jobs are being recovered without unsafe manual replay. |
| Policy fail mode awareness | `POLICY_CHECK_FAIL_MODE=closed` is default; unavailable policy path requeues with backoff | Explains why throughput drops during policy outages without silently bypassing controls. |
Existing production alerts already include useful starting thresholds: failed ratio above 10%, dispatch p99 above 1s, stale jobs above 50, and quarantine rate above 1 per second.
Implementation examples
First-15-minute incident checklist (YAML)
incident:
t_plus_0_to_5:
- assign_incident_commander
- set_severity_from_metrics
- freeze_nonessential_deployments
t_plus_5_to_10:
- check_safety_kernel_health
- check_scheduler_dispatch_p99
- check_stale_jobs_and_reconciler_lock
t_plus_10_to_15:
- choose_recovery_path:
- safety_kernel_restore
- worker_capacity_rebalance
- output_policy_rule_tuning
- publish_customer_status_update
- open_timeline_docMetric triage snapshot script (Bash)
#!/usr/bin/env bash
set -euo pipefail
BASE_URL="${BASE_URL:-http://localhost:9090}"
echo "=== Dispatch p99 ==="
curl -sG "$BASE_URL/api/v1/query" --data-urlencode 'query=histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))' | jq -r '.data.result[0].value'
echo "=== Failed ratio (5m) ==="
curl -sG "$BASE_URL/api/v1/query" --data-urlencode 'query=rate(cordum_jobs_completed_total{status="failed"}[5m]) / clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)' | jq -r '.data.result[0].value'
echo "=== Safety unavailable (5m) ==="
curl -sG "$BASE_URL/api/v1/query" --data-urlencode 'query=rate(cordum_safety_unavailable_total[5m])' | jq -r '.data.result[0].value'Lock and policy-path checks (Bash)
# Redis lock checks (single-writer components) redis-cli GET "cordum:reconciler:default" redis-cli GET "cordum:replayer:pending" # Job lock sample redis-cli GET "cordum:scheduler:job:JOB_ID" # Quarantine metric quick check curl -s http://localhost:9090/metrics | grep output_policy_quarantined # Safety kernel env inspection env | grep SAFETY_KERNEL
Limitations and tradeoffs
- - Tight severity thresholds reduce detection latency but increase paging noise during bursty traffic.
- - One shared runbook can hide service-specific recovery nuances if not versioned per workflow class.
- - Automatic replay after incidents improves recovery speed but requires strong idempotency discipline.
- - Manual overrides are sometimes necessary, but every override should be auditable.
Next step
Run this in one sprint:
- 1. Adopt a 3-level severity model with metric-based entry criteria.
- 2. Add the first-15-minute checklist to your on-call template and status page process.
- 3. Drill one safety-kernel outage scenario and one stale-jobs scenario.
- 4. After each drill, measure time-to-severity and time-to-safe-recovery.
Continue with AI Agent SLOs and Error Budgets and AI Agent Poison Message Handling.