AI Agent Incident Response Runbook: Severity, Triage, and Recovery Steps (2026)

The production problem

Most incident docs look good until the first real autonomous-agent outage. Then every path looks critical, every team thinks another team owns the issue, and somebody suggests restarting everything.

That approach is expensive. It can clear symptoms and create duplicate side effects if replay and lock state are not verified.

A runbook should answer three questions fast: what severity is this, what component is the likely blast center, and what safe recovery path should start now.

What top results miss

Source	Strong coverage	Missing piece
Google SRE Book: Managing Incidents	Strong incident-command structure and communication discipline.	No concrete guidance for policy-gated AI dispatch paths and replay-safe remediation.
Atlassian: How to create an incident response playbook	Solid playbook framework, ownership, and simulation emphasis.	No metric-level trigger design for autonomous agent control planes.
PagerDuty Response Docs: Getting Started	Practical role setup, severity levels, and practice loops.	No operational decision tree for retries, quarantines, and distributed lock checks.

Severity model and ownership

Incident status should come from objective trigger conditions first. Root cause can evolve; severity should still be deterministic.

Severity	Trigger condition	Primary signal	Owner on point
SEV-1	Safety kernel unavailable trend + dispatch backlog growth	`cordum_safety_unavailable_total` rising and jobs stuck	Incident Commander + platform on-call
SEV-2	Dispatch latency p99 sustained above threshold	`histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m])) > 1`	Scheduler owner
SEV-2	Failed completion ratio sustained above baseline	`rate(cordum_jobs_completed_total{status="failed"}[5m]) / rate(cordum_jobs_completed_total[5m]) > 0.1`	Workflow/runtime owner
SEV-3	Output quarantine spike without user-facing impact	`rate(cordum_output_policy_quarantined_total[5m]) > 1`	Safety/policy owner
SEV-2	Stale jobs exceed normal drift window	`cordum_scheduler_stale_jobs > 50`	Scheduler on-call

Cordum signal map

Implication	Current behavior	Why it matters
Safety dependency outage check	Track `cordum_safety_unavailable_total` and validate safety-kernel gRPC health	Distinguishes policy-service outage from worker capacity issues.
Output quarantine investigation	Inspect `cordum_output_policy_quarantined_total` and job `failure_reason`	Avoids disabling policy controls when the issue is narrow rule tuning.
Distributed lock integrity	Check Redis lock keys like `cordum:reconciler:default`	Multiple active reconcilers or missing locks can create duplicate or stuck processing.
Replay progress confirmation	Monitor `cordum_scheduler_orphan_replayed_total` trend after recovery	Confirms stuck pending jobs are being recovered without unsafe manual replay.
Policy fail mode awareness	`POLICY_CHECK_FAIL_MODE=closed` is default; unavailable policy path requeues with backoff	Explains why throughput drops during policy outages without silently bypassing controls.

Existing production alerts already include useful starting thresholds: failed ratio above 10%, dispatch p99 above 1s, stale jobs above 50, and quarantine rate above 1 per second.

Implementation examples

First-15-minute incident checklist (YAML)

incident-first-15m.yaml

YAML

incident:
  t_plus_0_to_5:
    - assign_incident_commander
    - set_severity_from_metrics
    - freeze_nonessential_deployments
  t_plus_5_to_10:
    - check_safety_kernel_health
    - check_scheduler_dispatch_p99
    - check_stale_jobs_and_reconciler_lock
  t_plus_10_to_15:
    - choose_recovery_path:
        - safety_kernel_restore
        - worker_capacity_rebalance
        - output_policy_rule_tuning
    - publish_customer_status_update
    - open_timeline_doc

Metric triage snapshot script (Bash)

incident-triage.sh

Bash

#!/usr/bin/env bash
set -euo pipefail

BASE_URL="${BASE_URL:-http://localhost:9090}"

echo "=== Dispatch p99 ==="
curl -sG "$BASE_URL/api/v1/query" --data-urlencode   'query=histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))'   | jq -r '.data.result[0].value'

echo "=== Failed ratio (5m) ==="
curl -sG "$BASE_URL/api/v1/query" --data-urlencode   'query=rate(cordum_jobs_completed_total{status="failed"}[5m]) / clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)'   | jq -r '.data.result[0].value'

echo "=== Safety unavailable (5m) ==="
curl -sG "$BASE_URL/api/v1/query" --data-urlencode   'query=rate(cordum_safety_unavailable_total[5m])'   | jq -r '.data.result[0].value'

Lock and policy-path checks (Bash)

incident-lock-checks.sh

Bash

# Redis lock checks (single-writer components)
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

# Job lock sample
redis-cli GET "cordum:scheduler:job:JOB_ID"

# Quarantine metric quick check
curl -s http://localhost:9090/metrics | grep output_policy_quarantined

# Safety kernel env inspection
env | grep SAFETY_KERNEL

Limitations and tradeoffs

- Tight severity thresholds reduce detection latency but increase paging noise during bursty traffic.
- One shared runbook can hide service-specific recovery nuances if not versioned per workflow class.
- Automatic replay after incidents improves recovery speed but requires strong idempotency discipline.
- Manual overrides are sometimes necessary, but every override should be auditable.

Next step

Run this in one sprint:

1. Adopt a 3-level severity model with metric-based entry criteria.
2. Add the first-15-minute checklist to your on-call template and status page process.
3. Drill one safety-kernel outage scenario and one stale-jobs scenario.
4. After each drill, measure time-to-severity and time-to-safe-recovery.

Continue with AI Agent SLOs and Error Budgets and AI Agent Poison Message Handling.