Skip to content
Guide

AI Agent Incident Response Runbook

Use repeatable severity gates and recovery steps when autonomous systems misbehave.

Guide12 min readApr 2026
TL;DR
  • -Runbooks fail when triggers are vague and ownership is fuzzy.
  • -Autonomous agent incidents need policy-path checks, not only uptime checks.
  • -The first 15 minutes should be scripted. Improvisation can wait.
  • -Recovery without replay safety can create duplicate side effects.
Fast triage

Use metric thresholds to assign severity before discussing root cause.

Policy-aware

Check safety-kernel and output-policy signals in the first pass.

Deterministic recovery

Use lock/state checks and replay controls before restarting components.

Scope

This runbook is for AI agent control planes running policy checks, queued dispatch, and asynchronous job completion handling.

The production problem

Most incident docs look good until the first real autonomous-agent outage. Then every path looks critical, every team thinks another team owns the issue, and somebody suggests restarting everything.

That approach is expensive. It can clear symptoms and create duplicate side effects if replay and lock state are not verified.

A runbook should answer three questions fast: what severity is this, what component is the likely blast center, and what safe recovery path should start now.

What top results miss

SourceStrong coverageMissing piece
Google SRE Book: Managing IncidentsStrong incident-command structure and communication discipline.No concrete guidance for policy-gated AI dispatch paths and replay-safe remediation.
Atlassian: How to create an incident response playbookSolid playbook framework, ownership, and simulation emphasis.No metric-level trigger design for autonomous agent control planes.
PagerDuty Response Docs: Getting StartedPractical role setup, severity levels, and practice loops.No operational decision tree for retries, quarantines, and distributed lock checks.

Severity model and ownership

Incident status should come from objective trigger conditions first. Root cause can evolve; severity should still be deterministic.

SeverityTrigger conditionPrimary signalOwner on point
SEV-1Safety kernel unavailable trend + dispatch backlog growth`cordum_safety_unavailable_total` rising and jobs stuckIncident Commander + platform on-call
SEV-2Dispatch latency p99 sustained above threshold`histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m])) > 1`Scheduler owner
SEV-2Failed completion ratio sustained above baseline`rate(cordum_jobs_completed_total{status="failed"}[5m]) / rate(cordum_jobs_completed_total[5m]) > 0.1`Workflow/runtime owner
SEV-3Output quarantine spike without user-facing impact`rate(cordum_output_policy_quarantined_total[5m]) > 1`Safety/policy owner
SEV-2Stale jobs exceed normal drift window`cordum_scheduler_stale_jobs > 50`Scheduler on-call

Cordum signal map

ImplicationCurrent behaviorWhy it matters
Safety dependency outage checkTrack `cordum_safety_unavailable_total` and validate safety-kernel gRPC healthDistinguishes policy-service outage from worker capacity issues.
Output quarantine investigationInspect `cordum_output_policy_quarantined_total` and job `failure_reason`Avoids disabling policy controls when the issue is narrow rule tuning.
Distributed lock integrityCheck Redis lock keys like `cordum:reconciler:default`Multiple active reconcilers or missing locks can create duplicate or stuck processing.
Replay progress confirmationMonitor `cordum_scheduler_orphan_replayed_total` trend after recoveryConfirms stuck pending jobs are being recovered without unsafe manual replay.
Policy fail mode awareness`POLICY_CHECK_FAIL_MODE=closed` is default; unavailable policy path requeues with backoffExplains why throughput drops during policy outages without silently bypassing controls.

Existing production alerts already include useful starting thresholds: failed ratio above 10%, dispatch p99 above 1s, stale jobs above 50, and quarantine rate above 1 per second.

Implementation examples

First-15-minute incident checklist (YAML)

incident-first-15m.yaml
YAML
incident:
  t_plus_0_to_5:
    - assign_incident_commander
    - set_severity_from_metrics
    - freeze_nonessential_deployments
  t_plus_5_to_10:
    - check_safety_kernel_health
    - check_scheduler_dispatch_p99
    - check_stale_jobs_and_reconciler_lock
  t_plus_10_to_15:
    - choose_recovery_path:
        - safety_kernel_restore
        - worker_capacity_rebalance
        - output_policy_rule_tuning
    - publish_customer_status_update
    - open_timeline_doc

Metric triage snapshot script (Bash)

incident-triage.sh
Bash
#!/usr/bin/env bash
set -euo pipefail

BASE_URL="${BASE_URL:-http://localhost:9090}"

echo "=== Dispatch p99 ==="
curl -sG "$BASE_URL/api/v1/query" --data-urlencode   'query=histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))'   | jq -r '.data.result[0].value'

echo "=== Failed ratio (5m) ==="
curl -sG "$BASE_URL/api/v1/query" --data-urlencode   'query=rate(cordum_jobs_completed_total{status="failed"}[5m]) / clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)'   | jq -r '.data.result[0].value'

echo "=== Safety unavailable (5m) ==="
curl -sG "$BASE_URL/api/v1/query" --data-urlencode   'query=rate(cordum_safety_unavailable_total[5m])'   | jq -r '.data.result[0].value'

Lock and policy-path checks (Bash)

incident-lock-checks.sh
Bash
# Redis lock checks (single-writer components)
redis-cli GET "cordum:reconciler:default"
redis-cli GET "cordum:replayer:pending"

# Job lock sample
redis-cli GET "cordum:scheduler:job:JOB_ID"

# Quarantine metric quick check
curl -s http://localhost:9090/metrics | grep output_policy_quarantined

# Safety kernel env inspection
env | grep SAFETY_KERNEL

Limitations and tradeoffs

  • - Tight severity thresholds reduce detection latency but increase paging noise during bursty traffic.
  • - One shared runbook can hide service-specific recovery nuances if not versioned per workflow class.
  • - Automatic replay after incidents improves recovery speed but requires strong idempotency discipline.
  • - Manual overrides are sometimes necessary, but every override should be auditable.

Next step

Run this in one sprint:

  1. 1. Adopt a 3-level severity model with metric-based entry criteria.
  2. 2. Add the first-15-minute checklist to your on-call template and status page process.
  3. 3. Drill one safety-kernel outage scenario and one stale-jobs scenario.
  4. 4. After each drill, measure time-to-severity and time-to-safe-recovery.

Continue with AI Agent SLOs and Error Budgets and AI Agent Poison Message Handling.

Runbooks should remove guessing, not add pages

If responders still ask who owns what in minute ten, the playbook is unfinished.