The incident response problem
Teams now have enough telemetry to detect failures quickly. They still lose time in triage and coordination. The hard part is not seeing an alert. The hard part is deciding what to do next, safely.
Microsoft's AutoTSG study analyzed more than 4,000 troubleshooting guides mapped to thousands of incidents. It found real value in guide-driven mitigation and showed automation can extract executable steps with strong quality metrics. The gap is operational control, not signal availability.
Failure mode to avoid
Autonomous remediation without policy boundaries turns MTTR gains into change-failure spikes. Speed without control is expensive.
What top articles cover vs miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| PagerDuty AIOps Quickstart Guide | Solid triage context features (outlier incidents, probable origin, related incidents, change correlation) with low setup overhead. | No concrete risk-tier policy matrix for when remediation should auto-run, require approval, or be denied. |
| Datadog Incident Automations Docs | Strong trigger model, execution history, incident context variables, and service-account guidance for auditability. | No default rollback thresholds or governance pattern for high-risk write actions across heterogeneous systems. |
| Splunk Mission Control Automation Docs | Clear playbook/action flow with prompt checkpoints and incident-type based auto triggering. | Limited guidance on numeric promotion and rollback gates for production remediations. |
This guide focuses on that missing layer: risk-tiered remediation decisions, dual policy checks, approval routing, and rollback gates.
Two-lane architecture
- -Classify severity and impacted services
- -Correlate with deploy and config timeline
- -Generate ranked remediation options
- -Write findings into incident timeline
- -Policy check before queueing request
- -Policy check again before dispatch
- -Approval for high-risk write path actions
- -Verification and rollback trigger after execution
In Cordum terms, this maps naturally to policy decisions (`ALLOW`, `DENY`, `REQUIRE_APPROVAL`, `ALLOW_WITH_CONSTRAINTS`) and approval records bound to policy snapshot plus job hash.
Risk-tier remediation matrix
| Tier | Example actions | Policy | Approval | Notes |
|---|---|---|---|---|
| Low | Enrich alert, fetch traces, annotate timeline, open follow-up ticket | ALLOW_WITH_CONSTRAINTS | No | Read-only operations, no side effects |
| Medium | Restart non-critical pods, clear cache, scale stateless worker pool | ALLOW_WITH_CONSTRAINTS | Conditional | Bounded blast radius and timeout required |
| High | Prod rollout rollback, feature flag off for high-traffic path | REQUIRE_APPROVAL | Yes | Approval bound to policy snapshot and job hash |
| Critical | Schema changes, credential rotations, destructive shell commands | DENY by default | Two-person exception | Execute only through explicit break-glass workflow |
Workflow and policy examples
Start with one deterministic workflow. Keep branches explicit so reviewers can reason about every transition under pressure.
name: incident-triage-remediation
trigger: alert.production.fired
steps:
triage:
type: action
topic: job.incident.triage
timeout_sec: 60
propose_remediation:
type: action
topic: job.incident.propose_remediation
depends_on: [triage]
timeout_sec: 120
policy_precheck:
type: action
topic: job.governance.policy_check
depends_on: [propose_remediation]
approval_gate:
type: approval
depends_on: [policy_precheck]
when: "{{ steps.policy_precheck.output.decision == 'REQUIRE_APPROVAL' }}"
timeout_sec: 1800
on_timeout: escalate
remediate:
type: action
topic: job.incident.remediate
depends_on: [policy_precheck, approval_gate]
timeout_sec: 300
verify:
type: action
topic: job.incident.verify
depends_on: [remediate]
timeout_sec: 120
rollback:
type: action
topic: job.incident.rollback
depends_on: [verify]
when: "{{ steps.verify.output.passed == false }}"version: v1
rules:
- id: deny-destructive-prod-shell
match:
topic: "job.exec.shell"
labels:
env: prod
command_class: destructive
decision: DENY
reason: "destructive shell command blocked"
- id: approval-prod-remediation
match:
topic: "job.incident.remediate"
labels:
env: prod
risk_tier: high
decision: REQUIRE_APPROVAL
reason: "high-risk production remediation"
- id: constrained-medium-risk
match:
topic: "job.incident.remediate"
labels:
risk_tier: medium
decision: ALLOW_WITH_CONSTRAINTS
constraints:
max_runtime_sec: 180
max_restarts: 2
allowed_namespaces: ["staging", "prod-edge"]
- id: allow-readonly-triage
match:
topic: "job.incident.triage"
decision: ALLOW# promotion-gate.sh set -euo pipefail UNAPPROVED_HIGH_RISK=$(curl -s "$API/metrics/unapproved-high-risk?window=10m") P95_TRIAGE_MS=$(curl -s "$API/metrics/triage-p95-ms?window=10m") VERIFY_PASS_RATE=$(curl -s "$API/metrics/remediation-verify-pass-rate?window=24h") if [ "$UNAPPROVED_HIGH_RISK" -gt 0 ]; then echo "BLOCK: unapproved high-risk remediation detected" exit 1 fi if [ "$P95_TRIAGE_MS" -gt 180000 ]; then echo "BLOCK: triage latency breach" exit 1 fi if (( $(echo "$VERIFY_PASS_RATE < 0.90" | bc -l) )); then echo "BLOCK: verification pass rate below threshold" exit 1 fi echo "PASS: canary promotion allowed"
Operational gates
| Gate | Target | Rollback trigger | Owner |
|---|---|---|---|
| P95 triage latency | <= 90s | > 180s for 10m | Incident Platform |
| Incorrect triage classification | < 10% weekly | > 20% weekly | SRE + Service Owners |
| Remediation verification pass rate | >= 95% | < 90% daily | Automation On-call |
| Unapproved high-risk executions | 0 | > 0 immediate stop | Governance |
| Approval queue median wait | <= 10m | > 20m for 30m | Incident Command |
| Post-remediation regression | < 2% | > 5% daily | Service Team |
If your team cannot name an owner per gate, the gate does not exist in practice.
Limitations and tradeoffs
Approval bottlenecks
Too many high-risk classifications can create long queues. Calibrate risk tiers with incident review data every week during rollout.
Model drift in triage quality
Triage labels degrade when service topology changes. Require periodic replay tests on recent incidents to keep confidence scores honest.
Rollback side effects
Some rollback actions are not perfectly reversible. Keep compensating procedures explicit in runbooks before enabling automation.