The production problem
Teams often complete incident reviews quickly and still repeat the same class of outage. The missing piece is usually not effort. It is evidence quality.
Generic postmortem templates capture timeline and impact, but autonomous-agent systems need extra detail: policy path, replay integrity, and lock behavior under load.
If these fields are missing, corrective actions become broad refactors instead of targeted reliability fixes.
What top results miss
| Source | Strong coverage | Missing piece |
|---|---|---|
| Google SRE: Postmortem Culture | Excellent blameless philosophy and criteria for when postmortems should be written. | No concrete template fields for policy-gated autonomous execution paths. |
| Atlassian: Incident Postmortem Template | Strong section-by-section template with Five Whys and corrective actions. | No distributed-lock or replay-integrity checks for queue-driven AI control planes. |
| PagerDuty: The Blameless Postmortem | Great guidance on blame-aware language and cognitive bias countermeasures. | No metric-backed evidence checklist for modern agent orchestration stacks. |
Template structure
A useful postmortem template reduces ambiguity during writing and increases cross-incident comparability. Keep sections stable across incidents.
| Section | Must include | Weak pattern to avoid |
|---|---|---|
| Incident summary | Customer impact, duration, severity, affected workflows | Only technical symptoms and no user/business impact |
| Timeline | Detection, response, recovery, and decision timestamps | Retroactive timeline built from memory only |
| Contributing factors | Load conditions, config drift, dependency health, governance mode | Single-cause storyline for complex failure chains |
| Policy path analysis | Denied, deferred, quarantined, and fail-mode decisions | Treating all blocked jobs as generic platform failure |
| Corrective actions | Owner, due date, priority, and objective verification metric | Action list with no ownership or closure signal |
Cordum evidence map
| Implication | Current behavior | Why it matters |
|---|---|---|
| User-impact severity signals | Dispatch p99 > 1s and failed ratio > 10% are existing production thresholds | You can map severity directly to known operational alert boundaries. |
| Policy dependency evidence | `cordum_safety_unavailable_total` plus safety-kernel health checks | Shows whether the incident was dependency outage versus worker/executor pressure. |
| Output governance evidence | `cordum_output_policy_quarantined_total` and job `failure_reason` samples | Distinguishes valid policy intervention from false-positive scanner behavior. |
| Replay integrity evidence | `cordum_scheduler_orphan_replayed_total` trend and stale-jobs gauge | Verifies backlog recovery and highlights latent incident debt after service restore. |
| Distributed lock evidence | Redis lock keys such as `cordum:reconciler:default` and per-job lock keys | Confirms single-writer assumptions during and after multi-replica incidents. |
Implementation examples
Blameless postmortem template (Markdown)
# Incident Postmortem - Blameless Template ## 1) Incident Summary - Incident ID: - Severity: - Start time / End time: - Customer impact summary: - Affected workflows/topics: ## 2) Detection and Timeline - Detection source: - First alert timestamp: - Response milestones: - Recovery timestamp: ## 3) Technical Impact Evidence - Dispatch p99 during incident: - Failed completion ratio: - Stale jobs peak: - Quarantine rate: ## 4) Policy Path Analysis - safety_unavailable events: - denied/quarantined event counts: - POLICY_CHECK_FAIL_MODE at incident time: - Any temporary policy overrides: ## 5) Contributing Factors (What/How framing) - What conditions existed? - How did they interact? - What signals were missing or noisy? ## 6) Replay and Consistency Checks - Orphan replay trend: - Reconciler lock evidence: - Duplicate side-effect check result: ## 7) Corrective Actions | Action | Owner | Priority | Due Date | Verification Metric | |--------|-------|----------|----------|---------------------| | | | | | | ## 8) Lessons Learned - What went well: - What slowed response: - What to change in runbooks/alerts:
Evidence query pack (PromQL)
# Dispatch p99
histogram_quantile(0.99, rate(cordum_scheduler_dispatch_latency_seconds_bucket[5m]))
# Failed completion ratio
rate(cordum_jobs_completed_total{status="failed"}[5m])
/ clamp_min(rate(cordum_jobs_completed_total[5m]), 0.001)
# Safety dependency degradation
rate(cordum_safety_unavailable_total[5m])
# Output quarantine trend
rate(cordum_output_policy_quarantined_total[5m])Corrective action schema (YAML)
postmortem_actions:
- id: PM-001
title: Add scheduler lock-state alert
owner: platform-oncall
priority: p1
due_date: 2026-04-14
verification:
metric: cordum_scheduler_stale_jobs
target: "<= 10 peak during weekly load test"
evidence_link: "https://internal/wiki/load-test-2026-04-14"
- id: PM-002
title: Harden safety-kernel TLS health probes
owner: security-platform
priority: p1
due_date: 2026-04-10
verification:
metric: cordum_safety_unavailable_total
target: "0 sustained spikes in 14 days"
evidence_link: "https://internal/wiki/safety-kernel-probe-rollout"Limitations and tradeoffs
- - Rich templates increase writing time; poor tooling makes this friction obvious.
- - Overly strict template enforcement can suppress incident-specific nuance.
- - Blameless language can be misused to avoid direct ownership if action tracking is weak.
- - Metrics alone are insufficient; responder decision context still matters for learning.
Next step
Run this in one sprint:
- 1. Replace your current template with the structure above for all SEV-1 and SEV-2 incidents.
- 2. Require one policy-path evidence item and one replay-integrity evidence item per postmortem.
- 3. Enforce owner + due date + verification metric on every corrective action.
- 4. Review action closure rate after 30 days and adjust template fields where completion lags.
Continue with AI Agent Incident Response Runbook and AI Agent SLOs and Error Budgets.